2009.01 ranking queries on uncertain data: a probabilistic threshold approach wenjie zhang, xuemin...

2009.01

Ranking Queries on Uncertain Data: A Probabilistic Threshold Approach

Wenjie Zhang, Xuemin Lin The University of New South Wales & NICTA

Ming Hua, Jian PeiSimon Fraser University

Presenter: Wang LiangSupervisor: Prof. David Cheung

2009.01

Outline

• Introduction

• Problem

• Algorithms

• Performance of Experiments• Conclusions

2009.01

Introduction

• Top-k query in certain database: return the k tuples with maximum scores based on some scoring function.

Query: Top-2 longest durations that a panda stays in a location in a time

2009.01

Introduction

• If the database is uncertain, what’s the answer of top-k query?

• The uncertain database can be viewed as the summary of a set of possible worlds.

• Each tuple is associated with a probability.• Multiple tuples can have constraints such as

mutual exclusion among them. (Generation Rules)• {R2, R3}• {R5, R6}

Possible world: deterministic database instance.

Query: Top-2 longest durations that a panda stays in a location in a time

2009.01

Introduction

• Two Problems:• What does a probabilistic top-k query mean?

• How can a probabilistic threshold top-k query be answered efficiently?

2009.01

Outline

• Introduction

• Problem

• Algorithms


2009.01

Problem Settings

• Database: Uncertain Database.

• A probabilistic threshold top-k query (PT-k query):• Query: Q (k, f (x), T) and Threshold: p

• For each possible world W, Q is applied and a set of k tuples Qk(W) is returned.

• Top-k probability of tuple t is the probability that t is in Qk(W) in all W.

• The answer set to a PT-k query is the set of all tuples whose top-k probability values are at least p.

, ( )

(1)

( , , ) { | , } (2)

k

kQ,T

W t Q W

kQ,T

Pr (t) Pr(W)

Answer Q p T t t T Pr (t) p

2009.01

Example

• A probabilistic threshold top-k query (PT-k query):• Query: Q (2, Duration (x), Table 3) and Threshold: 0.3• For each possible world W, Q is applied and a set of k tuples Qk(W) is returned.• Top-k probability of t is the probability that t is in Qk(W) in all W.• The answer set to a PT-k query is the set of all tuples whose top-k probability

values are at least p.

P=0.3

2009.01

Outline

• Introduction

• Problem

• Algorithms


2009.01

Algorithms

• An Exact Algorithm

• A Sampling Method

• A Poisson Approximation Based Method

2009.01

An Exact Algorithm

• The Basic Case

• Handling Generation Rules

• Pruning Techniques

2009.01

The Basic Case

• Assumption: all tuples are independent.• Scan the tuples in table in the ranking order.• Let be the set of list of all tuples in table in the ranking order.• For tuple ti, is dominant set.• Pr (ti, j) is the probability that tuple ti is ranked at the j-th position in all possible worlds. • Pr (Sti, j) is the probability that j tuples in Sti appear in all possible worlds.• Prk(ti) is the top-k probability of ti.

1... nL t t

1 1{ ... }it iS t t

1 1

(3)

(4)

i

i

i i t

k kk

i i i tj j

Pr(t , j) Pr(t )Pr(S , j -1)

Pr (t ) Pr(t , j) Pr(t ) Pr(S , j -1)

• In the basic case, for 1 , | |i j T

(5)

(6)i i-1

i i-1 i-1

t t i-1

t t i-1 t i-1

Pr(S ,0) Pr(S ,0)(1- Pr(t ))

Pr(S , j) Pr(S , j -1)Pr(t )+ Pr(S , j)(1- Pr(t ))

2009.01

Handling Generation Rules

• Rule-Tuple Compression:

• Let be the set of list of all tuples in table in the ranking order.

• For a tuple ti, two situations due to the presence of multi-tuple generation rules complicate the computation.• Situation one: ti is an independent tuple. Some tuples involved in

generation rule R are ranked higher than ti.

• Situation two: ti is involved in generation rule R, and some tuples in R are ranked higher than ti.

1... nL t t

1 1

(3)

(4)

i

i

i i t

k kk

i i i tj j

Pr(t , j) Pr(t )Pr(S , j -1)

Pr (t ) Pr(t , j) Pr(t ) Pr(S , j -1)

2009.01

Situation One

• Situation one: ti is an independent tuple. Some tuples involved in generation rule R are ranked higher than ti.

• Solution: • Suppose: R: is in ranking order. is ranked higher than ti.

• The tuples involved in R can be divided into two parts:• •

•

•

1 2, ...,

mr r rt t t1 2 0, ...

mr r rt t t

0

1 0 1{ ,..., }

m left j

m

left r r R rjR t t Pr(t ) Pr(t )

10 0 1{ ,..., } Pr( )

m m right j

m

right r r R rj mR t t Pr(t ) t

1 1

(4) i

k kk

i i i tj j

Pr (t ) Pr(t , j) Pr(t ) Pr(S , j -1)

' ( { | }) { } (8)i i leftt t RS S t t R t

'

1 1

(7)i

k kk

i i i tj j

Pr (t ) Pr(t , j) Pr(t ) Pr(S , j -1)

2009.01

Situation Two

• Situation two: ti is involved in generation rule R, and some tuples in R are ranked higher than ti.

• Solution: • Suppose: R: is in ranking order. is ranked higher than ti(trm0).

• The tuples involved in R can be divided into two parts:• •

•

•

1 2, ...,

mr r rt t t1 2 10, ...

mr r rt t t

0

1 10

1

1{ ,..., }

m left j

m

left r r R rjR t t Pr(t ) Pr(t )

10 0 1{ ,..., } Pr( )

m m right j

m

right r r R rj mR t t Pr(t ) t

1 1

(4) i

k kk

i i i tj j

Pr (t ) Pr(t , j) Pr(t ) Pr(S , j -1)

' ( { | }) (10)i it tS S t t R

'

1 1

(9) i

k kk

i i i tj j

Pr (t ) Pr(t , j) Pr(t ) Pr(S , j -1)

2009.01

Example

• Query: top-3 with p = 0.3

1 1

(3)

(4)

i

i

i i t

k kk

i i i tj j

Pr(t , j) Pr(t )Pr(S , j -1)

Pr (t ) Pr(t , j) Pr(t ) Pr(S , j -1)

1

1

: 0 0.3 (1 0 0) 0.3

0

1

1

1

t

kt 1

t

Pr(S ,0)

t Pr(S ,1) Pr (t )

Pr(S ,2)

1 2

1

: 0 0.3 (1 0 0) 0.3

0

2

2

2

t

kt

t

Pr(S ,0)=

t Pr(S ,1)= Pr (t )

Pr(S ,2)=

3 2

23

3

31

0.5

0.5

: 0.5

0

1 (0.5 0.5 0) 1

3 2

3

3

1-2

t 1-2 t

t 1-2 t 1-2 t

t

kt

j

Pr(t )

Pr(S ,0) (1- Pr(t ))Pr(S ,0)

t Pr(S ,1) Pr(t )Pr(S ,0) (1- Pr(t ))Pr(S ,1)

Pr(S ,2)

Pr (t ) Pr(t3) Pr(S , j -1)

4

0

: 0.5

0.5

0.3 (0 0.5 0.5) 0.3

4

4

4

t

t

t

k4

Pr(S ,0)

t Pr(S ,1)

Pr(S ,2)

Pr (t )

• Pr (ti, j) is the probability that tuple ti is ranked at the j-th position in all possible worlds.

• Pr (Sti, j) is the probability that j tuples in Sti appear in all possible worlds.

• Prk(ti) is the top-k probability of ti.

Generation Rules:{t1,t2,t8}

{t4,t5}

2009.01

Pruning Techniques

• Four pruning rules• Two of them can avoid checking some tuples that can not satisfy

the probability threshold.

• Two of them is about the stopping conditions.

2009.01

Time Complexity

• • RT is the set of all generation rules in table T.

• n is the number of tuples in table T.

• span(R) is the number of tuples in generation rules R.

( ( )).TR

O kn k span R

2009.01

Algorithms




2009.01

A Sampling Method

• Trade off the accuracy of answers against the efficiency.

• For a tuple t, let Xt be a random variable as an indicator to the event that t is ranked top-k in possible worlds.• Xt = 1, if t is ranked in the top-k list.

• Xt = 0, otherwise.

Then Prk(t) = E[Xt].

• Generate a set of samples S of possible worlds, compute the mean of Xt in S, namely ES[Xt] , as the approximation of E[Xt] .

2009.01

A Sampling Method

• Scan table T once to generate one sample.

• An independent tuple ti is included in s with probability Pr (ti)

• For a mutli-tuple generation rules R: . s takes a probability Pr (R) to include one tuple involved in R. If s takes a tuple in R, the tuple trl is chosen with probability Pr (trl) / Pr (R)

• Compute the top-k tuples in s. For each tuple t in the top-k list, Xt = 1.

• The top-k probability of ti: Prk (ti) = ES[Xti]

• Stopping Condition: Chernoff-Hoeffiding bound:

1 2...

mr r rt t t

2

For any (0< <1), ( >0), and a sample S of possible worlds, if

23ln

| |

then for any tuple t,

S t t t

S

Pr{| E [X ] - E[X ] |> E[X ]}

2009.01

Algorithms




2009.01

A Poisson Approximation Based Method

• Let be a set of independent random variables, such that and . Let . Then .If all pi’s are identical, are called Bernoulli trials and X follows a binomial distribution; otherwise, are called Poisson trials, and X follows a Poisson binomial distribution.

• Construct a set of Poisson trials corresponding to Sti as follows.

• Independent tuple , construct a random trial

• Multi-tuple rule R( ) . Combine the tuples in into a rule-tuple tR such that and construct

• Let then

1,..., nX X i iPr(X = 1) p

1 (1 )i iPr(X = 0) p i n 1

n

iiX X

1

[ ]n

iiE X p

1,... nX X

1,..., nX X

(3)

(4)

i

i

i i t

kk

i i tj=1

Pr(t , j)= Pr(t )Pr(S , j -1)

Pr (t )= Pr(t ) Pr(S , j -1)

'it

t S' :t t'X Pr(X = 1) Pr(t')

1itR S

1itR S

' ( )R t R T tPr(t ) Pr(t')

:

R Rt t RX Pr(X = 1) Pr(t )

1

n

iiX X

Pr( , ) Pr( )(0 )

itS j X j j n

(11)

(12)ki i

Pr(t,k)= Pr(t)Pr(X = k -1)

Pr (t )= Pr(t )Pr(X < k)

2009.01


• Distribution of Poisson Binomial Probability

• If , then the top-k probability of ti is small.• General stopping condition:

For example: if k = 100, p = 0.3, then the stopping condition is

it

| | 1

'1

For a tuple

1. 0 for >|S |+1;

2. < for 1, and for k

3.arg max 1, where .ti

ti

i

i

i i i i

S

i t Sj

t T

Pr(t ,k) k

Pr(t ,k) Pr(t ,k +1) k Pr(t ,k) Pr(t ,k +1)

Pr(t , j) Pr(t')

1k

'

2

Given a top-k query ( , ) and probability threshold , for

a tuple ,Let . Then if

1 1 1 ln ln 2 ln .

ti

k

ki t S

Q p f p

t T Pr(t') Pr (t) p

k kp p p

117

(11)

(12)ki i

Pr(t,k)= Pr(t)Pr(X = k -1)

Pr (t )= Pr(t )Pr(X < k)

2009.01


• When the success probability is small and the number of Poisson trials is large, Poisson binomial distribution can be approximated well by Poisson distribution.

• For a set of Poisson trials such that , let . X follows a Poisson binomial distribution. Let , the probability of can be approximated by

(10)ki iPr (t ) Pr(t )Pr(X < k)

1,..., nX X Pr( 1)i iX p 1

n

iiX X

1[ ]

n

iiE X p

Pr( ) ( ; )!

k

X k f k ek

X k

( 1 , )Pr( ) ( ; ) .

!

kX k F k

k

2009.01


• Time Complexity• n’ is the number of tuples read before the general stopping

condition is satisfied which depends on parameter k, probability threshold p and the probability distribution of tuples.

( ').O n

2009.01

Outline

• Introduction

• Problem

• Algorithms


2009.01

Experiments Setting

• PC: 3.0 GHz Pentium 4 CPU, 1.0 GB main memory, and a 160 GB hard disk, running the Microsoft Windows XP Professional Edition operating system.

• Algorithms: implemented in Microsoft Visual C++ V6.0

• Data set: a real data set and some synthetic data sets.• Real data set: International Ice Patrol Iceberg Sightings Database: to show

the difference among answers of different definition of top-k query on uncertain data.

• Synthetic data sets: to evaluate the algorithms.

2009.01

Synthetic Data Sets

• 20,000 tuples and 2,000 multi-tuple generation rules.

• The number of tuples involved in each multi-tuple generation rule follows the normal distribution N(5, 2).

• The probability values of independent tuples and multi-tuple generation rules follow the normal distribution N(0.5, 0.2) and N(0.7, 0.2).

• By default, k = 200 and p = 0.3.

• NB: since ranking queries are extensively supported by modern database management systems, they treat the generation of a ranked list of uncertain

tuples as a black box, and test algorithms on top of the ranked list.

2009.01

Scan Depth

• Stopping condition: number of tuples scanned by Poisson approximation based method.• Avg sample length: average number of tuples read by the sampling algorithm to generate

a sample unit.• Exact algo: number of tuples scanned by exact algorithm.• Answer set: the number of tuples in answer set.

2009.01

Efficiency

• RC: Exact algorithm with rule-tuple compression only.• RC+AR: Exact algorithm with RC and aggressive reordering.• RC+LR: Exact algorithm with RC and lazy reordering.• Sampling: Sampling Method.• The runtime of the Poisson approximation based method is always less than one second

2009.01

The approximation quality

• The recall and precision of Poisson approximation based method are always higher than 85% with runtime less than one second.

• Precision: percentage of tuples returned by sampling method that are in the actual top-k list returned by the exact method.

• Recall: percentage of tuples returned by the exact method that are also returned by the sampling method.

2009.01

Scalability

• (a): number of tuples from 20,000 to 100,000. number of multi-tuple rules: 10%. k = 200 and p = 0.3

• (b): number of tuples is fixed. Vary the number of rules from 500 to 2,500.

• The runtime increases mildly when the database size increases. Due to the pruning rules and the improvement on extracting sample unit.

2009.01

Outline

• Introduction

• Problem

• Algorithms


2009.01

Conclusions

• Proposed a new definition of top-k query on uncertain data.

• Developed three different algorithm: An exist algorithm, a sampling method, and a Poisson approximation based method.

2009.01

The End

2009.01

The approximation quality

• Average error rate:

• Precision: percentage of tuples returned by sampling method that are in the actual top-k list returned by the exact method.

• Recall: percentage of tuples returned by the exact method that are also returned by the sampling method.

ˆ| | /

|{ | } |

k

k k k

Pr (t) p

k

Pr (t) Pr (t) Pr (t)

t Pr (t) p

2009.01

Pruning Technique

• Let t1…tm…tn be the tuples in the ranking order. Assume L = t1…tm are read. Let LR be the set of open rules with respect to tm+1. For any tuple ti (i > m),

• If ti is not in any rule in LR, the top-k probability of ti

• If ti is in a rule in LR, the top-k probability of ti

1

0

.k

ki

k

Pr (t ) Pr(L, j)

1

0

max(1 ) .left left

kk

i R RR LR

j

Pr (t ) Pr(t ) Pr(L - t , j)

2009.01

The Reference

2009.01 ranking queries on uncertain data: a probabilistic threshold approach wenjie zhang, xuemin...

Documents

probability of ti

tuple ti

j tuples

tuples qkw

query pt

probability of tuple

multiple tuples

query mean