estimating the cost of operations. from l.q.p. to p.q.p having parsed a query and transformed it...

21
Estimating the Cost of Operations

Post on 20-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Estimating the Cost of Operations. From l.q.p. to p.q.p Having parsed a query and transformed it into a logical query plan, we must turn the logical plan

Estimating the Cost of Operations

Page 2: Estimating the Cost of Operations. From l.q.p. to p.q.p Having parsed a query and transformed it into a logical query plan, we must turn the logical plan

From l.q.p. to p.q.p • Having parsed a query and transformed it into a logical query plan, we must

turn the logical plan into a physical plan. – We do so by considering many different p.q.p. that are derived from the

l.q.p., and evaluating or estimating the cost of each.

Example

Initial logical query plan Two candidates for the best logical query plan

Page 3: Estimating the Cost of Operations. From l.q.p. to p.q.p Having parsed a query and transformed it into a logical query plan, we must turn the logical plan

Estimating Costs• How can we estimate the number of tuples in an intermediate

relation?

• We don’t want to execute the query in order to learn the costs. So, we need to estimate them.

Rules about estimation formulas:

1. Give (somehow) accurate estimates

2. Easy to compute

Page 4: Estimating the Cost of Operations. From l.q.p. to p.q.p Having parsed a query and transformed it into a logical query plan, we must turn the logical plan

Projection

• Projection retains duplicates, so the number of tuples in the result is the same as in the input.

• Result tuples are usually shorter than the input tuples.

• The size of a projection is the only one we can compute exactly.

Page 5: Estimating the Cost of Operations. From l.q.p. to p.q.p Having parsed a query and transformed it into a logical query plan, we must turn the logical plan

SelectionLet S = A=c (R)

We can estimate the size of the result as

T(S) = T(R) / V(R,A)

Let S = A<c (R)

On average, T(S) would be T(R)/2, but more properly: T(R)/3

Let S = Ac (R),

Then, an estimate is:

T(S) = T(R)*[(V(R,A)-1)/V(R,A)], or simply

T(S) = T(R)

Page 6: Estimating the Cost of Operations. From l.q.p. to p.q.p Having parsed a query and transformed it into a logical query plan, we must turn the logical plan

Selection ...Let

S = C AND D(R) = C(D(R)) and

U = D(R).

First estimate T(U) and then use this to estimate T(S).

Example

S = a=10 AND b<20(R)T(R) = 10,000, V(R,a) = 50T(S) = (1/50)* (1/3) * T(R) = 67

Note: Watch for selections like:

a=10 AND a>20(R)

Page 7: Estimating the Cost of Operations. From l.q.p. to p.q.p Having parsed a query and transformed it into a logical query plan, we must turn the logical plan

Selection ...• Let S = C OR D(R). • Simple estimate: T(S) = T(C(R)) + T(D(R)).• Problem: It’s possible that T(S)T(R)!

A more accurate estimate• Let:

– T(R)=n, – m1 = size of selection on C, and – m2 = size of selection on D.

• Then T(S) = n(1-(1-m1/n)(1-m2/n)) Why?

• Example: S = a=10 OR b<20(R). T(R) = 10,000, V(R,a) =50• Simple estimation: T(S) = 3533• More accurate: T(S) = 3466

Page 8: Estimating the Cost of Operations. From l.q.p. to p.q.p Having parsed a query and transformed it into a logical query plan, we must turn the logical plan

Natural Join R(X,Y) S(Y,Z)

• Anything could happen!

Extremes

• No tuples join: T(R S) = 0

• All tuples join: i.e. R.Y=S.Y = a.

Then,

T(R S) = T(R)*T(S)

Page 9: Estimating the Cost of Operations. From l.q.p. to p.q.p Having parsed a query and transformed it into a logical query plan, we must turn the logical plan

Two AssumptionsContainment of value sets

• If V(R,Y) ≤ V(S,Y), then every Y-value in R is assumed to occur as a Y-value in S

• When such thing can happen?

• For example when: Y is foreign key in R, and key in S

Preservation of set values

• If A is an attribute of R but not S, then it is assumed that

V(R S, A)=V(R, A)

• This may be violated when there are dangling tuples in R

• There is no violation when: Y is foreign key in R, and key in S

Page 10: Estimating the Cost of Operations. From l.q.p. to p.q.p Having parsed a query and transformed it into a logical query plan, we must turn the logical plan

Natural Join size estimation• Let, R(X,Y) and S(Y,Z), and suppose Y is a single attribute.

• What’s the size of T(R S)?

• Let r be a tuple in R and s be a tuple in S. What’s the probability that r and s join?

• Suppose V(R,Y) V(S,Y)

• By the containment of set values we infer that:

– Every Y’s value in R appears in S.

• So, the tuple r of R surely is going match with some tuples of S, but what’s the probability it matches with s?

• It’s 1/V(S,Y). So, T(S)/V(S,Y) tuples of S would match with tuple r.

• Hence, T(R S) = T(R)*T(S)/V(S,Y)

• By a similar reasoning, for the case when V(S,Y) V(R,Y), we get T(R S) = T(R)*T(S)/V(R,Y).

• Summarizing we have as an estimate:

T(R S) = T(R)*T(S)/max{V(R,Y),V(S,Y)}

Page 11: Estimating the Cost of Operations. From l.q.p. to p.q.p Having parsed a query and transformed it into a logical query plan, we must turn the logical plan

T(R S) = T(R)*T(S)/max{V(R,Y),V(S,Y)}

• Example: R(a,b), T(R)=1000, V(R,b)=20 S(b,c), T(S)=2000, V(S,b)=50, V(S,c)=100 U(c,d), T(U)=5000, V(U,c)=500

• Estimate the size of R S U• T(R S) =

1000*2000 / 50 = 40,000 • T((R S) U) =

40000 * 5000 / 500 = 400,000• T(S U) =

20,000• T(R (S U)) =

1000*20000 / 50 = 400,000

• The equality of results is not a coincidence.

• Note 1: estimate of final result should not depend on the evaluation order

• Note 2: intermediate results could be of different sizes

Page 12: Estimating the Cost of Operations. From l.q.p. to p.q.p Having parsed a query and transformed it into a logical query plan, we must turn the logical plan

Natural join with multiple join attrib.• R(x,y1,y2) S(y1,y2,z)• T(R S) = T(R)*T(S)/m1*m2, where

m1 = max{V(R,y1),V(S,y1)}

m2 = max{V(R,y2),V(S,y2)} • Why?• Let r be a tuple in R and s be a tuple in S. What’s the probability that r and s agree

on y1?

From the previous reasoning, it’s 1/max{V(R,y1),V(S,y1)}• Similarly, what’s the probability that r and s agree on y2?

It’s 1/max{V(R,y2),V(S,y2)}• Assuming that agreements on y1 and y2 are independent we estimate:

T(R S) =

T(R)*T(S)/[max{V(R,y1),V(S,y1)} * max{V(R,y2),V(S,y2)}]Example:

T(R)=1000, V(R,b)=20, V(R,c)=100T(S)=2000, V(S,d)=50, V(S,e)=50

R(a,b,c) R.b=S.d AND R.c=S.e S(d,e,f)T(R S) = (1000*2000)/(50*100)=400

Page 13: Estimating the Cost of Operations. From l.q.p. to p.q.p Having parsed a query and transformed it into a logical query plan, we must turn the logical plan

• Another example: (one of the previous) R(a,b), T(R)=1000, V(R,b)=20 S(b,c), T(S)=2000, V(S,b)=50, V(S,c)=100 U(c,d), T(U)=5000, V(U,c)=500

• Estimate the size of R S U

• Observe that R S U = (R U) S

• T(R U) = 1000*5000 = 5,000,000

• Note that the number of b’s in the product is 20 (=V(R,b)), and the number of c’s is 500 (=V(U,c)).

• T((R U) S) = 5,000,000 * 2000 / (50 * 500) = 400,000

Page 14: Estimating the Cost of Operations. From l.q.p. to p.q.p Having parsed a query and transformed it into a logical query plan, we must turn the logical plan

Size estimates for other operationsCartesian product:

T(R S) = T(R) * T(S)

Bag Union: sum of sizesSet union:

larger + half the smaller. Why?Because a set union can be as large as the sum of sizes or as small as the larger of the two arguments. Something in the middle is suggested.

Intersection: half the smaller. Why?Because intersection can be as small as 0 or as large as the sizes of the smaller. Something in the middle is suggested.

Difference: T(R-S) = T(R) - 1/2*T(S)Because the result can be between T(R) and T(R)-T(S). Something in the middle is suggested.

Page 15: Estimating the Cost of Operations. From l.q.p. to p.q.p Having parsed a query and transformed it into a logical query plan, we must turn the logical plan

Size estimates for other operationsDuplicate elimination in (R(a1,...,an)):

The size ranges from 1 to T(R).

T((R))= V(R,[a1...an]), if available (but usually not available).

Otherwise: T((R))= min[V(R,a1)*...*V(R,an), 1/2*T(R)] is suggested. Why?

V(R,a1)*...*V(R,an) is the upper limit on the number of distinct tuples that could exist

1/2*T(R) is because the size can be as small as 1 or as big as T(R)

Grouping and Aggregation:

similar to , but only with respect to grouping attributes.

Page 16: Estimating the Cost of Operations. From l.q.p. to p.q.p Having parsed a query and transformed it into a logical query plan, we must turn the logical plan

Computing the statistics• Computation of statistics is triggered automatically or manually.

• T(R)’s, and V(R,A)’s are just aggregation queries (COUNT queries).

• However, they are expensive to be computed.

Page 17: Estimating the Cost of Operations. From l.q.p. to p.q.p Having parsed a query and transformed it into a logical query plan, we must turn the logical plan

Incremental computation of statisticsMaintaining T(R):

• Add 1 for every insertion and subtract 1 for every deletion.

– What’s the problem?

• If there is a B-Tree on any attribute of R, then:

Just keep track of the B-Tree blocks and infer the approximate size of the relation.

• Requires effort only when?

• On B-Tree changes, which is relative rare compared with the rate of insertions and deletions.

Page 18: Estimating the Cost of Operations. From l.q.p. to p.q.p Having parsed a query and transformed it into a logical query plan, we must turn the logical plan

Incremental computation of statisticsMaintaining V(R,A):

• If there is an index on attribute A of a relation R, then:

– On insert into R, we must find the A-value for the new tuple in the index anyway, and so we can determine whether there is already such a value for A. If not increment V(R,A).

– On deletion…

• If there isn’t an index on A, the system could in effect create a rudimentary index by keeping a data structure (e.g. B-Tree) that holds every value of A.

• Final option: Sampling the relation.

Page 19: Estimating the Cost of Operations. From l.q.p. to p.q.p Having parsed a query and transformed it into a logical query plan, we must turn the logical plan

Histograms

Equal width

1-2 3-4 4-5 6-7 8-9Most frequent values

4 7 rest

Advantage: more accurate estimate of the size of a join.

Page 20: Estimating the Cost of Operations. From l.q.p. to p.q.p Having parsed a query and transformed it into a logical query plan, we must turn the logical plan

Example (freq. Values histogram)Estimate U = R(a,b) S(b,c)

V(R,b) = 14. Histogram for R.b: 0:150, 1:200, 5:100, rest: 550

V(S,b) = 13. Histogram for S.b:0:100, 1:80, 2:70, rest: 250

• Tuples in U– on 0:

• 100*150 = 15,000

– on 1: • 200*80 = 16,000

– on 2: • 70 * (550/(14-3)) = 3500

– on 5: • 100 * (250/(13-3)) = 2500

– on the 9 other values: • 9*(550/11)*(250/10) = 9*1250

Total T(U) = 15000 + 16000 + 3500 + 2500 + 9*1250 = 48,250 Simple estimate (equal occurrence assumption) T(U) = 1000*500/14 = 35,714

We have 9 values, because V(S,b)<V(R,b), and by the preservation of value sets

assumption, all the 9 values we didn’t consider yet in S, will be in R as well.

Page 21: Estimating the Cost of Operations. From l.q.p. to p.q.p Having parsed a query and transformed it into a logical query plan, we must turn the logical plan

Example (equal width histogram)• Schemas:

Jan(day,temp)

July(day,temp)

• Query: Find the pairs of days in Jan and Jul that had the same temperature.

SELECT Jan.day, July.day

FROM Jan, July

WHERE Jan.temp=July.temp;

• Size of join of each band is T1*T2/Width

– On band 40-49: 10*5/10 = 5

– On band 50-59: 5*20/10 = 10 size of the result is thus 5+10 = 15

• Without using the histogram we would estimate the size as

– 245*245/100 = 600 !!