dbms 2001notes 6: query compilation1 principles of database management systems 6: query compilation...
TRANSCRIPT
DBMS 2001 Notes 6: Query Compilation 1
Principles of Database Management Systems
6: Query Compilation and Optimization
Pekka Kilpeläinen(partially based on Stanford CS245 slide originals by Hector Garcia-Molina, Jeff
Ullman and Jennifer Widom)
DBMS 2001 Notes 6: Query Compilation 2
Overview• We have studied recently:
– Algorithms for selections and joins, and their costs (in terms of disk I/O)
• Next: A closer look at query compilation– Parsing– Algebraic optimization of logical query plans– Estimating sizes of intermediate results
• Focus still on selections and joins
• Remember the overall process of query execution:
DBMS 2001 Notes 6: Query Compilation 3
{P1,P2,…..}
{P1,C1>...}
parse
convert
apply laws
estimate result sizes
consider physical plans estimate costs
pick best
execute
Pi
answer
SQL query
parse tree
logical query plan
“improved” l.q.p
l.q.p. +sizesstatistics
DBMS 2001 Notes 6: Query Compilation 4
Step 1: Parsing• Check syntactic correctness of the query,
and transform it into a parse tree– Based on the formal grammar for SQL– Inner nodes nonterminal symbols
(syntactic categories for things like <Query>, <RelName>, or <Condition>)
– Leaves terminal symbols: names of relations or attributes, keywords (SELECT, FROM, …), operators (+, AND, OR, LIKE, …), operands (10, '%1960', …)
– Also check semantic correctness: Relations and attributes exist, operands compatible with operators, …
DBMS 2001 Notes 6: Query Compilation 5
Example: SQL queryConsider querying a movie database with relations
StarsIn(title, year, starName)MovieStar(name, address, gender, birthdate)
SELECT titleFROM StarsIn, MovieStarWHERE starName = name AND birthdate LIKE
‘%1960’ ;
(Find the titles for movies with stars born in 1960)
DBMS 2001 Notes 6: Query Compilation 6
Example: Parse Tree<Query>
<SFW>
SELECT <SelList> FROM <FromList> WHERE <Condition>
<Attribute> <RelName> , <FromList> AND
title StarsIn <RelName>
<Condition> <Condition>
<Attribute> = <Attribute> <Attribute> LIKE <Pattern>
starName name birthdate ‘%1960’
MovieStar
DBMS 2001 Notes 6: Query Compilation 7
Step 2: Parse Tree Logical Query
Plan• Basic strategy:
SELECT A, B, CFROM R1, R2WHERE Cond ;
becomes
A,B,C[Cond(R1 x R2)]
DBMS 2001 Notes 6: Query Compilation 8
Example: Logical Query Plan
title
starName=name and birthdate LIKE ‘%1960’
StarsIn MovieStar
DBMS 2001 Notes 6: Query Compilation 9
Step 3: Improving the L.Q.P
• Transform the logical query plan into an equivalent form expected to lead to better performance– Based on laws of relational algebra– Normal to produce a single optimized
form, which acts as input for the generation of physical query plans
DBMS 2001 Notes 6: Query Compilation 10
Example: Improved Logical Query Plan
title
starName=name
StarsIn
birthdate LIKE ‘%1960’
MovieStar
DBMS 2001 Notes 6: Query Compilation 11
Step 4: Estimate Result Sizes
• Cost estimates of database algorithms depend on sizes of input relations Need to estimate sizes of intermediate results– Estimates based on statistics about
relations, gathered periodically or incrementally
DBMS 2001 Notes 6: Query Compilation 12
Example: Estimate Result Sizes
Need expected size
StarsIn
MovieStar
DBMS 2001 Notes 6: Query Compilation 13
Steps 5, 6, ...
• Generate and compare query plans– generate alternate query plans P1, …, Pk
by selecting algorithms and execution orders for relational operations
– Estimate the cost of each plan
– Choose the plan Pi estimated to be "best"
• Execute plan Pi and return its result
DBMS 2001 Notes 6: Query Compilation 14
Example: One Physical Plan
Parameters: join order,
memory size, project attributes,...
Hash join
SEQ scan index scan Parameters:Select Condition,...
StarsIn MovieStar
DBMS 2001 Notes 6: Query Compilation 15
Next: Closer look at ...
• Transformation rules
• Estimating result sizes
DBMS 2001 Notes 6: Query Compilation 16
Relational algebra optimization
• Transformation rules(preserve equivalence)
• What are good transformations?
DBMS 2001 Notes 6: Query Compilation 17
Rules: Natural joins
R S = S R (commutative)
(R S) T = R (S T) (associative) • Carry attribute names in results, so
order is not important Can evaluate in any order
DBMS 2001 Notes 6: Query Compilation 18
(R x S) x T = R x (S x T)R x S = S x R
R U (S U T) = (R U S) U TR U S = S U R
Rules: Cross products & union similarly (both associative & commutative):
DBMS 2001 Notes 6: Query Compilation 19
Rules: Selects
p1p2(R) =
p1vp2(R) =
p1 [ p2 (R)]
[ p1 (R)] U [ p2 (R)]
Especially useful (applied left-to-right): Allows compound select conditions to be split and moved to suitable positions (See next)
NB: = AND; v = OR
DBMS 2001 Notes 6: Query Compilation 20
Rules: Products to JoinsDefinition of Natural Join:
R S = L[C(R x S)] ; Condition C equates attributes common to R
and S, and L projects one copy of them out
Applied right-to-left- definition of general join applied
similarly
DBMS 2001 Notes 6: Query Compilation 21
Let p = predicate with only R attribs q = predicate with only S attribs m = predicate with attribs common to R,S
p (R S) =
q (R S) =
Rules: combined
[p (R)] S
R [q (S)]
More rules can be derived...
DBMS 2001 Notes 6: Query Compilation 22
pq (R S) = [p (R)] [q (S)]
pqm (R S) =
m [(p R) (q S)]pvq (R S) =
[(p R) S] U [R (q S)]
Derived rules for
DBMS 2001 Notes 6: Query Compilation 23
Derivation for first one; Others for homework:
pq (R S) =
p [q (R S) ] =
p [(R q (S) ] =
[p (R)] [q (S)]
DBMS 2001 Notes 6: Query Compilation 24
Rules for combined with X
similar...
e.g., p (R X S) = ?
DBMS 2001 Notes 6: Query Compilation 25
p1p2 (R) p1 [p2 (R)]
p (R S) [p (R)] S
R S S R
Some “good” transformations:
• No transformation is always good• Usually good: early selections
DBMS 2001 Notes 6: Query Compilation 26
Outline - Query Processing
• Relational algebra level– transformations– good transformations
• Detailed query plan level– estimate costs– generate and compare plans
next
DBMS 2001 Notes 6: Query Compilation 27
• Estimating cost of query plan
(1) Estimating size of intermediate results
(2) Estimating # of I/Os
(considered last week)
DBMS 2001 Notes 6: Query Compilation 28
Estimating result size
• Maintain statistics for relation R– T(R) : # tuples in R– L(R) : Length of rows,
# of bytes in each R tuple– B(R): # of blocks to hold all R tuples– V(R, A) :
# distinct values in R for attribute A
DBMS 2001 Notes 6: Query Compilation 29
Example R A: 20 byte string
B: 4 byte integerC: 8 byte dateD: 5 byte string
A B C D
cat 1 10 a
cat 1 20 b
dog 1 30 a
dog 1 40 c
bat 1 50 d
T(R) = 5 L(R) = 37V(R,A) = 3 V(R,C) = 5V(R,B) = 1 V(R,D) = 4
DBMS 2001 Notes 6: Query Compilation 30
Size estimates for product W = R1xR2
T(W) =
L(W) =
T(R1) T(R2)
L(R1) + L(R2)
DBMS 2001 Notes 6: Query Compilation 31
L(W) = L(R)
T(W) = ?
Estimates for selection W =
A=a(R)
= AVG number of tuples that satisfy an equality condition on R.A
= …
DBMS 2001 Notes 6: Query Compilation 32
Example R V(R,A)=3
V(R,B)=1V(R,C)=5V(R,D)=4
AVG size of, say, A=val(R)? T(A=cat(R))=2, T(A=dog(R))=2, T(A=bat(R))=1
=> (2+2+1)/3 = 5/3
A B C D
cat 1 10 a
cat 1 20 b
dog 1 30 a
dog 1 40 c
bat 1 50 d
DBMS 2001 Notes 6: Query Compilation 33
Size of W = Z=a(R) in general:
Assume: Only existing Z values a1, a2, … are used in a select expression Z= ai, each with equal probalility 1/V(R,Z)
=> the expected size of Z=val(R) is
E(T(W)) = i 1/V(R,Z) * T(Z=ai (R))
= T(R)/V(R,Z)
DBMS 2001 Notes 6: Query Compilation 34
What about selection W=z val (R) ?
T(W) = ?
• Estimate 1: T(W) = T(R)/2– Rationale: On avg, 1/2 of tuples satisfy the
condition
• Estimate 2: T(W) = T(R)/3– Rationale: Acknowledges the tendency of selecting
"interesting" (e.g., rare tuples) more frequently
<, and < similary
DBMS 2001 Notes 6: Query Compilation 35
Size estimates for W = R1 R2
Let X = attributes of R1 Y = attributes of R2
X Y =
Same as R1 x R2
Case 1
DBMS 2001 Notes 6: Query Compilation 36
W = R1 R2 X Y = AR1 A B C R2 A D
Case 2
Assumption:
V(R1,A) V(R2,A) A(R1) A(R2)
V(R2,A) V(R1,A) A(R2) A(R1)
“Containment of value sets” [Sec. 7.4.4]
DBMS 2001 Notes 6: Query Compilation 37
Why should “containment of value sets” hold?
Consider joining relationsFaculties(FName, …) andDepts(DName, FName, …)
where Depts.FName is a foreign key, and faculties can have 0,…,n departments.
Now V(Depts, FName) V(Faculties, FName), and referential integrity requires that
FName(Depts) FName(Faculties)
DBMS 2001 Notes 6: Query Compilation 38
R1 A B C R2 A D
a
Estimating T(W) when V(R1,A) V(R2,A)
Take 1 tuple Match
Each estimated to match with T(R2)/V(R2,A) tuples ...
so T(W) = T(R1)T(R2) /V(R2, A)
DBMS 2001 Notes 6: Query Compilation 39
• V(R1,A) V(R2,A): T(W) = T(R1)T(R2)/V(R2,A)
• V(R2,A) V(R1,A): T(W) = T(R2)T(R1)/V(R1,A)
[A is the join attribute]=> in general: T(W) = T(R1)T(R2)/max{V(R1,A), V(R2,A)
}
DBMS 2001 Notes 6: Query Compilation 40
With similar ideas, can estimate sizes of:• W = AB (R) ….. [Sec. 7.4.2]
• W =A=aB=b (R) = A=a (B=b (R));Ass. A and B independent =>
T(W) = T(R)/(V(R, A) x V(R, B))
• W=R(A,X,Y) S(X,Y,B); [Sec. 7.4.5]
Ass. X and Y independent => … T(W) = T(R)T(S)/(max{V(R,X), V(S,X)} x
max{V(R,Y), V(S,Y)})
• Union, intersection, diff, …. [Sec. 7.4.7]
DBMS 2001 Notes 6: Query Compilation 41
Note: for complex expressions, need
intermediate estimates.
E.g. W = [A=a (R1) ] R2
Treat as relation U
T(U) = T(R1)/V(R1,A) L(U) = L(R1)
Also need an estimate for V(U, Ai) !
DBMS 2001 Notes 6: Query Compilation 42
To estimate V (U, Ai)
E.g., U = A=a (R) Say R has attributes A,B,C,D
V(U, A) = ?V(U, B) = ?V(U, C) = ?V(U, D) = ?
DBMS 2001 Notes 6: Query Compilation 43
Example R V(R,A)=3
V(R,B)=1V(R,C)=T(R)=5V(R,D)=3
U = A=x (R)
A B C D
cat 1 10 10
cat 1 20 20
dog 1 30 10
dog 1 40 30
bat 1 50 10
V(U, D) = V(R,D)/T(R,A) ... V(R,D)
V(U,A) =1 V(U,B) =1 V(U,C) = T(R) V(R,A)
DBMS 2001 Notes 6: Query Compilation 44
V(U,Ai) for Joins U = R1(A,B)
R2(A,C)
V(U,A) = min { V(R1, A), V(R2, A) }V(U,B) = V(R1, B)V(U,C) = V(R2, C)
[Assumption called “preservation of value sets”, sect. 7.4.4]
Values of non-join attributes preserved
DBMS 2001 Notes 6: Query Compilation 45
Example:
Z = R1(A,B) R2(B,C) R3(C,D)
• R1: T(R1) = 1000 V(R1,A)=50 V(R1,B)=100
• R2: T(R2) = 2000 V(R2,B)=200 V(R2,C)=300
• R3: T(R3) = 3000 V(R3,C)=90 V(R3,D)=500
DBMS 2001 Notes 6: Query Compilation 46
T(U) = T(R1) T(R2)/max{V(R1,B), V(R2,B)}= 10002000/200 = 10,000;
V(U,A) = V(R1,A) = 50V(U,C) = V(R2,C) = 300V(U,B) = min{V(R1,B), V(R2,B)}= 100
Intermediate Result: U(A,B,C)= R1(A,B)
R2(B,C)
DBMS 2001 Notes 6: Query Compilation 47
Z = U(A,B,C) R3(C,D)
T(Z) = T(U)T(R3)/ max{V(U,C), V(R3,C)}
= 10,0003000/300= 100,000
V(Z,A) = V(U,A) = 50V(Z,B) = V(U,B) = 100V(Z,C) = min{V(U,C), V(R3,C)} = 90V(Z,D) = V(R3,D) =500
DBMS 2001 Notes 6: Query Compilation 48
Outline/Summary
• Estimating cost of query plan– Estimating size of results done!– Estimating # of IOs last week
• Generate and compare plans– skip this
• Execute physical operations of query plan– Sketched last week