Factorizing Complex Predicates in Queries to Exploit Indexes
Prasanna Ganesan*Stanford University
Surajit Chaudhuri Sunita Sarawagi*Microsoft Research IIT Bombay
*Work done at Microsoft Research
Motivation
• Complex, redundant WHERE clauses– Application-generated decision-support queries
• May result in unsatisfactory plans– So many candidate plans, so little time– Conversion to normal form doesn’t work– Often end up with a table scan
• Goal: Techniques for efficiently identifying access paths for such complex WHERE clauses
Outline
• Why is the problem challenging?– CNF or DNF does not avoid redundancy– Plan space is large– Formal problem statement– Challenges
• Factorization– Basic factorization: “largest common conjunctive
factor”– Factorization involving union
• Approximate Factorization• Experiments
Basic Primitives(Index Intersection)
• SELECT addr FROM consumers WHERE (income>100000) AND (zipcode=94305)
Index intersect ()
Seek(income>100000) Seek(zipcode=94305)
Lookup(addr)
Basic Primitives(Index Union)
• SELECT addr FROM consumers WHERE (income>100000) OR (zipcode=94301)
Index union()
Seek(income>100000) Seek(zipcode=94301)
Lookup(addr)
A B
A B
Index Intersection and Union
• (A AND B) OR (A AND C) AB+AC
Seek(A) Seek(B) Seek(A)Seek(C)
Data Lookup
A(B+C)
Index Intersection and Union
• (A OR B) AND (A OR C) (A+B)(A+C)
Seek(A) Seek(B) Seek(A)Seek(C)
Data Lookup
A+BC
The Problem
• Given a relation R, find a plan to retrieve πS(σP(R)) using one or more of– Table scan – Index seeks/scans– Index intersections– Index unions– Data lookup from RID lists
• Focus on single-table selection– Naturally extends to arbitrary queries
Challenges
• Understanding the set of all feasible plans– Many equivalent rewritings– Can we rewrite to retrieve a superset?
• Identifying the “best” plan– Different index characteristics
• Impacts access cost
– Different selectivities• Impacts intersection/union costs as well
Roadmap
Plan Complexity
Expn. Format
Exact
vs A
ppro
ximat
e
DNF
CNF
Intersection +One Union
Arbitrary
Arbitrary
Basic Factorization
AND
B ED D FCAA
ANDAND AND
OR
AND
OR
C
{A,B,C} {A,C} {D,E} {D,F}
{A,C} {D}
{A,C,D}
Basic Factorization(Contd.)
• We now have a conjunctive factor (A AND C AND D)
• Use standard optimizer module to find plan for this factor– Table scan, index seek or index intersection– Typically a greedy algorithm based on index
costs and selectivity
• Evaluate remaining conditions as a filter
Introducing the Union
• If query has conjunctive factors, simple factorization usually suffices
• Many queries don’t have such factors– Need to explore index unions
• Consider plans with at most one union– No index intersection above it – Sufficient for large set of practical queries– Limited space allows optimal algorithms
Single-Union Plans ---(1)
• Assume expression in Disjunctive Normal Form(DNF)
• E.g. E = ABC+ACD+ADG+DGH
• Consider factorizing E as f.Q+R– Find intersection plan for f– Recursively find single-union plan for R– Merge the two plans (re-use R’s union if it
exists)
ACDG
A
Single-Union Plans---(2)
• E=ABC+ACD+ADG+DGH = f.Q+R
• Say f=AC. E=AC(B+D)+ADG+DGH
• Recursively factorize R into DG(A+H) Q
R
Seek(A) Seek(G)Seek(D)
Lookup(Filter E)
Lookup(Filter C(B+D))
Lookup(Filter A+H)
Single-Union Plans ---(3)
• Cost(E)~minE=f.Q+R ( cost(f.Q)+cost(R))
– Natural dynamic-programming formulation – Real equation slightly more complex– Cost is exponential
• Use a greedy alternative– Choose the f that provides greatest cost
reduction without further factorizing R
Other Expression Forms
• Conjunctive Normal Form(CNF)– Can just use one term– Multiply terms E.g. (A+B)(A+C) => A+BC– Recursive algorithm in paper
• General AND-OR trees– Bottom-up algorithm– Applies DNF algorithm to OR nodes and CNF
algorithm to AND nodes
Approximate Factoring
• Often, predicates are similar but not identical– A(X BETWEEN 1 AND 100) + B(X BETWEEN
10 AND 110)– Like to exploit similarity of X predicates
• Relax both X predicates to (X BETWEEN 1 AND 110)– Resulting query is more general (assuming no
NOTty problems)
Challenge
• What predicates do we relax?– Trade-off between factoring benefit and cost
of false positives
• Rule 1 of relaxation is:– We do not talk about Fight Club.
• Find “best” set of range predicates to relax for each attribute– Then select the “best” attribute
Don’t relax irrelevant predicates
Finding predicates to relax
• Given expression with range predicates involving attribute X. – Find which predicates to relax for greatest
plan improvement.
• Turns out a greedy algorithm is optimal for many cost functions– Proof in paper appendix– Useful as a heuristic even otherwise
Key Idea
• Relax a pair of predicates if computed to be beneficial
• Repeat treating the relaxed query as the original query
• Trick is in figuring out when a relaxation is beneficial– Original predicates are treated slightly
differently from relaxed predicates– Details in paper
Experiments
• Experiments on SQL Server 2000
• Factorizing done in stand-alone module– Did I hear someone say SQL is declarative?
• Queries on UCI Machine Learning and UCI KDD data.– Table sizes ~ 1 million rows
• 15 workloads– Mostly DNF queries (#terms:1 to >100)
0
10
20
30
40
50
60
70
80
Annea
lU
Austra
lian
Balanc
e_Sca
le
Breas
t
Breas
t2
Diabet
es
Diabet
es2
Hypot
hyro
id
Lette
r
Mus
hroo
omPim
a
Pima2
Shuttle
Soybe
an_L
arge
Soybe
an_L
arge
2
% R
ed
uct
ion
in R
un
nin
g T
ime
%IncrementalReduction(Approx)%Reduction(Exact)
Reduction in Running Time
Impact of Factorization
0
5
10
15
20
25
30
35
40
45
50
Total ChangedPlans
Changes fromExact factoring
Changes fromApprox. factoring
Index-intersection
plans
IIU plans
% o
f Q
ue
rie
s
Related Work
• Optimization of complex WHERE clause– Convert to CNF/DNF [Selinger79, Dayal87]– Using multiple indexes [Mohan90]
• No factorization
– Using smarter indexes [Leslie95]
• Factorization a popular idea in other domains– Compilers [Reinwald66], VLSI Design
[Brayton87]