© eth zürich eric lo eth zurich a joint work with carsten binnig (u of heidelberg), donald...
TRANSCRIPT
© ETH Zürich
Eric Lo
ETH Zuricha joint work with Carsten Binnig (U of Heidelberg), Donald Kossmann (ETH Zurich),
Tamer Ozsu (U of Waterloo) and Peter Hass (IBM Almaden Research Center)
Symbolic Query Processing
2ETH Zurich
Symbolic Query Processing
Treat all data as symbols (think of variables)
E.g., a1 represents any value under the domain of
attribute a
Table R and S are called symbolic relations
3ETH Zurich
Background – Symbolic Execution 1/3
Borrow the concept from symbolic execution
A well known program verification technique
Represent values of program variables with
symbolic values instead of concrete data
Manipulate expressions based on those
symbolic values
4ETH Zurich
Background – Symbolic Execution 2/3
1. minsalary = read_input();
2. bensalary = minsalary + 2000;
3. if (bensalary < 80000)
4. output “no kidding!”;
5. else
6. output “that’s right”;
Find a test case for path 1236
Symbolic execution – start:
1. minsalary = ben
2. bensalary = ben + 2000;
3. bensalary = ben + 2000
and !(bensalary < 80000);- ( )
Symbolic execution – end
Instantiate ():
ben = 90000 expected input
“that’s right” expected output
5ETH Zurich
Background – Symbolic Execution 3/3
Has been research for > 20 years
Still have many limitations E.g., cannot handle highly complex software
However, many large software vendors still put
hope on this technique for program verifications E.g., Microsoft Research
No progress on database applications
involve an external database and SQL
6ETH Zurich
SQP Applications
Extend program verification and symbolic
execution techniques to support database
applications
For DBMS testing focus of today
7ETH Zurich
Symbolic Query Processing
Query manipulates data according
to different needs
R b=c S
Want the join results to have one
tuple? set c1=b1
Want the join results to have: four tuples
Zipf distribution (t1 joins more, t2 joins
less)?
b1
8ETH Zurich
DBMS Testing
To test a DBMS, we generate a lot of test
databases and execute a lot of test queries
DBMS vendors are looking for a way to control
the intermediate results of a test query such that
we can test an individual component of a DBMS
under a particular test case
9ETH Zurich
DBMS Testing Example
Test the accuracy of a cardinality
estimation component of a query
optimizer under
a multi-way hash join query
a two-way join query with aggregation
If we can make sure executing the
test query on the test database gives
expected answer
10ETH Zurich
DBMS Testing The test query is given
Physical join ordering can be
fixed (by testers)
Evaluation algorithm (e.g.,
using hash-join) can be fixed
too
However, the size of the
intermediate results cannot be
fixed easily
11ETH Zurich
DBMS Testing Problem
Guarantee that executing a test query on a test
database can obtain the desired intermediate query
results (e.g.,. output cardinality, data distribution)
12ETH Zurich
DBMS Testing Problem A test case T is:
a parametric query Qp
with a set of constraints C on each
intermediate result
A good test database D means Qp (D) satisfies C
- if the set of parameters p is properly
instantiated
D covers test case TTest case T
13ETH Zurich
Trial-and-error
Generate Database 3, 2, and 1 Using traditional database generators
such as IBM Test DB generator, MSR
DB generator, etc
Search for parameters
T2 is never covered
The database generation process
does not care about the test queries
14ETH Zurich
Latest approach – Finding query parameters MSR realized this problem [TKDE06]
Given the test database + the test query Qp,
search parameter values for p such that Qp(D)
(almost) fit the cardinality requirements
defined on the test case
It is a NP-hard problem
Same as the previous approach, T2 is never
covered
15ETH Zurich
QAGen – Query Aware test database Generator
Based on symbolic query
processing
We can control the output size
of each intermediate query
result (and even more)
18ETH Zurich
QAGen overview – Query Analyzer
Analyzer the query and assign the
knob to an operator
A knob is a parameter of an
operator to control the output
(e.g., output cardinality,
distribution)
A knob for an operator is not
always available for tuning
19ETH Zurich
QAGen overview – Query Analyzer
A knob for an operator is not always available for tuning
join distribution? Yes
join distribution? No
20ETH Zurich
QAGen overview – Query Analyzer
The available knob(s) for an operator depends on its input characteristics
Definition: pre-grouping data
Definition: non pre-grouping data
23ETH Zurich
Symbolic Query Engine and Symbolic Database (SDB)
An SQL operator: Add predicates to a symbol
Replace a symbol with another other
symbol (e.g., joining)
E.g., SELECT a FROM R
WHERE a > p;
1 output
σa>p
<=p>p
24ETH Zurich
Symbolic Query Engine and Symbolic Database (SDB)
How to physically store the
symbolic data?
Options: Implement a native symbolic database
Use relational database- How to represent “a1 > p”?
- Stores all predicates that are associated
with a symbol s in a separate relation called
PTable
<=p
>p
a1 a1>pa2 a2<=p
s Pred.
PTable
26ETH Zurich
Data Instantiation
• Data instantiator uses a constraint solver:• Input: a (propositional) constraint (e.g., A + B > 50)• Output: any concrete values for the constraint (e.g., A=99, B=12)
28ETH Zurich
Symbolic Query Engine
Iterator-based open(), getNext(), close()
No naughty user Contradicting knob values
33ETH Zurich
SQP – operator (with FK constraint)
When the input of the join is
pre-grouped, the world has
changed
It sometimes happen, e.g., 2-way join Base tables A, B and C with
foreign key relationships A B, B C
34ETH Zurich
SQP – operator (with FK constraint)
Do not support join distribution (the knob is disabled by the
analyzer)
Controlling the output cardinality is a subset-sum problem
(weakly NP-hard)
Subset-sum has a
pseudo-polynomial time exact
solution using
dynamic programming
35ETH Zurich
SQP – operator (with FK constraint)
Blocking
During open() Materialize Table S in a temporary relation
SELECT COUNT(k)
From S
GROUP BY k
Solve the subset-sum
36ETH Zurich
SQP – χ operatorAction 1: Aggregation attribute replacement
• o_date3 o_date1• o_date4 o_date2
2nd output group (o_date2)
1st output group (o_date1)
37ETH Zurich
SQP – χ operatorAction 2 (base case version): - Adding aggregation constraints to PTable, base case:
<l_price1, aggsum1 = l_price1+ l_price2 + l_price3+l_price4 + l_price7><l_price2, aggsum1 = l_price1+ l_price2 + l_price3+l_price4 + l_price7><l_price3, aggsum1 = ‘’> <l_price4, aggsum1=‘’><l_price7, aggsum1 = l_price1+ l_price2 + l_price3+l_price4 + l_price7><l_price5, aggsum2 = l_price5+ l_price6 + l_price8><l_price6, aggsum2 = l_price5+ l_price6 + l_price8> <l_price8, aggsum2 = ‘’>
38ETH Zurich
SQP – χ operatorAction 2 (optimized version):- A constraint solver call is exponential to the size of predicates- Adding 2 aggregation constraints to PTable:
<l_price1, aggsum1 = l_price1 x 5><l_price5, aggsum2 = l_price5 x 3>
and do l_price replacement
40ETH Zurich
Data Instantiation
Use a constraint solver to instantiate the symbolic
database for each symbolic relation r
for each tuple t for each symbol s
load the related predicates Pinstantiate Pcache P
41ETH Zurich
Experiment 1 – Operator Performance
Study the performance (and scalability) of Individual operator during SQP The data instantiation phase
Use TPC-dbgen to generate 3 TPCH-DB 10M, 100M, 1G
Q8(TPCH-DB) to collect the intermediate results
R for each operator
QAGen(Q8, R) Q8 query aware database
44ETH Zurich
Experiment 2 – Effects of knob values
Use TPCH Q8
6 sets of knob values TPCH-Uniform, TPCH-Zipf Min-Uniform, Min-Zipf Max-Uniform, Max-Zipf
47ETH Zurich
Related Work, Future Work, Conclusions
Reverse Query Processing (ICDE07) Given the result R, the query Q, reversely process Q
to generate D for function testing database applications, view maintenance, debugging SQL
Multiple SQL statements (to ACM TSE journal)
49ETH Zurich
Current approach 2 – Stochastically generate many test queries
Based on a given test database,
RAGS/QGen generates many
valid SQL queries to test the
system
No guarantee that T1 can be
covered
Same as the previous approach,
T2 is never covered