© eth zürich eric lo eth zurich a joint work with carsten binnig (u of heidelberg), donald...

50
© ETH Zürich Eric Lo ETH Zurich a joint work with Carsten Binnig (U of Heidelberg), Donald Kossmann (ETH Zurich), Tamer Ozsu (U of Waterloo) and Peter Hass (IBM Almaden Research Center) Symbolic Query Processing

Upload: rudolph-hutchinson

Post on 03-Jan-2016

216 views

Category:

Documents


0 download

TRANSCRIPT

© ETH Zürich

Eric Lo

ETH Zuricha joint work with Carsten Binnig (U of Heidelberg), Donald Kossmann (ETH Zurich),

Tamer Ozsu (U of Waterloo) and Peter Hass (IBM Almaden Research Center)

Symbolic Query Processing

2ETH Zurich

Symbolic Query Processing

Treat all data as symbols (think of variables)

E.g., a1 represents any value under the domain of

attribute a

Table R and S are called symbolic relations

3ETH Zurich

Background – Symbolic Execution 1/3

Borrow the concept from symbolic execution

A well known program verification technique

Represent values of program variables with

symbolic values instead of concrete data

Manipulate expressions based on those

symbolic values

4ETH Zurich

Background – Symbolic Execution 2/3

1. minsalary = read_input();

2. bensalary = minsalary + 2000;

3. if (bensalary < 80000)

4. output “no kidding!”;

5. else

6. output “that’s right”;

Find a test case for path 1236

Symbolic execution – start:

1. minsalary = ben

2. bensalary = ben + 2000;

3. bensalary = ben + 2000

and !(bensalary < 80000);- ( )

Symbolic execution – end

Instantiate ():

ben = 90000 expected input

“that’s right” expected output

5ETH Zurich

Background – Symbolic Execution 3/3

Has been research for > 20 years

Still have many limitations E.g., cannot handle highly complex software

However, many large software vendors still put

hope on this technique for program verifications E.g., Microsoft Research

No progress on database applications

involve an external database and SQL

6ETH Zurich

SQP Applications

Extend program verification and symbolic

execution techniques to support database

applications

For DBMS testing focus of today

7ETH Zurich

Symbolic Query Processing

Query manipulates data according

to different needs

R b=c S

Want the join results to have one

tuple? set c1=b1

Want the join results to have: four tuples

Zipf distribution (t1 joins more, t2 joins

less)?

b1

8ETH Zurich

DBMS Testing

To test a DBMS, we generate a lot of test

databases and execute a lot of test queries

DBMS vendors are looking for a way to control

the intermediate results of a test query such that

we can test an individual component of a DBMS

under a particular test case

9ETH Zurich

DBMS Testing Example

Test the accuracy of a cardinality

estimation component of a query

optimizer under

a multi-way hash join query

a two-way join query with aggregation

If we can make sure executing the

test query on the test database gives

expected answer

10ETH Zurich

DBMS Testing The test query is given

Physical join ordering can be

fixed (by testers)

Evaluation algorithm (e.g.,

using hash-join) can be fixed

too

However, the size of the

intermediate results cannot be

fixed easily

11ETH Zurich

DBMS Testing Problem

Guarantee that executing a test query on a test

database can obtain the desired intermediate query

results (e.g.,. output cardinality, data distribution)

12ETH Zurich

DBMS Testing Problem A test case T is:

a parametric query Qp

with a set of constraints C on each

intermediate result

A good test database D means Qp (D) satisfies C

- if the set of parameters p is properly

instantiated

D covers test case TTest case T

13ETH Zurich

Trial-and-error

Generate Database 3, 2, and 1 Using traditional database generators

such as IBM Test DB generator, MSR

DB generator, etc

Search for parameters

T2 is never covered

The database generation process

does not care about the test queries

14ETH Zurich

Latest approach – Finding query parameters MSR realized this problem [TKDE06]

Given the test database + the test query Qp,

search parameter values for p such that Qp(D)

(almost) fit the cardinality requirements

defined on the test case

It is a NP-hard problem

Same as the previous approach, T2 is never

covered

15ETH Zurich

QAGen – Query Aware test database Generator

Based on symbolic query

processing

We can control the output size

of each intermediate query

result (and even more)

16ETH Zurich

QAGen – Generate a query-aware test database for each test case

17ETH Zurich

QAGen overview

18ETH Zurich

QAGen overview – Query Analyzer

Analyzer the query and assign the

knob to an operator

A knob is a parameter of an

operator to control the output

(e.g., output cardinality,

distribution)

A knob for an operator is not

always available for tuning

19ETH Zurich

QAGen overview – Query Analyzer

A knob for an operator is not always available for tuning

join distribution? Yes

join distribution? No

20ETH Zurich

QAGen overview – Query Analyzer

The available knob(s) for an operator depends on its input characteristics

Definition: pre-grouping data

Definition: non pre-grouping data

21ETH Zurich

QAGen overview – Query Analyzer

22ETH Zurich

Symbolic Query Engine and Symbolic Database

23ETH Zurich

Symbolic Query Engine and Symbolic Database (SDB)

An SQL operator: Add predicates to a symbol

Replace a symbol with another other

symbol (e.g., joining)

E.g., SELECT a FROM R

WHERE a > p;

1 output

σa>p

<=p>p

24ETH Zurich

Symbolic Query Engine and Symbolic Database (SDB)

How to physically store the

symbolic data?

Options: Implement a native symbolic database

Use relational database- How to represent “a1 > p”?

- Stores all predicates that are associated

with a symbol s in a separate relation called

PTable

<=p

>p

a1 a1>pa2 a2<=p

s Pred.

PTable

25ETH Zurich

Data Instantiator

26ETH Zurich

Data Instantiation

• Data instantiator uses a constraint solver:• Input: a (propositional) constraint (e.g., A + B > 50)• Output: any concrete values for the constraint (e.g., A=99, B=12)

27ETH Zurich

Symbolic Query Engine

28ETH Zurich

Symbolic Query Engine

Iterator-based open(), getNext(), close()

No naughty user Contradicting knob values

29ETH Zurich

SQP – Table operator

Fill up the table with

symbols

30ETH Zurich

SQP – σ operator

31ETH Zurich

SQP – operator (with FK constraint)

Action: join key replacement

32ETH Zurich

SQP – operator (with FK constraint)

Action: join key replacement

33ETH Zurich

SQP – operator (with FK constraint)

When the input of the join is

pre-grouped, the world has

changed

It sometimes happen, e.g., 2-way join Base tables A, B and C with

foreign key relationships A B, B C

34ETH Zurich

SQP – operator (with FK constraint)

Do not support join distribution (the knob is disabled by the

analyzer)

Controlling the output cardinality is a subset-sum problem

(weakly NP-hard)

Subset-sum has a

pseudo-polynomial time exact

solution using

dynamic programming

35ETH Zurich

SQP – operator (with FK constraint)

Blocking

During open() Materialize Table S in a temporary relation

SELECT COUNT(k)

From S

GROUP BY k

Solve the subset-sum

36ETH Zurich

SQP – χ operatorAction 1: Aggregation attribute replacement

• o_date3 o_date1• o_date4 o_date2

2nd output group (o_date2)

1st output group (o_date1)

37ETH Zurich

SQP – χ operatorAction 2 (base case version): - Adding aggregation constraints to PTable, base case:

<l_price1, aggsum1 = l_price1+ l_price2 + l_price3+l_price4 + l_price7><l_price2, aggsum1 = l_price1+ l_price2 + l_price3+l_price4 + l_price7><l_price3, aggsum1 = ‘’> <l_price4, aggsum1=‘’><l_price7, aggsum1 = l_price1+ l_price2 + l_price3+l_price4 + l_price7><l_price5, aggsum2 = l_price5+ l_price6 + l_price8><l_price6, aggsum2 = l_price5+ l_price6 + l_price8> <l_price8, aggsum2 = ‘’>

38ETH Zurich

SQP – χ operatorAction 2 (optimized version):- A constraint solver call is exponential to the size of predicates- Adding 2 aggregation constraints to PTable:

<l_price1, aggsum1 = l_price1 x 5><l_price5, aggsum2 = l_price5 x 3>

and do l_price replacement

39ETH Zurich

Data Instantiation

40ETH Zurich

Data Instantiation

Use a constraint solver to instantiate the symbolic

database for each symbolic relation r

for each tuple t for each symbol s

load the related predicates Pinstantiate Pcache P

41ETH Zurich

Experiment 1 – Operator Performance

Study the performance (and scalability) of Individual operator during SQP The data instantiation phase

Use TPC-dbgen to generate 3 TPCH-DB 10M, 100M, 1G

Q8(TPCH-DB) to collect the intermediate results

R for each operator

QAGen(Q8, R) Q8 query aware database

42ETH Zurich

Experiments – TPC-H Query 8

43ETH Zurich

Experiment 1 – TPC-H Query 8

44ETH Zurich

Experiment 2 – Effects of knob values

Use TPCH Q8

6 sets of knob values TPCH-Uniform, TPCH-Zipf Min-Uniform, Min-Zipf Max-Uniform, Max-Zipf

45ETH Zurich

Experiment 2 – Effects of knob values

46ETH Zurich

Experiment 3 – System Scalability

47ETH Zurich

Related Work, Future Work, Conclusions

Reverse Query Processing (ICDE07) Given the result R, the query Q, reversely process Q

to generate D for function testing database applications, view maintenance, debugging SQL

Multiple SQL statements (to ACM TSE journal)

48ETH Zurich

49ETH Zurich

Current approach 2 – Stochastically generate many test queries

Based on a given test database,

RAGS/QGen generates many

valid SQL queries to test the

system

No guarantee that T1 can be

covered

Same as the previous approach,

T2 is never covered

50ETH Zurich

QAGen overview – Query Analyzer

Each knob combination

(e.g., output cardinality + join

distribution) for an operator

may have different ways to

implement it

The output is an knob-

annotated execution plan