comp 5138 relational database management systems semester 2, 2007 lecture 12 query processing and...

42
COMP 5138 COMP 5138 Relational Database Relational Database Management Systems Management Systems Semester 2, 2007 Semester 2, 2007 Lecture 12 Lecture 12 Query Processing and Optimization Query Processing and Optimization

Upload: ginger-holland

Post on 12-Jan-2016

222 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: COMP 5138 Relational Database Management Systems Semester 2, 2007 Lecture 12 Query Processing and Optimization

COMP 5138COMP 5138

Relational Database Relational Database

Management Systems Management Systems

Semester 2, 2007Semester 2, 2007

Lecture 12Lecture 12

Query Processing and OptimizationQuery Processing and Optimization

Page 2: COMP 5138 Relational Database Management Systems Semester 2, 2007 Lecture 12 Query Processing and Optimization

2222

L10L10Query ProcessingQuery Processing

& Optimization& Optimization

Parsing and Translation

EvaluationIndividual operationsEntire operations

OptimizationCost-basedHeuristic

Today’s Agenda

Page 3: COMP 5138 Relational Database Management Systems Semester 2, 2007 Lecture 12 Query Processing and Optimization

3333

L10L10Query ProcessingQuery Processing

& Optimization& Optimization

Overview Query Processing

Basic StepsParsing and TranslationOptimisationEvaluation

Source: Silberschatz/Korth/Sudarshan: Database System Concepts, 2002.

Page 4: COMP 5138 Relational Database Management Systems Semester 2, 2007 Lecture 12 Query Processing and Optimization

4444

L10L10Query ProcessingQuery Processing

& Optimization& Optimization

Basic Steps in Query Processing

Parsing and Translationtranslate the query into its internal form. This is then translated into relational algebra.Parser checks syntax, verifies relations

Query OptimizationAmongst all equivalent query-evaluation plans choose the one with lowest cost.

Query EvaluationQuery-execution engine takes a query-evaluation plan, executes that plan, and returns the answers to the query.

Page 5: COMP 5138 Relational Database Management Systems Semester 2, 2007 Lecture 12 Query Processing and Optimization

5555

L10L10Query ProcessingQuery Processing

& Optimization& Optimization

Parsing and Translation

SQL gets translated into relational algebra, which can be shown as expression tree.

Operators have one or more input sets and return one (or more) output set(s).Leafs correspond to (base) relations.

Example:

SELECT name FROM students , enrolled WHERE cid = ‘COMP5138’

students enrolled

cid=‘COMP5138’

name

Projection

Join

Selection

Page 6: COMP 5138 Relational Database Management Systems Semester 2, 2007 Lecture 12 Query Processing and Optimization

6666

L10L10Query ProcessingQuery Processing

& Optimization& Optimization

Parsing and Translation

EvaluationIndividual operationsEntire operations

OptimizationCost-basedHeuristic

Today’s Agenda

Page 7: COMP 5138 Relational Database Management Systems Semester 2, 2007 Lecture 12 Query Processing and Optimization

7777

L10L10Query ProcessingQuery Processing

& Optimization& Optimization Query Evaluation Plan

Each relational algebra operation can be evaluated using one of several different algorithms

Correspondingly, a relational-algebra expression can be evaluated in many ways.

Annotated expression specifying detailed evaluation strategy is called an evaluation-plan.

Example: We can use an index onbalance to find accounts

with balance<2500.

Or can perform completerelation scan and discard

accounts with balance 2500account

balance

balance2500

Page 8: COMP 5138 Relational Database Management Systems Semester 2, 2007 Lecture 12 Query Processing and Optimization

8888

L10L10Query ProcessingQuery Processing

& Optimization& OptimizationExpression Tree vs.

Query Evaluation Plan

Relational algebra operators == logical operators which represent the intermediate results.Physical operators show how query is evaluated.

account

balance

balance2500

Project balance

Index Range Scan(account.balance,

balance2500)

Table Scan (account)

Project balance

Filterbalance2500

Page 9: COMP 5138 Relational Database Management Systems Semester 2, 2007 Lecture 12 Query Processing and Optimization

9999

L10L10Query ProcessingQuery Processing

& Optimization& Optimization Measures of Query Costs

Query Optimization: Amongst all equivalent evaluation plans choose the one with lowest cost. Cost is generally measured as total elapsed time for answering query

Many factors contribute to time costdisk accesses, CPU, or even network communication

Typically disk access is the predominant cost, and is also relatively easy to estimate.For simplicity, we just use number of block transfers from disk as the cost measure

We ignore the difference in cost between sequential and random I/O for simplicityWe also ignore CPU costs for simplicity

Page 10: COMP 5138 Relational Database Management Systems Semester 2, 2007 Lecture 12 Query Processing and Optimization

10101010

L10L10Query ProcessingQuery Processing

& Optimization& Optimization Access Path

An access path is a method of retrieving tuples:File scan, or index that matches a selection (in the query)

A tree index matches (a conjunction of) terms that involve only attributes in a prefix of the search key. E.g., Tree index on <a, b, c> matches the selection a=5

AND b=3, and matches a=5 AND b>6, but not b=3.

A hash index matches (a conjunction of) terms that has a term attribute = value for every attribute in the search key of the index.

E.g., Hash index on <a, b, c> matches a=5 AND b=3 AND c=5; but it does not match b=3, or a=5 AND b=3, or a>5 AND b=3 AND c=5.

Page 11: COMP 5138 Relational Database Management Systems Semester 2, 2007 Lecture 12 Query Processing and Optimization

11111111

L10L10Query ProcessingQuery Processing

& Optimization& Optimization Selection

Find the most selective access path, retrieve tuples using it, and apply any remaining terms that don’t match the index:

Most selective access path: An index or file scan that we estimate will require the fewest page I/Os.

Table scan – search algorithms that locate and retrieve records that fulfill a selection condition.

Cost estimate (number of disk blocks scanned) = bR

bR denotes number of blocks containing records from relation R

e.g. Oracle: TABLE SCAN, TABLE ACCESS FULL

Index scan – search algorithms that use an indexselection condition must be on search-key of index.Cost = HTi + number of blocks containing retrieved records (PK: 1 block)

HT: number of index level

e.g. Oracle: INDEX RANGE SCAN or INDEX UNIQUE SCAN

Page 12: COMP 5138 Relational Database Management Systems Semester 2, 2007 Lecture 12 Query Processing and Optimization

12121212

L10L10Query ProcessingQuery Processing

& Optimization& Optimization Projection

The expensive part is removing duplicates.SQL systems don’t remove duplicates unless the keyword DISTINCT is specified in a query.

Sorting Approach: Sort on <sid, bid> and remove duplicates. (Can optimize this by dropping unwanted information while sorting.)

Hashing Approach: Hash on <sid, bid> to create partitions. Load partitions into memory one at a time, build in-memory hash structure, and eliminate duplicates.

Page 13: COMP 5138 Relational Database Management Systems Semester 2, 2007 Lecture 12 Query Processing and Optimization

13131313

L10L10Query ProcessingQuery Processing

& Optimization& Optimization Sorting

Importance of sortingSQL queries can specify that the output be sortedSQL and relational algebra can be implemented efficiently if the input are sorted

We may build an index on the relation, and then use the index to read the relation in sorted order. May lead to one disk block access for each tuple.

For relations that fit in memory, techniques like quick sort can be used. For relations that don’t fit in memory, external sort-merge is a good choice.

Page 14: COMP 5138 Relational Database Management Systems Semester 2, 2007 Lecture 12 Query Processing and Optimization

14141414

L10L10Query ProcessingQuery Processing

& Optimization& Optimization External Sort-Merge

1. Create sorted runs. Let i be 0 initially. Repeatedly do the following till the end of the relation: (a) Read M blocks of relation into memory (b) Sort the in-memory blocks (c) Write sorted data to run Ri; increment i.

Let the final value of i be N

2. Merge the runs (N-way merge). We assume (for now) that N < M. 1. Use N blocks of memory to buffer input runs, and 1 block to buffer

output. Read the first block of each run into its buffer page2. repeat

1. Select the first record (in sort order) among all buffer pages2. Write the record to the output buffer. If output is full, write it to disk.3. If this is the last records of the input buffer page then

read the next block (if any) of the run into the buffer.

3. until all input buffer pages are empty:

3. If i M, several merge passes are required.1. In each pass, contiguous groups of M - 1 runs are merged.

Let M denote memory size (in pages).

Page 15: COMP 5138 Relational Database Management Systems Semester 2, 2007 Lecture 12 Query Processing and Optimization

15151515

L10L10Query ProcessingQuery Processing

& Optimization& OptimizationExternal Sort-Merge -

Example

Page 16: COMP 5138 Relational Database Management Systems Semester 2, 2007 Lecture 12 Query Processing and Optimization

16161616

L10L10Query ProcessingQuery Processing

& Optimization& Optimization External Sort-Merge

Cost analysis:Total number of merge passes required: logM–1(br/M).

Disk accesses for initial run creation as well as in each pass is 2br

for final pass, we don’t count write cost we ignore final write cost for all operations since the output of

an operation may be sent to the parent operation without being written to disk

Thus total number of disk accesses for external sorting:

br ( 2 logM–1(br / M) + 1)

Page 17: COMP 5138 Relational Database Management Systems Semester 2, 2007 Lecture 12 Query Processing and Optimization

17171717

L10L10Query ProcessingQuery Processing

& Optimization& Optimization Join Operations

Several different algorithms to implement joinsNested-loop joinBlock nested-loop joinIndexed nested-loop joinMerge-joinHash-join

Choice based on cost estimateExamples use the following information

Number of records of students: 1,000 enrolled: 10,000Number of blocks of students: 100 enrolled: 400

Page 18: COMP 5138 Relational Database Management Systems Semester 2, 2007 Lecture 12 Query Processing and Optimization

18181818

L10L10Query ProcessingQuery Processing

& Optimization& Optimization Nested-Loop Join

To compute the theta join R Sfor each tuple r in R do begin

for each tuple s in S do begin if (r,s)=true then add r•s to the resultend

endR is called the outer relation,S the inner relation of the joinRequires no indices and can be used with any kind of join condition.Expensive since it examines every pair of tuples in the two relations.

Page 19: COMP 5138 Relational Database Management Systems Semester 2, 2007 Lecture 12 Query Processing and Optimization

19191919

L10L10Query ProcessingQuery Processing

& Optimization& OptimizationNested-Loop Join Cost

AnalysisIn the worst case, if there is memory only to hold one block of each relation, the estimated cost is |R| bS + bR If the smaller relation fits entirely in memory, use that as the inner relation. Reduces cost to bR + bS disk accesses.

Example: In the worst case scenario:students as outer relation: 1000 400 + 100 = 400,100 disk accessesenrolled as outer relation: 10000 100 + 400= 1,000,400 disk accessesIf smaller relation (students) fits entirely in memory, the cost estimate will be 500 disk accesses.

Block nested-loops algorithm (next slide) is preferable.

Page 20: COMP 5138 Relational Database Management Systems Semester 2, 2007 Lecture 12 Query Processing and Optimization

20202020

L10L10Query ProcessingQuery Processing

& Optimization& Optimization Block Nested-Loop Join

Variant of nested-loop join in which every block of inner relation is paired with every block of outer relation.

for each block BR of R do begin

for each block Bs of S do begin for each tuple r in BR do begin

for each tuple s in Bs do begin if (r,s)=true then

add r•s to the result endend

endend

Page 21: COMP 5138 Relational Database Management Systems Semester 2, 2007 Lecture 12 Query Processing and Optimization

21212121

L10L10Query ProcessingQuery Processing

& Optimization& OptimizationBlock Nested-Loop Join

Cost AnalysisWorst case estimate: bR bS + bR block accesses.

Each block in the inner relation S read once for each block in the outer relation (instead of once for each tuple in the outer relation)

Best case: bR + bS block accesses.

Improvements to nested loop & block nested loop algorithms:In block nested-loop, use M — 2 disk blocks as blocking unit for outer relations, where M = memory size in blocks; use remaining two blocks to buffer inner relation and output

Cost = bR / (M-2) bS + bR

If equi-join attribute forms a key on inner relation, stop inner loop on first matchScan inner loop forward and backward alternately, to make use of the blocks remaining in buffer (with LRU replacement)Use index on inner relation if available (next slide)

Page 22: COMP 5138 Relational Database Management Systems Semester 2, 2007 Lecture 12 Query Processing and Optimization

22222222

L10L10Query ProcessingQuery Processing

& Optimization& OptimizationIndexed Nested-Loop Join

Index lookups can replace file scans ifjoin is an equi-join or natural join andan index is available on the inner relation’s join attribute

For each tuple r in the outer relation R, use the index to look up tuples in S that satisfy the join condition with tuple R.Worst case: buffer has space for only one page of R, and, for each tuple in R, we perform an index lookup on S.

Cost of the join: bR + |R| c

Where c is the cost of traversing index and fetching all matching S tuples for one tuple of Rc can be estimated as cost of a single selection on S using the join condition.

If an index exists on the appropriate column of each relation, you can choose which relation is the outer one

Compare the costs each way

Page 23: COMP 5138 Relational Database Management Systems Semester 2, 2007 Lecture 12 Query Processing and Optimization

23232323

L10L10Query ProcessingQuery Processing

& Optimization& Optimization Merge-Join

1. Sort both relations on their join attribute (if not already sorted on the join attributes).

2. Merge the sorted relations to join themJoin step is similar to the merge stage of the sort-merge algorithm. Main difference is handling of duplicate values in join attribute

Can construct an index just to compute a join.Detailed algorithm in textbook

Page 24: COMP 5138 Relational Database Management Systems Semester 2, 2007 Lecture 12 Query Processing and Optimization

24242424

L10L10Query ProcessingQuery Processing

& Optimization& Optimization Merge-Join Cost Analysis

Can be used only for equi-joins and natural joinsEach block needs to be read only once (assuming all tuples for any given value of the join attributes fit in memoryThus number of block accesses for merge-join is bR + bS + the cost of sorting if relations are unsorted.

hybrid merge-join: If one relation is sorted, and the other has a secondary B+-tree index on the join attribute

Merge the sorted relation with the leaf entries of the B+-tree . Sort the result on the addresses of the unsorted relation’s tuplesScan the unsorted relation in physical address order and merge with previous result, to replace addresses by the actual tuples

Sequential scan more efficient than random lookup

Page 25: COMP 5138 Relational Database Management Systems Semester 2, 2007 Lecture 12 Query Processing and Optimization

25252525

L10L10Query ProcessingQuery Processing

& Optimization& Optimization Hash-Join

Applicable for equi-joins and natural joins.A hash function h is used to partition tuples of both relations h maps JoinAttrs values to {0, 1, ..., n}, where JoinAttrs denotes the common attributes of R and S used in the natural join.

R0, R1, . . ., Rn denote partitions of R tuples

Each tuple r R is put in partition Ri where i = h(r [JoinAttrs]).

S0, S1. . ., Sn denotes partitions of S tuples

Each tuple s S is put in partition Si, where i = h(s[JoinAttrs]).

Page 26: COMP 5138 Relational Database Management Systems Semester 2, 2007 Lecture 12 Query Processing and Optimization

26262626

L10L10Query ProcessingQuery Processing

& Optimization& Optimization Hash-Join (cont’d)

Page 27: COMP 5138 Relational Database Management Systems Semester 2, 2007 Lecture 12 Query Processing and Optimization

27272727

L10L10Query ProcessingQuery Processing

& Optimization& Optimization Other Operations

Duplicate elimination can be implemented via hashing or sorting.

On sorting duplicates will come adjacent to each other, and all but one set of duplicates can be deleted. Hashing is similar – duplicates will come into the same bucket.

Aggregation can be implemented in a manner similar to duplicate elimination.

Sorting or hashing can be used to bring tuples in the same group together, then the aggregate functions can be applied on each group.

Set Operations (UNION, INTERSECT, SET MINUS)

Page 28: COMP 5138 Relational Database Management Systems Semester 2, 2007 Lecture 12 Query Processing and Optimization

28282828

L10L10Query ProcessingQuery Processing

& Optimization& OptimizationEvaluation of Expressions

So far: we have seen algorithms for individual operations

Alternatives for evaluating an entire expression tree:Materialization (also: set-at-a-time): simply evaluate one operation at a time. The result of each evaluation is materialized (stored) in a temporary relation for subsequent use.Pipelining (also: tuple-at-a-time): evaluate several operations simultaneously in a pipeline

Page 29: COMP 5138 Relational Database Management Systems Semester 2, 2007 Lecture 12 Query Processing and Optimization

29292929

L10L10Query ProcessingQuery Processing

& Optimization& Optimization Materialization

Materialized evaluation: evaluate one operation at a time, starting at the lowest-level. Use intermediate results materialized into temporary relations to evaluate next-level operations.E.g., in figure below, compute and store

then compute the store its join with customer, and finally compute the projections on customer-name.

)(2500 accountbalance<

Materialized evaluation is always applicable

Costs can be quite high

Page 30: COMP 5138 Relational Database Management Systems Semester 2, 2007 Lecture 12 Query Processing and Optimization

30303030

L10L10Query ProcessingQuery Processing

& Optimization& Optimization Pipelining

Pipelined evaluation : evaluate several operations simultaneously, passing the results of one operation on to the next.

E.g., in previous expression tree, don’t store result of

instead, pass tuples directly to the join. Similarly, don’t store result of join, pass tuples directly to projection.

Much cheaper than materialization: no need to store a temporary relation to disk.

Pipelining may not always be possible – e.g., sort, hash-join.

)(2500 accountbalance<

Page 31: COMP 5138 Relational Database Management Systems Semester 2, 2007 Lecture 12 Query Processing and Optimization

31313131

L10L10Query ProcessingQuery Processing

& Optimization& Optimization

Parsing and Translation

EvaluationIndividual operationsEntire operations

OptimizationCost-basedHeuristic

Today’s Agenda

Page 32: COMP 5138 Relational Database Management Systems Semester 2, 2007 Lecture 12 Query Processing and Optimization

32323232

L10L10Query ProcessingQuery Processing

& Optimization& OptimizationIntro to Query Optimization

A relational algebra expression may have many equivalent expressionsExample: SELECT BALANCE FROM account WHERE balance < 2500Can be translated into balance2500(balance(account)) which is equivalent to balance(balance2500(account))

Page 33: COMP 5138 Relational Database Management Systems Semester 2, 2007 Lecture 12 Query Processing and Optimization

33333333

L10L10Query ProcessingQuery Processing

& Optimization& Optimization Query Optimization

Central Problems:Query is (by definition) declarative, e.g. it does not specify the execution order.

But we need an executable plan.

The goal of query optimization is less to find the optimal plan, but more to avoid the worst.

Time for query optimization adds to total query execution time.

Three evaluation planInteraction of evaluation techniquesCost-based Heuristic

Page 34: COMP 5138 Relational Database Management Systems Semester 2, 2007 Lecture 12 Query Processing and Optimization

34343434

L10L10Query ProcessingQuery Processing

& Optimization& OptimizationCost-based Query

Optimization

Generation of query-evaluation plans for an expression involves several steps: Generating logically equivalent expressions

Use equivalence rules to transform an expression into an equivalent one.

Annotating resultant expressions to get alternative query plans

Choosing the cheapest plan based on estimated cost

The overall process is called cost-based optimization.

Some systems only have rule-based optimization, i.e. they don‘t take statistical information into account.

Page 35: COMP 5138 Relational Database Management Systems Semester 2, 2007 Lecture 12 Query Processing and Optimization

35353535

L10L10Query ProcessingQuery Processing

& Optimization& OptimizationStatistic Information

in the DB Catalog

Need information about the relations and indexes involved. Catalogs typically contain at least:

# tuples (NTuples) and # pages (NPages) for each relation.# distinct key values (NKeys) and NPages for each index.Index height, low/high key values (Low/High) for each tree index.

Catalogs are updated periodically.Updating whenever data changes is too expensive; lots of approximation anyway, so slight inconsistency ok.

More detailed information (e.g., histograms of the values in some field) are sometimes stored.

Page 36: COMP 5138 Relational Database Management Systems Semester 2, 2007 Lecture 12 Query Processing and Optimization

36363636

L10L10Query ProcessingQuery Processing

& Optimization& Optimization Heuristic Optimization

Cost-based optimization is expensive, even with dynamic programming.Systems may use heuristics to reduce the number of choices that must be made in a cost-based fashion.Heuristic optimization transforms the query-tree by using a set of rules that typically (but not in all cases) improve execution performance:

Perform selection early (reduces the number of tuples)Perform projection early (reduces the number of attributes)Perform most restrictive selection and join operations before other similar operations.Some systems use only heuristics, others combine heuristics with partial cost-based optimization.

Page 37: COMP 5138 Relational Database Management Systems Semester 2, 2007 Lecture 12 Query Processing and Optimization

37373737

L10L10Query ProcessingQuery Processing

& Optimization& Optimization Transformation Example

Query: Find the names of all students who have enrolled in some course with 6 credit points.

name(credit_points=6 (course (enrolled student)))

Transformation using distribution rule 3:

name( (credit_points=6 (course)) (enrolled student)))

Performing the selection as early as possible reduces the size of the relation to be joined.

Page 38: COMP 5138 Relational Database Management Systems Semester 2, 2007 Lecture 12 Query Processing and Optimization

38383838

L10L10Query ProcessingQuery Processing

& Optimization& OptimizationExample with

Multiple Transformations

Query: Find the names of all students which enrolled in a 6 credit_point course, whose grade is a distinction (‘D’).name(grade=‘D’ credit_points=6

(course (enrolled student))) Transformation using join association rule:name(grade=‘D’ credit_points=6

((course enrolled) student))

Second form provides an opportunity to apply the “perform selections early” rule, resulting in the subexpression

credit_points=6 (course) grade=‘D’ (enrolled)

Thus a sequence of transformations can be useful

Page 39: COMP 5138 Relational Database Management Systems Semester 2, 2007 Lecture 12 Query Processing and Optimization

39393939

L10L10Query ProcessingQuery Processing

& Optimization& Optimization Optimization Strategies

Ideally: find the optimal plan.This approach is very expensive in time and space.

Typically: Find a good plan, avoiding the worst plan.do not generate all expressions

Central Problem: Join Ordering (Join Enumeration)Typical Approach:

use dynamic programming for generating join plansrestrict to left-deep join trees

Page 40: COMP 5138 Relational Database Management Systems Semester 2, 2007 Lecture 12 Query Processing and Optimization

40404040

L10L10Query ProcessingQuery Processing

& Optimization& Optimization Left Deep Join Trees

In left-deep join trees, the right-hand-side input for each join is a relation, not the result of an intermediate join.

Page 41: COMP 5138 Relational Database Management Systems Semester 2, 2007 Lecture 12 Query Processing and Optimization

41414141

L10L10Query ProcessingQuery Processing

& Optimization& OptimizationDynamic Programming

in Optimization

To find best join tree for a set of n relations:To find best plan for a set S of n relations, consider all possible plans of the form: S1 (S – S1) where S1 is any non-empty subset of S.Recursively compute costs for joining subsets of S to find the cost of each plan. Choose the cheapest of the 2n – 1 alternatives.When plan for any subset is computed, store it and reuse it when it is required again, instead of recomputing it

Dynamic programming

E.g. r1, r2, r3, r4, r5

(r1, r2, r3), (r4, r5) =12*12=144

(r1, r2, r3) and (r4, r5) = 12+12 =24

Page 42: COMP 5138 Relational Database Management Systems Semester 2, 2007 Lecture 12 Query Processing and Optimization

42424242

L10L10Query ProcessingQuery Processing

& Optimization& Optimization Wrap-Up

Parsing and Translation

EvaluationIndividual operationsEntire operations

OptimizationCost-basedHeuristic