firebird: cost-based optimization and statistics, by dmitry yemanov (in english)

Cost-based Optimization

andStatistics in Firebird

Dmitry Yemanov

The Firebird Projecthttp://www.firebirdsql.org

http://www.firebirdsql.org/

Introduction

Optimizer decides how to find all the information required in the most efficient way it can

Different queries and/or fetch strategies may benefit from

different data access paths Some information should exist in order to help the

optimizer in guessing about the best access path

Optimization strategies Rule-based (heuristics) Cost-based (statistics)

Rule-based Optimization

Heuristical definitions Indexed retrieval is better than a full table scan

(and indexed loop join is better than a merge join) B-tree has three levels of depth Compound indices are better than simple ones

Drawbacks Indices could be bad for some operations User intentions are not taken into account Not ready for “ad hoc” queries

Cost-based Optimization

Key points Every operation has an associated cost value Cost value is calculated using statistical data Cost is aggregated from bottom up in the access path

Drawbacks Complex implementation Slow optimization process Requires up-to-date statistics

Basic Terms

Selectivity Represents a fraction of rows from a row set Lies in the value range 0.0 to 1.0

Cardinality Represents number of rows in a row set Base cardinality is the number of rows in a base table

Understanding of Cost

Cost Is a function of the estimated cardinalities Represents computational complexity of the retrieval

Measurement Cost value linearly depends on the number of logical reads

required to perform an operation Logical read is equal to a single page fetch Cost value may also take into account auxiliary steps such

as an external sorting

Cost Measurement (example)

Full table scan cost = base cardinality

Unique index scan cost = b-tree level + 1

Range index scan cost = b-tree level + N + selectivity * base cardinality

(N represents the number of the required leaf page fetches

and thus depends on the average key length)

Cost Aggregation (example)Final Row Set

cost = 9000

Sortcost = 9000

Full Scancost = 1000

Filtercost = 7000

Index Scancost = 5

Loop Joincost = 6000

SELECT *FROM T1 JOIN T2 ON T1.PK = T2.FKWHERE T1.VAL + T2.VAL < 100ORDER BY T1.NUM

Statistics

Information describing data amounts and distribution of values on different levels(table, index, column)

Stored in a database or estimated at runtime

Collected by request or automatically

Core Statistics

Number of Rows in a Table (Base Cardinality) Small tables:

number of used record slots on the data pages Large tables:

number of used data pages / average record length Estimated at runtime

via scanning pointer or data pages

Core Statistics (continued)

Index Selectivity 1 / number of distinct keys in the index Maintained per segment: (A), (A, B), (A, B, C) Assumes uniform distribution of values Calculated during index creation or upon request

(SET STATISTICS statement) Stored on the index root page Visible in RDB$INDICES and RDB$INDEX_SEGMENTS

Decisions Based on Core Statistics

Full Table Scan over Indexed Retrieval Selectivity close to 1.0 suggests a full scan

What Indices to Use Compare index selectivities and index scan costs Consider segment operations for compound indices Calculate selectivities for AND and OR operations

Order of Streams in Loop Joins Calculate costs for different join orders

and choose the best one

Advanced Statistics

Table level Average page fill factor Average row length

(both help with a better base cardinality estimation) Number of rows

(allows to avoid the runtime pages scan)

Advanced Statistics (continued)

Index level B-tree depth Average key length

(both help with a better cost estimation for index scans) Clustering factor

(allows to prefer an index navigation

over an external sort under some conditions;

also could be used to avoid filling the sparse bitmap)

Clustering Factor

Index Key 1

Index Key 2

Index Key 3

Index Key 5

Index Key 4

Data Page 12

Data Page 25

Data Page 28

Data Page 57

Data Page 44

Data Page 12

Data Page 13

Data Page 14

Bad Clustering Factor Good Clustering Factor

Advanced Statistics (continued)

Column level Selectivity

(core feature, required to estimate costs) Number of NULLs

(useful for selectivity estimations for IS [NOT] NULL) Value distribution histogram

(allows selectivity estimations for non-uniform value

distributions)

Sample Histograms

0 25 50 75 100 125 150 175 200 225 250 275 300 325 350 375 400 425 450 475 500

'A'

'B'

'C'

'D'

1 5 5 5 10 20 50 50 80 100

1. Non-Selective Column

2. Selective Column

Decisions Based on Advanced Statistics

Sort Aggregation vs Hash Aggregation Selectivity of columns being grouped by

Loop Join vs Merge Join vs Hash Join Cardinality of tables and filtering predicates

Index Usage Number of NULLs or histogram

Index Navigation vs External Sorting Clustering factor

The Firebird Projectwww.firebirdsql.org

http://www.firebirdsql.org/

firebird: cost-based optimization and statistics, by dmitry yemanov (in english)

Technology