firebird: cost-based optimization and statistics, by dmitry yemanov (in english)
DESCRIPTION
Basic introduction to internal mechanism of Firebird optimizer. How it works, how it decides to use this or that index, why sometimes it fails and what you can do to improve performance? Definitely this presentation will not answer all these questions but it gives you a basic knowledge of Firebird optimizer internals. This is not for all developers and requires some qualification, definitely.TRANSCRIPT
Cost-based Optimization
andStatistics in Firebird
Dmitry Yemanov
The Firebird Projecthttp://www.firebirdsql.org
Introduction
Optimizer decides how to find all the information required in the most efficient way it can
Different queries and/or fetch strategies may benefit from
different data access paths Some information should exist in order to help the
optimizer in guessing about the best access path
Optimization strategies Rule-based (heuristics) Cost-based (statistics)
Rule-based Optimization
Heuristical definitions Indexed retrieval is better than a full table scan
(and indexed loop join is better than a merge join) B-tree has three levels of depth Compound indices are better than simple ones
Drawbacks Indices could be bad for some operations User intentions are not taken into account Not ready for “ad hoc” queries
Cost-based Optimization
Key points Every operation has an associated cost value Cost value is calculated using statistical data Cost is aggregated from bottom up in the access path
Drawbacks Complex implementation Slow optimization process Requires up-to-date statistics
Basic Terms
Selectivity Represents a fraction of rows from a row set Lies in the value range 0.0 to 1.0
Cardinality Represents number of rows in a row set Base cardinality is the number of rows in a base table
Understanding of Cost
Cost Is a function of the estimated cardinalities Represents computational complexity of the retrieval
Measurement Cost value linearly depends on the number of logical reads
required to perform an operation Logical read is equal to a single page fetch Cost value may also take into account auxiliary steps such
as an external sorting
Cost Measurement (example)
Full table scan cost = base cardinality
Unique index scan cost = b-tree level + 1
Range index scan cost = b-tree level + N + selectivity * base cardinality
(N represents the number of the required leaf page fetches
and thus depends on the average key length)
Cost Aggregation (example)Final Row Set
cost = 9000
Sortcost = 9000
Full Scancost = 1000
Filtercost = 7000
Index Scancost = 5
Loop Joincost = 6000
SELECT *FROM T1 JOIN T2 ON T1.PK = T2.FKWHERE T1.VAL + T2.VAL < 100ORDER BY T1.NUM
Statistics
Information describing data amounts and distribution of values on different levels(table, index, column)
Stored in a database or estimated at runtime
Collected by request or automatically
Core Statistics
Number of Rows in a Table (Base Cardinality) Small tables:
number of used record slots on the data pages Large tables:
number of used data pages / average record length Estimated at runtime
via scanning pointer or data pages
Core Statistics (continued)
Index Selectivity 1 / number of distinct keys in the index Maintained per segment: (A), (A, B), (A, B, C) Assumes uniform distribution of values Calculated during index creation or upon request
(SET STATISTICS statement) Stored on the index root page Visible in RDB$INDICES and RDB$INDEX_SEGMENTS
Decisions Based on Core Statistics
Full Table Scan over Indexed Retrieval Selectivity close to 1.0 suggests a full scan
What Indices to Use Compare index selectivities and index scan costs Consider segment operations for compound indices Calculate selectivities for AND and OR operations
Order of Streams in Loop Joins Calculate costs for different join orders
and choose the best one
Advanced Statistics
Table level Average page fill factor Average row length
(both help with a better base cardinality estimation) Number of rows
(allows to avoid the runtime pages scan)
Advanced Statistics (continued)
Index level B-tree depth Average key length
(both help with a better cost estimation for index scans) Clustering factor
(allows to prefer an index navigation
over an external sort under some conditions;
also could be used to avoid filling the sparse bitmap)
Clustering Factor
Index Key 1
Index Key 2
Index Key 3
Index Key 5
Index Key 4
Data Page 12
Data Page 25
Data Page 28
Data Page 57
Data Page 44
Data Page 12
Data Page 13
Data Page 14
Bad Clustering Factor Good Clustering Factor
Advanced Statistics (continued)
Column level Selectivity
(core feature, required to estimate costs) Number of NULLs
(useful for selectivity estimations for IS [NOT] NULL) Value distribution histogram
(allows selectivity estimations for non-uniform value
distributions)
Sample Histograms
0 25 50 75 100 125 150 175 200 225 250 275 300 325 350 375 400 425 450 475 500
'A'
'B'
'C'
'D'
1 5 5 5 10 20 50 50 80 100
1. Non-Selective Column
2. Selective Column
Decisions Based on Advanced Statistics
Sort Aggregation vs Hash Aggregation Selectivity of columns being grouped by
Loop Join vs Merge Join vs Hash Join Cardinality of tables and filtering predicates
Index Usage Number of NULLs or histogram
Index Navigation vs External Sorting Clustering factor
The Firebird Projectwww.firebirdsql.org