h.lu/hkust l03: physical database design (i) introduction index selection partitioning &...
TRANSCRIPT
H.Lu/HKUST
L03: Physical Database Design (I)
Introduction Index Selection Partitioning & denormalization
H.Lu/HKUST L03: Physical Database Design -- 2
Overview
After logical design, we have a “good” logical schema that describes the data to be stored and related constraints;
The next step is physical database design, a process of producing a physical schema, i.e., a description of the implementation of the database on secondary storage using the target DBMS
The main tasks including: choosing indexes, make clustering decisions, and to refine the conceptual and external schemas (if necessary) to meet performance goals.
We must begin by understanding the workload: The most important queries and how often they arise. The most important updates and how often they arise. The desired performance for these queries and updates.
H.Lu/HKUST L03: Physical Database Design -- 3
Understanding the Workload For each query in the workload:
Which relations does it access? Which attributes are retrieved? Which attributes are involved in selection/join conditions?
How selective are these conditions likely to be? For each update in the workload:
Which attributes are involved in selection/join conditions? How selective are these conditions likely to be?
The type of update (INSERT/DELETE/UPDATE), and the attributes that are affected.
H.Lu/HKUST L03: Physical Database Design -- 4
T-R Cross-Reference
A. Enter the details of a new project and the employees assigned to it.
B. Enter the details of a new employee and his/her assignment
C. Change the salary for a particular job title.
D. List the project number, name and the average salary of employees working on the project
E. List the name of employees and the name of projects they have been assigned to.
Emplyee(eno, enam, title)
Pay (title, salary)
Project(pno, pname, budget, loc)
Assignment (eno, pno, duration)
I R U D I R U D I R U D I R U D I R U DEMP X X XPAY X X XPROJ X X XASSI GN X X X X
QEQA QB QC QD
H.Lu/HKUST L03: Physical Database Design -- 5
Types of Queries (I)
Point query: returns at most one record (or part of a record) based on an equality selection.SELECT name FROM Employee WHERE eid=8788
Multipoint query: may return several records based on an equality selectionSELECT name FROM Employee WHERE dept=Sales
Range query: returns a set of records whose values lie in an interval of half interval for the selected attribute SELECT name FROM Employee WHERE salary > 15000
Prefix match query on an attribute of sequence of attributes X specifies only a prefix of XLname = ‘Lu’ Fname LIKE ‘Ho’
H.Lu/HKUST L03: Physical Database Design -- 6
Types of Queries
Extremal query: returns a set of records whose value on some attribute is minimum or maximum.
Ordering query: Display a set of records in the order of the value of some attribute (s)
Grouping query: partition the results of a query into groups
Join query: multi-table queries Equality join Non-equality join
H.Lu/HKUST L03: Physical Database Design -- 7
Cost Estimation
Each query is associated a cost for its execution. The main concern of physical database design is
performance, which involves cost estimation. Cost estimation:
Must estimate cost of each operation in a query execution plan tree.
• Depends on input cardinalities and processing method. Must estimate size of result for each operation in tree!
• Use information about the input relations.• For selections and joins, assume independence of
predicates.
H.Lu/HKUST L03: Physical Database Design -- 8
Statistics and Catalogs
Need information about the relations and indexes involved. Catalogs typically contain at least: # tuples (NTuples) and # pages (NPages) for each relation. # distinct key values (NKeys) and NPages for each index. Index height, low/high key values (Low/High) for each tree
index. Catalogs updated periodically.
Updating whenever data changes is too expensive; lots of approximation anyway, so slight inconsistency ok.
More detailed information (e.g., histograms of the values in some field) are sometimes stored.
H.Lu/HKUST L03: Physical Database Design -- 9
Selectivity of Predicates
Consider a query block: Maximum # tuples in result is the product of the
cardinalities of relations in the FROM clause. Selectivity, s, associated with each term reflects the
impact of the term in reducing result size. Term col=value s = 1 / number of distinct values Term col > value s = (High -value)/(High - Low) Mutiple terms: product of each term Implicit assumption that terms are independent!
SELECT attribute listFROM relation listWHERE term1 AND ... AND termk
H.Lu/HKUST L03: Physical Database Design -- 10
Joins
Theta-join :
Result schema same as that of cross-product. Fewer tuples than cross-product, might be able to compute more
efficiently Equi-Join: A special case of condition join where the condition c
contains only equalities.
Result schema similar to cross-product, but only one copy of fields for which equality is specified.
Natural Join: Equijoin on all common fields.
R c S c R S ( )
. .S id RidS R
sidS R
H.Lu/HKUST L03: Physical Database Design -- 11
Join Selectivity
Join selectivity of
Estimate join selectivity R.A is a key
R.A is a key, and S.B is a foreign key reference to R
Neither the above two is true
. .R A S BR S
. .R A S BR SJS
R S
H.Lu/HKUST L03: Physical Database Design -- 12
Decisions to Make Index selection
What indexes should we create?• Which relations should have indexes? What field(s) should be
the search key? Should we build several indexes? For each index, what kind of an index should it be?
• Clustered? Hash/tree? Dynamic/static? Dense/sparse? Should we make changes to the conceptual schema?
Consider alternative normalized schemas? (Remember, there are many choices in decomposing into BCNF, etc.)
Should we ``undo’’ some decomposition steps and settle for a lower normal form? (Denormalization.)
Vertical/Horizontal partitioning, replication, views ...
H.Lu/HKUST
L03: Physical Database Design (I)
Introduction Index Selection Partitioning & denormalization
H.Lu/HKUST L03: Physical Database Design -- 14
Indexes
A Heap file allows us to retrieve records: by specifying the rid, or by scanning all records sequentially
Sometimes, we want to retrieve records by specifying the values in one or more fields, e.g., Find all students in the “CS” department Find all students with a gpa > 3
Indexes are file structures that enable us to answer such value-based queries efficiently.
H.Lu/HKUST L03: Physical Database Design -- 15
Indexes
An index on a file speeds up selections on the search key fields for the index. Any subset of the fields of a relation can be the search
key for an index on the relation. Search key is not the same as key (minimal set of
fields that uniquely identify a record in a relation). An index contains a collection of data entries, and
supports efficient retrieval of all data entries k* with a given key value k.
H.Lu/HKUST L03: Physical Database Design -- 16
Alternatives for Data Entry k* in Index
Three alternatives: Data record with key value k <k, rid of data record with search key value k> <k, list of rids of data records with search key k>
Choice of alternative for data entries is orthogonal to the indexing technique used to locate data entries with a given key value k. Examples of indexing techniques: B+ trees, hash-based
structures Typically, index contains auxiliary information that directs
searches to the desired data entries
H.Lu/HKUST L03: Physical Database Design -- 17
Alternatives for Data Entries (Contd.)
Alternative 1: If this is used, index structure is a file organization for
data records (like Heap files or sorted files). At most one index on a given collection of data
records can use Alternative 1. (Otherwise, data records duplicated, leading to redundant storage and potential inconsistency.)
If data records very large, # of pages containing data entries is high. Implies size of auxiliary information in the index is also large, typically.
H.Lu/HKUST L03: Physical Database Design -- 18
Alternatives for Data Entries (Contd.)
Alternatives 2 and 3: Data entries typically much smaller than data records. So,
better than Alternative 1 with large data records, especially if search keys are small. (Portion of index structure used to direct search is much smaller than with Alternative 1.)
If more than one index is required on a given file, at most one index can use Alternative 1; rest must use Alternatives 2 or 3.
Alternative 3 more compact than Alternative 2, but leads to variable sized data entries even if search keys are of fixed length.
H.Lu/HKUST L03: Physical Database Design -- 19
Example B+-Tree Index
Balanced tree Leaf nodes contain keys and pointers pointing to data pages Internal nodes contain keys and pointers: Each pointer Pi,
associated to a key Ki, points to a subtree in which all key values k lie between Ki and Ki+1, P0(Pm) points to a subtree in which all key values are less then K0 (large than Km),
Each node must at least be half full
Root
14* 16* 33* 34* 38* 39*22* 27* 29*17* 18* 20* 21*7*5* 8*2* 3*
135 3020 22
17
H.Lu/HKUST L03: Physical Database Design -- 20
B+ Trees in Practice
Typical order: 100. Typical fill-factor: 67%. average fanout = 133
Typical capacities: Height 4: 1334 = 312,900,700 records Height 3: 1333 = 2,352,637 records
Can often hold top levels in buffer pool: Level 1 = 1 page = 8 Kbytes Level 2 = 133 pages = 1 Mbyte Level 3 = 17,689 pages = 133 MBytes
H.Lu/HKUST L03: Physical Database Design -- 21
Key Length and Fanout
The leaf pages contain 40 million key-pointer pairsPage size = 4K,
length of pointer = 6 bytes,
length of key = 4 bytes
Tree level: 3 levels
How about the key length is 94 bytes?
H.Lu/HKUST L03: Physical Database Design -- 22
Static Hashing Index # primary pages fixed, allocated sequentially, never
de-allocated; overflow pages if needed. h(k) mod M = bucket to which data entry with key k
belongs. (M = # of buckets)
h(key) mod N
hkey
Primary bucket pages Overflow pages
20
N-1
H.Lu/HKUST L03: Physical Database Design -- 23
Extensible Hashing
20*
00
01
10
11
2 2
2
2
LOCAL DEPTH 2
2
DIRECTORY
GLOBAL DEPTHBucket A
Bucket B
Bucket C
Bucket D
Bucket A2(`split image'of Bucket A)
1* 5* 21*13*
32*16*
10*
15* 7* 19*
4* 12*
19*
2
2
2
000
001
010
011
100
101
110
111
3
3
3DIRECTORY
Bucket A
Bucket B
Bucket C
Bucket D
Bucket A2(`split image'of Bucket A)
32*
1* 5* 21*13*
16*
10*
15* 7*
4* 20*12*
LOCAL DEPTH
GLOBAL DEPTH
H.Lu/HKUST L03: Physical Database Design -- 24
Index Classification
Primary vs. secondary: If search key contains primary key, then called primary index. Unique index: Search key contains a candidate key.
Clustered vs. unclustered: If order of data records is the same as, or `close to’, order of data entries, then called clustered index. Alternative 1 implies clustered, but not vice-versa. A file can be clustered on at most one search key. Cost of retrieving data records through index varies greatly
based on whether index is clustered or not!
H.Lu/HKUST L03: Physical Database Design -- 25
Clustered vs. Unclustered Index Suppose that Alternative (2) is used for data entries, and that
the data records are stored in a Heap file. To build clustered index, first sort the Heap file (with some
free space on each page for future inserts). Overflow pages may be needed for inserts. (Thus, order of data
recs is `close to’, but not identical to, the sort order.)
direct search for
(Index File)
(Data file)
Data Records
Data entries
Index entries
Data entries
data entriesCLUSTERED
Data Records
UNCLUSTERED
H.Lu/HKUST L03: Physical Database Design -- 26
Index Classification (Contd.)
Dense vs. Sparse: If there is at least one data entry per search key value (in some data record), then dense. Alternative 1 always leads
to dense index. Every sparse index is
clustered! Sparse indexes are smaller;
however, some useful optimizations are based on dense indexes.
Ashby, 25, 3000
Smith, 44, 3000
Ashby
Cass
Smith
22
25
30
40
44
44
50
Sparse Indexon
Name Data File
Dense Indexon
Age
33
Bristow, 30, 2007
Basu, 33, 4003
Cass, 50, 5004
Tracy, 44, 5004
Daniels, 22, 6003
Jones, 40, 6003
H.Lu/HKUST L03: Physical Database Design -- 27
Index Classification (Contd.) Composite Search Keys: Search on a
combination of fields. Equality query: Every field value
is equal to a constant value. E.g. wrt <sal,age> index:
• age=20 and sal =75 Range query: Some field value is
not a constant. E.g.:• age =20; or age=20 and sal > 10
Data entries in index sorted by search key to support range queries. Lexicographic order, or Spatial order.
sue 13 75
bob
cal
joe 12
10
20
8011
12
name age sal
<sal, age>
<age, sal> <age>
<sal>
12,20
12,10
11,80
13,75
20,12
10,12
75,13
80,11
11
12
12
13
10
20
75
80
Data recordssorted by name
Data entries in indexsorted by <sal,age>
Data entriessorted by <sal>
Examples of composite keyindexes using lexicographic order.
H.Lu/HKUST L03: Physical Database Design -- 28
Clustered Index and Queries
Clustered index can be sparse: save space and number of disk access
Good for multipoint queries, i.e., equality accesses to nonunique fields – the results will be in consecutive pages.
Good for equality join on an attribute with few distinct values;
If both relations have clustering index on join attributes, can use merge join
H.Lu/HKUST L03: Physical Database Design -- 29
Non-clustered index
Non-clustered index are best if they cover the query;
• Query can be answered using the index only good if each query retrieves significantly fewer
records than there are pages in the file• Significant: sequential scan is much faster than random
scan
• Useful for point queries
• For multipoint queries, may or may not help
H.Lu/HKUST L03: Physical Database Design -- 30
Choice of Indexes
One approach: consider the most important queries in turn. Consider the best plan using the current indexes, and see if a better plan is possible with an additional index. If so, create it.
Before creating an index, must also consider the impact on updates in the workload! Trade-off: indexes can make queries go faster,
updates slower. Require disk space, too.
H.Lu/HKUST L03: Physical Database Design -- 31
Issues to Consider in Index Selection
Attributes mentioned in a WHERE clause are candidates for index search keys. Exact match condition suggests hash index. Range query suggests tree index.
• Clustering is especially useful for range queries, although it can help on equality queries as well in the presence of duplicates.
Try to choose indexes that benefit as many queries as possible. Since only one index can be clustered per relation, choose it based on important queries that would benefit the most from clustering.
H.Lu/HKUST L03: Physical Database Design -- 32
Issues in Index Selection (Contd.) Multi-attribute search keys should be considered when a
WHERE clause contains several conditions. If range selections are involved, order of attributes should be
carefully chosen to match the range ordering. Such indexes can sometimes enable index-only strategies for
important queries.• For index-only strategies, clustering is not important!
When considering a join condition: Hash index on inner is very good for Index Nested Loops.
• Should be clustered if join column is not key for inner, and inner tuples need to be retrieved.
Clustered B+ tree on join column(s) good for Sort-Merge.
H.Lu/HKUST L03: Physical Database Design -- 33
Example 1
Hash index on D.dname supports ‘Toy’ selection. Given this, index on D.dno is not needed.
Hash index on E.dno allows us to get matching (inner) Emp tuples for each selected (outer) Dept tuple.
What if WHERE included: `` ... AND E.age=25’’ ? Could retrieve Emp tuples using index on E.age, then join with
Dept tuples satisfying dname selection. Comparable to strategy that used E.dno index.
So, if E.age index is already created, this query provides much less motivation for adding an E.dno index.
SELECT E.ename, D.mgrFROM Emp E, Dept DWHERE D.dname=‘Toy’ AND E.dno=D.dno
H.Lu/HKUST L03: Physical Database Design -- 34
Example 2
Clearly, Emp should be the outer relation. Suggests that we build a hash index on D.dno.
What index should we build on Emp? B+ tree on E.sal could be used, OR an index on E.hobby could
be used. Only one of these is needed, and which is better depends upon the selectivity of the conditions.
• As a rule of thumb, equality selections more selective than range selections.
As both examples indicate, our choice of indexes is guided by the plan(s) that we expect an optimizer to consider for a query. Have to understand optimizers!
SELECT E.ename, D.mgrFROM Emp E, Dept DWHERE E.sal BETWEEN 10000 AND 20000 AND E.hobby=‘Stamps’ AND E.dno=D.dno
H.Lu/HKUST L03: Physical Database Design -- 35
Examples of Clustering B+ tree index on E.age can be
used to get qualifying tuples. How selective is the condition? Is the index clustered?
Consider the GROUP BY query. If many tuples have E.age > 10,
using E.age index and sorting the retrieved tuples may be costly.
Clustered E.dno index may be better!
Equality queries and duplicates: Clustering on E.hobby helps!
SELECT E.dnoFROM Emp EWHERE E.age>40
SELECT E.dno, COUNT (*)FROM Emp EWHERE E.age>10GROUP BY E.dno
SELECT E.dnoFROM Emp EWHERE E.hobby=Stamps
H.Lu/HKUST L03: Physical Database Design -- 36
Clustering and Joins
Clustering is especially important when accessing inner tuples in INL. Should make index on E.dno clustered.
Suppose that the WHERE clause is instead:WHERE E.hobby=‘Stamps AND E.dno=D.dno If many employees collect stamps, Sort-Merge join may be worth
considering. A clustered index on D.dno would help. Summary: Clustering is useful whenever many tuples are to be
retrieved.
SELECT E.ename, D.mgrFROM Emp E, Dept DWHERE D.dname=‘Toy’ AND E.dno=D.dno
H.Lu/HKUST L03: Physical Database Design -- 37
Multi-Attribute Index Keys
To retrieve Emp records with age=30 AND sal=4000, an index on <age,sal> would be better than an index on age or an index on sal. Such indexes also called composite or concatenated indexes. Choice of index key orthogonal to clustering etc.
If condition is: 20<age<30 AND 3000<sal<5000: Clustered tree index on <age,sal> or <sal,age> is best.
If condition is: age=30 AND 3000<sal<5000: Clustered <age,sal> index much better than <sal,age> index!
Composite indexes are larger, updated more often.
H.Lu/HKUST L03: Physical Database Design -- 38
Covering Index A number of
queries can be answered without retrieving any tuples from one or more of the relations involved if a suitable index is available.
SELECT D.mgrFROM Dept D, Emp EWHERE D.dno=E.dno
SELECT D.mgr, E.eidFROM Dept D, Emp EWHERE D.dno=E.dno
SELECT E.dno, COUNT(*)FROM Emp EGROUP BY E.dno
SELECT E.dno, MIN(E.sal)FROM Emp EGROUP BY E.dno
SELECT AVG(E.sal)FROM Emp EWHERE E.age=25 AND E.sal BETWEEN 3000 AND 5000
<E.dno>
<E.dno,E.eid>
Tree index!
<E.dno>
<E.dno,E.sal>
Tree index!
<E. age,E.sal> or<E.sal, E.age>
Tree!
H.Lu/HKUST L03: Physical Database Design -- 39
Cost Based Index Selection
SQL Server 7 and above Input: schema (with existing index), workload Output: recommendations adding/removing indexes
Basic approach: First step: pick the best index for each SQL statement in the
workload by enumerating relevant indexes on attributes, and computing the cost of execution plans that would use the index
Second step: computes the cost of combining the indexes picked in the first step. The subset of indexes with the lowest combined cost is used for the final recommendation.
Details: VLDB 02 paper
H.Lu/HKUST L03: Physical Database Design -- 40
Summary Understanding the nature of the workload for the application,
and the performance goals, is essential to developing a good design. What are the important queries and updates? What
attributes/relations are involved? Indexes support efficient retrieval of records based on the values in some fields.
Index is a collection of data entries plus a way to quickly find entries with given key values. Data entries can be actual data records, <key, rid> pairs, or <key, rid-list> pairs. Choice orthogonal to indexing technique used to locate data
entries with a given key value. Indexes can be classified as clustered vs. unclustered, primary
vs. secondary, and dense vs. sparse. Differences have important consequences for utility/performance.
H.Lu/HKUST L03: Physical Database Design -- 41
Summary (Contd.) Several indexes can be built on a given file of data records,
each with a different search key. Indexes must be chosen to speed up important queries (and
perhaps some updates!). Index maintenance overhead on updates to key fields. Choose indexes that can help many queries, if possible. Build indexes to support index-only strategies. Clustering is an important decision; only one index on a given
relation can be clustered! Order of fields in composite index key can be important.
Static indexes may have to be periodically re-built. Statistics have to be periodically updated.