ch.12 indexing and hashing - tarleton state university
TRANSCRIPT
![Page 1: Ch.12 Indexing and Hashing - Tarleton State University](https://reader031.vdocuments.net/reader031/viewer/2022012423/617855503895815f901616db/html5/thumbnails/1.jpg)
Ch.12 Indexing and Hashing
Common DB operations we want to support support: random lookup + sequential scan
READ p.482 → Five factors for evaluating indexing/hashing algorithms
Insertion
Deletion
Concepts:
Classifications:
Clustered (a.k.a. primary) vs. non-clustered (a.k.a. secondary)
Dense vs. sparse
![Page 2: Ch.12 Indexing and Hashing - Tarleton State University](https://reader031.vdocuments.net/reader031/viewer/2022012423/617855503895815f901616db/html5/thumbnails/2.jpg)
Examples:
Dense:
Sparse:
Clustered or non-clustered?
![Page 3: Ch.12 Indexing and Hashing - Tarleton State University](https://reader031.vdocuments.net/reader031/viewer/2022012423/617855503895815f901616db/html5/thumbnails/3.jpg)
Other minor practical issues: Overflow blocks
Long records that extend over multiple blocks
Duplicates that extend over multiple blocks
![Page 4: Ch.12 Indexing and Hashing - Tarleton State University](https://reader031.vdocuments.net/reader031/viewer/2022012423/617855503895815f901616db/html5/thumbnails/4.jpg)
Major practical issue: For a large table, the index itself will be large!
Solutions: Store index in RAM
Store index on disk how many blocks?
o Since index is sorted logarithmic search log2(b) disk accesses
o Logarithmic search vs. linear search, worst-case
Multi-level index → example on next page
![Page 5: Ch.12 Indexing and Hashing - Tarleton State University](https://reader031.vdocuments.net/reader031/viewer/2022012423/617855503895815f901616db/html5/thumbnails/5.jpg)
![Page 6: Ch.12 Indexing and Hashing - Tarleton State University](https://reader031.vdocuments.net/reader031/viewer/2022012423/617855503895815f901616db/html5/thumbnails/6.jpg)
Index updates:
Single-level
o Insertion
dense
sparse
o Deletion
Dense
Sparse
Multi-level ……..
READ and take notes: Section 12.2.3 → Detailed algorithms for the above
![Page 7: Ch.12 Indexing and Hashing - Tarleton State University](https://reader031.vdocuments.net/reader031/viewer/2022012423/617855503895815f901616db/html5/thumbnails/7.jpg)
What if the file is not ordered on the desired searck key?
Secondary index
All secondary indices must be dense!
![Page 8: Ch.12 Indexing and Hashing - Tarleton State University](https://reader031.vdocuments.net/reader031/viewer/2022012423/617855503895815f901616db/html5/thumbnails/8.jpg)
Problem with all index-sequential files:
Both random lookups and sequential scans get slower after many
insertions and deletions, due to overflow blocks
o Solution: reorganize file periodically O(K) linear time
o Solution: leave room to grow wasted memory
o Use a different type of index!
TREES!
![Page 9: Ch.12 Indexing and Hashing - Tarleton State University](https://reader031.vdocuments.net/reader031/viewer/2022012423/617855503895815f901616db/html5/thumbnails/9.jpg)
Introduction to Section 12.3 – Trees
Fundamental benefit of trees: LOGARITHMIC HEIGHT
N = 15 = 24 – 1
H = 4 = log2(N)
![Page 10: Ch.12 Indexing and Hashing - Tarleton State University](https://reader031.vdocuments.net/reader031/viewer/2022012423/617855503895815f901616db/html5/thumbnails/10.jpg)
Fundamental problem of trees: BALANCING
---------------------------------------------------------------------------------------------------------
![Page 11: Ch.12 Indexing and Hashing - Tarleton State University](https://reader031.vdocuments.net/reader031/viewer/2022012423/617855503895815f901616db/html5/thumbnails/11.jpg)
Quiz:
1] List the 3 classification criteria we covered for indices.
![Page 12: Ch.12 Indexing and Hashing - Tarleton State University](https://reader031.vdocuments.net/reader031/viewer/2022012423/617855503895815f901616db/html5/thumbnails/12.jpg)
2] A further classification criterion for indices is whether their search key (SK) is
a candidate key (CK) of the table or not.
If SK ≠ CK, then we have to solve this problem: how does a unique index entry
point to multiple tuples?
With clustered indices, we can simply point to the tuple containing the first
ocurrence of SK:
Explain why this works!
Does this solution work for unclustered (secondary indices)? Explain why or
why not.
![Page 13: Ch.12 Indexing and Hashing - Tarleton State University](https://reader031.vdocuments.net/reader031/viewer/2022012423/617855503895815f901616db/html5/thumbnails/13.jpg)
![Page 14: Ch.12 Indexing and Hashing - Tarleton State University](https://reader031.vdocuments.net/reader031/viewer/2022012423/617855503895815f901616db/html5/thumbnails/14.jpg)
12.3 B+ Trees for index files
B is for balanced … but there are many definitions of balanced!
Properties:
Each key stored in
the node is the
minimal key in the
right sub-tree
![Page 15: Ch.12 Indexing and Hashing - Tarleton State University](https://reader031.vdocuments.net/reader031/viewer/2022012423/617855503895815f901616db/html5/thumbnails/15.jpg)
![Page 16: Ch.12 Indexing and Hashing - Tarleton State University](https://reader031.vdocuments.net/reader031/viewer/2022012423/617855503895815f901616db/html5/thumbnails/16.jpg)
![Page 17: Ch.12 Indexing and Hashing - Tarleton State University](https://reader031.vdocuments.net/reader031/viewer/2022012423/617855503895815f901616db/html5/thumbnails/17.jpg)
Example:
The non-leaf levels form a hierarchy of sparse indices!
![Page 18: Ch.12 Indexing and Hashing - Tarleton State University](https://reader031.vdocuments.net/reader031/viewer/2022012423/617855503895815f901616db/html5/thumbnails/18.jpg)
Logarithmic height property:
If there are K search-key values in the file, H ≤ log n/2 (K) Explain this for a BT
Why is it important?
Random searches can be performed in logarithmic time b/c the
height of the tree needs only be traversed once!
(algorithm below)
![Page 19: Ch.12 Indexing and Hashing - Tarleton State University](https://reader031.vdocuments.net/reader031/viewer/2022012423/617855503895815f901616db/html5/thumbnails/19.jpg)
![Page 20: Ch.12 Indexing and Hashing - Tarleton State University](https://reader031.vdocuments.net/reader031/viewer/2022012423/617855503895815f901616db/html5/thumbnails/20.jpg)
“Back-of-the-envelope” estimate:
------------------------------------------------------------------------
![Page 21: Ch.12 Indexing and Hashing - Tarleton State University](https://reader031.vdocuments.net/reader031/viewer/2022012423/617855503895815f901616db/html5/thumbnails/21.jpg)
Week 14, Lect 3/3
Quiz: A DB file has a B+ tree index.
The node size in the B+ tree is 4 KB, the searck keys are 24-byte strings, and each
pointer is represented on 8 bytes. What is the maximum # of pointers n that can be
stored in a node?
What is the minimum?
If the file has 5 million search keys, what is the number of disk accesses when we
search for a random key?
What is the number of disk accesses when we access all keys sequentially?
![Page 22: Ch.12 Indexing and Hashing - Tarleton State University](https://reader031.vdocuments.net/reader031/viewer/2022012423/617855503895815f901616db/html5/thumbnails/22.jpg)
Insertions and deletions to the main file can be handled efficiently,
as the index can be reorganized in logarithmic time.
Important exceptions:
o When inserting, a node becomes too big → split nodes
o When deleting, a node become too small → merge nodes
![Page 23: Ch.12 Indexing and Hashing - Tarleton State University](https://reader031.vdocuments.net/reader031/viewer/2022012423/617855503895815f901616db/html5/thumbnails/23.jpg)
Insert “Clearview”
![Page 24: Ch.12 Indexing and Hashing - Tarleton State University](https://reader031.vdocuments.net/reader031/viewer/2022012423/617855503895815f901616db/html5/thumbnails/24.jpg)
Delete “Downtown”
![Page 25: Ch.12 Indexing and Hashing - Tarleton State University](https://reader031.vdocuments.net/reader031/viewer/2022012423/617855503895815f901616db/html5/thumbnails/25.jpg)
It’s not always possible to merge nodes
Delete “Perryridge” → Node a is left with too few pointers (remember n/2 )
Solution: merge it w/its sibling node → root now has too few pointers → simply
eliminate root and merged node becomes new root!
![Page 26: Ch.12 Indexing and Hashing - Tarleton State University](https://reader031.vdocuments.net/reader031/viewer/2022012423/617855503895815f901616db/html5/thumbnails/26.jpg)
It’s not always possible to merge nodes!
What if the sibling is (almost) full?
Solution: redistribute the pointers between siblings.
Delete “Perryridge” → As before, has too few pointers, but it’s sibling has now too
many!
borrows the rightmost pointer of .
Rightmost key of can always overwrite the leftmost one of its own parent (root here)!
![Page 27: Ch.12 Indexing and Hashing - Tarleton State University](https://reader031.vdocuments.net/reader031/viewer/2022012423/617855503895815f901616db/html5/thumbnails/27.jpg)
READ
12.3.4 – B+ Tree File Organization
12.3.5 – Indexing strings
SKIP
12.4 – B-Trees
12.5 – Multiple-Key Access
![Page 28: Ch.12 Indexing and Hashing - Tarleton State University](https://reader031.vdocuments.net/reader031/viewer/2022012423/617855503895815f901616db/html5/thumbnails/28.jpg)
12.6 Static Hashing
Hash = implicit index
Notation: set of all search keys K
set of all “bucket” addresses B (buckets are disk blocks)
hash function h is a function from K to B → h(Ki)
A bucket may contain tuples with different search keys → after being read from the disk,
the entire bucket must be searched.
Worst hash function ever: all search keys are mapped into the same bucket!
Properties of a good hash function:
Uniform distribution
Random distribution (Why?)
o Typically, h() operates on the low-level binary representation of the search key
READ example p.508 (31 is prime!)
-----------------------------------------------------------------------------------------------
![Page 29: Ch.12 Indexing and Hashing - Tarleton State University](https://reader031.vdocuments.net/reader031/viewer/2022012423/617855503895815f901616db/html5/thumbnails/29.jpg)
Week 15, Lect.1/3 (last!)
Quiz
Practice exercise 12.3 (a): Construct a B+ tree from empty, by inserting the
following values in order:
(2, 3, 5, 7, 11, 17, 19, 23, 29, 31)
The max. # of pointers is n = 4.
Practice exercise 12.4 (d): From the previous tree, delete 23.
![Page 30: Ch.12 Indexing and Hashing - Tarleton State University](https://reader031.vdocuments.net/reader031/viewer/2022012423/617855503895815f901616db/html5/thumbnails/30.jpg)
Back to hashing …
p.508 “The function can be implemented efficiently …” → Horner’s algorithm!
12.6.2 Bucket overflow
Even if the hash funtion is perfect (i.e. uniform/random), overflow can still occur
due to:
the growth of the DB!
multiple records w/same search key K
Delay overflow by using fudge factor → nB = (nr/fr) (1+d)
When overflow happens, use overflow buckets.
![Page 31: Ch.12 Indexing and Hashing - Tarleton State University](https://reader031.vdocuments.net/reader031/viewer/2022012423/617855503895815f901616db/html5/thumbnails/31.jpg)
![Page 32: Ch.12 Indexing and Hashing - Tarleton State University](https://reader031.vdocuments.net/reader031/viewer/2022012423/617855503895815f901616db/html5/thumbnails/32.jpg)
Hash index w/overflow buckets
Do you see why overflow buckets lead to degraded performance?
Solution …
![Page 33: Ch.12 Indexing and Hashing - Tarleton State University](https://reader031.vdocuments.net/reader031/viewer/2022012423/617855503895815f901616db/html5/thumbnails/33.jpg)
12.7 Dynamic Hashing
Extendable hashing idea: The hashing function generates a “large” number of
bits b (e.g. 32), but not all of them are being used as bucket addresses. Only i (i <
b) are.
![Page 34: Ch.12 Indexing and Hashing - Tarleton State University](https://reader031.vdocuments.net/reader031/viewer/2022012423/617855503895815f901616db/html5/thumbnails/34.jpg)
Nice example in text pp.515-517
We have the following branch names and the associated hash values (handout):
![Page 35: Ch.12 Indexing and Hashing - Tarleton State University](https://reader031.vdocuments.net/reader031/viewer/2022012423/617855503895815f901616db/html5/thumbnails/35.jpg)
Buckets can hold only 2 records.
We start w/empty hash table, i = 0 bits → 20 = 1 bucket
Insert Brighton and two Downtown:
![Page 36: Ch.12 Indexing and Hashing - Tarleton State University](https://reader031.vdocuments.net/reader031/viewer/2022012423/617855503895815f901616db/html5/thumbnails/36.jpg)
Insert Mianus:
![Page 37: Ch.12 Indexing and Hashing - Tarleton State University](https://reader031.vdocuments.net/reader031/viewer/2022012423/617855503895815f901616db/html5/thumbnails/37.jpg)
Insert three Perryridge:
![Page 38: Ch.12 Indexing and Hashing - Tarleton State University](https://reader031.vdocuments.net/reader031/viewer/2022012423/617855503895815f901616db/html5/thumbnails/38.jpg)
12.8 Comparison
Ordered indexing (sequential or B+ tree) vs. hashing:
Performance depends on what type of queries we perform most often:
Lookup of individual values vs. range queries
End of material required for final.