fall 2004ceng 3511 hashing reference: chapters: 11,12
Post on 20-Dec-2015
227 views
TRANSCRIPT
FALL 2004 CENG 351 1
Hashing
Reference: Chapters: 11,12
FALL 2004 CENG 351 2
Motivation• The primary goal is to locate the desired record in
a single access of disk.– Sequential search: O(N)
– B+ trees: O(logk N)
– Hashing: O(1)
• In hashing, the key of a record is transformed into an address and the record is stored at that address.
• Hash-based indexes are best for equality selections. Cannot support range searches.
• Static and dynamic hashing techniques exist.
FALL 2004 CENG 351 3
Hash-based Index
• Data entries are kept in buckets (an abstract term)• Each bucket is a collection of one primary page and
zero or more overflow pages.• Given a search key value, k, we can find the bucket
where the data entry k* is stored as follows:– Use a hash function, denoted by h
– The value of h(k) is the address for the desired bucket. h(k) should distribute the search key values uniformly over the collection of buckets
FALL 2004 CENG 351 4
Design Factors
• Bucket size: the number of records that can be held at the same address.
• Loading factor: the ratio of the number of records put in the file to the total capacity of the buckets.
• Hash function: should evenly distribute the keys among the addresses.
• Overflow resolution technique.
FALL 2004 CENG 351 5
Hash Functions
• Key mod N:– N is the size of the table, better if it is prime.
• Folding:– e.g. 123|456|789: add them and take mod.
• Truncation: – e.g. 123456789 map to a table of 1000 addresses by
picking 3 digits of the key.
• Squaring:– Square the key and then truncate
• Radix conversion: – e.g. 1 2 3 4 treat it to be base 11, truncate if necessary.
FALL 2004 CENG 351 6
Static Hashing• Primary Area: # primary pages fixed, allocated
sequentially, never de-allocated; (say M buckets).– A simple hash function: h(k) = f(k) mod M
• Overflow area: disjoint from the primary area. It keeps buckets which hold records whose key maps to a full bucket.– Adding the address of an overflow bucket to a primary
area bucket is called chaining.
• Collision does not cause a problem as long as there is still room in the mapped bucket. Overflow occurs during insertion when a record is hashed to the bucket that is already full.
FALL 2004 CENG 351 7
Example• Assume f(k) = k. Let M = 5. So, h(k) = k mod 5• Bucket factor = 3 records.
0 35 60
1 6 46
2 12 57 62
3 33
4 44
17
Primary area overflow
FALL 2004 CENG 351 8
Load Factor (Packing density)
• To limit the amount of overflow we allocate more space to the primary area than we need (i.e. the primary area will be, say, 70% full)
• Load Factor =
=> Lf =
# of records in the file
# of spaces in primary area
n
M * Bkfr
FALL 2004 CENG 351 9
Effects of Lf and Bkfr
• Performance can be enhanced by the choice of bucket size and load factor.
• In general, a smaller load factor means – less overflow and a faster fetch time; – but more wasted space.
• A larger Bkfr means – less overflow in general, – but slower fetch.
FALL 2004 CENG 351 10
Insertion and Deletion
• Insertion: New records are inserted at the end of the chain.
• Deletion: Two ways are possible:1. Mark the record to be deleted2. Consolidate sparse buckets when deleting
records.– In the 2nd approach:
• When a record is deleted, fill its place with the last record in the chain of the current bucket.
• Deallocate the last bucket when it becomes empty.
FALL 2004 CENG 351 11
Problem of Static Hashing
• The main problem with static hashing: the number of buckets is fixed:– Long overflow chains can develop and degrade performance.
– On the other hand, if a file shrinks greatly, a lot of bucket space will be wasted.
• There are some other hashing techniques that allow dynamically growing and shrinking hash index. These include:
– linear hashing
– extendible hashing
FALL 2004 CENG 351 12
Linear Hashing
• It maintains a constant load factor.
• Thus avoids reorganization.
• It does so, by incrementally adding new buckets to the primary area.
• In linear hashing the last bits in the hash number are used for placing the records.
FALL 2004 CENG 351 13
Example
000 8 16 32
001 17 25
010 34 50
011 11 27
100 28 12
101 5
110 14
111 55 15
Last 3 bits
e.g.34: 10001028: 01110013: 001101 21: 010101
Insert: 13, 21, 37
Lf = 15/24
= 63%
FALL 2004 CENG 351 14
Insertion of records
• To expand the table: split an existing bucket denoted by k digits into two buckets using the last k+1 digits.
• e.g.
000
1000
0000
FALL 2004 CENG 351 15
Expanding the table
0000 16 32
001 17 25
010 34 50
011 11 27
100 28 12
101 5 13 21
110 14
111 55 15
1000 8
37
Boundary value
FALL 2004 CENG 351 16
0000 16 32
0001 17
0010 34 50
011 11 27
100 28 12
101 5 13 21
110 14
111 55 15
1000 8
1001 25
1010 26
Boundary value
37
k = 3
Hash # 1000: uses last 4 digits
Hash # 1101: uses last 3 digits
FALL 2004 CENG 351 17
Fetching a record
• Calculate the hash function.
• Look at the last k digits.– If it’s less than the boundary value, the location
is in the bucket labeled with the last k+1 digits.– Otherwise it is in the bucket labeled with the
last k digits.
• Follow overflow chains as with static hashing.
FALL 2004 CENG 351 18
Insertion
• Search for the correct bucket into which to place the new record.
• If the bucket is full, allocate a new overflow bucket.
• If there are now Lf*Bkfr records more than needed for the given Lf, – Add one more bucket to the primary area.– Distribute the records from the bucket chain at the
boundary value between the original area and the new primary area buckets
– Add 1 to the boundary value.
FALL 2004 CENG 351 19
Deletion
• Read in a chain of records.
• Replace the deleted record with the last record in the chain.– If the last overflow bucket becomes empty, deallocate it.
• When the number of records is Lf * Bkfr less than the number needed for Lf, contract the primary area by one bucket.
Compressing the table is exact opposite of expanding it:
• Keep the total # of records in the file and buckets in primary area.
• When we have Lf * Bkfr fewer records than needed, consolidate the last bucket with the bucket which shares the same last k digits.
FALL 2004 CENG 351 20
Extendible Hashing
• Hash function returns b bits
• Only the prefix i bits are used to hash the item
• There are 2i entries in the bucket address table
• Let ij be the length of the common hash prefix for data bucket j, there is 2(i-ij) entries in bucket address table points to j
Extendable Hashingi
i2
bucket2
i3
bucket3
i1
bucket1
Data bucket
Bucket address table
Length of common hash prefixHash prefix
FALL 2004 CENG 351 21
Splitting a bucket: Case 1
• Splitting (Case 1: ij=i)– Only one entry in bucket address table points to data
bucket j
– i++; split data bucket j to j, z; ij=iz=i; rehash all items previously in j;
Extendable Hashing
3
3
2
1
3
000001010011100101110111
2
00011011
2
2
1
FALL 2004 CENG 351 22
Splitting: Case 2
• Splitting (Case 2: ij< i)– More than one entry in bucket address table point to
data bucket j
– split data bucket j to j, z; ij = iz = ij +1; Adjust the pointers previously point to j to j and z; rehash all items previously in j;
Extendable Hashing
2
2
1
3
000001010011100101110111
2
2
2
3
000001010011100101110111
2
FALL 2004 CENG 351 23
Example• Suppose the hash function is h(x) = x mod 8 and each
bucket can hold at most two records. Show the extendable hash structure after inserting 1, 4, 5, 7, 8, 2, 20.
Extendable Hashing
1 4 5 7 8 2 20001 100 101 111 000 010 100
14
00
1
11
45
1
0
1
00
01
10
11
18
1
2
45
2
7
2
FALL 2004 CENG 351 24
Example
3
000
001
010
011
100
101
110
111
18
2
420
2
7
2
2
3
5
3
18
2
2
2
2
45
2
7
2
00
01
10
11
inserting 1, 4, 5, 7, 8, 2, 20
1 4 5 7 8 2 20001 100 101 111 000 010 100
Extendable Hashing
FALL 2004 CENG 351 25
Comments on Extendible Hashing• If directory fits in memory, equality search answered
with one disk access.
– A typical example: a100MB file with 100 bytes/entry and a page size of 4K contains 1,000,000 records (as data entries) but only about 25,000 directory elements chances are high that directory will fit in memory.
• If the distribution of hash values is skewed (e.g., a large number of search key values all are hashed to the same bucket ), directory can grow large.
– But this kind of skew can be avoided with a well-tuned hashing function
FALL 2004 CENG 351 26
Comments on Extendible Hashing
• Delete: If removal of data entry makes bucket empty, can be merged with a “buddy” bucket. If each directory element points to same bucket as its split image, can halve directory.
FALL 2004 CENG 351 27
Summary• Hash-based indexes: best for equality searches,
cannot support range searches.
• Static Hashing can lead to long overflow chains.
• Extendible Hashing avoids overflow pages by splitting a full bucket when a new data entry is to be added to it. – Directory to keep track of buckets, doubles periodically.– Can get large with skewed data; additional I/O if this
does not fit in main memory.