fall 2004ceng 3511 hashing reference: chapters: 11,12

27
FALL 2004 CENG 351 1 Hashing Reference: Chapters: 11,12

Post on 20-Dec-2015

227 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: FALL 2004CENG 3511 Hashing Reference: Chapters: 11,12

FALL 2004 CENG 351 1

Hashing

Reference: Chapters: 11,12

Page 2: FALL 2004CENG 3511 Hashing Reference: Chapters: 11,12

FALL 2004 CENG 351 2

Motivation• The primary goal is to locate the desired record in

a single access of disk.– Sequential search: O(N)

– B+ trees: O(logk N)

– Hashing: O(1)

• In hashing, the key of a record is transformed into an address and the record is stored at that address.

• Hash-based indexes are best for equality selections. Cannot support range searches.

• Static and dynamic hashing techniques exist.

Page 3: FALL 2004CENG 3511 Hashing Reference: Chapters: 11,12

FALL 2004 CENG 351 3

Hash-based Index

• Data entries are kept in buckets (an abstract term)• Each bucket is a collection of one primary page and

zero or more overflow pages.• Given a search key value, k, we can find the bucket

where the data entry k* is stored as follows:– Use a hash function, denoted by h

– The value of h(k) is the address for the desired bucket. h(k) should distribute the search key values uniformly over the collection of buckets

Page 4: FALL 2004CENG 3511 Hashing Reference: Chapters: 11,12

FALL 2004 CENG 351 4

Design Factors

• Bucket size: the number of records that can be held at the same address.

• Loading factor: the ratio of the number of records put in the file to the total capacity of the buckets.

• Hash function: should evenly distribute the keys among the addresses.

• Overflow resolution technique.

Page 5: FALL 2004CENG 3511 Hashing Reference: Chapters: 11,12

FALL 2004 CENG 351 5

Hash Functions

• Key mod N:– N is the size of the table, better if it is prime.

• Folding:– e.g. 123|456|789: add them and take mod.

• Truncation: – e.g. 123456789 map to a table of 1000 addresses by

picking 3 digits of the key.

• Squaring:– Square the key and then truncate

• Radix conversion: – e.g. 1 2 3 4 treat it to be base 11, truncate if necessary.

Page 6: FALL 2004CENG 3511 Hashing Reference: Chapters: 11,12

FALL 2004 CENG 351 6

Static Hashing• Primary Area: # primary pages fixed, allocated

sequentially, never de-allocated; (say M buckets).– A simple hash function: h(k) = f(k) mod M

• Overflow area: disjoint from the primary area. It keeps buckets which hold records whose key maps to a full bucket.– Adding the address of an overflow bucket to a primary

area bucket is called chaining.

• Collision does not cause a problem as long as there is still room in the mapped bucket. Overflow occurs during insertion when a record is hashed to the bucket that is already full.

Page 7: FALL 2004CENG 3511 Hashing Reference: Chapters: 11,12

FALL 2004 CENG 351 7

Example• Assume f(k) = k. Let M = 5. So, h(k) = k mod 5• Bucket factor = 3 records.

0 35 60

1 6 46

2 12 57 62

3 33

4 44

17

Primary area overflow

Page 8: FALL 2004CENG 3511 Hashing Reference: Chapters: 11,12

FALL 2004 CENG 351 8

Load Factor (Packing density)

• To limit the amount of overflow we allocate more space to the primary area than we need (i.e. the primary area will be, say, 70% full)

• Load Factor =

=> Lf =

# of records in the file

# of spaces in primary area

n

M * Bkfr

Page 9: FALL 2004CENG 3511 Hashing Reference: Chapters: 11,12

FALL 2004 CENG 351 9

Effects of Lf and Bkfr

• Performance can be enhanced by the choice of bucket size and load factor.

• In general, a smaller load factor means – less overflow and a faster fetch time; – but more wasted space.

• A larger Bkfr means – less overflow in general, – but slower fetch.

Page 10: FALL 2004CENG 3511 Hashing Reference: Chapters: 11,12

FALL 2004 CENG 351 10

Insertion and Deletion

• Insertion: New records are inserted at the end of the chain.

• Deletion: Two ways are possible:1. Mark the record to be deleted2. Consolidate sparse buckets when deleting

records.– In the 2nd approach:

• When a record is deleted, fill its place with the last record in the chain of the current bucket.

• Deallocate the last bucket when it becomes empty.

Page 11: FALL 2004CENG 3511 Hashing Reference: Chapters: 11,12

FALL 2004 CENG 351 11

Problem of Static Hashing

• The main problem with static hashing: the number of buckets is fixed:– Long overflow chains can develop and degrade performance.

– On the other hand, if a file shrinks greatly, a lot of bucket space will be wasted.

• There are some other hashing techniques that allow dynamically growing and shrinking hash index. These include:

– linear hashing

– extendible hashing

Page 12: FALL 2004CENG 3511 Hashing Reference: Chapters: 11,12

FALL 2004 CENG 351 12

Linear Hashing

• It maintains a constant load factor.

• Thus avoids reorganization.

• It does so, by incrementally adding new buckets to the primary area.

• In linear hashing the last bits in the hash number are used for placing the records.

Page 13: FALL 2004CENG 3511 Hashing Reference: Chapters: 11,12

FALL 2004 CENG 351 13

Example

000 8 16 32

001 17 25

010 34 50

011 11 27

100 28 12

101 5

110 14

111 55 15

Last 3 bits

e.g.34: 10001028: 01110013: 001101 21: 010101

Insert: 13, 21, 37

Lf = 15/24

= 63%

Page 14: FALL 2004CENG 3511 Hashing Reference: Chapters: 11,12

FALL 2004 CENG 351 14

Insertion of records

• To expand the table: split an existing bucket denoted by k digits into two buckets using the last k+1 digits.

• e.g.

000

1000

0000

Page 15: FALL 2004CENG 3511 Hashing Reference: Chapters: 11,12

FALL 2004 CENG 351 15

Expanding the table

0000 16 32

001 17 25

010 34 50

011 11 27

100 28 12

101 5 13 21

110 14

111 55 15

1000 8

37

Boundary value

Page 16: FALL 2004CENG 3511 Hashing Reference: Chapters: 11,12

FALL 2004 CENG 351 16

0000 16 32

0001 17

0010 34 50

011 11 27

100 28 12

101 5 13 21

110 14

111 55 15

1000 8

1001 25

1010 26

Boundary value

37

k = 3

Hash # 1000: uses last 4 digits

Hash # 1101: uses last 3 digits

Page 17: FALL 2004CENG 3511 Hashing Reference: Chapters: 11,12

FALL 2004 CENG 351 17

Fetching a record

• Calculate the hash function.

• Look at the last k digits.– If it’s less than the boundary value, the location

is in the bucket labeled with the last k+1 digits.– Otherwise it is in the bucket labeled with the

last k digits.

• Follow overflow chains as with static hashing.

Page 18: FALL 2004CENG 3511 Hashing Reference: Chapters: 11,12

FALL 2004 CENG 351 18

Insertion

• Search for the correct bucket into which to place the new record.

• If the bucket is full, allocate a new overflow bucket.

• If there are now Lf*Bkfr records more than needed for the given Lf, – Add one more bucket to the primary area.– Distribute the records from the bucket chain at the

boundary value between the original area and the new primary area buckets

– Add 1 to the boundary value.

Page 19: FALL 2004CENG 3511 Hashing Reference: Chapters: 11,12

FALL 2004 CENG 351 19

Deletion

• Read in a chain of records.

• Replace the deleted record with the last record in the chain.– If the last overflow bucket becomes empty, deallocate it.

• When the number of records is Lf * Bkfr less than the number needed for Lf, contract the primary area by one bucket.

Compressing the table is exact opposite of expanding it:

• Keep the total # of records in the file and buckets in primary area.

• When we have Lf * Bkfr fewer records than needed, consolidate the last bucket with the bucket which shares the same last k digits.

Page 20: FALL 2004CENG 3511 Hashing Reference: Chapters: 11,12

FALL 2004 CENG 351 20

Extendible Hashing

• Hash function returns b bits

• Only the prefix i bits are used to hash the item

• There are 2i entries in the bucket address table

• Let ij be the length of the common hash prefix for data bucket j, there is 2(i-ij) entries in bucket address table points to j

Extendable Hashingi

i2

bucket2

i3

bucket3

i1

bucket1

Data bucket

Bucket address table

Length of common hash prefixHash prefix

Page 21: FALL 2004CENG 3511 Hashing Reference: Chapters: 11,12

FALL 2004 CENG 351 21

Splitting a bucket: Case 1

• Splitting (Case 1: ij=i)– Only one entry in bucket address table points to data

bucket j

– i++; split data bucket j to j, z; ij=iz=i; rehash all items previously in j;

Extendable Hashing

3

3

2

1

3

000001010011100101110111

2

00011011

2

2

1

Page 22: FALL 2004CENG 3511 Hashing Reference: Chapters: 11,12

FALL 2004 CENG 351 22

Splitting: Case 2

• Splitting (Case 2: ij< i)– More than one entry in bucket address table point to

data bucket j

– split data bucket j to j, z; ij = iz = ij +1; Adjust the pointers previously point to j to j and z; rehash all items previously in j;

Extendable Hashing

2

2

1

3

000001010011100101110111

2

2

2

3

000001010011100101110111

2

Page 23: FALL 2004CENG 3511 Hashing Reference: Chapters: 11,12

FALL 2004 CENG 351 23

Example• Suppose the hash function is h(x) = x mod 8 and each

bucket can hold at most two records. Show the extendable hash structure after inserting 1, 4, 5, 7, 8, 2, 20.

Extendable Hashing

1 4 5 7 8 2 20001 100 101 111 000 010 100

14

00

1

11

45

1

0

1

00

01

10

11

18

1

2

45

2

7

2

Page 24: FALL 2004CENG 3511 Hashing Reference: Chapters: 11,12

FALL 2004 CENG 351 24

Example

3

000

001

010

011

100

101

110

111

18

2

420

2

7

2

2

3

5

3

18

2

2

2

2

45

2

7

2

00

01

10

11

inserting 1, 4, 5, 7, 8, 2, 20

1 4 5 7 8 2 20001 100 101 111 000 010 100

Extendable Hashing

Page 25: FALL 2004CENG 3511 Hashing Reference: Chapters: 11,12

FALL 2004 CENG 351 25

Comments on Extendible Hashing• If directory fits in memory, equality search answered

with one disk access.

– A typical example: a100MB file with 100 bytes/entry and a page size of 4K contains 1,000,000 records (as data entries) but only about 25,000 directory elements chances are high that directory will fit in memory.

• If the distribution of hash values is skewed (e.g., a large number of search key values all are hashed to the same bucket ), directory can grow large.

– But this kind of skew can be avoided with a well-tuned hashing function

Page 26: FALL 2004CENG 3511 Hashing Reference: Chapters: 11,12

FALL 2004 CENG 351 26

Comments on Extendible Hashing

• Delete: If removal of data entry makes bucket empty, can be merged with a “buddy” bucket. If each directory element points to same bucket as its split image, can halve directory.

Page 27: FALL 2004CENG 3511 Hashing Reference: Chapters: 11,12

FALL 2004 CENG 351 27

Summary• Hash-based indexes: best for equality searches,

cannot support range searches.

• Static Hashing can lead to long overflow chains.

• Extendible Hashing avoids overflow pages by splitting a full bucket when a new data entry is to be added to it. – Directory to keep track of buckets, doubles periodically.– Can get large with skewed data; additional I/O if this

does not fit in main memory.