DESIGN & ANALYSIS OF ALGORITHM02 – HASHING (CONTD.)
Informatics Department
Parahyangan Catholic University
ANALOGY
Let's say that you have a drawer full of socks, 20 red socks (all identical) and 12 blue socks, and it is dark in the room. How many socks should you grab, to assure that you have at least one matching pair ?
How about 20 red socks, 12 blue socks, and 8 green socks ?
How about unlimited # ofred, blue, green, yellow, and purple socks ?
ANALOGY
In a city of 2 million people, no one has more than 1.5 million hairs on his/her head. Can you show that at least two people in the city have exactly the same number of hairs on their heads?
PIGEONHOLE PRINCIPLE
In mathematics, the pigeonhole principle states that if n pigeons are put into m pigeonholes with n > m, then at least one pigeonhole must contain more than one pigeon.
-- wikipedian = the range of
possible keysm = the size of the hash table
COLLISION When possible key range > table size, two
distinct keys k1 and k2 may be mapped to the same indexh(k1) = h(k2)
This condition is known as collision resolution strategy is requiredyellow orange red green blue black white
??
COLLISION HANDLING3 STRATEGIES
Open addressing Linear probing Quadratic probing Double Hashing
Separate chaining
Coalesced hashing
COLLISION HANDLINGOPEN ADDRESSING
In open addressing, a colliding entry will be placed in a new slot in the same table
John Smith
Lisa Smith
Kenny Baker
J (74)
K (75)
L (76)
M (77)
N (78)
Jane Smith
John Smith / 521-8976
Lisa Smith / 521-5030
Kenny Baker / 418-4165
Jane Smith / 521-1234
Kayla Newman
?
COLLISION HANDLINGSEPARATE CHAINING
In separate chaining, colliding entries are stored in linked list in different area
John Smith
Lisa Smith
Kenny Baker
J (74)
K (75)
L (76)
M (77)
Jane Smith
Kayla Newman
John Smith 521-8976
Jane Smith 521-1234
Kenny Baker 418-4165
Lisa Smith 521-5030
Kayla Newman418-4222
COLLISION HANDLINGCOALESCED HASHING
Coalesced hashing combines open addressing and separate chaining. It uses linked list like separate chaining, but stored in empty slot in the same table
John Smith
Lisa Smith
Kenny Baker
J (74)
K (75)
L (76)
M (77)
N (78)
Jane Smith
John Smith / 521-8976
Lisa Smith / 521-5030
Kenny Baker / 418-4165
Jane Smith / 521-1234
Kayla Newman
Kayla Newman / 418-4222
PERFORMANCE ANALYSIS
What is the advantage and disadvantage of the three collision handling methods ? How to compare them ? What measurement can we use ?
Load Factor : what is the average number of elements stored in a slot ?
Probe Number : how many slots we need to examine before finding the empty slot ?
EXAMPLE :: OPEN ADDRESSING
Load factor = 1 (because every slot only has 1 element)
Probe number for “Lisa Smith” = ? Probe number for “Kayla Newman” = ?
John Smith
Lisa Smith
Kenny Baker
J (74)
K (75)
L (76)
M (77)
N (78)
Jane Smith
John Smith / 521-8976
Lisa Smith / 521-5030
Kenny Baker / 418-4165
Jane Smith / 521-1234
Kayla Newman
Kayla Newman / 418-4222
EXAMPLE :: SEPARATE CHAINING
Load factor = #of probe (because collided elements are stored in linked list)
Probe number for “Jane Smith” = ? Probe number for “Kayla Newman” = ?
John Smith
Lisa Smith
Kenny Baker
J (74)
K (75)
L (76)
M (77)
Jane Smith
Kayla Newman
John Smith 521-8976
Jane Smith 521-1234
Kenny Baker 418-4165
Lisa Smith 521-5030
Kayla Newman418-4222
What if we insert new element in the
beginning of the list ?
EXAMPLE :: COALESCED HASHING
John Smith
Lisa Smith
Kenny Baker
J (74)
K (75)
L (76)
M (77)
N (78)
Jane Smith
John Smith / 521-8976
Lisa Smith / 521-5030
Kenny Baker / 418-4165
Jane Smith / 521-1234
Load factor = 1 (because every slot only has 1 element)
What is the advantage of this method ? How many slot(s) to check to insert Kenny Baker ? How many slot(s) to check to search Kenny Baker ?
OPEN ADDRESSING
In open addressing, a colliding entry will be placed in a new slot in the same table (using hash function h(k,i), where i is the probe number)
There are generally 3 techniques to decide the next slot to be filled : linear probing quadratic probing double hashing
The sequence of h(k,0), h(k,1), h(k,2), … is called probe sequence
OPEN ADDRESSINGLINEAR PROBING Define
where h`(k) is the initial hash function, and i is the probe number for key k
Example: m=13 k = 5 h’(k) = 5 k = 18 h’(k) = 5 (collision)
h(k,1) = (5+1) mod 13 = 6 k = 19 h’(k) = 6 (collision)
h(k,1) = (6+1) mod 13 = 7 k = 31 h’(k) = 5 (collision)
h(k,1) = (5+1) mod 13 = 6 (collision)h(k,2) = (5+2) mod 13 = 7 (collision)h(k,3) = (5+3) mod 13 = 8
mikhikh mod))`((),(
Suffers from primary clustering
Clusters arises since an empty slot preceded by i non-empty slots gets filled next with probability (i+1)/m
There are only m distinct probe sequence
OPEN ADDRESSINGLINEAR PROBING
idx Data
1 A
2 B
3 C
4 D
5 E
6 F
7
8
9
… …
Every k which h(k) between 1 and 6 will be placed in this slot
OPEN ADDRESSINGQUADRATIC PROBING
Definewhere h`(k) is the initial hash function, i is the probe number for key k, and c1 & c2 are some constant
micickhikh mod))`((),( 221
OPEN ADDRESSINGQUADRATIC PROBING
Example: m=13, c1=2, c2=3 k = 5 h’(k) = 5 k = 18 h’(k) = 5 (collision)
h(k,1) = (5+2*1+3*12) mod 13 = 10 k = 19 h’(k) = 6 k = 31 h’(k) = 5 (collision)
h(k,1) = (5+2*1+3*12) mod 13 = 10 (collision)
h(k,2) = (5+2*2+3*22) mod 13 = 8 k = 32 h’(k) = 6(collision)
h(k,1) = (6+2*1+3*12) mod 13 = 11
h(k,1) is not exactly next to h’(k), thus
avoid primary clustering problem
However, keys with same h’(k) are re-hashed to same place. This leads to a milder form of clustering, called secondary clustering.(again, there are only m distinct probe sequence)
OPEN ADDRESSINGQUADRATIC PROBING
Observe these 2 cases where h’(k)=5 and h’(k)=6(m=13, c1=2 and c2=3)
Note that only slot 0, 5, 6, 8, 9,10, and 12 can be filled by keys with h’(k)=5
Only slot 0, 1, 6, 7, 9, 10, and 11 can be filled by keys with h’(k)=6
h'(k) = 5 h'(k) = 6
probe# h(k,i)
probe# h(k,i)
1 10 1 11
2 8 2 9
3 12 3 0
4 9 4 10
5 12 5 0
6 8 6 9
7 10 7 11
8 5 8 6
9 6 9 7
10 0 10 1
11 0 11 1
12 6 12 7
13 5 13 6
This suggest that some slots might get filled with
higher probability than the others.
OPEN ADDRESSINGQUADRATIC PROBING
The choice of m, c1, and c2 are important for m = 2n , a good choice is c1 = c2 = 0.5 For prime m > 2, most choice of c1 and c2 will
make h(k, i) distinct for i in [0, (M-1)/2)].
Example: m = 24 = 16, c1 = c2 = 0.5, h’(k) = 0 Probe # h(k,i) Probe # h(k,i)
0 0 8 41 1 9 132 3 10 73 6 11 24 10 12 145 15 13 116 5 14 97 12 15 8
OPEN ADDRESSINGDOUBLE HASHING
Definewhere h1(k) is the initial hash function, i is the probe number for key k, and h2(k) is a different hash function than h1(k)
Two different keys a and b that initially hashed to the same location (h1(a) = h1(b)) will have a different probe sequence, since h2(a) ≠ h2(b)
mkhikhikh mod))()((),( 21
OPEN ADDRESSINGDOUBLE HASHING
h2(k) must be relative prime to m
Example : Let m be the power of 2 and h2(k) always returns
an odd number Let m be prime and h2(k) always returns positive
integers less than m
There are Θ(m2) distinct probe sequence
BASIC HASH TABLE OPERATION
INSERT(key, value)we have discussed this a lot
value SEARCH(key)similar to INSERT
DELETE(key)do not delete the value, mark it “deleted” instead
(why?)
In separate chaining, all three operations are merely inserting, searching, and deleting in appropriate linked
list
When to stop searching ?
INSERTIONIN OPEN ADDRESSING
INSERT(key, value)// returns true if key is successfully inserted// returns false otherwisei = 0while(i < m)
idx = HASHFUNCTION(key, i)if(table[idx] is empty or marked as deleted)
table[idx] = (key,value)return true
elsei = i+1
return false
SEARCHINGIN OPEN ADDRESSING
SEARCH(key)// returns associated value if key is found// returns null otherwise
i = 0while(i < m)
idx = HASHFUNCTION(key, i)if(table[idx] is empty)
return null //reached an empty slot, //so key is must not be
in the hash tableelse if(table[idx] not marked as deleted AND
table[idx].key == key))return table[idx].value //key found
elsei = i+1 //try the next slot
return null //tried all m possible slots and key not found
DELETIONIN OPEN ADDRESSING
DELETE(key)// returns associated value if key is found // and successfully deleted. returns null otherwisei = 0while(i < m)
idx = HASHFUNCTION(key, i)if(table[idx] is empty)
return null //reached an empty slot, //so key is must not be
in the hash tableelse if((table[idx] not marked as deleted
AND table[idx].key == key))
temp = table[idx].valuemark table[idx] as deletedreturn temp
elsei = i+1 //try the next slot
return null //tried all m possible slots and key not found