chapter 5 hashing - computer science | william & marytadavis/cs303/ch05sm.pdfchapter 5 hashing...
TRANSCRIPT
Chapter 5Hashing
Introduction
2
hashing performs basic operations, such as insertion,deletion, and finds in average time
Hashing
3
a hash table is merely an of some fixed sizehashing converts into locations in a hashtable
searching on the key becomes something like arraylookup
hashing is typically a many-to-one map: multiple keys aremapped to the same array index
mapping multiple keys to the same position results in a____________ that must be resolved
two parts to hashing:a hash function, which transforms keys into array indicesa collision resolution procedure
Hashing Functions
4
let be the set of search keyshash functions map into the set of in thehash table
ideally, distributes over the slots ofthe hash table, to minimize collisionsif we are hashing items, we want the number of itemshashed to each location to be close to
example: Library of Congress Classification Systemhash function if we look at the first part of the call numbers(e.g., E470, PN1995)
collision resolution involves going to the stacks andlooking through the booksalmost all of CS is hashed to QA75 and QA76 (BAD)
Hashing Functions
5
suppose we are storing a set of nonnegative integersgiven , we can obtain hash values between 0 and 1 with thehash function
when is divided byfast operation, but we need to be careful when choosing
example: if , is just the lowest-order bits ofare all the hash values equally likely?
choosing to be a not too close to a power of 2works well in practice
Hashing Functions
6
we can also use the hash function below for floating pointnumbers if we interpret the bits as an ______________
two ways to do this in C, assuming long int and doubletypes have the same lengthfirst method uses C to accomplish this task
Hashing Functions
7
we can also use the hash function below for floating pointnumbers if we interpret the bits as an integer (cont.)second uses a , which is a variable that can holdobjects of different types and sizes
Hashing Functions
8
we can hash strings by combining a hash of each _________
is an additional parameter we get to chooseif is larger than any character value, then this approach iswhat you would obtain if you treated the string as a base-______________
Hashing Functions
9
K&R suggest a slightly simpler hash function, correspondingto
Weiss suggests
Hashing Functions
10
we can use the idea for strings if our search key has_____________ parts, say, street, city, state:
hash = ((street * R + city) % M) * R + state) % M;
same ideas apply to hashing vectors
Hash Functions
11
the choice of parameters can have a ____________ effecton the results of hashingcompare the text's string hashing algorithm for differentpairs of and
plot _______________ of the number of words hashed toeach hash table location; we use the American dictionaryfrom the aspell program as data (305,089 words)
Hash Functions
12
example:good: words are __________________________
Hash Functions
13
example:very bad
Hash Functions
14
example:better
Hash Functions
15
example:bad
Collision Resolution
16
hash table collisionoccurs when elements hash to the __________________in the tablevarious for dealing with collision
separate chainingopen addressinglinear probingother methods
Separate Chaining
17
separate chainingkeep a list of all elements that to the same locationeach location in the hash table is a _________________example: first 10 squares
Separate Chaining
18
insert, search, delete in listsall proportional to of linked listinsert
new elements can be inserted at of listduplicates can increment ______________
other structures could be used instead of listsbinary search treeanother hash table
linked lists good if table is and hash function isgood
Separate Chaining
19
how long are the linked lists in a hash table?value: where is the number of keys
and is the size of the tableis it reasonable to assume the hash table would exhibitthis behavior?
load factoraverage length of a listtime to search: time to evaluate the hashfunction + time to the list
unsuccessful search:successful search:
Separate Chaining
20
observationsmore important than table size
general rule: make the table as large as the number ofelements to be stored,keep table size prime to ensure good ________________
Separate Chaining
21
declaration of hash structure
Separate Chaining
22
hash member function
Separate Chaining
23
routines for separate chaining
Separate Chaining
24
routines for separate chaining
Open Addressing
25
linked lists incur extra coststime to for new cellseffort and complexity of defining second data structure
a different collision strategy involves placing colliding keysin nearby slots
if a collision occurs, try cells until an emptyone is foundbigger table size needed withload factor should be below
three common strategieslinear probingquadratic probingdouble hashing
Linear Probing
26
linear probing insert operationwhen is hashed, if slot is open, place thereif there is a collision, then start looking for an empty slotstarting with location in the hash table, andproceed through , , 0, 1, 2,
wrapping around the hash table, looking foran empty slot
search operation is similarchecking whether a table entry is vacant (or is one weseek) is called a ___________
Linear Probing
27
example: add 89, 18, 49, 58, 69 with and
Linear Probing
28
as long as the table is , a vacant cell can be foundbut time to locate an empty cell can become largeblocks of occupied cells results in primary _____________
deleting entries leaves __________some entries may no longer be foundmay require moving many other entries
expected number of probes
for search hits:
for insertion and search misses:
for , these values are and , respectively
Linear Probing
29
performance of linear probing (dashed) vs. more randomcollision resolution
adequate up toSuccessful, Unsuccessful, Insertion
Quadratic Probing
30
quadratic probingeliminates _______________________collision function is quadraticexample: add 89, 18, 49, 58, 69 with and
Quadratic Probing
31
in linear probing, letting table get nearly greatlyhurts performancequadratic probing
no of finding an empty cell once thetable gets larger than half fullat most, of the table can be used to resolvecollisionsif table is half empty and the table size is prime, then weare always guaranteed to accommodate a new elementcould end up with situation where all keys map to thesame table location
Quadratic Probing
32
quadratic probingcollisions will probe the same alternative cells
clusteringcauses less than half an extra probe per search
Double Hashing
33
double hashing
apply a second hash function to and probe across___________________________function must never evaluate to ____make sure all cells can be probed
Double Hashing
34
double hashing examplewith
is a prime smaller than table sizeinsert 89, 18, 49, 58, 69
Double Hashing
35
double hashing example (cont.)note here that the size of the table (10) is not primeif 23 inserted in the table, it would collide with 58
since and the table size is 10,only one alternative location, which is taken
Rehashing
36
table may get _______________run time of operations may take too longinsertions may for quadratic resolution
too many removals may be intermixed with insertionssolution: build a new table (with a newhash function)
go through original hash table to compute a hash valuefor each (non-deleted) elementinsert it into the new table
Rehashing
37
example: insert 13, 15, 24, 6 into a hash table of size 7with
Rehashing
38
example (cont.)insert 23table will be over 70% full; therefore, a new table iscreated
Rehashing
39
example (cont.)new table is size 17new hash functionall old elements are inserted into newtable
Rehashing
40
rehashing run time since elements and to rehashthe entire table of size roughly
must have been insertions since last rehashrehashing may run OK if in ________________
if interactive session, rehashing operation could producea slowdown
rehashing can be implemented with _________________could rehash as soon as the table is half fullcould rehash only when an insertion failscould rehash only when a certain ______ ___ is reached
may be best, as performance degrades as load factorincreases
Hash Tables with Worst-Case Access
41
hash tables so faraverage case for insertions, searches, and
deletionsseparate chaining: worst case
some queries will take nearly logarithmic timeworst-case time would be better
important for applications such as lookup tables forrouters and memory cachesif is known in advance, and elements can be_______________, worst-case time isachievable
Hash Tables with Worst-Case Access
42
perfect hashingassume all items known _________________separate chaining
if the number of lists continually increases, the lists willbecome shorter and shorterwith enough lists, high probability of ______________two problems
number of lists might be unreasonably __________the hashing might still be unfortunate
can be made large enough to have probabilityof no collisionsif collision detected, clear table and try again with adifferent hash function (at most done 2 times)
Hash Tables with Worst-Case Access
43
perfect hashing (cont.)how large must be?
theoretically, should be , which is ____________solution: use lists
resolve collisions by using hash tables instead oflinked lists
each of these lists can have elementseach secondary hash table will use a different hashfunction until it is ______________________
can also perform similar operation for primary hashtable
total size of secondary hash tables is at most
Hash Tables with Worst-Case Access
44
perfect hashing (cont.)example: slots 1, 3, 5, 7 empty; slots 0, 4, 8 have 1element each; slots 2, 6 have 2 elements each; slot 9has 3 elements
Hash Tables with Worst-Case Access
45
cuckoo hashingbound known for a long time
researchers surprised in 1990s to learn that if one of twotables were chosen as items were inserted,the size of the largest list would be , which issignificantly smallermain idea: use 2 tables
neither more than fulluse a separate hash function for eachitem will be stored in one of these two locationscollisions resolved by elements
Hash Tables with Worst-Case Access
46
cuckoo hashing (cont.)example: 6 items; 2 tables of size 5; each table hasrandomly chosen hash function
A can be placed at position 0 in Table 1 or position 2 inTable 2a search therefore requires at most 2 table accesses inthis exampleitem deletion is trivial
Hash Tables with Worst-Case Access
47
cuckoo hashing (cont.)insertion
ensure item is not already in one of the tablesuse first hash function and if first table location is___________, insert thereif location in first table is occupied
____________ element there and place current itemin correct position in first tabledisplaced element goes to its alternate hash positionin the second table
Hash Tables with Worst-Case Access
48
cuckoo hashing (cont.)example: insert A
insert B (displace A)
Hash Tables with Worst-Case Access
49
cuckoo hashing (cont.)insert C
insert D (displace C) and E
Hash Tables with Worst-Case Access
50
cuckoo hashing (cont.)insert F (displace E) (E displaces A)
(A displaces B) (B relocated)
Hash Tables with Worst-Case Access
51
cuckoo hashing (cont.)insert G
displacements are _______________G D B A E F C G
also results in a displacement cycle
Hash Tables with Worst-Case Access
52
cuckoo hashing (cont.)cycles
_______insertions should require < displacements
if a certain number of displacements is reached on aninsertion, tables can be with new hashfunctions
Hash Tables with Worst-Case Access
53
cuckoo hashing (cont.)variations
higher number of tables (3 or 4)place item in second hash slot immediately instead of________________ other itemsallow each cell to store keys
space utilization increased
Hash Tables with Worst-Case Access
54
cuckoo hashing (cont.)benefits
worst-case lookup and deletion timesavoidance of _____________________potential for __________________
potential issuesextremely sensitive to choice of hash functionstime for insertion increases rapidly as load factorapproaches 0.5
Hash Tables with Worst-Case Access
55
hopscotch hashingimproves on linear probing algorithm
linear probing tries cells in sequential order, startingfrom hash location, which can be long due to primaryand secondary clusteringinstead, hopscotch hashing places a bound on__________________ of the probe sequence
results in worst-case constant-time lookupcan be parallelized
Hash Tables with Worst-Case Access
56
hopscotch hashing (cont.)if insertion would place an element too far from itshash location, go backward and otherelements
evicted elements cannot be placed farther than themaximal length
each position in the table contains information aboutthe current element inhabiting it, plus others that________ to it
Hash Tables with Worst-Case Access
57
hopscotch hashing (cont.)example: MAX_DIST = 4
each bit string provides 1 bit of information about thecurrent position and the next 3 that follow
1: item hashes to current location; 0: no
Hash Tables with Worst-Case Access
58
hopscotch hashing (cont.)example: insert H in 9
try in position 13, but too far, so try candidates foreviction (10, 11, 12)evict G in 11
Hash Tables with Worst-Case Access
59
hopscotch hashing (cont.)example: insert I in 6
position 14 too far, so try positions 11, 12, 13G can move down oneposition 13 still too far; F can move down one
Hash Tables with Worst-Case Access
60
hopscotch hashing (cont.)example: insert I in 6
position 12 still too far, so try positions 9, 10, 11B can move down threenow slot is open for I, fourth from 6
Hash Tables with Worst-Case Access
61
universal hashingin principle, we can end up with a situation where all ofour keys are hashed to the in thehash table (bad)more realistically, we could choose a hash functionthat does not distribute the keysto avoid this, we can choose the hash function______________ so that it is independent of the keysbeing storedyields provably good performance on average
Hash Tables with Worst-Case Access
62
universal hashing (cont.)let be a finite collection of functions mappingour set of keys to the range }
is a collection if for each pair ofdistinct keys , the number of hash functions
for which is at mostthat is, with a randomly selected hash function ,the chance of a between distinct andis not more than the probability of a collision if
and were chosen randomly andindependently from }
Hash Tables with Worst-Case Access
63
universal hashing (cont.)example: choose a prime sufficiently large that everykey is in the range 0 to (inclusive)let andthen the family
is a universal class of hash functions
Hash Tables with Worst-Case Access
64
extendible hashingamount of data too large to fit in _____________
main consideration is then the number of diskaccesses
assume we need to store records andrecords fit in one disk blockcurrent problems
if probing or separate chaining is used, collisionscould cause to be examinedduring a searchrehashing would be expensive in this case
Hash Tables with Worst-Case Access
65
extendible hashing (cont.)allows search to be performed in disk accesses
insertions require a bit moreuse B-tree
as increases, height of B-tree ______________could make height = 1, but multi-way branchingwould be extremely high
Hash Tables with Worst-Case Access
66
extendible hashing (cont.)example: 6-bit integers
root contains 4 pointers determined by first 2 bitseach leaf has up to 4 elements
Hash Tables with Worst-Case Access
67
extendible hashing (cont.)example: insert 100100
place in third leaf, but fullsplit leaf into 2 leaves, determined by 3 bits
Hash Tables with Worst-Case Access
68
extendible hashing (cont.)example: insert 000000
first leaf split
Hash Tables with Worst-Case Access
69
extendible hashing (cont.)considerations
several directory may be required if theelements in a leaf agree in more than D+1 leadingbits
number of bits to distinguish bit stringsdoes not work well with (duplicates: does not work at all)
Hash Tables with Worst-Case Access
70
final pointschoose hash function carefullywatch _________________
separate chaining: close to 1probing hashing: 0.5
hash tables have some _______________not possible to find min/maxnot possible to for a string unless theexact string is known
binary search trees can do this, and is onlyslightly worse than
Hash Tables with Worst-Case Access
71
final points (cont.)hash tables good for
symbol tablegaming
remembering locations to avoid recomputingthrough transposition table
spell checkers