6.01.hash tableintro
TRANSCRIPT
ECE 250 Algorithms and Data Structures
Douglas Wilhelm Harder, M.Math. LEL
Department of Electrical and Computer Engineering
University of Waterloo
Waterloo, Ontario, Canada
Copyright © 2006-11 by Douglas Wilhelm Harder. All rights reserved.
Introduction to Hash Tables
Hash Functions
2
Outline
• Discuss storing unordered data• Discuss IP addresses and domain names• Consider conversions between these two forms• Introduce the idea of hashing:
– Reducing O(ln(n)) operations to O(1)
• Consider some of the weaknesses
Hash Functions
3
Background
• Given a set of data, suppose each data item is associated with a unique integer key in a particular range:– UW Student ID Numbers decimal– Social Insurance Numbers decimal– IP Addresses binary
• We will use IP addresses as a model• Databases often assign unique primary keys be using an
automatically incremented counter
Hash Functions
4
IP Addresses
• Each computer communicating on a network using the Internet Protocol Version 4 (IPv4) has a unique IP address– 32 bits/4 bytes allowing four billion addresses
• Represented in human-readable format as four bytes– The ECE web server has the IP address129.97.56.100– The URL http://129.97.56.100/ is valid– IP addresses are not easy to remember
Hash Functions
5
IP Addresses
• Domain names were introduced for humans• The Domain Name System (DNS) are hierarchical
– There are a limited number of top-level domains– Countries use ISO 1366 country codes– Responsibility for 2nd-, 3rd-, and lower-level domains are the
responsibility of the parent
Hash Functions
6
IP Addresses
• The Unix command host translates between IP addresses and domain names
$ host uwaterloo.ca
uwaterloo.ca has address 129.97.128.40
$ host ece.uwaterloo.ca
ece.uwaterloo.ca has address 129.97.56.100
$ host www.uwaterloo.ca
www.uwaterloo.ca is an alias for info.uwaterloo.ca.
info.uwaterloo.ca has address 129.97.128.40
$ host www.google.ca
www.google.ca is an alias for www.google.com.
www.google.com is an alias for www.l.google.com.
www.l.google.com has address 72.14.205.99
www.l.google.com has address 72.14.205.103
www.l.google.com has address 72.14.205.104
www.l.google.com has address 72.14.205.147
Hash Functions
7
IP Addresses
• The mapping is not one-to-one– Some IP address have multiple domain names– Some domain names have multiple IP addresses
• This allows flexibility:
Hash Functions
8
IP Addresses
DNS allows a division of effort in name translation– A DNS server in Korea (kr) does not need to know the IP address of
uwaterloo.ca
Similarly, the University of Waterloo has control of names within its domain– Any IP address starting with 129.97 belongs to UW– This gives UW 2562 = 65535 IP addresses
Hash Functions
9
IP Addresses
The translation of IP addresses to domain names is straight-forward:– Use an array of size 65536
– E.g., 90 × 256 + 209 = 23249Index Address Domain Name
23240 129.97.90.200 sidicsem.uwaterloo.ca
23241 129.97.90.201 watdist8.uwaterloo.ca
23242 129.97.90.202 NO DOMAIN NAME
23243 129.97.90.203 secure0.uwaterloo.ca
23244 129.97.90.204 msma.uwaterloo.ca
23245 129.97.90.205 ehab0.uwaterloo.ca
23246 129.97.90.206 calliope1.uwaterloo.ca
23247 129.97.90.207 calliope2.uwaterloo.ca
23248 129.97.90.208 dsip-lpt.uwaterloo.ca
23249 129.97.90.209 churchill.uwaterloo.ca
Hash Functions
10
IP Addresses
In this example, the solution is clear:– The array is fixed in size– The array is almost filled (dense)
• UW currently uses 65% of possible IP addresses
– The translation from IP address to domain name is Θ(1)
There are two problems we will examine:1. What if the array is very sparse?
2. How do we go the other way?
Hash Functions
11
IP Addresses
Problem #1:– UW uses 65% of its 216 IP addresses– What if the relative number of used addresses is small?
The new standard, IPv6, uses 128-bit addresses– Allows 340 undecillion IP addresses – ~7 IP addresses per cubic micrometre of atmosphere
– Removes the need for Network Address Translations (NAT)
Suppose UW is assigned 232 addresses– We cannot have an array with four billion entries
Hash Functions
12
IP Addresses
An array storing (domain name, IP address)-pairs sorted on the IP address would be slow to maintain– The IP address is the key and associated with it is the string
– Any new or deleted domain names would require O(n) work– Accessing an entry would require a binary search O(ln(n))
Using an AVL tree would still require that all operations require O(ln(n)) time
Hash Functions
13
IP Addresses
Can we do better than O(ln(n))
– Can we get it down to Θ (1)?
Problem:– So long as we require that the entries are sorted, we cannot do better
than O(ln(n))
Do we care about the order?– Do we need to know the IP address of the domain name which comes
alphabetically after churchill.uwaterloo.ca?
Hash Functions
14
UW Student ID Numbers
Let’s start with an easier example:– Each UW student is assigned an 8-digit UW Student ID Number– Allocating an array of size 108 is wasteful– UW has only had ~105 students– There are only ~102 students in this class
Suppose I want to store the grade associate with each student in this class
Hash Functions
15
UW Student ID Numbers
Solution:– Allocate an array of 1000 bins– The bins are labeled 000, 001, ..., 999– Store the mark of the student with number
20123456 in bin 456
Benefits:– Taking the modulo 1000 is Θ(1)
– Modulo n is the remainder after dividing by n
– Accessing an array entry is also Θ(1)– Only 100 students: 1 in 10 bins are filled
454
455
456 84
457
458
459
460
461
462
463 79
464
465
...
...
...
...
Hash Functions
16
UW Student ID Numbers
Problem:– Multiple students may have the same last three digits– Assuming the last three digits are random:
• What is the probability that all students will a different set of last three digits?
– Answer: 0.5%
Similar question:– What is the probability that, in a group of 22 students, no two students
share the same birthday?– Answer: 49%
Hash Functions
17
UW Student ID Numbers
• The process of mapping a number onto a smaller range is called hashing
• The difficulty where multiple objects may hash to the same value is said to be a collision
• Hash tables use a hash function together with a mechanism for dealing with collisions
Hash Functions
18
IP Addresses
Going back to our issue with UW being assigned 1032 128-bit IP addresses– Assume we will use at most 220 IP addresses
– Allocate an array of size 220
– Define a hash function which deals with 128-bit inputs and maps it down to a number from 0, ..., 220 – 1
– Deal with collisions
Hash Functions
19
IP Addresses
Problem #2:– How do we go the other way?
For example, given churchill.uwaterloo.ca, how do we find the corresponding IP address 129.97.90.209 today?
Even with the 32-bit IP address of today, this is still a significant problem
Same idea:– Take a hash of the string which maps it to a value on the range 0, ..., 216
– 1– Deal with collisions and look it up in an array of size 216
Hash Functions
20
IP Addresses
• We will break the processinto three independentsteps:
Object
32-bit integer
Map to an index 0, ..., M – 1
Deal with collisions
Techniques vary...
Modulo, mid-square,multiplicative, Fibonacci
Chained hash tablesOpen addressing
Linear ProbingDouble Hashing
Hash Functions
21
Summary
• Discuss storing unordered data• Discuss IP addresses and domain names• Consider conversions between these two forms• Introduce the idea of using a smaller array
– Converted “large” numbers into valid array indices– Reduces O(ln(n)) in arrays and AVL trees to to O(1)
• Discussed the issues with collisions
Hash Functions
22
Usage Notes
• These slides are made publicly available on the web for anyone to use• If you choose to use them, or a part thereof, for a course at another
institution, I ask only three things:– that you inform me that you are using the slides,– that you acknowledge my work, and– that you alert me of any mistakes which I made or changes which you make, and
allow me the option of incorporating such changes (with an acknowledgment) in my set of slides
Sincerely,
Douglas Wilhelm Harder, MMath