6.01.hash tableintro

22
ECE 250 Algorithms and Data Structures Douglas Wilhelm Harder, M.Math. LEL Department of Electrical and Computer Engineering University of Waterloo Waterloo, Ontario, Canada Copyright © 2006-11 by Douglas Wilhelm Harder. All rights reserved. Introduction to Hash Tables

Upload: amoldkul

Post on 16-Jul-2015

124 views

Category:

Technology


0 download

TRANSCRIPT

ECE 250 Algorithms and Data Structures

Douglas Wilhelm Harder, M.Math. LEL

Department of Electrical and Computer Engineering

University of Waterloo

Waterloo, Ontario, Canada

Copyright © 2006-11 by Douglas Wilhelm Harder. All rights reserved.

Introduction to Hash Tables

Hash Functions

2

Outline

• Discuss storing unordered data• Discuss IP addresses and domain names• Consider conversions between these two forms• Introduce the idea of hashing:

– Reducing O(ln(n)) operations to O(1)

• Consider some of the weaknesses

Hash Functions

3

Background

• Given a set of data, suppose each data item is associated with a unique integer key in a particular range:– UW Student ID Numbers decimal– Social Insurance Numbers decimal– IP Addresses binary

• We will use IP addresses as a model• Databases often assign unique primary keys be using an

automatically incremented counter

Hash Functions

4

IP Addresses

• Each computer communicating on a network using the Internet Protocol Version 4 (IPv4) has a unique IP address– 32 bits/4 bytes allowing four billion addresses

• Represented in human-readable format as four bytes– The ECE web server has the IP address129.97.56.100– The URL http://129.97.56.100/ is valid– IP addresses are not easy to remember

Hash Functions

5

IP Addresses

• Domain names were introduced for humans• The Domain Name System (DNS) are hierarchical

– There are a limited number of top-level domains– Countries use ISO 1366 country codes– Responsibility for 2nd-, 3rd-, and lower-level domains are the

responsibility of the parent

Hash Functions

6

IP Addresses

• The Unix command host translates between IP addresses and domain names

$ host uwaterloo.ca

uwaterloo.ca has address 129.97.128.40

$ host ece.uwaterloo.ca

ece.uwaterloo.ca has address 129.97.56.100

$ host www.uwaterloo.ca

www.uwaterloo.ca is an alias for info.uwaterloo.ca.

info.uwaterloo.ca has address 129.97.128.40

$ host www.google.ca

www.google.ca is an alias for www.google.com.

www.google.com is an alias for www.l.google.com.

www.l.google.com has address 72.14.205.99

www.l.google.com has address 72.14.205.103

www.l.google.com has address 72.14.205.104

www.l.google.com has address 72.14.205.147

Hash Functions

7

IP Addresses

• The mapping is not one-to-one– Some IP address have multiple domain names– Some domain names have multiple IP addresses

• This allows flexibility:

Hash Functions

8

IP Addresses

DNS allows a division of effort in name translation– A DNS server in Korea (kr) does not need to know the IP address of

uwaterloo.ca

Similarly, the University of Waterloo has control of names within its domain– Any IP address starting with 129.97 belongs to UW– This gives UW 2562 = 65535 IP addresses

Hash Functions

9

IP Addresses

The translation of IP addresses to domain names is straight-forward:– Use an array of size 65536

– E.g., 90 × 256 + 209 = 23249Index Address Domain Name

23240 129.97.90.200 sidicsem.uwaterloo.ca

23241 129.97.90.201 watdist8.uwaterloo.ca

23242 129.97.90.202 NO DOMAIN NAME

23243 129.97.90.203 secure0.uwaterloo.ca

23244 129.97.90.204 msma.uwaterloo.ca

23245 129.97.90.205 ehab0.uwaterloo.ca

23246 129.97.90.206 calliope1.uwaterloo.ca

23247 129.97.90.207 calliope2.uwaterloo.ca

23248 129.97.90.208 dsip-lpt.uwaterloo.ca

23249 129.97.90.209 churchill.uwaterloo.ca

Hash Functions

10

IP Addresses

In this example, the solution is clear:– The array is fixed in size– The array is almost filled (dense)

• UW currently uses 65% of possible IP addresses

– The translation from IP address to domain name is Θ(1)

There are two problems we will examine:1. What if the array is very sparse?

2. How do we go the other way?

Hash Functions

11

IP Addresses

Problem #1:– UW uses 65% of its 216 IP addresses– What if the relative number of used addresses is small?

The new standard, IPv6, uses 128-bit addresses– Allows 340 undecillion IP addresses – ~7 IP addresses per cubic micrometre of atmosphere

– Removes the need for Network Address Translations (NAT)

Suppose UW is assigned 232 addresses– We cannot have an array with four billion entries

Hash Functions

12

IP Addresses

An array storing (domain name, IP address)-pairs sorted on the IP address would be slow to maintain– The IP address is the key and associated with it is the string

– Any new or deleted domain names would require O(n) work– Accessing an entry would require a binary search O(ln(n))

Using an AVL tree would still require that all operations require O(ln(n)) time

Hash Functions

13

IP Addresses

Can we do better than O(ln(n))

– Can we get it down to Θ (1)?

Problem:– So long as we require that the entries are sorted, we cannot do better

than O(ln(n))

Do we care about the order?– Do we need to know the IP address of the domain name which comes

alphabetically after churchill.uwaterloo.ca?

Hash Functions

14

UW Student ID Numbers

Let’s start with an easier example:– Each UW student is assigned an 8-digit UW Student ID Number– Allocating an array of size 108 is wasteful– UW has only had ~105 students– There are only ~102 students in this class

Suppose I want to store the grade associate with each student in this class

Hash Functions

15

UW Student ID Numbers

Solution:– Allocate an array of 1000 bins– The bins are labeled 000, 001, ..., 999– Store the mark of the student with number

20123456 in bin 456

Benefits:– Taking the modulo 1000 is Θ(1)

– Modulo n is the remainder after dividing by n

– Accessing an array entry is also Θ(1)– Only 100 students: 1 in 10 bins are filled

454

455

456 84

457

458

459

460

461

462

463 79

464

465

...

...

...

...

Hash Functions

16

UW Student ID Numbers

Problem:– Multiple students may have the same last three digits– Assuming the last three digits are random:

• What is the probability that all students will a different set of last three digits?

– Answer: 0.5%

Similar question:– What is the probability that, in a group of 22 students, no two students

share the same birthday?– Answer: 49%

Hash Functions

17

UW Student ID Numbers

• The process of mapping a number onto a smaller range is called hashing

• The difficulty where multiple objects may hash to the same value is said to be a collision

• Hash tables use a hash function together with a mechanism for dealing with collisions

Hash Functions

18

IP Addresses

Going back to our issue with UW being assigned 1032 128-bit IP addresses– Assume we will use at most 220 IP addresses

– Allocate an array of size 220

– Define a hash function which deals with 128-bit inputs and maps it down to a number from 0, ..., 220 – 1

– Deal with collisions

Hash Functions

19

IP Addresses

Problem #2:– How do we go the other way?

For example, given churchill.uwaterloo.ca, how do we find the corresponding IP address 129.97.90.209 today?

Even with the 32-bit IP address of today, this is still a significant problem

Same idea:– Take a hash of the string which maps it to a value on the range 0, ..., 216

– 1– Deal with collisions and look it up in an array of size 216

Hash Functions

20

IP Addresses

• We will break the processinto three independentsteps:

Object

32-bit integer

Map to an index 0, ..., M – 1

Deal with collisions

Techniques vary...

Modulo, mid-square,multiplicative, Fibonacci

Chained hash tablesOpen addressing

Linear ProbingDouble Hashing

Hash Functions

21

Summary

• Discuss storing unordered data• Discuss IP addresses and domain names• Consider conversions between these two forms• Introduce the idea of using a smaller array

– Converted “large” numbers into valid array indices– Reduces O(ln(n)) in arrays and AVL trees to to O(1)

• Discussed the issues with collisions

Hash Functions

22

Usage Notes

• These slides are made publicly available on the web for anyone to use• If you choose to use them, or a part thereof, for a course at another

institution, I ask only three things:– that you inform me that you are using the slides,– that you acknowledge my work, and– that you alert me of any mistakes which I made or changes which you make, and

allow me the option of incorporating such changes (with an acknowledgment) in my set of slides

Sincerely,

Douglas Wilhelm Harder, MMath

[email protected]