hashing part one reaching for the perfect search most of this material stolen from "file...

Intro to Hashing

Hashing Part OneReaching for the Perfect Search

Most of this material stolen from"File Structures" by Folk, Zoellick and Riccardi

Text File v. Binary File

Unordered Binary File◦ average search takes N/2 file operations

Ordered Binary File◦ average search takes Log2N file operations◦ but keeping the data file sorted is costly

Indexed File◦ average search takes 3 or 4 file operations

Perfect Search◦ search time = 1file read

Searching for Data in Large Files

Definition:oa magic black box that converts a key to the file

address of that record

Hash Function

Name Field1 Field2 . . .

Dannelly

Dannelly HashFunction

Example Hashing Function:o Key = Customer's Name

o Function = 1st letter x 2nd letter, then use rightmost 4 letters.

Name ascii product RRNBALL 66x65 = 4290 290LOWELL 76x79 = 6004 004TREE 84x82 = 6888 888OLIVIER 79x76 = 6004 004

Definition:◦ When two or more keys hash to the same

address.

Minimizing the Number of Collisions:1) pick a hash function that avoids collisions,

i.e. one with a seemingly random distribution◦ e.g. our previous function is terrible because letters like

"E" and "L" occur frequently, while no one's name starts with "XZ".

2) spread out the records◦ 300 records in a file with space for 1000 records will

have many fewer collisions than 300 records in a file with capacity of 400

Collision

Our objective is to muddle the relationship between the keys and the addresses.

Good Ideas: use both addition and multiplication

avoid integer overflow so mix in some subtraction and division too

divide by prime numbers

Hash Function Selection

1. pad the name with spaces2. fold and add pairs of letters3. mod by a prime after each add4. divide sum by file size

Example: Key="LOWELL" and file size = 1,000

L O W E L L 76 79 | 87 69 | 76 76 | 32 32 | 32 32 | 32 32

7679 + 8769 = 16,448 % 19,937 = 16,448 16448 + 7676 = 24,124 % 19,937 = 4,187 4187 + 3232 = 7,419 % 19,937 = 7,419 7419 + 3232 = 10,651 % 19,937 = 10,651 10651 + 3232 = 13,883 % 19,937 = 13,833

13833 % 1000 = 833

Improved Hash FunctionWhy 19,937 ?

19,937 is the largestprime that insures the

next add will not cause integer overflow.

The simplest hash function for a string is "add up all the characters, then divide by filesize"

For example, ◦ filesize = 100 records◦ key = "pen"◦ address = ( 16 + 5 + 14 ) % 100 = 35

1. Find another word with the same mapping

2. Give an improvement to this hash function

Class Exercise

a b c d e f g h i j k l m n o p q r s t u v w x y z

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26

The optimal hash function for a set of keys:1. will evenly distribute the keys across the address

space, and2. every address has a equal chance of being used.

Uniform distribution is nearly impossible.

Good Mapping Poor Mappingkey address key address

1 1 2 2

A 3 A 3B 4 B 4C 5 C 5D 6 D 6E 7 E 7

8 8 9 910 10

Optimal Hash Function

Suppose we have a file of 10,000 records, finding a hash function that will take our 10,000 keys and yield 10,000 different addresses is essentially impossible.

So, our 10,000 records are stored in a larger file. How much larger than 10,000? o 10,500? o 12,000?o 50,000?

It Depends◦ larger datafile:

more empty (wasted) space fewer collisions

Selecting a File Size

Even with a very good hash function, collisions will occur.

We must have an algorithm to locate alternative addresses.

Example,◦ Suppose "dog" and "cat" both hash to location 25.◦ If we add "dog" first, then dog goes in location 25.◦ If we later add "cat", where does it go?

◦ Same idea for searching. If cat is supposed to be at 25 but dog is there, where do we look next?

When Collisions Occur

"Linear Probing" or "Progressive Overflow"

When a key maps to address already in use, just try the next one. If that one is in use, try the next one. yadda yadda

Easy to implement.

Usually works well, especially with a non-dense file and a good hash function.

Can lead to clumps of records.

Simple Collision Resolution

Assume these keys map to these addresses:1. adams = 202. bates = 223. cole = 204. dean = 215. evans = 23

Where will each record be placed if inserted in that order?

Using linear probing, how many file accesses for each?

Clumping

How many collisions is acceptable?◦ Analysis: packing density v probing length

Is there a collision resolution algorithm better than linear probing?◦ buckets

Next Class

hashing part one reaching for the perfect search most of this material stolen from "file...

Documents