cs305/503, spring 2009 hash tables michael barnathan

39
CS305/503, Spring 2009 Hash Tables Michael Barnathan

Post on 21-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CS305/503, Spring 2009 Hash Tables Michael Barnathan

CS305/503, Spring 2009Hash Tables

Michael Barnathan

Page 2: CS305/503, Spring 2009 Hash Tables Michael Barnathan

Here’s what we’ll be learning:• Review of Assignment 3.

• Theory:– Keys and values.– What constitutes a good hash function?

• Data Structures:– Hash Tables.

• Collision Resolution:– Chaining– Open Addressing / Linear Probing

• Perfect Hashing• Cuckoo Hashing

Page 3: CS305/503, Spring 2009 Hash Tables Michael Barnathan

Assignment 3 Grades

• Probably the most difficult assignment of the course.• I always implement the assignments myself prior to handing them

out to ensure that I’m not assigning something overwhelming.– This time, one person somehow accessed my solution and attempted

to submit it back to me.• But I can recognize my own code.

– That person received a 0 on the assignment (not shown on graph).

Page 4: CS305/503, Spring 2009 Hash Tables Michael Barnathan

Assignment 3• This assignment tested several real-world development skills.• It required learning a new class: Map.

– But you had everything you needed to do so already, and have since you learned how to use vectors.– Today we’re just going over the theory behind hashing. This is how maps work, not how to use them.– From your perspective, Maps are just arrays that can use things other than numbers as indices.

• .get() and .put() are otherwise the same.• Since the keys aren’t contiguous anymore, you need a means of getting a complete list of them: .keySet()

– Learning how to use new libraries and classes is a vital development skill. (“You are being prepared to solve problems that do not exist yet”)

• Or how will you come in and work on code that others have been developing for years?• Could you work on, say, the next version of Windows if you can’t learn how to use new libraries? Do you think they’re

only using the standard classes?• It required basic encapsulation to complete easily, but not a whole architecture.

– If you tried to stuff each entry into a string, you needed to do extra work parsing the string so you could display it in the proper format.

– But if you created a class for a Word, you could keep the part of speech and definition separate, then print them out as you wished.

• Aside from that, it required making tough design decisions.– To sort or search, and how to do each?– TreeMap or HashMap?– Fast range queries or fast individual word lookup?

• Finally, it required following a detailed spec, such as your clients will give you (but with more hints).

– Ignoring parts of the spec., such as “next” or “-hash”, cost most of the lost points.– Incorrect implementations of these didn’t cost nearly as much as lack of implementation altogether.

Page 5: CS305/503, Spring 2009 Hash Tables Michael Barnathan

Assignment 3 – The Bright Side• If this wasn’t a challenge, you’re ready to go out there and

write production code.• If it was, then you just may need more experience.

– It does get easier over time.– If you haven’t been developing for a long time, you’re not

behind the curve.• It took me a year to learn how to use functions!

– (But I didn’t have any formal instruction then.)• The important thing is that you stay at it.

• This is only 6% of your final grade. Don’t stress out too much over it.

Page 6: CS305/503, Spring 2009 Hash Tables Michael Barnathan

Experience

• Ability arises mostly from years of borderline-obsessive work. You can’t acquire it overnight.

• Who here has been coding for at least:– 1 year?– 2 years?– 5 years?– 10 years?– 15 years?

• Who has coded anything on the side?

Page 7: CS305/503, Spring 2009 Hash Tables Michael Barnathan

Assignment 3 Code

• Let’s review the solution.

• Questions?

Page 8: CS305/503, Spring 2009 Hash Tables Michael Barnathan

Review: Arrays and Random Access• Let’s review arrays for a moment:• A size n array is indexed by a contiguous set of integers from 0

to n-1.• Because the array is contiguous in memory, accessing any

element of it can be performed in constant-time. This is random access.

• If the index actually represents something about the dataset, we can use this to access desired elements in constant-time.

• For example, asking “who is the 4th person up to bat?” in a baseball roster.– Answer: roster[3] (remember, they start at 0).– This is an O(1) operation.

Bill Gates Andrew Jackson Henry Purcell Reggie White(Worst team ever.)

I’m fourth!

Page 9: CS305/503, Spring 2009 Hash Tables Michael Barnathan

Keys and Values• An index is an example of a numeric key into the array.• A key is an attribute or combination of attributes by which

each record is identified.– Arr[3] identifies as the fourth element in the array. In this case, the

key is simply an element’s position in the array.• But we can also identify arrays by attributes such as employee

names and salaries.– These don’t map too well to array indices.

• The value of an element is the data accessed by the key.– For example, if Arr[3] was an Employee, “3” is the key and the

resulting Employee object is the value.• A container that maps directly between keys and values is

called a Map (surprise!) or an associative array.

Page 10: CS305/503, Spring 2009 Hash Tables Michael Barnathan

Arrays’ Shortcomings• Arrays work well if keys are contiguous integers.

– Years in a calendar, for example.• However, what if we have a non-numeric key?

– In every data structure we’ve discussed so far, we have no choice but to search for it, which is an Ω(log n) operation.

Eve

Mallory

Bob

Alice

Trudy JohnCharlie

John? John?

I’m over here!

Who?

Don’t look at me!

No one ever listens

to me…

Page 11: CS305/503, Spring 2009 Hash Tables Michael Barnathan

Mapping Data

• Idea: What if we could map the word “John” to an array index somehow?– “John” -> 5. Arr[5] = …

• Then finding “John” becomes equivalent to mapping “John” to 5 and accessing Arr[5].

• Arrays are random-access, so this is O(1).• Obvious question: How do we turn “John” into 5?

Why 5 and not 6?• Less obvious question: What if “Bob” also maps to 5?

What happens then?

Page 12: CS305/503, Spring 2009 Hash Tables Michael Barnathan

Maps and Mathematical Functions

• Go waaay back and think about the first time you heard the word “function”.

• It was something that took input and transformed it into output.

234 468f(x) = 2x

Page 13: CS305/503, Spring 2009 Hash Tables Michael Barnathan

Maps and Mathematical Functions

• So if we can do that, why not this?

AliceBobJohn 012h(x)

Black box

Page 14: CS305/503, Spring 2009 Hash Tables Michael Barnathan

Hashing: The Idea

• We call the process of transforming input with a function and using the result as an index hashing.

• This allows us to use strings or other objects as keys.

AliceBobJohn 012h(x)

50,000 25,000 75,000

double[] Salaries

Salaries[“Alice”] = 50000

Salaries[“Bob”] = 25000

Salaries[“John”] = 75000

Page 15: CS305/503, Spring 2009 Hash Tables Michael Barnathan

The Hash Function

• We call h(x) a hash function.• Any function that maps the input type to something

suitable for indexing may be used.– In Java, this means we are mapping from Object to int.– In fact, every Java class has a built in function called:

int hashCode()

– This function is defined in the Object class, which means every object has a default one.

– It also means you can override it in your own objects.

h(x)

Page 16: CS305/503, Spring 2009 Hash Tables Michael Barnathan

Good Hash Functions1. A hash function must be deterministic: it must always

return the same value for the same input.2. Good hash functions distribute their output as uniformly as

possible to minimize the number of “collisions”: two different input values that hash to the same output.

1. If every distinct input value is mapped to a distinct output value, the function is called injective, or one-to-one. This is the ideal.

2. If the space of possible inputs is greater in size than possible outputs, it is also impossible (due to the pigeonhole principle: if you put n+1 objects in n holes, at least one hole must have more than one object in it).

3. Because the hash function is computed on every access of the hash table, good hash functions execute very quickly.

Page 17: CS305/503, Spring 2009 Hash Tables Michael Barnathan

The Birthday Paradox• If the range of possible inputs is larger than the range of

possible outputs, it is impossible to obtain an ideal hash function due to the pigeonhole principle.

• However, even if this is not the case, it is still unlikely that a uniform hash function will avoid collisions.

• This is due to the birthday paradox:– This just refers to the counterintuitive notion that it is highly likely

that two people in a relatively small group share the same birthday.– Assuming a uniform distribution:– In a group of 23 people, the probability that 2 share a birthday is 50%.– In a group of 50 people, the probability is 97%.– The probability does not reach 100% until 365 people are in the room.

• “Having the same birthday” -> “Hashing to the same value”.

Page 18: CS305/503, Spring 2009 Hash Tables Michael Barnathan

The Birthday Paradox

)!365(365

!365)(

nnp

n

(Wikipedia)

Page 19: CS305/503, Spring 2009 Hash Tables Michael Barnathan

Popular Hash Functions• MD5

– MD4• SHA1

– SHA2– SHA3

• CRC32• 3DES• Tiger• (Aside: Many hash functions are used for cryptography as

well. Should you use them for cryptography, make sure you pad the data with an extra string, called a salt, to avoid “rainbow table” attacks).

Page 20: CS305/503, Spring 2009 Hash Tables Michael Barnathan

Hash Tables• The hash table is the array that the hash function provides an

index into.• Like other arrays, it begins with a fixed capacity and strategies

must be employed to maintain it as the hash table grows.• Because performance degrades as the hash table begins to

fill, the size of a hash table is usually increased when capacity passes a certain load factor.– For example, a table with a load factor of 0.75 would increase in size

when it is 75% full.– 0.75 is the default in Java’s HashTable, HashSet, and HashMap classes.

• Collisions, mappings of distinct objects to the same position in the array, must also be handled.– They become more of a problem as the hash table fills.

Page 21: CS305/503, Spring 2009 Hash Tables Michael Barnathan

Collision Resolution

• What if element B hashes to a location already filled by element A?

• We have a collision.• There are two strategies for handling this

scenario:– Linear Probing.– Chaining.

• Or, to put it in intuitive terms:– This spot’s taken. Store the new element

somewhere else, or:– Cram both elements into the same spot.

Alice Bob

1

h(x)

Page 22: CS305/503, Spring 2009 Hash Tables Michael Barnathan

Linear Probing• Let element B hash to the location h(B).• Suppose h(B) is already filled by element A.• A linear probing strategy simply stores B in the next available

space.– If h(B) + 1 is available, this is where it is stored.– If not, we move to h(B) + 2 and check whether it is available.– And so on.– If we hit the end of the table, we wrap around to the beginning

(modular arithmetic).• It is also possible to use an arbitrary offset k.

– Then we check h(b) + k, h(b) + 2k, etc.– Again, everything is (mod n), the size of the table, so we wrap.

• The same strategy is used for access:– If the hashed element is not the same as the one we’re looking up,

move down the hash table and check the next element. Repeat until the elements match or an empty space is reached.

Page 23: CS305/503, Spring 2009 Hash Tables Michael Barnathan

Linear Probing Example

Alice

Bob

John

Eve

Trudy

Insert “Mallory”

h(x)

Suppose Mallory hashes to John’s spot.

Page 24: CS305/503, Spring 2009 Hash Tables Michael Barnathan

Linear Probing Example

Alice

Bob

John

Eve

Trudy

Insert “Mallory”

h(x)

We check the next spot. It’s filled.

Page 25: CS305/503, Spring 2009 Hash Tables Michael Barnathan

Linear Probing Example

Alice

Bob

John

Eve

Trudy

Insert “Mallory”

h(x)

We check the next spot. It’s filled.

Page 26: CS305/503, Spring 2009 Hash Tables Michael Barnathan

Linear Probing Example

Alice

Bob

John

Eve

Trudy

Mallory

Insert “Mallory”

h(x)

When we find an empty spot, it is filled.

Page 27: CS305/503, Spring 2009 Hash Tables Michael Barnathan

Advantages and Disadvantages• Advantages:

– Very space-efficient; values are stored in the hash table itself.– Simple; no extra structures needed.– Works fairly well when load factor is low.

• However, a low load factor wastes space.– Because colliding elements remain adjacent in memory, caching

behavior is exceptional.• Disadvantages:

– Performance swiftly degrades when load factor exceeds 0.8.– Collisions may cluster, and this requires traversing the hash table one

element at a time to find the next available space. This may slow insertion.

Page 28: CS305/503, Spring 2009 Hash Tables Michael Barnathan

Chaining

• Let element B hash to the location h(B).• Suppose h(B) is already filled by element A.• A chaining strategy stores a linked list at each

node and appends the new node to the list.• When we wish to access the element again,

we perform a linear search on the list.

Page 29: CS305/503, Spring 2009 Hash Tables Michael Barnathan

Chaining Example

Alice

Bob

John

Eve

Trudy

Insert “Mallory”

h(x)

Suppose Mallory hashes to John’s spot.

Mallory

We then append Mallory to a linked list in that same spot.

Page 30: CS305/503, Spring 2009 Hash Tables Michael Barnathan

Advantages and Disadvantages• Advantages:

– Intuitive; the location we hash at is always the one returned by the hash function.

– New elements can be added to the list in constant-time; linear probing requires a linear scan.

– Performance degrades linearly even as the table fills.– More elements may be stored in the table than there are available

slots using this method.– You can quickly discover the number of keys that collide with another.

• Disadvantages:– Storing the data in adjacent memory locations, as in linear probing,

has very good caching behavior. Linked lists in general do not.

Page 31: CS305/503, Spring 2009 Hash Tables Michael Barnathan

Performance

(Wikipedia)

Page 32: CS305/503, Spring 2009 Hash Tables Michael Barnathan

Perfect Hashing

• If all n keys are known prior to hashing, it is possible to construct a function that maps these keys to a hash table of size n without collisions.

• This function is known as a perfect hash function.• There is a generalized procedure for discovering

perfect hash functions described at http://cmph.sourceforge.net/papers/chm92.pdf.

• But since this is a difficult paper to understand, just be aware that it is possible.

Page 33: CS305/503, Spring 2009 Hash Tables Michael Barnathan

Cuckoo Hashing• This is a strategy that uses two hashing functions to insert.• If a collision occurs using the first hash function, the existing

element is pushed out of its space (replaced by the new element) and hashed using the second function.

• This can potentially push another element out. If a loop occurs, the hash table is rebuilt using a different set of hash functions.

• However, a collision on both hash functions is unlikely until the table begins to fill.– This begins earlier than in the other two strategies:– Using two hash functions, an appropriate load factor is .5.– However, using three, the appropriate load factor jumps to .91.

• This strategy was generally found superior to both chaining and probing. However, it is still not widely known.– Fortunately for you, I have some very esoteric areas of interest.

Page 34: CS305/503, Spring 2009 Hash Tables Michael Barnathan

Unsorted Associative Containers• Java has excellent built-in support for hashing.• In particular, the unsorted associative containers utilize hash

tables:– HashMap, which you have used:

• Similar functions to TreeMap.• Usually faster for random-access queries.• As you saw in Assignment 3, performing range queries or sequential

access is a pain (you had to sort).– HashSet.– HashTable (which is very much like HashMap).

• Why are they unsorted?– The point of a hash function is to turn keys into integers. In general,

sorted order cannot be maintained through this conversion.

Page 35: CS305/503, Spring 2009 Hash Tables Michael Barnathan

Keeping a Hash Table Sorted

• It’s possible to make hash tables range-efficient with some extra structure.

• Specifically, what if we stored a linked list within the table that pointed to the next element in sorted order?

• This incurs no extra asymptotic cost on insertion, access, or deletion.

• Java has a class that implements this idea, LinkedHashMap.

Page 36: CS305/503, Spring 2009 Hash Tables Michael Barnathan

Hashing in Other Languages

• Java: HashMap• C++: hash_map• C#: Hashtable• Perl: $var{‘key’} = “value”• PHP: $var[‘key’] = “value”• Ruby: v = { ‘key’ =>

‘value’ }

Page 37: CS305/503, Spring 2009 Hash Tables Michael Barnathan

Performance

• What is the complexity of insertion in a hash table if there are no collisions?

• What if there are collisions?– If you choose your table size appropriately,

collisions are rather rare. The average size of your chains usually ends up around 2 or 3.

• Do hash tables need to use any extra space?

Page 38: CS305/503, Spring 2009 Hash Tables Michael Barnathan

CRUD: Hash Tables• Insertion (average): O(1).• Access (average): O(1).• Deletion (average): O(1).

• Insertion (worst): O(n).• Access (worst): O(n).• Deletion (worst): O(n).

• Since collisions are not very common with a good hash function and an appropriate load factor, hash tables very often yield constant-time insertion, access, and deletion.

• The amount of space used depends on the load factor, but remains O(n).• They are incredibly useful structures!• They allow you to index data by a generalized key rather than a numeric ID, and

are therefore used extensively in databases and distributed queries. A hash-based algorithm called MapReduce powers Google.

Page 39: CS305/503, Spring 2009 Hash Tables Michael Barnathan

Access on Demand

• This was our discussion of hashing.• Next time, we will discuss amortized analysis

and Java’s “Set” classes.• The lesson:

– An unlikely event actually has a very high probability given enough repetitions (birthday paradox).