can’t provide fast insertion/removal and fast lookup at the same time vectors, linked lists,...

46
Can’t provide fast insertion/removal and fast lookup at the same time Vectors, Linked Lists, Stack, Queues, Deques 4 Data Structures - CSCI 102 Copyright © William C. Cheng Data Structure Limitations Provide consistently fast operations, but must maintain an internal ordering Binary Search Trees, Heaps What if we didn’t care about the ordering of the elements at all? How can we further improve the performance of lookup, add & removal?

Upload: timothy-greene

Post on 04-Jan-2016

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Can’t provide fast insertion/removal and fast lookup at the same time Vectors, Linked Lists, Stack, Queues, Deques 4 Data Structures - CSCI 102 Copyright

Can’t provide fast insertion/removal and fast lookup atthe same time

Vectors, Linked Lists, Stack, Queues, Deques

4

Data Structures - CSCI 102

Copyright © William C. Cheng

Data Structure Limitations

Provide consistently fast operations, but must maintainan internal ordering

Binary Search Trees, Heaps

What if we didn’t care about the ordering of the elementsat all?

How can we further improve the performance of lookup,add & removal?

Page 2: Can’t provide fast insertion/removal and fast lookup at the same time Vectors, Linked Lists, Stack, Queues, Deques 4 Data Structures - CSCI 102 Copyright

Each value in the table has a unique key

For operations where we only care about fastadd/remove/search, not fast traversal, we create a tablestructure to optimize for fast lookup

5

Data Structures - CSCI 102

Copyright © William C. Cheng

Lookup Tables

The key is used as a short identifier to lookup an entirevalue in the table

Your student ID is used to look up your student record(e.g. name, GPA, etc.)

Example

Page 3: Can’t provide fast insertion/removal and fast lookup at the same time Vectors, Linked Lists, Stack, Queues, Deques 4 Data Structures - CSCI 102 Copyright

Search(key)See if a particular value identified by key is in thetable

What kind of operations do we need to perform on a lookuptable?

6

Data Structures - CSCI 102

Copyright © William C. Cheng

Lookup Tables

Insert(key,value)Insert a new value identified by key into the table

Remove(key)Remove the value identified by key from the table

We don’t care as much about traversal (visiting allelements) in this scenario

Page 4: Can’t provide fast insertion/removal and fast lookup at the same time Vectors, Linked Lists, Stack, Queues, Deques 4 Data Structures - CSCI 102 Copyright

Let’s assume ID is a unique integer

We want to keep a directory of all the students at USC andbe able to look them up by their student ID

7

Data Structures - CSCI 102

Copyright © William C. Cheng

Sample Object

struct Student {string name;double gpa;int id;

};

Page 5: Can’t provide fast insertion/removal and fast lookup at the same time Vectors, Linked Lists, Stack, Queues, Deques 4 Data Structures - CSCI 102 Copyright

Student data[4999];

If we can guarantee that student IDs will always range from0 to N (e.g. 0 to 4999), we could just store them in an array:

8

Data Structures - CSCI 102

Copyright © William C. Cheng

Direct Address Table

int id = 3285;Student s = data[id];

Then when we want to grab a particular student, we knowStudent N is at index N:

Page 6: Can’t provide fast insertion/removal and fast lookup at the same time Vectors, Linked Lists, Stack, Queues, Deques 4 Data Structures - CSCI 102 Copyright

Data Structures - CSCI 102

Direct Address Table

StudentObjects

John Doe3.20

Jane Doe2.62

Some Guy

Name

3.7

GPA

4

ID

0

1

2

3

4

5

4999

9

Copyright © William C. Cheng

StudentIDs

Data

0

24

Page 7: Can’t provide fast insertion/removal and fast lookup at the same time Vectors, Linked Lists, Stack, Queues, Deques 4 Data Structures - CSCI 102 Copyright

Direct Addressing

10

Data Structures - CSCI 102

Copyright © William C. Cheng

Direct Address Table

Maps keys directly to the indexes in an arrayUnused array indexes need to be marked

O(1) worst case

Generally use NULLOperations are fast

Page 8: Can’t provide fast insertion/removal and fast lookup at the same time Vectors, Linked Lists, Stack, Queues, Deques 4 Data Structures - CSCI 102 Copyright

Key RestrictionsDirect Addressing Issues

11

Data Structures - CSCI 102

Copyright © William C. Cheng

Direct Address Table

Array Size

Keys must fall into a nice, uniform rangeKeys must be numeric

If there are N possible keys, then data[] must be ofsize NOur array could get HUGEWhat if we’re only using a small numbers of keys?Tons of space is wasted

How can we get around these limitations?

Page 9: Can’t provide fast insertion/removal and fast lookup at the same time Vectors, Linked Lists, Stack, Queues, Deques 4 Data Structures - CSCI 102 Copyright

Hash Functions

12

Data Structures - CSCI 102

Copyright © William C. Cheng

Hash Functions

A function that maps key values to array indexesInput records all have a unique keyThe hash function maps key to an array indexRecords are stored at data[hash(key)]Ideally every unique key also has unique hash(key)

Direct Addressing essentially uses a hash function thatdoes nothing

int directAddressHash(int studentId) {return studentId;

}

Page 10: Can’t provide fast insertion/removal and fast lookup at the same time Vectors, Linked Lists, Stack, Queues, Deques 4 Data Structures - CSCI 102 Copyright

13

Copyright © William C. Cheng

Data Structures - CSCI 102

Hash Tables

StudentObjects

John Doe

Jane Doe

Some Guy

3.2

2.6

3.7

0

2

4

NameGPAID

hash(4)

hash(0)

hash(2)

Data

StudentIDs

(Keys)

0

24

HashFunction

Page 11: Can’t provide fast insertion/removal and fast lookup at the same time Vectors, Linked Lists, Stack, Queues, Deques 4 Data Structures - CSCI 102 Copyright

How can we avoid having to make our array gigantic tohold all possible keys?

Hash Functions

15

Data Structures - CSCI 102

Copyright © William C. Cheng

Hash Tables

Simple solution: use modular arithmeticSize of the backing array is no longer dependent onthe number of unique keysint modularHash(int studentId) {

return studentId % ARRAY_SIZE;}

int directAddressHash(int studentId) {return studentId;

}

Recall direct addressing:

Page 12: Can’t provide fast insertion/removal and fast lookup at the same time Vectors, Linked Lists, Stack, Queues, Deques 4 Data Structures - CSCI 102 Copyright

FastHashing is supposed to be faster than a binary searchtree. hash(key) needs to be O(1)

What makes a good hash function?

16

Data Structures - CSCI 102

Copyright © William C. Cheng

Hash Functions

DeterministicIf we have a key K, then hash(K) must always givethe same result

Uniform distributionThe hash function should uniformly distribute keysacross all of the available indexes in the storage array

Making a good hash function is hard

Page 13: Can’t provide fast insertion/removal and fast lookup at the same time Vectors, Linked Lists, Stack, Queues, Deques 4 Data Structures - CSCI 102 Copyright

For strings, use things like ASCII letter codes

Map your data into the set of natural numbersMaking a hash function

N = {0, 1, 2, ...}

17

Data Structures - CSCI 102

Copyright © William C. Cheng

Hash Functions

Prime table sizes tend to yield better resultsPrime numbers are your friend

E.g. make sure "get" and "gets" hash differentlyHandle variants of the same pattern

Try to be independent of any patterns that may exist inthe data

You won’t usually have to write your own, but you shouldknow what the default hash function does

Page 14: Can’t provide fast insertion/removal and fast lookup at the same time Vectors, Linked Lists, Stack, Queues, Deques 4 Data Structures - CSCI 102 Copyright

Hash Tables do not maintain any ordering of theirinternal elements

Hashing Issues

19

Data Structures - CSCI 102

Copyright © William C. Cheng

Hash Tables

Creating a perfect hash function is almost impossible

When two distinct keys generate the same hash valueit’s called a collision

Collisions

hash(K1) == hash(K2)

Page 15: Can’t provide fast insertion/removal and fast lookup at the same time Vectors, Linked Lists, Stack, Queues, Deques 4 Data Structures - CSCI 102 Copyright

If we try to insert a new element and there’s a collision,keep probing the hash table until we find a vacant space

Open Addressing

23

Data Structures - CSCI 102

Copyright © William C. Cheng

Collision Handling

If a collision occurs, use a deterministic algorithm tocalculate the next array index to check (based on theinitial hash result)

Probing

All data is stored directly in the hash table. No extra datastructures are needed.

Page 16: Can’t provide fast insertion/removal and fast lookup at the same time Vectors, Linked Lists, Stack, Queues, Deques 4 Data Structures - CSCI 102 Copyright

Start with an empty Hash Table

25

Data Structures - CSCI 102

Copyright © William C. Cheng

Open Addressing (Linear Probing)

Data0

1

2

3

4

Page 17: Can’t provide fast insertion/removal and fast lookup at the same time Vectors, Linked Lists, Stack, Queues, Deques 4 Data Structures - CSCI 102 Copyright

26

Copyright © William C. Cheng

Student

Data Structures - CSCI 102

Open Addressing (Linear Probing)Insert "John Doe" with ID = 123

Data0

1

2

3

4

John Doe

2.8

123

Name

GPA

ID

Page 18: Can’t provide fast insertion/removal and fast lookup at the same time Vectors, Linked Lists, Stack, Queues, Deques 4 Data Structures - CSCI 102 Copyright

27

Copyright © William C. Cheng

Student

1

2

3

4

John Doe

2.8

123

Name

GPA

ID

hash(123) = 1

hash()

Data Structures - CSCI 102

Open Addressing (Linear Probing)Insert "John Doe" with ID = 123

hash(123) = 1

Data0

Page 19: Can’t provide fast insertion/removal and fast lookup at the same time Vectors, Linked Lists, Stack, Queues, Deques 4 Data Structures - CSCI 102 Copyright

28

Copyright © William C. Cheng

Student

1

2

3

4

John Doe

2.8

123

Name

GPA

ID

hash(123) = 1

hash()

Data Structures - CSCI 102

Open Addressing (Linear Probing)Insert "John Doe" with ID = 123

hash(123) = 1data[1] is empty, no collision

Data0

Page 20: Can’t provide fast insertion/removal and fast lookup at the same time Vectors, Linked Lists, Stack, Queues, Deques 4 Data Structures - CSCI 102 Copyright

29

Copyright © William C. Cheng

Student

Data0

1

2

3

4

John Doe2.8123

John Doe

2.8

123

Name

GPA

ID

hash(123) = 1

hash()

Data Structures - CSCI 102

Open Addressing (Linear Probing)Insert "John Doe" with ID = 123

hash(123) = 1data[1] is empty, no collision

store it there

Page 21: Can’t provide fast insertion/removal and fast lookup at the same time Vectors, Linked Lists, Stack, Queues, Deques 4 Data Structures - CSCI 102 Copyright

Data Structures - CSCI 102

Open Addressing (Linear Probing)Hash Table contains one item

Data0

1

2

3

4

30

Copyright © William C. Cheng

John Doe2.8123

Page 22: Can’t provide fast insertion/removal and fast lookup at the same time Vectors, Linked Lists, Stack, Queues, Deques 4 Data Structures - CSCI 102 Copyright

31

Copyright © William C. Cheng

Data Structures - CSCI 102

Open Addressing (Linear Probing)Insert "Jane Doe" with ID = 202

Data0

1

2

3

4

John Doe2.8123

StudentJane Doe

3.4

202

Name

GPA

ID

Page 23: Can’t provide fast insertion/removal and fast lookup at the same time Vectors, Linked Lists, Stack, Queues, Deques 4 Data Structures - CSCI 102 Copyright

32

Copyright © William C. Cheng

hash(202) = 3

Data0

1

2

3

4

John Doe2.8123

StudentJane Doe

3.4

202

Name

GPA

ID

hash()

Data Structures - CSCI 102

Open Addressing (Linear Probing)Insert "Jane Doe" with ID = 202

hash(202) = 3

Page 24: Can’t provide fast insertion/removal and fast lookup at the same time Vectors, Linked Lists, Stack, Queues, Deques 4 Data Structures - CSCI 102 Copyright

33

Copyright © William C. Cheng

hash(202) = 3

Data0

1

2

3

4

John Doe2.8123

StudentJane Doe

3.4

202

Name

GPA

ID

hash()

Data Structures - CSCI 102

Open Addressing (Linear Probing)Insert "Jane Doe" with ID = 202

hash(202) = 3data[3] is empty, no collision

Page 25: Can’t provide fast insertion/removal and fast lookup at the same time Vectors, Linked Lists, Stack, Queues, Deques 4 Data Structures - CSCI 102 Copyright

34

Copyright © William C. Cheng

hash(202) = 3

Data0

1

2

3

4

John Doe2.8123

Jane Doe3.4202

hash()

Data Structures - CSCI 102

Open Addressing (Linear Probing)Insert "Jane Doe" with ID = 202

hash(202) = 3data[3] is empty, no collision

store it there

Student

Name

Jane Doe

GPA

3.4

ID

202

Page 26: Can’t provide fast insertion/removal and fast lookup at the same time Vectors, Linked Lists, Stack, Queues, Deques 4 Data Structures - CSCI 102 Copyright

35

Copyright © William C. Cheng

Data0

1

2

3

4

John Doe2.8123

Jane Doe3.4202

Data Structures - CSCI 102

Open Addressing (Linear Probing)Hash Table contains two items

Page 27: Can’t provide fast insertion/removal and fast lookup at the same time Vectors, Linked Lists, Stack, Queues, Deques 4 Data Structures - CSCI 102 Copyright

36

Copyright © William C. Cheng

Data0

1

2

3

4

John Doe2.8123

Jane Doe3.4202Student

Some Guy

3.5

401

Name

GPA

ID

Data Structures - CSCI 102

Open Addressing (Linear Probing)Insert "Some Guy" with ID = 401

Page 28: Can’t provide fast insertion/removal and fast lookup at the same time Vectors, Linked Lists, Stack, Queues, Deques 4 Data Structures - CSCI 102 Copyright

37

Copyright © William C. Cheng

Data0

1

2

3

4

John Doe2.8123

Jane Doe3.4202Student

Some Guy

3.5

401

Name

GPA

ID

hash(401) = 1

hash()

Data Structures - CSCI 102

Open Addressing (Linear Probing)Insert "Some Guy" with ID = 401

hash(401) = 1

Page 29: Can’t provide fast insertion/removal and fast lookup at the same time Vectors, Linked Lists, Stack, Queues, Deques 4 Data Structures - CSCI 102 Copyright

38

Copyright © William C. Cheng

Data0

1

2

3

4

John Doe2.8123

Jane Doe3.4202Student

Some Guy

3.5

401

Name

GPA

ID

hash(401) = 1

hash()

Data Structures - CSCI 102

Open Addressing (Linear Probing)Insert "Some Guy" with ID = 401

hash(401) = 1data[1] is non-empty, collision!

Page 30: Can’t provide fast insertion/removal and fast lookup at the same time Vectors, Linked Lists, Stack, Queues, Deques 4 Data Structures - CSCI 102 Copyright

39

Copyright © William C. Cheng

hash(401) = 1

Data0

1

2

3

4

John Doe2.8123

Jane Doe3.4202Student

Some Guy

3.5

401

Name

GPA

ID

hash()

Data Structures - CSCI 102

Open Addressing (Linear Probing)Insert "Some Guy" with ID = 401

hash(401) = 1data[1] is non-empty, collision!

hash(401)+1 = 2

Page 31: Can’t provide fast insertion/removal and fast lookup at the same time Vectors, Linked Lists, Stack, Queues, Deques 4 Data Structures - CSCI 102 Copyright

40

Copyright © William C. Cheng

Data0

1

2

3

4

John Doe2.8123

Jane Doe3.4202Student

Some Guy

3.5

401

Name

GPA

ID

hash()

Data Structures - CSCI 102

Open Addressing (Linear Probing)Insert "Some Guy" with ID = 401

hash(401) = 1data[1] is non-empty, collision!

hash(401)+1 = 2data[2] is empty, no collision

hash(401) = 1

Page 32: Can’t provide fast insertion/removal and fast lookup at the same time Vectors, Linked Lists, Stack, Queues, Deques 4 Data Structures - CSCI 102 Copyright

hash(401)+1 = 2data[2] is empty, no collision

41

Copyright © William C. Cheng

Data0

1

2

3

4

John Doe2.8123

Some Guy3.5401

Jane Doe3.4202

hash(401) = 1

hash()

Data Structures - CSCI 102

Open Addressing (Linear Probing)Insert "Some Guy" with ID = 401

hash(401) = 1

data[1] is non-empty, collision!

store it there

Student

Name

Some Guy

GPA

3.5

ID

401

Page 33: Can’t provide fast insertion/removal and fast lookup at the same time Vectors, Linked Lists, Stack, Queues, Deques 4 Data Structures - CSCI 102 Copyright

Data0

1

2

3

4

123

Some Guy3.5401

Jane Doe3.4

202

42

Copyright © William C. Cheng

Data Structures - CSCI 102

Open Addressing (Linear Probing)Hash Table contains three items

John Doe2.8

Page 34: Can’t provide fast insertion/removal and fast lookup at the same time Vectors, Linked Lists, Stack, Queues, Deques 4 Data Structures - CSCI 102 Copyright

Search(key)What is the Big O of each of these operations?

48

Data Structures - CSCI 102

Copyright © William C. Cheng

Open Addressing (Linear Probing)

Insert(key,value)

Remove(key)

Average: O(1), Worst Case: O(N)

Average: O(1), Worst Case: O(N)

Average: O(1), Worst Case: O(N)

How big is the table?

load factor = (# of elements) / (size of array)

Operations depend on the table’s load factor

How many slots are taken already?

"Utilization"

Page 35: Can’t provide fast insertion/removal and fast lookup at the same time Vectors, Linked Lists, Stack, Queues, Deques 4 Data Structures - CSCI 102 Copyright

Each slot in the Hash Table can now contain a list ofelements instead of a single element

Chaining

50

Data Structures - CSCI 102

Copyright © William C. Cheng

Collision Handling

When multiple items hash to the same slot, they areplaced in the list at that slot

This requires the overhead of an extra list for each slot thatcontains one or more elements

Page 36: Can’t provide fast insertion/removal and fast lookup at the same time Vectors, Linked Lists, Stack, Queues, Deques 4 Data Structures - CSCI 102 Copyright

2.8123

Jane Doe3.4202

51

Copyright © William C. Cheng

Data0

1

2

3

4

Data Structures - CSCI 102

ChainingHash Table contains two items

John Doe

Page 37: Can’t provide fast insertion/removal and fast lookup at the same time Vectors, Linked Lists, Stack, Queues, Deques 4 Data Structures - CSCI 102 Copyright

StudentSome Guy

3.5

401

Name

GPA

ID

52

Copyright © William C. Cheng

Data0

1

2

3

4

Data Structures - CSCI 102

ChainingInsert "Some Guy" with ID = 401

John Doe

2.8123

Jane Doe3.4202

Page 38: Can’t provide fast insertion/removal and fast lookup at the same time Vectors, Linked Lists, Stack, Queues, Deques 4 Data Structures - CSCI 102 Copyright

2.8123

Jane Doe3.4

202

StudentSome Guy

3.5

401

Name

GPA

ID

53

Copyright © William C. Cheng

Data0

1

2

3

4

hash(401) = 1

hash()

Data Structures - CSCI 102

ChainingInsert "Some Guy" with ID = 401

hash(401) = 1

John Doe

Page 39: Can’t provide fast insertion/removal and fast lookup at the same time Vectors, Linked Lists, Stack, Queues, Deques 4 Data Structures - CSCI 102 Copyright

StudentSome Guy

3.5

401

Name

GPA

ID

54

Copyright © William C. Cheng

Data0

1

2

3

4

hash(401) = 1

hash()

Data Structures - CSCI 102

ChainingInsert "Some Guy" with ID = 401

hash(401) = 1data[1] is non-empty, collision!

John Doe

2.8123

Jane Doe3.4

202

Page 40: Can’t provide fast insertion/removal and fast lookup at the same time Vectors, Linked Lists, Stack, Queues, Deques 4 Data Structures - CSCI 102 Copyright

StudentSome Guy

3.5

401

Name

GPA

ID

55

Copyright © William C. Cheng

Data0

1

2

3

4

hash(401) = 1

hash()

Data Structures - CSCI 102

ChainingInsert "Some Guy" with ID = 401

hash(401) = 1data[1] is non-empty, collision!Chaining says to add the newentry to the list at data[1]

John Doe

2.8123

Jane Doe3.4

202

Page 41: Can’t provide fast insertion/removal and fast lookup at the same time Vectors, Linked Lists, Stack, Queues, Deques 4 Data Structures - CSCI 102 Copyright

StudentSome Guy

3.5

401

Name

GPA

ID

56

Copyright © William C. Cheng

Data0

1

2

3

4

hash()

Data Structures - CSCI 102

ChainingInsert "Some Guy" with ID = 401

hash(401) = 1data[1] is non-empty, collision!Chaining says to add the newentry to the list at data[1]

Insert Some Guy in the list at data[1]

hash(401) = 1

John Doe2.8123

Jane Doe3.4

202

Page 42: Can’t provide fast insertion/removal and fast lookup at the same time Vectors, Linked Lists, Stack, Queues, Deques 4 Data Structures - CSCI 102 Copyright

57

Copyright © William C. Cheng

Data0

1

2

3

4

2.8123

Jane Doe3.4202

Data Structures - CSCI 102

ChainingHash Table contains three items

Some Guy3.5401

John Doe

Page 43: Can’t provide fast insertion/removal and fast lookup at the same time Vectors, Linked Lists, Stack, Queues, Deques 4 Data Structures - CSCI 102 Copyright

63

Data Structures - CSCI 102

Copyright © William C. Cheng

Chaining

Search(key)What is the Big O of each of these operations?

Insert(key,value)

Remove(key)

Average: O(1), Worst Case: O(N)

Average: O(1), Worst Case: O(1)

Average: O(1), Worst Case: O(N)

Operations depend on the average length of a chain (exceptfor insert)

Page 44: Can’t provide fast insertion/removal and fast lookup at the same time Vectors, Linked Lists, Stack, Queues, Deques 4 Data Structures - CSCI 102 Copyright

If a malicious user knows what hash function you’reusing, they can intentionally cause your worst-casebehavior

The Problem

66

Data Structures - CSCI 102

Copyright © William C. Cheng

Collision Handling

When the Hash Table is created, randomly choose ahash function independent of the keys that are going tobe stored

No single input gives worst-case behavior(just like randomized Quicksort)

Universal Hashing

Page 45: Can’t provide fast insertion/removal and fast lookup at the same time Vectors, Linked Lists, Stack, Queues, Deques 4 Data Structures - CSCI 102 Copyright

Like chaining, but each element in the hash table holdsanother hash table with a different hash function

Multi-Level Hashing

67

Data Structures - CSCI 102

Copyright © William C. Cheng

Collision Handling

If the set of possible keys is static (never changes), wecan develop a perfect multi-level hash to give O(1) worstcase performance

e.g. The reserved keywords in a programminglanguage are a static set of keys

Perfect Hashing

By hashing multiple times, we can greatly decrease theodds of a collision

Page 46: Can’t provide fast insertion/removal and fast lookup at the same time Vectors, Linked Lists, Stack, Queues, Deques 4 Data Structures - CSCI 102 Copyright

Hash Tables generally do provide a way for you toretrieve a list of the known keys

Just keep in mind there is no guaranteed ordering ofthe keys

Other Notes

68

Data Structures - CSCI 102

Copyright © William C. Cheng

Hash Tables

C++ currently has no built-in hash tableThere’s a proposal for unordered_map in the STL is onthe tableGoogle Sparse Hash provides C++ hash tablesBoost C++ Libraries provides hash tableshttp://www.boost.org/