cs240tut09

4
CS 240 Tutorial 9 Notes Dictionary/Associative array: An abstract data type which holds a collection of (key, value) pairs, where each key appears at most once. This allows one to store a value in the associative array using a key as an ‘index’. The value can later be extracted if one knows the key it was stored under. Typical operations are insert(k,v): add value v, associated to key k delete(k): remove item associated to key k search(k): return value associated to key k (if one exists) For simplicity, keys are usually assumed to be integers. (If not, we can always map the keys to integers first.) Also, say maximum key is M . Question: How would one go about implementing an associative array? Insert Search Delete Unsorted array/linked list O(1) (add to end) O(n) (brute force) O(n) (brute force) Sorted array O(n) (shifting) O(log n) (binary search) O(n) (shifting) Balanced search tree (e.g., AVL) O(log n) (search + rotation) O(log n) (max height) O(log n) (search + swap + rotations) Direct addressing O(1) O(1) O(1) (array A of size M, store v at A[k]) (key tells us exactly where item will be stored) Question: What is the downside to direct addressing? Space required is O(M ), even if n is very small, which is wasteful. If the keys contained 99 decimal digits, A would have to be of size 10 100 , which is more than the estimated number of atoms in the universe! A hash table is a way of maintaining the good behaviour of this approach, while also addressing the downside. The idea is to use an array of smaller size (e.g., O(n)) and then map the original keys into a smaller range, so that they are indices for this smaller array. The process of mapping a key into a small keyspace is known as hashing, and is done by applying a hash function. Main problem: Since we are mapping a large keyspace onto a small keyspace, some large keys will hash to the same small key (by the pigeonhole principle), so we must somehow deal with collisions. Three ideas: Chaining: Each array location can contain multiple (k,v) by storing them in a linked list. Linear probing: If the location where you want to insert is already filled, insert in the next available location. Double hashing: Instead of looking sequentially for the next available location, jump ahead by a certain amount until a space is free. The jump amount is controlled by a second hash function. Example: Using chaining, linear probing, and double hashing, insert aardvark, aback, abacus, and abaft into an array of size 5, where the key is the word itself and the hash function is h(w)= h(w 1 w 2 ...w k )= k X i=1 ascii(w i ) mod 5. 1

Upload: davidknight

Post on 20-Nov-2015

214 views

Category:

Documents


0 download

DESCRIPTION

cs240tut

TRANSCRIPT

  • CS 240 Tutorial 9 NotesDictionary/Associative array: An abstract data type which holds a collection of (key, value) pairs, whereeach key appears at most once.

    This allows one to store a value in the associative array using a key as an index. The value can laterbe extracted if one knows the key it was stored under.

    Typical operations are

    insert(k, v): add value v, associated to key k

    delete(k): remove item associated to key k

    search(k): return value associated to key k (if one exists)

    For simplicity, keys are usually assumed to be integers. (If not, we can always map the keys to integersfirst.) Also, say maximum key is M .

    Question: How would one go about implementing an associative array?

    Insert Search DeleteUnsorted array/linked list O(1)

    (add to end)

    O(n)(brute force)

    O(n)(brute force)

    Sorted array O(n)(shifting)

    O(log n)(binary search)

    O(n)(shifting)

    Balanced search tree (e.g., AVL) O(log n)(search + rotation)

    O(log n)(max height)

    O(log n)(search + swap + rotations)

    Direct addressing O(1) O(1) O(1)(array A of size M , store v at A[k]) (key tells us exactly where item will be stored)

    Question: What is the downside to direct addressing?

    Space required is O(M), even if n is very small, which is wasteful. If the keys contained 99 decimaldigits, A would have to be of size 10100, which is more than the estimated number of atoms in theuniverse!

    A hash table is a way of maintaining the good behaviour of this approach, while also addressing the downside.The idea is to use an array of smaller size (e.g., O(n)) and then map the original keys into a smaller range, sothat they are indices for this smaller array.

    The process of mapping a key into a small keyspace is known as hashing, and is done by applying a hashfunction.

    Main problem: Since we are mapping a large keyspace onto a small keyspace, some large keys will hash tothe same small key (by the pigeonhole principle), so we must somehow deal with collisions.Three ideas:

    Chaining: Each array location can contain multiple (k, v) by storing them in a linked list.

    Linear probing: If the location where you want to insert is already filled, insert in the next availablelocation.

    Double hashing: Instead of looking sequentially for the next available location, jump ahead by a certainamount until a space is free. The jump amount is controlled by a second hash function.

    Example: Using chaining, linear probing, and double hashing, insert aardvark, aback, abacus, and abaftinto an array of size 5, where the key is the word itself and the hash function is

    h(w) = h(w1w2 . . . wk) =

    ki=1

    ascii(wi) mod 5.

    1

  • For double hashing, use the secondary hash function

    h2(w) = h2(w1w2 . . . wk) = 1 +

    ( ki=1

    ascii(wi) 3ki mod 4).

    Answer: Note that

    h(aardvark) = 97 + 97 + 114 + 100 + 118 + 97 + 114 + 107 mod 5 = 844 mod 5 = 4

    h(aback) = 97 + 98 + 97 + 99 + 107 mod 5 = 498 mod 5 = 3

    h(abacus) = 97 + 98 + 97 + 99 + 117 + 115 mod 5 = 623 mod 5 = 3

    h(abaft) = 97 + 98 + 97 + 102 + 116 mod 5 = 510 mod 5 = 0

    Using chaining:

    insert aardvark:

    A[0]A[1]A[2]A[3]A[4] aardvark

    insert aback:

    A[0]A[1]A[2]A[3]A[4]

    aback

    aardvark

    insert abacus:

    A[0]A[1]A[2]A[3]A[4]

    abacus abackaardvark

    insert abaft:

    A[0]A[1]A[2]A[3]A[4]

    abaft

    abacus abackaardvark

    Note: Values are inserted at the start of the linked list. This keeps insertion at O(1) cost.However, in the worst case all items hash to the same location, and search/delete cost O(n).

    But if the hash function is chosen properly this behaviour is unlikely. Assuming each hash value is equallylikely to occur, search/delete cost O(1 + n/|A|) in the average case. If we take |A| n then this is O(1).This makes sense intuitively: if you want to store n items in A, you probably want to take |A| n to avoidexcessive chaining.

    2

  • Using linear probing:

    insert aardvark:

    A[0]A[1]A[2]A[3]A[4] aardvark X

    insert aback:

    A[0]A[1]A[2]A[3]A[4]

    aback Xaardvark

    insert abacus:

    A[0]A[1]A[2]A[3]A[4]

    abacus X

    aback aardvark

    insert abaft:

    A[0]A[1]A[2]A[3]A[4]

    abacus abaft X

    aback

    aardvark

    Note: Now insert/delete/search are all O(n) in the worst case. When the hash table is mostly empty thisbehaviour is unlikely, but as the table fills up (its load factor increases) it becomes more and more likely.

    Using double hasing:Note that

    h2(aardvark) = 1 + (97 37 + 97 36 + 114 35 + 100 34 + 118 33 + 97 32 + 114 3 + 107 mod 4)= 1 + (323162 mod 4) = 1 + 2 = 3

    h2(aback) = 1 + (97 34 + 98 33 + 97 32 + 99 3 + 107 mod 4)= 1 + (11780 mod 4) = 1 + 0 = 1

    h2(abacus) = 1 + (97 35 + 98 34 + 97 33 + 99 32 + 117 3 + 115 mod 4)= 1 + (35485 mod 4) = 1 + 1 = 2

    h2(abaft) = 1 + (97 34 + 98 33 + 97 32 + 102 3 + 116 mod 4)= 1 + (11798 mod 4) = 1 + 2 = 3

    3

  • insert aardvark:

    A[0]A[1]A[2]A[3]A[4] aardvark X

    jump amount: 3

    insert aback:

    A[0]A[1]A[2]A[3]A[4]

    aback Xaardvark

    jump amount: 1

    insert abacus:

    A[0]A[1]A[2]A[3]A[4]

    abacus X

    aback aardvark

    jump amount: 2

    insert abaft:

    A[0]A[1]A[2]A[3]A[4]

    abacus abaft X

    aback aardvark

    jump amount: 3

    Note: When the jump amount is 1, double hashing is identical to linear probing. Also, the jump amountshould never be 0, or no alternate positions will ever be checked. In general, if the jump amount evenlydivides the array size, not all alternate positions will be checked. (Making the array size prime and the jumpamount smaller than |A| guards against this.)

    4