ics 220 – data structures and algorithms lecture 11 dr. ken cosh

22
ICS 220 – Data Structures and Algorithms Lecture 11 Dr. Ken Cosh

Upload: ashlyn-bradford

Post on 03-Jan-2016

221 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: ICS 220 – Data Structures and Algorithms Lecture 11 Dr. Ken Cosh

ICS 220 – Data Structures and Algorithms

Lecture 11

Dr. Ken Cosh

Page 2: ICS 220 – Data Structures and Algorithms Lecture 11 Dr. Ken Cosh

Review

• Hash Functions– Spreading data evenly through a table

avoiding collisions while maintaining a simple efficient function.

• Linear Probing, Quadratic Probing, General Increment Probing.

• Tombstone deletion.

Page 3: ICS 220 – Data Structures and Algorithms Lecture 11 Dr. Ken Cosh

This week

• Data Compression Techniques– We can speed up the efficiency of many

algorithms by compressing data.– Obviously storing ‘M’ or ‘F’, rather than ‘Male’

or ‘Female’, takes less space, AND less time to transfer between bits of a program.

– Lets investigate how we can efficiently encode data.

Page 4: ICS 220 – Data Structures and Algorithms Lecture 11 Dr. Ken Cosh

Encoding 128

• 128

• 80

• 1000000

• CXXVIII

• ρκη

• |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

Page 5: ICS 220 – Data Structures and Algorithms Lecture 11 Dr. Ken Cosh

Assumptions

• We have ‘n’ different symbols we can use to encode a message.– In Morse code n=3.– In Binary n=2.

• All symbols mi forming the set M, have probabilities of occurrence P(mi) such that P(mi) + … + P(mn) =1– Infrequently occurring symbols can be assigned a

long code word, while short code words are reserved for frequent symbols.

Page 6: ICS 220 – Data Structures and Algorithms Lecture 11 Dr. Ken Cosh

Coding Objectives

• Each codeword corresponds to exactly one symbol.

• Decoding should not require any look ahead.– This is known as the ‘prefix’ property.

Page 7: ICS 220 – Data Structures and Algorithms Lecture 11 Dr. Ken Cosh

Prefix Property

• Symbols: A, B, C

• Codes: 1, 2, 12

• Message: 12

• Is it ‘AB’? Is it ‘C’?

Page 8: ICS 220 – Data Structures and Algorithms Lecture 11 Dr. Ken Cosh

Prefix Property

• Symbols: A, B, C,• Codes: 1, 22, 12• Message: 1222• Read in 1, is it an A? • Read in 2, was it a C?• Read in 2, Should it be AB?• Read in 2, Ah, finally we can assume it

was CB.

Page 9: ICS 220 – Data Structures and Algorithms Lecture 11 Dr. Ken Cosh

Prefix Property

• Symbols: A, B, C

• Codes: 11, 12, 21

• No Ambiguity here.

Page 10: ICS 220 – Data Structures and Algorithms Lecture 11 Dr. Ken Cosh

Code Optimisation

• The length of a code for one symbol should not exceed the length of a less likely symbol;– if P(mi) ≤ P(mj) then L(mi) ≥ L(mj)

• There should be no unused short codes, either as stand alone encodings or as prefixs for longer codes.– 01, 000, 001, 100, 101 is not ideal as 11 is not

used.

Page 11: ICS 220 – Data Structures and Algorithms Lecture 11 Dr. Ken Cosh

Huffman Coding

• Huffman coding is a method for choosing a representation for each symbol, resulting in a prefix-free code– The bit string representing some particular

symbol is never a prefix of the bit string representing any other symbol

• The most common characters are expressed using shorter strings of bits than are used for less common symbols.

Page 12: ICS 220 – Data Structures and Algorithms Lecture 11 Dr. Ken Cosh

Huffman Coding

• Huffman coding starts by creating a heap, based on the frequencies of the symbols to be encoded.– The root of the heap should be 1.0 – the sum of both

child nodes.– Each data node appears in a leaf node.

• Take the set (with corresponding occurrence frequencies out of 120);

A B C D E F G H I10 15 5 15 20 5 15 30 5

Page 13: ICS 220 – Data Structures and Algorithms Lecture 11 Dr. Ken Cosh

Creating a Heap

• There are several ways a heap can be created from the given data, here is one option;

120

55 65

25

A

30

15

C 10

F I

B D

H 35

EG

55

5

10 15 15 15 20

30

Page 14: ICS 220 – Data Structures and Algorithms Lecture 11 Dr. Ken Cosh

Huffman Coding

• Each node is then given a code based on their position in the tree – 0 for left node, 1 for right node;

120

55 65

25

A

30

15

C 10

F I

B D

H 35

EG

55

5

10 15 15 15 20

30

A = 000

B = 010

C = 0010

D = 011

E = 111

F = 00110

G = 110

H = 10

I = 00111

Page 15: ICS 220 – Data Structures and Algorithms Lecture 11 Dr. Ken Cosh

Huffman DecodingA = 000

B = 010

C = 0010

D = 011

E = 111

F = 00110

G = 110

H = 10

I = 00111

0010000010010000110111

000011011111011

01111100000110

0100011111010111000011111011

10111000011000001010111

Does this code fit our prefix requirements?

Page 16: ICS 220 – Data Structures and Algorithms Lecture 11 Dr. Ken Cosh

Adaptive Huffman

• Huffman coding was invented back in the 1950’s, and has a key assumption – that we know the expected frequency of each symbol.

• We could get this frequency from a large corpus, such as the BNC, but then the resulting huffman tree might not be ideal for all circumstances.– A technical computing paper is likely to contain more

non alphabet symbols than ‘war and peace’.

• Adaptive Huffman is usable in this situation, as it reconstructs the tree as data is read in.

Page 17: ICS 220 – Data Structures and Algorithms Lecture 11 Dr. Ken Cosh

Adaptive Huffman

• Initially all symbols are unused and so stored in a leaf with value 0. As symbols are used they are moved up the tree.

• As symbols are used more frequently, they are compared and if necessary swapped with their siblings.

• Sibling comparison is easy if the heap is stored as a doubly linked list.

Page 18: ICS 220 – Data Structures and Algorithms Lecture 11 Dr. Ken Cosh

Doubly Linked List

120

55 65

25

A

30

15

C 10

F I

B D

H 35

EG

55

5

10 15 15 15 20

30

Page 19: ICS 220 – Data Structures and Algorithms Lecture 11 Dr. Ken Cosh

Cabbages are bad

0abcdegrs

0

1

1cabdegrs

1

2

1c

bdegrs0

a1

2

3

1c

degrs

1a1

0 1b

2

3

1c

degrs

1a1

0 1b

add cadd a add b

transpose

Page 20: ICS 220 – Data Structures and Algorithms Lecture 11 Dr. Ken Cosh

Adaptive Huffman

• Note that Adaptive Huffman requires just one pass through the message to encode it;– The first time the b is transmitted it has the code 101.– The second time the b is transmitted it has the code

0.

• As symbols rise up the tree their codes become shorter.

• So long as the decoder uses the same formula it can decode the same message.

Page 21: ICS 220 – Data Structures and Algorithms Lecture 11 Dr. Ken Cosh

Run-Length Encoding

• The word cabbage has a repeated b, so run length encoding might be able to shorten the code, by indicating that there are 2 b’s.– This is similar to the ๆ symbol in Thai.

• Run-length encoding doesn’t work if there are numbers in the sequence!

Page 22: ICS 220 – Data Structures and Algorithms Lecture 11 Dr. Ken Cosh

Homework

• Taking the following phrase;– “Data Structures And Algorithms Analysis”

• Use the Adaptive Huffman Coding method to generate a code for it.