data structures and algorithms

46
Data Structures and Algorithms Huffman compression: An Application of Binary Trees and Priority Queues

Upload: myrna

Post on 25-Feb-2016

36 views

Category:

Documents


0 download

DESCRIPTION

Data Structures and Algorithms . Huffman compression: An Application of Binary Trees and Priority Queues. Encoding and Compression. Fax Machines ASCII Variations on ASCII min number of bits needed cost of savings patterns modifications. Purpose of Huffman Coding. - PowerPoint PPT Presentation

TRANSCRIPT

Data Structures and Algorithms

Data StructuresandAlgorithms Huffman compression: An Application of Binary Trees and Priority QueuesCS 102Encoding and CompressionFax MachinesASCIIVariations on ASCIImin number of bits neededcost of savingspatternsmodifications

CS 102Purpose of Huffman CodingProposed by Dr. David A. Huffman in 1952A Method for the Construction of Minimum Redundancy CodesApplicable to many forms of data transmissionOur example: text files

An Introduction to Huffman CodingMarch 21, 2000Mike Scott3CS 102The Basic AlgorithmHuffman coding is a form of statistical codingNot all characters occur with the same frequency!Yet all characters are allocated the same amount of space1 char = 1 byte, be it e or x

An Introduction to Huffman CodingMarch 21, 2000Mike Scott4CS 102The Basic AlgorithmAny savings in tailoring codes to frequency of character?Code word lengths are no longer fixed like ASCII.Code word lengths vary and will be shorter for the more frequently used characters.

An Introduction to Huffman CodingMarch 21, 2000Mike Scott5CS 102The (Real) Basic AlgorithmScan text to be compressed and tally occurrence of all characters.Sort or prioritize characters based on number of occurrences in text.Build Huffman code tree based on prioritized list.Perform a traversal of tree to determine all code words.Scan text again and create new file using the Huffman codes.An Introduction to Huffman CodingMarch 21, 2000Mike Scott6CS 102Building a TreeConsider the following short text:

Eerie eyes seen near lake.

Count up the occurrences of all characters in the textAn Introduction to Huffman CodingMarch 21, 2000Mike Scott7CS 102Building a TreeEerie eyes seen near lake.What characters are present?E e r i space y s n a r l k .An Introduction to Huffman CodingMarch 21, 2000Mike Scott8CS 102Building a TreePrioritize charactersCreate binary tree nodes with character and frequency of each characterPlace nodes in a priority queueThe lower the occurrence, the higher the priority in the queueAn Introduction to Huffman CodingMarch 21, 2000Mike Scott9CS 102Building a TreePrioritize charactersUses binary tree nodes

public class HuffNode{public char myChar;public int myFrequency;public HuffNode myLeft, myRight;}

priorityQueue myQueue;An Introduction to Huffman CodingMarch 21, 2000Mike Scott10CS 102Building a TreeThe queue after inserting all nodes

Null Pointers are not shownE1i1y1l1k1.1r2s2n2a2sp4e8An Introduction to Huffman CodingMarch 21, 2000Mike Scott11CS 102Building a TreeWhile priority queue contains two or more nodesCreate new nodeDequeue node and make it left subtreeDequeue next node and make it right subtreeFrequency of new node equals sum of frequency of left and right childrenEnqueue new node back into queue

An Introduction to Huffman CodingMarch 21, 2000Mike Scott12CS 102Building a TreeE1i1y1l1k1.1r2s2n2a2sp4e8An Introduction to Huffman CodingMarch 21, 2000Mike Scott13CS 102Building a TreeE1i1y1l1k1.1r2s2n2a2sp4e82An Introduction to Huffman CodingMarch 21, 2000Mike Scott14CS 102Building a TreeE1i1y1l1k1.1r2s2n2a2sp4e82An Introduction to Huffman CodingMarch 21, 2000Mike Scott15CS 102Building a TreeE1i1k1.1r2s2n2a2sp4e82y1l12An Introduction to Huffman CodingMarch 21, 2000Mike Scott16CS 102Building a TreeE1i1k1.1r2s2n2a2sp4e82y1l12An Introduction to Huffman CodingMarch 21, 2000Mike Scott17CS 102Building a TreeE1i1r2s2n2a2sp4e82y1l12k1.12An Introduction to Huffman CodingMarch 21, 2000Mike Scott18CS 102Building a TreeE1i1r2s2n2a2sp4e82y1l12k1.12An Introduction to Huffman CodingMarch 21, 2000Mike Scott19CS 102Building a TreeE1i1n2a2sp4e82y1l12k1.12r2s24An Introduction to Huffman CodingMarch 21, 2000Mike Scott20CS 102Building a TreeE1i1n2a2sp4e82y1l12k1.12r2s24An Introduction to Huffman CodingMarch 21, 2000Mike Scott21CS 102Building a TreeE1i1sp4e82y1l12k1.12r2s24n2a24An Introduction to Huffman CodingMarch 21, 2000Mike Scott22CS 102Building a TreeE1i1sp4e82y1l12k1.12r2s24n2a24An Introduction to Huffman CodingMarch 21, 2000Mike Scott23CS 102Building a TreeE1i1sp4e82y1l12k1.12r2s24n2a244An Introduction to Huffman CodingMarch 21, 2000Mike Scott24CS 102Building a TreeE1i1sp4e82y1l12k1.12r2s24n2a244An Introduction to Huffman CodingMarch 21, 2000Mike Scott25CS 102Building a TreeE1i1sp4e82y1l12k1.12r2s24n2a2446An Introduction to Huffman CodingMarch 21, 2000Mike Scott26CS 102Building a TreeE1i1sp4e82y1l12k1.12r2s24n2a2446What is happening to the characters with a low number of occurrences?An Introduction to Huffman CodingMarch 21, 2000Mike Scott27CS 102Building a TreeE1i1sp4e82y1l12k1.12r2s24n2a24468An Introduction to Huffman CodingMarch 21, 2000Mike Scott28CS 102Building a TreeE1i1sp4e82y1l12k1.12r2s24n2a24468An Introduction to Huffman CodingMarch 21, 2000Mike Scott29CS 102Building a TreeE1i1sp4e82y1l12k1.12r2s24n2a2446810An Introduction to Huffman CodingMarch 21, 2000Mike Scott30CS 102Building a TreeE1i1sp4e82y1l12k1.12r2s24n2a2446810An Introduction to Huffman CodingMarch 21, 2000Mike Scott31CS 102Building a TreeE1i1sp4e82y1l12k1.12r2s24n2a244681016An Introduction to Huffman CodingMarch 21, 2000Mike Scott32CS 102Building a TreeE1i1sp4e82y1l12k1.12r2s24n2a244681016An Introduction to Huffman CodingMarch 21, 2000Mike Scott33CS 102Building a TreeE1i1sp4e82y1l12k1.12r2s24n2a24468101626An Introduction to Huffman CodingMarch 21, 2000Mike Scott34CS 102Building a TreeE1i1sp4e82y1l12k1.12r2s24n2a24468101626After enqueueing this node there is only one node left in priority queue.An Introduction to Huffman CodingMarch 21, 2000Mike Scott35CS 102Building a TreeDequeue the single node left in the queue.

This tree contains the new code words for each character.

Frequency of root node should equal number of characters in text.

E1i1sp4e82y1l12k1.12r2s24n2a24468101626Eerie eyes seen near lake. 26 charactersAn Introduction to Huffman CodingMarch 21, 2000Mike Scott36CS 102Encoding the FileTraverse Tree for CodesPerform a traversal of the tree to obtain new code wordsGoing left is a 0 going right is a 1code word is only completed when a leaf node is reached E1i1sp4e82y1l12k1.12r2s24n2a24468101626An Introduction to Huffman CodingMarch 21, 2000Mike Scott37CS 102Encoding the FileTraverse Tree for CodesCharCodeE0000i0001y0010l0011k0100.0101space011e10r1100s1101n1110a1111E1i1sp4e82y1l12k1.12r2s24n2a24468101626An Introduction to Huffman CodingMarch 21, 2000Mike Scott38CS 102Encoding the FileRescan text and encode file using new code wordsEerie eyes seen near lake.

CharCodeE0000i0001y0010l0011k0100.0101space011e10r1100s1101n1110a11110000101100000110011100010101101101001111101011111100011001111110100100101Why is there no need for a separator character?.

An Introduction to Huffman CodingMarch 21, 2000Mike Scott39CS 102Encoding the FileResultsHave we made things any better?73 bits to encode the textASCII would take 8 * 26 = 208 bits0000101100000110011100010101101101001111101011111100011001111110100100101If modified code used 4 bits per character are needed. Total bits 4 * 26 = 104. Savings not as great.An Introduction to Huffman CodingMarch 21, 2000Mike Scott40CS 102Decoding the FileHow does receiver know what the codes are?Tree constructed for each text file. Considers frequency for each fileBig hit on compression, especially for smaller filesTree predeterminedbased on statistical analysis of text files or file typesData transmission is bit based versus byte based

An Introduction to Huffman CodingMarch 21, 2000Mike Scott41CS 102Decoding the FileOnce receiver has tree it scans incoming bit stream0 go left1 go rightE1i1sp4e82y1l12k1.12r2s24n2a2446810162610100011011110111101111110000110101CS 102SummaryHuffman coding is a technique used to compress files for transmissionUses statistical codingmore frequently used symbols have shorter code wordsWorks well for text and fax transmissionsAn application that uses several data structures44Huffman CompressionIs the code correct?Based on the way the tree is formed, it is clear that the codewords are validPrefix Property is assured, since each codeword ends at a leafall original nodes corresponding to the characters end up as leavesDoes it give good compression?For a block code of N different characters, log2N bits are needed per characterThus a file containing M ASCII characters, 8M bits are needed45Huffman CompressionGiven Huffman codes {C0,C1,CN-1} for the N characters in the alphabet, each of length |Ci|Given frequencies {F0,F1,FN-1} in the fileWhere sum of all frequencies = MThe total bits required for the file is:Sum from 0 to N-1 of (|Ci| * Fi)Overall total bits depends on differences in frequenciesThe more extreme the differences, the better the compressionIf frequencies are all the same, no compressionSee example from board46Huffman ShortcomingsWhat is Huffman missing?Although OPTIMAL for single character (word) compression, Huffman does not take into account patterns / repeated sequences in a fileEx: A file with 1000 As followed by 1000 Bs, etc. for every ASCII character will not compress AT ALL with HuffmanYet it seems like this file should be compressableWe can use run-length encoding in this case (see text)However run-length encoding is very specific and not generally effective for most files (since they do not typically have long runs of each character)