compression for sending and storing information text, audio, images, videos
Post on 21-Dec-2015
231 Views
Preview:
TRANSCRIPT
Compression
For sending and storing information
Text, audio, images, videos
Common Applications
• Text compression
– loss-less, gzip uses Lempel-Ziv coding, 3:1 compression
– better than Huffman
• Audio compression
– lossy, mpeg 3:1 to 24:1 compression
– MPEG = motion picture expert group
• Image compression
– lossy, jpeg 3:1 compression
– JPEG = Joint photographic expert group
• Video compression
– lossy, mpeg 27:1 compression
Text Compression
• Prefix code: one, of many, approaches
– no code is prefix of any other code
– constraint: loss-less
– tasks
• encode: text (string) -> code
• decode: code --> text
– main goal: maximally reduce storage, measured by compression ratio
– minor goals:
• simplicity
• efficiency: time and space
– some require code dictionary or 2 passes of data
Simplest Text Encoding
• Run-length encoding
• Requires special character, say @
• Example Source:
– ACCCTGGGGGAAAACCCCCC
• Encoding:
– A@C3T@G5@4A@C6
• Method
– any 3 or more characters are replace by @char#
• +: simple
• -: special characters, non-optimal
Shannon’s Information theory (1948)How well can we encode?
• Shannon’s goal: reduce size of messages for improved communication
• What messages would be easiest/hardest to send?
– Random bits hardest - no redundancy or pattern
• Formal definition: S, a set of symbols si
• Information content of S = -sum pi*log(pi)
– measure of randomness
– more random, less predictable, higher information content!
• Theorem: only measure with several natural properties
• Information is not knowledge
• Compression relies on finding regularities or redundancies.
Example
• Send ACTG each occurring 1/4 of the time
• Code: A--00, C--01, T--10, G--11
• 2 bits per letters: no surprise
• Average message length:
– prob(A)*codelength(A)+prob(B)*codelength(B) +…
– 1/4*2+…. = 2 bits.
• Now suppose:
– prob(A) = 13/16 and other 1/16
– Codes: A - 1; C-00, G-010, T-011 (prefix)
– 13/16*1+ 1/16*2+ 1/16*3+1/16*3=21/16 = 1.3+
• What is best result? Part of the answer:
• The information content! But how to get it?
Understanding Entropy/Information
• Suppose a set S is divided into k classes
• Let ni be the number of elements in class i
• Let N be the sum of all ni.
• Let pi be ni/N (the frequency of class i)
• Entropy(S) = -p1*log(p1) - p2*log(p2) -….-pk*log(pk).
• Note if k = 2, same as before.
• If all classes equally likely (pi = 1/k) then– Entropy(S) = - 1/k*log(1/k) - … = -log(1/k) = log(k)
– If k = power of 2, then this is number of bits to distinguish all classes
• If one class has probability 1, then– Entropy(S) = - 0*log(..) -… -1*log(1) … = 0
– Set isn’t mixed up at all.
• Intuitively entropy gives right answers.
• Learning Hint: To understand equations, try special cases.
The Shannon-Fano Algorithm
• Earliest algorithm: Heuristic divide and conquer
• Illustration: source text with only letters ABCDE
• Symbol A B C D E
• ----------------------------------
• Count 15 7 6 6 5
• Intuition: frequent letters get short codes
• 1. Sort symbols according to their frequencies/probabilities, i.e. ABCDE.
• 2. Recursively divide into two parts, each with approx. same number of counts.
• This is instance of “balancing” which is NP-complete.
• Note: variable length codes.
Shannon-Fano Tree
N o te P re fix p rop e rty
a00
b01
-
c10
d1 10
e1 11
-
-
-0 1
Result for this distribution
• Symbol Count -log(1/p) Code (# of bits)
------ ----- -------- --------- --------------------
A 15 1.38 00 30
B 7 2.48 01 14
C 6 2.70 10 12
D 6 2.70 110 18
E 5 2.96 111 15
TOTAL (# of bits): 89
average message length = 89/39=2.3
Note: Prefix property for decoding
Can you do better?
Theoretical optimum = -sum pi*log(pi) = entropy
Code Tree Method/Analysis
• Binary tree method
• Internal nodes have left/right references:
– 0 means go to the left
– 1 means go to the right
• Leaf nodes store the value
• Decode time-cost is O(logN)
• Decode space-cost is O(N)
– quick argument: number of leaves > number of internal nodes.
– Proof: induction on …..
• number of internal nodes.
• Prefix Property: each prefix uniquely defines char.
Code Encode(character)
• Again can use binary prefix tree
• For encode and decode could use hashing
– yields O(1) encode/decode time
– O(N) space cost ( N is size of alphabet)
• For compression, main goal is reducing storage size
– in example it’s the total number of bits
– code size for single character = depth of tree
– code size for document = sum of (frequency of char * depth of character)
– different trees yield different storage efficiency
– What’s the best tree?
Huffman Code
• Provably optimal: i.e. yields minimum storage cost
• Algorithm: CodeTree huff(document)
1. Compute the frequency and a leaf node for each char
• leaf node has countfield and character
2. Remove the 2 nodes with least counts and create a new node with count equal to the sum of counts and sons, the removed nodes.
• internal node has 2 node ptrs and count field
3. Repeat 2 until only 1 node left.
4. That’s it!
Bad code example
char code frequency bits
a 000 10 30
e 001 15 45
i 010 12 36
s 011 3 9
t 10 4 8
Total 128
Tree, a la Huffman
R e p ea t: M e rge low e st fre qu e ncy no d es
10
3 4
7
17
15 12
27
44
Tree with codes: note Prefix property
fre q / cod e / ch ar
1000a
30 10
s
40 11
t
7
17
1510e
1211i
27
44
Tree Cost
b its/n o de to ta l b its : 95 (be fo re 12 8)
1 0 /2 /20
3 /3 /9 4 /3 /12
7
17
1 5/2 /30 1 2 /2 /24
27
44
Analysis
• Intuition: least frequent chars get longest codes or most frequent chars get shortest codes.
• Let T be a minimal code tree. (Induction)
– All nodes have 2 sons. (by construction)
– Lemma: if c1 and c2 be least frequently used then they are at the deepest depth
• Proof:
– if not deepest nodes, exchange and total cost (number of bits) goes down
Analysis (continued)
• Sk : Huffman algorithm on k chars produces optimal code.
– S2: obvious
– Sk => Sk+1
• Let T be optimal code on k+1 chars
• By lemma, two least freq chars are deepest
• Replace two least freq char by new char with freq equal to sum
• Now have tree with k nodes
• By induction, Huffman yields optimal tree.
Lempel-Ziv
• Input: string of characters
• Internal: dictionary of (codewords, words)
• Output: string of codewords and characters.
• Codewords are distinct from characters.
• In algorithm, w is a string, c is character and w+c means concatenation.
• When adding a new word to the dictionary, a new code word needs to be assigned.
Lempel-Ziv Algorithm
w = NIL;
while ( read a character c )
{
if w+c exists in the dictionary
w = w+c;
else
add w+c to the dictionary;
output the code for w;
w = k;
}
Adaptive Encoding
• Webster has 157,000 entries: could encode in X bits
– but only works for this document
– Don’t want to do two passes
• Adaptive Huffman
– modify model on the fly
• Zempel-Liv 1977
• ZLW Zempel-Liv Welsh
– 1984 used in compress (UNIX)
– uses dictionary method
– variable number of symbols to fixed length code
– better with large documents- finds repetitive patterns
Audio Compression
• Sounds can be represented as a vector valued function
• At any point in time, a sound is a combination of different frequencies of different strengths
• For example, each note on a piano yields a specific frequency.
• Also, our ears, like pianos, have cilia that responds to specific frequencies.
• Just like sin(x) can be approximated by small number of terms, e.g. x -x^3/3+x^5/120…, so can sound.
• Transforming a sound into its “spectrum” is done mathematically by a fourier transform.
• The spectrum can be played back, as on computer with sound cards.
Audio
• Using many frequencies, as in CDs, yields a good approximation Using few frequenices, as in telephones, a poor approximation
• Sampling frequencies yields compresssion ratios between 6 to 24, depending on sound and quality
• High-priced electronic pianos store and reuse “samples” of concert pianos
• High filter: removes/reduces high frequencies, a common problem with aging
• Low filter: removes/reduces low frequencies
• Can use differential methods:
– only report change in sounds
Image Compression
• with or without loss, mostly with
– who cares about what the eye can’t see
• Black and white images can regarded as functions from the plane (R^2) into the reals (R), as in old TVs
– positions vary continuous, but our eyes can’t see the discreteness around 100 pixels per inch.
• Color images can be regarded as functions from the plane into R^3, the RGB space.
– Colors are vary continuous, but our eyes sample colors with only 3 difference receptors (RGB)
• Mathematical theories yields close approximation
– there are spatial analogues to fourier transforms
Image Compression
• faces can be done with eigenfaces
– images can be regarded a points in R^(big)
– choose good bases and use most important vectors
– i.e. approximate with fewer dimensions:
– JPEG, MPEG, GIF are compressed images
Video Compression
• Uses DCT (discrete cosine transform)
– Note: Nice functions can be approximated by
• sum of x, x^2,… with appropriate coefficients
• sum of sin(x), sin(2x),… with right coefficients
• almost any infinite sum of functions
– DCT is good because few terms give good results on images.
– Differential methods used:
• only report changes in video
Summary
• Issues:
– Context: what problem are you solving and what is an acceptable solution.
– evaluation: compression ratios
– fidelity, if loss
• approximation, quantization, transforms, differential
– adaptive, if on-the-fly, e.g. movies, tv
– Different sources yield different best approaches
• cartoons versus cities versus outdoors
– code book separate or not
– fixed or variable length codes
top related