cs336: intelligent information retrieval lecture 6: compression
Post on 20-Dec-2015
220 views
TRANSCRIPT
Compression
• Change the representation of a file so that– takes less space to store– less time to transmit
• Text compression: original file can be reconstructed exactly
• Data Compression: can tolerate small changes or noise– typically sound or images
• already a digital approximation of an analog waveform
Text Compression: Two classes
• Symbol-wise Methods– estimate the probability of occurrence of symbols
– More accurate estimates => Greater compression
• Dictionary Methods– represent several symbols as one codeword
• compression performance based on length of codewords
– replace words and other fragments of text with an index to an entry in a “dictionary”
Compression: Based on text model
• Determine output code based on pr distribution of model– # bits should use to encode a symbol equivalent
to information content (H(s) = - log Pr(s))– Recall Information Theory and Entropy (H)
• avg amount of information per symbol over an alphabet• H = (Pr(s) * I(s))
– The more probable a symbol is, the less information it provides (easier to predict)
– The less probable, the more information it provides
textcompressed text
encoder
model
decoder
model
Compression: Based on text model
• Symbol-wise methods (statistical)– Can assume independence assumption– Considering context achieves best compression rates
• e.g. probability of seeing ‘u’ next is higher if we’ve just seen ‘q’
– Must yield probability distribution such that the probability of the symbol actually occurring is very high
• Dictionary methods– string representations typically based upon text seen so far
• most chars coded as part of a string that has already occurred
• Space is saved if the reference or pointer takes fewer bits than the string it replaces
textcompressed text
encoder
model
decoder
model
Statistical Models• Static: same model regardless of text processed
– But, won’t be good for all inputs
• Recall entropy: measures information content over collection of symbols
– high pr(s) => low entropy; low pr(s) => high entropy– pr(s) = 1 => only s is possible, no encoding needed– pr(s) = 0 =>s won’t occur, so s can not be encoded
– What is implication for models in which some pr(s) = 0, but s does in fact occur?
ii
i ppE 21
log
s can not be encoded!Therefore in practice, all symbols must be assigned pr(s) > 0
Semi-static Statistical Models
• Generate model on fly for file being processed
– first pass to estimate probabilities
– probabilities transmitted to decoder before encoded transmission
– disadvantage is having to transmit model first
Adaptive Statistical Models
• Model constructed from the text just encoded
– begin with general model, modify gradually as more text is seen
• best models take into account context
– decoder can generate same model at each time step since it has seen all characters up to that point
– advantages: effective w/out fine-tuning to particular text and only 1 pass needed
Adaptive Statistical Models• What about pr(0) problem?
– allow 1 extra count divided evenly amongst unseen symbols
• recall probabilities estimated via occurrence counts• assume 72,300 total occurrences, 5 unseen characters:• each unseen char, u, gets pr(u) = pr(char is one of the unseen) * pr(all unseen
char) = 1/5 * 1/72301
• Not good for full-text retrieval– must be decoded from beginning
• therefore not good for random access to files
Symbol-wise Methods
• Morse Code– common symbols assigned short codes (few bits)– rare symbols have longer codes
• Huffman coding (prefix-free code)– no codeword is a prefix of another symbol’s
codewordSymbol Codeword Probability
a 0000 0.05 b 0001 0.05 c 001 0.1 d 01 0.2 e 10 0.3 f 110 0.2 g 111 0.1
Huffman codingSymbol Codeword Probability
a 0000 0.05 b 0001 0.05 c 001 0.1 d 01 0.2 e 10 0.3 f 110 0.2 g 111 0.1
• What is the string for 1010110111111110?
• Decoder identifies 10 as first codeword => e• Decoding proceeds l to rt on remainder of string• Good when prob. distribution is static• Best when model is word-based• Typically used for compressing text in IR
eefggf
Symbol Codeword Probability a 0000 0.05 b 0001 0.05 c 001 0.1 d 01 0.2 e 10 0.3 f 110 0.2 g 111 0.1
Create a decoding tree bottom-up1. Each symbol and pr are leafs2. 2 nodes w/ smallest p are joined
under same parent (p = p(s1)+p(s2)3. Repeat, ignoring children, until all
nodes are connected4. Label nodes• Left branch gets 0• Right branch 1
e 0.3
a 0.05
b 0.05
c 0.1
d 0.2
f 0.2
g 0.1
0.1
0.2
0.4
0.3
0.6
1.00
0
0
0 1 1 1
1
1
10 0
• Requires 2 passes over the document
1. gather statistics and build the coding table 2. encode the document
• Must explicitly store coding table with document
– eats into your space savings on short documents
• Exploit only non-uniformity in symbol distribution
– adaptive algorithms can recognize the higher-order redundancy in strings e.g. 0101010101....
Huffman coding
Dictionary Methods
• Braille– special codes are used to represent whole words
• Dictionary: list of sub-strings with corresponding code-words
– we do this naturally e.g) replace “december” w/ “12”
– codebook for ascii might contain • 128 characters and 128 common letter pairs• At best each char encoded in 4 bits rather than 7
– can be static, semi-static, or adaptive (best)
Dictionary Methods• Ziv-Lempel
– Built on the fly– Replace strings w/ reference to a previous
occurrence – codebook
• all words seen prior to current position• codewords represented by “pointers”
– Compression: pointer stored in fewer bits than string– Especially useful when resources used must be
minimized (e.g. 1 machine distributes data to many)• Relatively easy to implement• Decoding is fast• Requires only small amount of memory
Ziv-Lempel• Coded output: series of triples <x,y,z>
• x: how far back to look in previous decoded text to find next string
• y: how long the string is• z: next character from the input
– only necessary if char to be coded did not occur previously
ptr intotext
<0,0,a> <0,0,b> <2,1,a> <3,2,b> <5,3,b> <1,10,a>
a b a a bb a aa b b 10 b’s followed by a
Accessing Compressed Text
• Store information identifying a document’s location relative to the beginning of the file– Store bit offset
• variable length codes mean docs may not begin/end on byte boundaries
– insist that docs begin on byte boundaries• some docs will have waste bits at the end• need a method to explicitly identify end of coded
text
Text Compression• Key difference b/w symbol-wise & dictionary
– symbol-wise base coding of symbol on context– dictionary groups symbols together creating an implicit context
• Present techniques give compression of ~2 bits/character for general English text
• In order to do better, techniques must leverage– semantic content– external world knowledge
• Rule of thumb: The greater the compression,– the slower the program runs or – the more memory it uses.
• Inverted lists contain skewed data, so text compression methods are not appropriate.
Inverted File Compression
• Note: each inverted list can be stored as an ascending sequence of integers– <8; 2, 4, 9, 45, 57, 76, 78, 80>
• Replace with initial position w/ d-gaps– <8; 2,2,5,36,12,19,2,2>– on average gaps < largest doc num
• Compression models describe probability distribution of d-gap sizes
Fixed-Length Compression
• Typically, integer stored in 4 bytes
• To compress the d-gaps for : <5; 1, 3, 7, 70, 250 > <5; 1, 2, 4, 63, 180 >
• Use fewer bytes:– Two leftmost bits store the number of bytes
(1-4)– The d-gap is stored in the next 6, 14, 22, 30
bitsLength Bytes0x<64 164x<16,384 216,384x<4,194,304 34,194,304x<1,073,741,824 4
Value Compressed1 00 0000012 00 0000104 00 00010063 00 111111180 01 000000 10110100
Example
The inverted list is: 1, 3, 7, 70, 250
After computing gaps: 1, 2, 4, 63, 180
Number bytes reduced from 5*4 to 6
Code• This method is superior to a binary encoding• Represent the number x in two parts:
1. Unary code: specifies bits needed to code x– 1 + log x followed by
2. binary code: codes x in that many bits log x bits that represent x – 2 log x
Let x = 9: log x = 3 ( 8 = 23 < 9 < 24 = 16)
part 1: 1 + 3 = 4 => 1110 (unary)
part 2: 9 – 8 = 1
code: 1110 001
Unary code: (n-1) 1-bits followed by a 0, therefore 1 represented by 0, 2 represented by 10, 3 by 110, etc.