cs336: intelligent information retrieval lecture 6: compression

CS336: Intelligent Information Retrieval

Lecture 6: Compression

Compression

• Change the representation of a file so that– takes less space to store– less time to transmit

• Text compression: original file can be reconstructed exactly

• Data Compression: can tolerate small changes or noise– typically sound or images

• already a digital approximation of an analog waveform

Text Compression: Two classes

• Symbol-wise Methods– estimate the probability of occurrence of symbols

– More accurate estimates => Greater compression

• Dictionary Methods– represent several symbols as one codeword

• compression performance based on length of codewords

– replace words and other fragments of text with an index to an entry in a “dictionary”

Compression: Based on text model

• Determine output code based on pr distribution of model– # bits should use to encode a symbol equivalent

to information content (H(s) = - log Pr(s))– Recall Information Theory and Entropy (H)

• avg amount of information per symbol over an alphabet• H = (Pr(s) * I(s))

– The more probable a symbol is, the less information it provides (easier to predict)

– The less probable, the more information it provides

textcompressed text

encoder

model

decoder

model

Compression: Based on text model

• Symbol-wise methods (statistical)– Can assume independence assumption– Considering context achieves best compression rates

• e.g. probability of seeing ‘u’ next is higher if we’ve just seen ‘q’

– Must yield probability distribution such that the probability of the symbol actually occurring is very high

• Dictionary methods– string representations typically based upon text seen so far

• most chars coded as part of a string that has already occurred

• Space is saved if the reference or pointer takes fewer bits than the string it replaces

textcompressed text

encoder

model

decoder

model

Statistical Models• Static: same model regardless of text processed

– But, won’t be good for all inputs

• Recall entropy: measures information content over collection of symbols

– high pr(s) => low entropy; low pr(s) => high entropy– pr(s) = 1 => only s is possible, no encoding needed– pr(s) = 0 =>s won’t occur, so s can not be encoded

– What is implication for models in which some pr(s) = 0, but s does in fact occur?

ii

i ppE 21

log

s can not be encoded!Therefore in practice, all symbols must be assigned pr(s) > 0

Semi-static Statistical Models

• Generate model on fly for file being processed

– first pass to estimate probabilities

– probabilities transmitted to decoder before encoded transmission

– disadvantage is having to transmit model first

Adaptive Statistical Models

• Model constructed from the text just encoded

– begin with general model, modify gradually as more text is seen

• best models take into account context

– decoder can generate same model at each time step since it has seen all characters up to that point

– advantages: effective w/out fine-tuning to particular text and only 1 pass needed

Adaptive Statistical Models• What about pr(0) problem?

– allow 1 extra count divided evenly amongst unseen symbols

• recall probabilities estimated via occurrence counts• assume 72,300 total occurrences, 5 unseen characters:• each unseen char, u, gets pr(u) = pr(char is one of the unseen) * pr(all unseen

char) = 1/5 * 1/72301

• Not good for full-text retrieval– must be decoded from beginning

• therefore not good for random access to files

Symbol-wise Methods

• Morse Code– common symbols assigned short codes (few bits)– rare symbols have longer codes

• Huffman coding (prefix-free code)– no codeword is a prefix of another symbol’s

codewordSymbol Codeword Probability

a 0000 0.05 b 0001 0.05 c 001 0.1 d 01 0.2 e 10 0.3 f 110 0.2 g 111 0.1

Huffman codingSymbol Codeword Probability

a 0000 0.05 b 0001 0.05 c 001 0.1 d 01 0.2 e 10 0.3 f 110 0.2 g 111 0.1

• What is the string for 1010110111111110?

• Decoder identifies 10 as first codeword => e• Decoding proceeds l to rt on remainder of string• Good when prob. distribution is static• Best when model is word-based• Typically used for compressing text in IR

eefggf

Symbol Codeword Probability a 0000 0.05 b 0001 0.05 c 001 0.1 d 01 0.2 e 10 0.3 f 110 0.2 g 111 0.1

Create a decoding tree bottom-up1. Each symbol and pr are leafs2. 2 nodes w/ smallest p are joined

under same parent (p = p(s1)+p(s2)3. Repeat, ignoring children, until all

nodes are connected4. Label nodes• Left branch gets 0• Right branch 1

e 0.3

a 0.05

b 0.05

c 0.1

d 0.2

f 0.2

g 0.1

0.1

0.2

0.4

0.3

0.6

1.00

0

0

0 1 1 1

1

1

10 0

• Requires 2 passes over the document

1. gather statistics and build the coding table 2. encode the document

• Must explicitly store coding table with document

– eats into your space savings on short documents

• Exploit only non-uniformity in symbol distribution

– adaptive algorithms can recognize the higher-order redundancy in strings e.g. 0101010101....

Huffman coding

Dictionary Methods

• Braille– special codes are used to represent whole words

• Dictionary: list of sub-strings with corresponding code-words

– we do this naturally e.g) replace “december” w/ “12”

– codebook for ascii might contain • 128 characters and 128 common letter pairs• At best each char encoded in 4 bits rather than 7

– can be static, semi-static, or adaptive (best)

Dictionary Methods• Ziv-Lempel

– Built on the fly– Replace strings w/ reference to a previous

occurrence – codebook

• all words seen prior to current position• codewords represented by “pointers”

– Compression: pointer stored in fewer bits than string– Especially useful when resources used must be

minimized (e.g. 1 machine distributes data to many)• Relatively easy to implement• Decoding is fast• Requires only small amount of memory

Ziv-Lempel• Coded output: series of triples <x,y,z>

• x: how far back to look in previous decoded text to find next string

• y: how long the string is• z: next character from the input

– only necessary if char to be coded did not occur previously

ptr intotext

<0,0,a> <0,0,b> <2,1,a> <3,2,b> <5,3,b> <1,10,a>

a b a a bb a aa b b 10 b’s followed by a

Accessing Compressed Text

• Store information identifying a document’s location relative to the beginning of the file– Store bit offset

• variable length codes mean docs may not begin/end on byte boundaries

– insist that docs begin on byte boundaries• some docs will have waste bits at the end• need a method to explicitly identify end of coded

text

Text Compression• Key difference b/w symbol-wise & dictionary

– symbol-wise base coding of symbol on context– dictionary groups symbols together creating an implicit context

• Present techniques give compression of ~2 bits/character for general English text

• In order to do better, techniques must leverage– semantic content– external world knowledge

• Rule of thumb: The greater the compression,– the slower the program runs or – the more memory it uses.

• Inverted lists contain skewed data, so text compression methods are not appropriate.

Inverted File Compression

• Note: each inverted list can be stored as an ascending sequence of integers– <8; 2, 4, 9, 45, 57, 76, 78, 80>

• Replace with initial position w/ d-gaps– <8; 2,2,5,36,12,19,2,2>– on average gaps < largest doc num

• Compression models describe probability distribution of d-gap sizes

Fixed-Length Compression

• Typically, integer stored in 4 bytes

• To compress the d-gaps for : <5; 1, 3, 7, 70, 250 > <5; 1, 2, 4, 63, 180 >

• Use fewer bytes:– Two leftmost bits store the number of bytes

(1-4)– The d-gap is stored in the next 6, 14, 22, 30

bitsLength Bytes0x<64 164x<16,384 216,384x<4,194,304 34,194,304x<1,073,741,824 4

Value Compressed1 00 0000012 00 0000104 00 00010063 00 111111180 01 000000 10110100

Example

The inverted list is: 1, 3, 7, 70, 250

After computing gaps: 1, 2, 4, 63, 180

Number bytes reduced from 5*4 to 6

Code• This method is superior to a binary encoding• Represent the number x in two parts:

1. Unary code: specifies bits needed to code x– 1 + log x followed by

2. binary code: codes x in that many bits log x bits that represent x – 2 log x

Let x = 9: log x = 3 ( 8 = 23 < 9 < 24 = 16)

part 1: 1 + 3 = 4 => 1110 (unary)

part 2: 9 – 8 = 1

code: 1110 001

Unary code: (n-1) 1-bits followed by a 0, therefore 1 represented by 0, 2 represented by 10, 3 by 110, etc.

Value Compressed1 02 10 04 110 0063 111110 11111180 11111110 0110100

Example

The inverted list is: 1, 3, 7, 70, 250After computing gaps: 1, 2, 4, 63, 180

Number bits reduced from 5*32 to 36.Decode: let u = unary, b = binary

x = 2u-1 +b

1 + log x

x – 2 log x

0

cs336: intelligent information retrieval lecture 6: compression

Documents

text compression

compression slide

particular text

fragments of text

general model

dictionary slide

data compression

needed slide