zone indexes paolo ferragina dipartimento di informatica università di pisa reading 6.1
Post on 20-Dec-2015
218 Views
Preview:
TRANSCRIPT
Zone indexes
Paolo FerraginaDipartimento di Informatica
Università di Pisa
Reading 6.1
Parametric and zone indexes
Thus far, a doc has been a term sequence
But documents have multiple parts: Author Title Date of publication Language Format etc.
These are the metadata about a document
Sec. 6.1
Zone
A zone is a region of the doc that can contain an arbitrary amount of text e.g.,
Title Abstract References …
Build inverted indexes on fields AND zones to permit querying
E.g., “find docs with merchant in the title zone and matching the query gentle rain”
Sec. 6.1
Example zone indexes
Encode zones in dictionary vs. postings.
Sec. 6.1
Tiered indexes
Break postings up into a hierarchy of lists Most important … Least important
Inverted index thus broken up into tiers of decreasing importance
At query time use top tier unless it fails to yield K docs
If so drop to lower tiers
Sec. 7.2.1
Example tiered index
Sec. 7.2.1
Index construction:Compression of postings
Paolo FerraginaDipartimento di Informatica
Università di Pisa
Reading 5.3 and a paper
code for integer encoding
x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.
code for x takes 2 log2 x +1 bits (ie. factor of 2 from optimal)
0000...........0 x in binary Length-1
Optimal for Pr(x) = 1/2x2, and i.i.d integers
It is a prefix-free encoding…
Given the following sequence of coded integers, reconstruct the original sequence:
0001000001100110000011101100111
8 6 3 59 7
code for integer encoding
Use -coding to reduce the length of the first field
Useful for medium-sized integers
e.g., 19 represented as <00,101,10011>.
coding x takes about log2 x + 2 log2( log2 x ) + 2 bits.
(Length) x
Optimal for Pr(x) = 1/2x(log x)2, and i.i.d integers
Variable-bytecodes [10.2 bits per TREC12]
Wish to get very fast (de)compress byte-align
Given a binary representation of an integer Append 0s to front, to get a multiple-of-7 number of bits Form groups of 7-bits each Append to the last group the bit 0, and to the other
groups the bit 1 (tagging)
e.g., v=214+1 binary(v) = 10000000000000110000001 10000000 00000001
Note: We waste 1 bit per byte, and avg 4 for the first byte.
But it is a prefix code, and encodes also the value 0 !!
PForDelta coding
10 11 11 …01 01 11 11 01 42 2311 10
2 3 3 …1 1 3 3 23 13 42 2
a block of 128 numbers = 256 bits = 32 bytes
Use b (e.g. 2) bits to encode 128 numbers or create exceptions
Encode exceptions: ESC or pointers
Choose b to encode 90% values, or trade-off: b waste more bits, b more exceptions
Translate data: [base, base + 2b-1] [0,2b-1]
Index construction:Compression of documents
Paolo FerraginaDipartimento di Informatica
Università di Pisa
Reading Managing-Gigabytes: pg 21-36, 52-56, 74-79
Uniquely Decodable Codes
A variable length code assigns a bit string (codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?
A uniquely decodable code can always be uniquely decomposed into their codewords.
Prefix Codes
A prefix code is a variable length code in which no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11Can be viewed as a binary trie
0 1
a
b c
d
0
0 1
1
Average Length
For a code C with codeword length L[s], the average length is defined as
p(A) = .7 [0], p(B) = p(C) = p(D) = .1 [1--]
La = .7 * 1 + .3 * 3 = 1.6 bit (Huffman achieves 1.5 bit)
We say that a prefix code C is optimal if for all prefix codes C’, La(C) La(C’)
Ss
a sLspCL ][)()(
Entropy (Shannon, 1948)
For a source S emitting symbols with probability p(s), the self information of s is:
bits
Lower probability higher information
Entropy is the weighted average of i(s)
Ss sp
spSH)(
1log)()( 2
)(
1log)( 2 sp
si
s s
s
occ
T
T
occTH
||log
||)( 20
0-th order empirical entropy of string T
i(s)
Performance: Compression ratio
Compression ratio =
#bits in output / #bits in input
Compression performance: We relate entropy against compression ratio.
p(A) = .7, p(B) = p(C) = p(D) = .1
H ≈ 1.36 bits
Huffman ≈ 1.5 bits per symb
||
|)(|)(0 T
TCvsTH
s
scspSH |)(|)()(Shannon In practiceAvg cw length
Empirical H vs Compression ratio
|)(|)(|| 0 TCvsTHT
Statistical Coding
How do we use probability p(s) to encode s?
Huffman codes
Arithmetic codes
Document Compression
Huffman coding
Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms gzip, bzip, jpeg (as option), fax compression,…
Properties: Generates optimal prefix codes Cheap to encode and decode La(Huff) = H if probabilities are powers of 2
Otherwise, La(Huff) < H +1 < +1 bit per symb on avg!!
Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5a(.1) b(.2) d(.5)c(.2)
a=000, b=001, c=01, d=1There are 2n-1 “equivalent” Huffman trees
(.3)
(.5)
(1)
What about ties (and thus, tree depth) ?
0
0
0
11
1
Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to the symbol to be encoded.
Decoding: Start at root and take branch for each bit received. When at leaf, output its symbol and return to root.
a(.1) b(.2)
(.3) c(.2)
(.5) d(.5)0
0
0
1
1
1
abc... 00000101
101001... dcb
Problem with Huffman Coding
Take a two symbol alphabet = {a,b}.Whichever is their probability, Huffman uses 1
bit for each symbol and thus takes n bits to encode a message of n symbols
This is ok when the probabilities are almost the same, but what about p(a) = .999.
The optimal code for a is bits So optimal coding should use n *.0014 bits,
which is much less than the n bits taken by Huffman
00144.)999log(.
Document Compression
Arithmetic coding
Introduction
It uses “fractional” parts of bits!!
Gets nH(T) + 2 bits vs. nH(T)+n of Huffman
Used in JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer implementation is not too bad.
Ideal performance. In practice, it is 0.02 * n
Symbol interval
Assign each symbol to an interval range from 0 (inclusive) to 1 (exclusive).
e.g.
a = .2
c = .3
b = .5
cum[c] = p(a)+p(b) = .7
cum[b] = p(a) = .2
cum[a] = .0
The interval for a particular symbol will be calledthe symbol interval (e.g for b it is [.2,.7))
Sequence interval
Coding the message sequence: bac
The final sequence interval is [.27,.3)
a = .2
c = .3
b = .5
0.0
0.2
0.7
1.0
a
c
b
0.2
0.3
0.55
0.7
a
c
b
0.2
0.22
0.27
0.3(0.7-0.2)*0.3=0.15
(0.3-0.2)*0.5 = 0.05
(0.3-0.2)*0.3=0.03
(0.3-0.2)*0.2=0.02(0.7-0.2)*0.2=0.1
(0.7-0.2)*0.5 = 0.25
The algorithm
To code a sequence of symbols with probabilities
pi (i = 1..n) use the following algorithm:
0
1
0
0
l
s iiii
iii
Tcumsll
Tpss
*11
1 *
p(a) = .2
p(c) = .3
p(b) = .5
0.27
0.2
0.3
2.0
1.0
1
1
i
i
l
s
03.03.0*1.0 is
27.0)5.02.0(*1.02.0 il
The algorithm
Each message narrows the interval by a factor p[Ti]
Final interval size is
1
0
0
0
s
l
n
iin Tps
1
iii
iiii
Tpss
Tcumsll
*1
*11
Sequence interval[ ln , ln + sn )
Take a number inside
Decoding Example
Decoding the number .49, knowing the message is of length 3:
The message is bbc.
a = .2
c = .3
b = .5
0.0
0.2
0.7
1.0
a
c
b
0.2
0.3
0.55
0.7
a
c
b
0.3
0.35
0.475
0.55
0.490.49
0.49
How do we encode that number?
If x = v/2k (dyadic fraction) then the
encoding is equal to bin(v) over k digits (possibly padded with 0s in front)
0111.16/7
11.4/3
01.3/1
How do we encode that number?
Binary fractional representation:
FractionalEncode(x)1. x = 2 * x2. If x < 1 output 03. else {output 1; x = x - 1; }
.... 54321 bbbbb
...2222 44
33
22
11 bbbb
01.3/1
2 * (1/3) = 2/3 < 1, output 0
2 * (2/3) = 4/3 > 1, output 1 4/3 – 1 = 1/3Incremental Generation
Which number do we encode?
Truncate the encoding to the first d = log2 (2/sn) bits
Truncation gets a smaller number… how much smaller?
Truncation Compression
d
i
id
i
id
i
ididb
222212
11
)(
1
)(
2222
log2
log 22nss snn
ln + sn
ln
ln + sn/2
....... 32154321 dddd bbbbbbbbbx 0∞
Bound on code length
Theorem: For a text T of length n, the Arithmetic
encoder generates at most
log2 (2/sn) < 1 + log2 2/sn = 1 + (1 - log2 sn)
= 2 - log2 (∏ i=1,n p(Ti))
= 2 - log2 (∏ [p()occ()])
= 2 - ∑ occ() * log2 p()
≈ 2 + ∑ ( n*p() ) * log2 (1/p())
= 2 + n H(T) bits
T = acabc
sn = p(a) *p(c) *p(a) *p(b) *p(c)
= p(a)2 * p(b) * p(c)2
Document Compression
Dictionary-based compressors
LZ77
Algorithm’s step: Output <dist, len, next-char> Advance by len + 1
A buffer “window” has fixed length and moves
a a c a a c a b c a a a a a aDictionary
(all substrings starting here)
<6,3,a>
<3,4,c>a a c a a c a b c a a a a a a c
a c
a c
LZ77 Decoding
Decoder keeps same dictionary window as encoder. Finds substring and inserts a copy of it
What if l > d? (overlap with text to be compressed) E.g. seen = abcd, next codeword is (2,9,e) Simply copy starting at the cursor
for (i = 0; i < len; i++) out[cursor+i] = out[cursor-d+i]
Output is correct: abcdcdcdcdcdce
LZ77 Optimizations used by gzip
LZSS: Output one of the following formats(0, position, length) or (1,char)
Typically uses the second format if length < 3.
Special greedy: possibly use shorter match so that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code
You find this at: www.gzip.org/zlib/
Dictionary search
Exact string search
Paper on Cuckoo Hashing
Exact String Search
Given a dictionary D of K strings, of total
length N, store them in a way that we can
efficiently support searches for a pattern P
over them.
Hashing ?
Hashing with chaining
Key issue: a good hash function
Basic assumption: Uniform hashing
Avg #keys per slot = n * (1/m) = n/m = (load factor)
Search cost
m = (n)
In practice
A trivial hash function is:
prime
A “provably good” hash is
Each ai is selected at random in [0,m)
k0 k1 k2 kr
≈log2 m
r ≈ L / log2 m
a0 a1 a2 ar
K
a
prime
l = max string lenm = table size
not necessarily: (...mod p) mod m
Cuckoo Hashing
A B C
E D
2 hash tables, and 2 random choices where an item can be
stored
A B C
E D
F
A running example
A B FC
E D
A running example
A B FC
E D
G
A running example
E G B FC
A D
A running example
Cuckoo Hashing Examples
A B C
E D F
G
Random (bipartite) graph: node=cell, edge=key
Natural Extensions
More than 2 hashes (choices) per key.
Very different: hypergraphs instead of graphs. Higher memory utilization
3 choices : 90+% in experiments 4 choices : about 97%
2 hashes + bins of B-size.
Balanced allocation and tightly O(1)-size bins Insertion sees a tree of possible evict+ins paths
but more insert time(and random access)
more memory...but more local
Dictionary search
Prefix-string search
Reading 3.1 and 5.2
Prefix-string Search
Given a dictionary D of K strings, of total
length N, store them in a way that we can
efficiently support prefix searches for a
pattern P over them.
Trie: speeding-up searches
1
2 2
0
4
5
6
7
2 3
y
s
1z
stile zyg
5
etic
ialygy
aibelyite
czecin
omo
Pro: O(p) search time
Cons: edge + node labels and tree structure
Front-coding: squeezing strings
http://checkmate.com/All_Natural/http://checkmate.com/All_Natural/Applied.htmlhttp://checkmate.com/All_Natural/Aroma.htmlhttp://checkmate.com/All_Natural/Aroma1.htmlhttp://checkmate.com/All_Natural/Aromatic_Art.htmlhttp://checkmate.com/All_Natural/Ayate.htmlhttp://checkmate.com/All_Natural/Ayer_Soap.htmlhttp://checkmate.com/All_Natural/Ayurvedic_Soap.htmlhttp://checkmate.com/All_Natural/Bath_Salt_Bulk.htmlhttp://checkmate.com/All_Natural/Bath_Salts.htmlhttp://checkmate.com/All/Essence_Oils.htmlhttp://checkmate.com/All/Mineral_Bath_Crystals.htmlhttp://checkmate.com/All/Mineral_Bath_Salt.htmlhttp://checkmate.com/All/Mineral_Cream.html
http://checkmate.com/All/Natural/Washcloth.html...
0 http://checkmate.com/All_Natural/33 Applied.html34 roma.html38 1.html38 tic_Art.html34 yate.html35 er_Soap.html35 urvedic_Soap.html33 Bath_Salt_Bulk.html42 s.html25 Essence_Oils.html25 Mineral_Bath_Crystals.html38 Salt.html33 Cream.html
3345%
0 http://checkmate.com/All/Natural/Washcloth.html...
….systile syzygetic syzygial syzygy….2 5 5
Gzip may be much better...
….70systile 92zygeti c85ial 65y 110szaibelyite 82czecin92omo….
systile szaielyite
CTon a sample
2-level indexing
Disk
InternalMemory A disadvantage:
•Trade-off ≈ speed vs space (because of bucket size)
2 advantages:• Search ≈ typically 1 I/O
• Space ≈ Front-coding over buckets
top related