![Page 1: Zone indexes Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 6.1](https://reader035.vdocuments.net/reader035/viewer/2022062320/56649d4a5503460f94a26567/html5/thumbnails/1.jpg)
Zone indexes
Paolo FerraginaDipartimento di Informatica
Università di Pisa
Reading 6.1
![Page 2: Zone indexes Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 6.1](https://reader035.vdocuments.net/reader035/viewer/2022062320/56649d4a5503460f94a26567/html5/thumbnails/2.jpg)
Parametric and zone indexes
Thus far, a doc has been a term sequence
But documents have multiple parts: Author Title Date of publication Language Format etc.
These are the metadata about a document
Sec. 6.1
![Page 3: Zone indexes Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 6.1](https://reader035.vdocuments.net/reader035/viewer/2022062320/56649d4a5503460f94a26567/html5/thumbnails/3.jpg)
Zone
A zone is a region of the doc that can contain an arbitrary amount of text e.g.,
Title Abstract References …
Build inverted indexes on fields AND zones to permit querying
E.g., “find docs with merchant in the title zone and matching the query gentle rain”
Sec. 6.1
![Page 4: Zone indexes Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 6.1](https://reader035.vdocuments.net/reader035/viewer/2022062320/56649d4a5503460f94a26567/html5/thumbnails/4.jpg)
Example zone indexes
Encode zones in dictionary vs. postings.
Sec. 6.1
![Page 5: Zone indexes Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 6.1](https://reader035.vdocuments.net/reader035/viewer/2022062320/56649d4a5503460f94a26567/html5/thumbnails/5.jpg)
Tiered indexes
Break postings up into a hierarchy of lists Most important … Least important
Inverted index thus broken up into tiers of decreasing importance
At query time use top tier unless it fails to yield K docs
If so drop to lower tiers
Sec. 7.2.1
![Page 6: Zone indexes Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 6.1](https://reader035.vdocuments.net/reader035/viewer/2022062320/56649d4a5503460f94a26567/html5/thumbnails/6.jpg)
Example tiered index
Sec. 7.2.1
![Page 7: Zone indexes Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 6.1](https://reader035.vdocuments.net/reader035/viewer/2022062320/56649d4a5503460f94a26567/html5/thumbnails/7.jpg)
Index construction:Compression of postings
Paolo FerraginaDipartimento di Informatica
Università di Pisa
Reading 5.3 and a paper
![Page 8: Zone indexes Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 6.1](https://reader035.vdocuments.net/reader035/viewer/2022062320/56649d4a5503460f94a26567/html5/thumbnails/8.jpg)
code for integer encoding
x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.
code for x takes 2 log2 x +1 bits (ie. factor of 2 from optimal)
0000...........0 x in binary Length-1
Optimal for Pr(x) = 1/2x2, and i.i.d integers
![Page 9: Zone indexes Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 6.1](https://reader035.vdocuments.net/reader035/viewer/2022062320/56649d4a5503460f94a26567/html5/thumbnails/9.jpg)
It is a prefix-free encoding…
Given the following sequence of coded integers, reconstruct the original sequence:
0001000001100110000011101100111
8 6 3 59 7
![Page 10: Zone indexes Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 6.1](https://reader035.vdocuments.net/reader035/viewer/2022062320/56649d4a5503460f94a26567/html5/thumbnails/10.jpg)
code for integer encoding
Use -coding to reduce the length of the first field
Useful for medium-sized integers
e.g., 19 represented as <00,101,10011>.
coding x takes about log2 x + 2 log2( log2 x ) + 2 bits.
(Length) x
Optimal for Pr(x) = 1/2x(log x)2, and i.i.d integers
![Page 11: Zone indexes Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 6.1](https://reader035.vdocuments.net/reader035/viewer/2022062320/56649d4a5503460f94a26567/html5/thumbnails/11.jpg)
Variable-bytecodes [10.2 bits per TREC12]
Wish to get very fast (de)compress byte-align
Given a binary representation of an integer Append 0s to front, to get a multiple-of-7 number of bits Form groups of 7-bits each Append to the last group the bit 0, and to the other
groups the bit 1 (tagging)
e.g., v=214+1 binary(v) = 10000000000000110000001 10000000 00000001
Note: We waste 1 bit per byte, and avg 4 for the first byte.
But it is a prefix code, and encodes also the value 0 !!
![Page 12: Zone indexes Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 6.1](https://reader035.vdocuments.net/reader035/viewer/2022062320/56649d4a5503460f94a26567/html5/thumbnails/12.jpg)
PForDelta coding
10 11 11 …01 01 11 11 01 42 2311 10
2 3 3 …1 1 3 3 23 13 42 2
a block of 128 numbers = 256 bits = 32 bytes
Use b (e.g. 2) bits to encode 128 numbers or create exceptions
Encode exceptions: ESC or pointers
Choose b to encode 90% values, or trade-off: b waste more bits, b more exceptions
Translate data: [base, base + 2b-1] [0,2b-1]
![Page 13: Zone indexes Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 6.1](https://reader035.vdocuments.net/reader035/viewer/2022062320/56649d4a5503460f94a26567/html5/thumbnails/13.jpg)
Index construction:Compression of documents
Paolo FerraginaDipartimento di Informatica
Università di Pisa
Reading Managing-Gigabytes: pg 21-36, 52-56, 74-79
![Page 14: Zone indexes Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 6.1](https://reader035.vdocuments.net/reader035/viewer/2022062320/56649d4a5503460f94a26567/html5/thumbnails/14.jpg)
Uniquely Decodable Codes
A variable length code assigns a bit string (codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?
A uniquely decodable code can always be uniquely decomposed into their codewords.
![Page 15: Zone indexes Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 6.1](https://reader035.vdocuments.net/reader035/viewer/2022062320/56649d4a5503460f94a26567/html5/thumbnails/15.jpg)
Prefix Codes
A prefix code is a variable length code in which no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11Can be viewed as a binary trie
0 1
a
b c
d
0
0 1
1
![Page 16: Zone indexes Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 6.1](https://reader035.vdocuments.net/reader035/viewer/2022062320/56649d4a5503460f94a26567/html5/thumbnails/16.jpg)
Average Length
For a code C with codeword length L[s], the average length is defined as
p(A) = .7 [0], p(B) = p(C) = p(D) = .1 [1--]
La = .7 * 1 + .3 * 3 = 1.6 bit (Huffman achieves 1.5 bit)
We say that a prefix code C is optimal if for all prefix codes C’, La(C) La(C’)
Ss
a sLspCL ][)()(
![Page 17: Zone indexes Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 6.1](https://reader035.vdocuments.net/reader035/viewer/2022062320/56649d4a5503460f94a26567/html5/thumbnails/17.jpg)
Entropy (Shannon, 1948)
For a source S emitting symbols with probability p(s), the self information of s is:
bits
Lower probability higher information
Entropy is the weighted average of i(s)
Ss sp
spSH)(
1log)()( 2
)(
1log)( 2 sp
si
s s
s
occ
T
T
occTH
||log
||)( 20
0-th order empirical entropy of string T
i(s)
![Page 18: Zone indexes Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 6.1](https://reader035.vdocuments.net/reader035/viewer/2022062320/56649d4a5503460f94a26567/html5/thumbnails/18.jpg)
Performance: Compression ratio
Compression ratio =
#bits in output / #bits in input
Compression performance: We relate entropy against compression ratio.
p(A) = .7, p(B) = p(C) = p(D) = .1
H ≈ 1.36 bits
Huffman ≈ 1.5 bits per symb
||
|)(|)(0 T
TCvsTH
s
scspSH |)(|)()(Shannon In practiceAvg cw length
Empirical H vs Compression ratio
|)(|)(|| 0 TCvsTHT
![Page 19: Zone indexes Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 6.1](https://reader035.vdocuments.net/reader035/viewer/2022062320/56649d4a5503460f94a26567/html5/thumbnails/19.jpg)
Statistical Coding
How do we use probability p(s) to encode s?
Huffman codes
Arithmetic codes
![Page 20: Zone indexes Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 6.1](https://reader035.vdocuments.net/reader035/viewer/2022062320/56649d4a5503460f94a26567/html5/thumbnails/20.jpg)
Document Compression
Huffman coding
![Page 21: Zone indexes Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 6.1](https://reader035.vdocuments.net/reader035/viewer/2022062320/56649d4a5503460f94a26567/html5/thumbnails/21.jpg)
Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms gzip, bzip, jpeg (as option), fax compression,…
Properties: Generates optimal prefix codes Cheap to encode and decode La(Huff) = H if probabilities are powers of 2
Otherwise, La(Huff) < H +1 < +1 bit per symb on avg!!
![Page 22: Zone indexes Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 6.1](https://reader035.vdocuments.net/reader035/viewer/2022062320/56649d4a5503460f94a26567/html5/thumbnails/22.jpg)
Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5a(.1) b(.2) d(.5)c(.2)
a=000, b=001, c=01, d=1There are 2n-1 “equivalent” Huffman trees
(.3)
(.5)
(1)
What about ties (and thus, tree depth) ?
0
0
0
11
1
![Page 23: Zone indexes Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 6.1](https://reader035.vdocuments.net/reader035/viewer/2022062320/56649d4a5503460f94a26567/html5/thumbnails/23.jpg)
Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to the symbol to be encoded.
Decoding: Start at root and take branch for each bit received. When at leaf, output its symbol and return to root.
a(.1) b(.2)
(.3) c(.2)
(.5) d(.5)0
0
0
1
1
1
abc... 00000101
101001... dcb
![Page 24: Zone indexes Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 6.1](https://reader035.vdocuments.net/reader035/viewer/2022062320/56649d4a5503460f94a26567/html5/thumbnails/24.jpg)
Problem with Huffman Coding
Take a two symbol alphabet = {a,b}.Whichever is their probability, Huffman uses 1
bit for each symbol and thus takes n bits to encode a message of n symbols
This is ok when the probabilities are almost the same, but what about p(a) = .999.
The optimal code for a is bits So optimal coding should use n *.0014 bits,
which is much less than the n bits taken by Huffman
00144.)999log(.
![Page 25: Zone indexes Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 6.1](https://reader035.vdocuments.net/reader035/viewer/2022062320/56649d4a5503460f94a26567/html5/thumbnails/25.jpg)
Document Compression
Arithmetic coding
![Page 26: Zone indexes Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 6.1](https://reader035.vdocuments.net/reader035/viewer/2022062320/56649d4a5503460f94a26567/html5/thumbnails/26.jpg)
Introduction
It uses “fractional” parts of bits!!
Gets nH(T) + 2 bits vs. nH(T)+n of Huffman
Used in JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer implementation is not too bad.
Ideal performance. In practice, it is 0.02 * n
![Page 27: Zone indexes Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 6.1](https://reader035.vdocuments.net/reader035/viewer/2022062320/56649d4a5503460f94a26567/html5/thumbnails/27.jpg)
Symbol interval
Assign each symbol to an interval range from 0 (inclusive) to 1 (exclusive).
e.g.
a = .2
c = .3
b = .5
cum[c] = p(a)+p(b) = .7
cum[b] = p(a) = .2
cum[a] = .0
The interval for a particular symbol will be calledthe symbol interval (e.g for b it is [.2,.7))
![Page 28: Zone indexes Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 6.1](https://reader035.vdocuments.net/reader035/viewer/2022062320/56649d4a5503460f94a26567/html5/thumbnails/28.jpg)
Sequence interval
Coding the message sequence: bac
The final sequence interval is [.27,.3)
a = .2
c = .3
b = .5
0.0
0.2
0.7
1.0
a
c
b
0.2
0.3
0.55
0.7
a
c
b
0.2
0.22
0.27
0.3(0.7-0.2)*0.3=0.15
(0.3-0.2)*0.5 = 0.05
(0.3-0.2)*0.3=0.03
(0.3-0.2)*0.2=0.02(0.7-0.2)*0.2=0.1
(0.7-0.2)*0.5 = 0.25
![Page 29: Zone indexes Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 6.1](https://reader035.vdocuments.net/reader035/viewer/2022062320/56649d4a5503460f94a26567/html5/thumbnails/29.jpg)
The algorithm
To code a sequence of symbols with probabilities
pi (i = 1..n) use the following algorithm:
0
1
0
0
l
s iiii
iii
Tcumsll
Tpss
*11
1 *
p(a) = .2
p(c) = .3
p(b) = .5
0.27
0.2
0.3
2.0
1.0
1
1
i
i
l
s
03.03.0*1.0 is
27.0)5.02.0(*1.02.0 il
![Page 30: Zone indexes Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 6.1](https://reader035.vdocuments.net/reader035/viewer/2022062320/56649d4a5503460f94a26567/html5/thumbnails/30.jpg)
The algorithm
Each message narrows the interval by a factor p[Ti]
Final interval size is
1
0
0
0
s
l
n
iin Tps
1
iii
iiii
Tpss
Tcumsll
*1
*11
Sequence interval[ ln , ln + sn )
Take a number inside
![Page 31: Zone indexes Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 6.1](https://reader035.vdocuments.net/reader035/viewer/2022062320/56649d4a5503460f94a26567/html5/thumbnails/31.jpg)
Decoding Example
Decoding the number .49, knowing the message is of length 3:
The message is bbc.
a = .2
c = .3
b = .5
0.0
0.2
0.7
1.0
a
c
b
0.2
0.3
0.55
0.7
a
c
b
0.3
0.35
0.475
0.55
0.490.49
0.49
![Page 32: Zone indexes Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 6.1](https://reader035.vdocuments.net/reader035/viewer/2022062320/56649d4a5503460f94a26567/html5/thumbnails/32.jpg)
How do we encode that number?
If x = v/2k (dyadic fraction) then the
encoding is equal to bin(v) over k digits (possibly padded with 0s in front)
0111.16/7
11.4/3
01.3/1
![Page 33: Zone indexes Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 6.1](https://reader035.vdocuments.net/reader035/viewer/2022062320/56649d4a5503460f94a26567/html5/thumbnails/33.jpg)
How do we encode that number?
Binary fractional representation:
FractionalEncode(x)1. x = 2 * x2. If x < 1 output 03. else {output 1; x = x - 1; }
.... 54321 bbbbb
...2222 44
33
22
11 bbbb
01.3/1
2 * (1/3) = 2/3 < 1, output 0
2 * (2/3) = 4/3 > 1, output 1 4/3 – 1 = 1/3Incremental Generation
![Page 34: Zone indexes Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 6.1](https://reader035.vdocuments.net/reader035/viewer/2022062320/56649d4a5503460f94a26567/html5/thumbnails/34.jpg)
Which number do we encode?
Truncate the encoding to the first d = log2 (2/sn) bits
Truncation gets a smaller number… how much smaller?
Truncation Compression
d
i
id
i
id
i
ididb
222212
11
)(
1
)(
2222
log2
log 22nss snn
ln + sn
ln
ln + sn/2
....... 32154321 dddd bbbbbbbbbx 0∞
![Page 35: Zone indexes Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 6.1](https://reader035.vdocuments.net/reader035/viewer/2022062320/56649d4a5503460f94a26567/html5/thumbnails/35.jpg)
Bound on code length
Theorem: For a text T of length n, the Arithmetic
encoder generates at most
log2 (2/sn) < 1 + log2 2/sn = 1 + (1 - log2 sn)
= 2 - log2 (∏ i=1,n p(Ti))
= 2 - log2 (∏ [p()occ()])
= 2 - ∑ occ() * log2 p()
≈ 2 + ∑ ( n*p() ) * log2 (1/p())
= 2 + n H(T) bits
T = acabc
sn = p(a) *p(c) *p(a) *p(b) *p(c)
= p(a)2 * p(b) * p(c)2
![Page 36: Zone indexes Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 6.1](https://reader035.vdocuments.net/reader035/viewer/2022062320/56649d4a5503460f94a26567/html5/thumbnails/36.jpg)
Document Compression
Dictionary-based compressors
![Page 37: Zone indexes Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 6.1](https://reader035.vdocuments.net/reader035/viewer/2022062320/56649d4a5503460f94a26567/html5/thumbnails/37.jpg)
LZ77
Algorithm’s step: Output <dist, len, next-char> Advance by len + 1
A buffer “window” has fixed length and moves
a a c a a c a b c a a a a a aDictionary
(all substrings starting here)
<6,3,a>
<3,4,c>a a c a a c a b c a a a a a a c
a c
a c
![Page 38: Zone indexes Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 6.1](https://reader035.vdocuments.net/reader035/viewer/2022062320/56649d4a5503460f94a26567/html5/thumbnails/38.jpg)
LZ77 Decoding
Decoder keeps same dictionary window as encoder. Finds substring and inserts a copy of it
What if l > d? (overlap with text to be compressed) E.g. seen = abcd, next codeword is (2,9,e) Simply copy starting at the cursor
for (i = 0; i < len; i++) out[cursor+i] = out[cursor-d+i]
Output is correct: abcdcdcdcdcdce
![Page 39: Zone indexes Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 6.1](https://reader035.vdocuments.net/reader035/viewer/2022062320/56649d4a5503460f94a26567/html5/thumbnails/39.jpg)
LZ77 Optimizations used by gzip
LZSS: Output one of the following formats(0, position, length) or (1,char)
Typically uses the second format if length < 3.
Special greedy: possibly use shorter match so that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code
![Page 40: Zone indexes Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 6.1](https://reader035.vdocuments.net/reader035/viewer/2022062320/56649d4a5503460f94a26567/html5/thumbnails/40.jpg)
You find this at: www.gzip.org/zlib/
![Page 41: Zone indexes Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 6.1](https://reader035.vdocuments.net/reader035/viewer/2022062320/56649d4a5503460f94a26567/html5/thumbnails/41.jpg)
Dictionary search
Exact string search
Paper on Cuckoo Hashing
![Page 42: Zone indexes Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 6.1](https://reader035.vdocuments.net/reader035/viewer/2022062320/56649d4a5503460f94a26567/html5/thumbnails/42.jpg)
Exact String Search
Given a dictionary D of K strings, of total
length N, store them in a way that we can
efficiently support searches for a pattern P
over them.
Hashing ?
![Page 43: Zone indexes Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 6.1](https://reader035.vdocuments.net/reader035/viewer/2022062320/56649d4a5503460f94a26567/html5/thumbnails/43.jpg)
Hashing with chaining
![Page 44: Zone indexes Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 6.1](https://reader035.vdocuments.net/reader035/viewer/2022062320/56649d4a5503460f94a26567/html5/thumbnails/44.jpg)
Key issue: a good hash function
Basic assumption: Uniform hashing
Avg #keys per slot = n * (1/m) = n/m = (load factor)
![Page 45: Zone indexes Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 6.1](https://reader035.vdocuments.net/reader035/viewer/2022062320/56649d4a5503460f94a26567/html5/thumbnails/45.jpg)
Search cost
m = (n)
![Page 46: Zone indexes Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 6.1](https://reader035.vdocuments.net/reader035/viewer/2022062320/56649d4a5503460f94a26567/html5/thumbnails/46.jpg)
In practice
A trivial hash function is:
prime
![Page 47: Zone indexes Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 6.1](https://reader035.vdocuments.net/reader035/viewer/2022062320/56649d4a5503460f94a26567/html5/thumbnails/47.jpg)
A “provably good” hash is
Each ai is selected at random in [0,m)
k0 k1 k2 kr
≈log2 m
r ≈ L / log2 m
a0 a1 a2 ar
K
a
prime
l = max string lenm = table size
not necessarily: (...mod p) mod m
![Page 48: Zone indexes Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 6.1](https://reader035.vdocuments.net/reader035/viewer/2022062320/56649d4a5503460f94a26567/html5/thumbnails/48.jpg)
Cuckoo Hashing
A B C
E D
2 hash tables, and 2 random choices where an item can be
stored
![Page 49: Zone indexes Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 6.1](https://reader035.vdocuments.net/reader035/viewer/2022062320/56649d4a5503460f94a26567/html5/thumbnails/49.jpg)
A B C
E D
F
A running example
![Page 50: Zone indexes Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 6.1](https://reader035.vdocuments.net/reader035/viewer/2022062320/56649d4a5503460f94a26567/html5/thumbnails/50.jpg)
A B FC
E D
A running example
![Page 51: Zone indexes Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 6.1](https://reader035.vdocuments.net/reader035/viewer/2022062320/56649d4a5503460f94a26567/html5/thumbnails/51.jpg)
A B FC
E D
G
A running example
![Page 52: Zone indexes Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 6.1](https://reader035.vdocuments.net/reader035/viewer/2022062320/56649d4a5503460f94a26567/html5/thumbnails/52.jpg)
E G B FC
A D
A running example
![Page 53: Zone indexes Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 6.1](https://reader035.vdocuments.net/reader035/viewer/2022062320/56649d4a5503460f94a26567/html5/thumbnails/53.jpg)
Cuckoo Hashing Examples
A B C
E D F
G
Random (bipartite) graph: node=cell, edge=key
![Page 54: Zone indexes Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 6.1](https://reader035.vdocuments.net/reader035/viewer/2022062320/56649d4a5503460f94a26567/html5/thumbnails/54.jpg)
Natural Extensions
More than 2 hashes (choices) per key.
Very different: hypergraphs instead of graphs. Higher memory utilization
3 choices : 90+% in experiments 4 choices : about 97%
2 hashes + bins of B-size.
Balanced allocation and tightly O(1)-size bins Insertion sees a tree of possible evict+ins paths
but more insert time(and random access)
more memory...but more local
![Page 55: Zone indexes Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 6.1](https://reader035.vdocuments.net/reader035/viewer/2022062320/56649d4a5503460f94a26567/html5/thumbnails/55.jpg)
Dictionary search
Prefix-string search
Reading 3.1 and 5.2
![Page 56: Zone indexes Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 6.1](https://reader035.vdocuments.net/reader035/viewer/2022062320/56649d4a5503460f94a26567/html5/thumbnails/56.jpg)
Prefix-string Search
Given a dictionary D of K strings, of total
length N, store them in a way that we can
efficiently support prefix searches for a
pattern P over them.
![Page 57: Zone indexes Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 6.1](https://reader035.vdocuments.net/reader035/viewer/2022062320/56649d4a5503460f94a26567/html5/thumbnails/57.jpg)
Trie: speeding-up searches
1
2 2
0
4
5
6
7
2 3
y
s
1z
stile zyg
5
etic
ialygy
aibelyite
czecin
omo
Pro: O(p) search time
Cons: edge + node labels and tree structure
![Page 58: Zone indexes Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 6.1](https://reader035.vdocuments.net/reader035/viewer/2022062320/56649d4a5503460f94a26567/html5/thumbnails/58.jpg)
Front-coding: squeezing strings
http://checkmate.com/All_Natural/http://checkmate.com/All_Natural/Applied.htmlhttp://checkmate.com/All_Natural/Aroma.htmlhttp://checkmate.com/All_Natural/Aroma1.htmlhttp://checkmate.com/All_Natural/Aromatic_Art.htmlhttp://checkmate.com/All_Natural/Ayate.htmlhttp://checkmate.com/All_Natural/Ayer_Soap.htmlhttp://checkmate.com/All_Natural/Ayurvedic_Soap.htmlhttp://checkmate.com/All_Natural/Bath_Salt_Bulk.htmlhttp://checkmate.com/All_Natural/Bath_Salts.htmlhttp://checkmate.com/All/Essence_Oils.htmlhttp://checkmate.com/All/Mineral_Bath_Crystals.htmlhttp://checkmate.com/All/Mineral_Bath_Salt.htmlhttp://checkmate.com/All/Mineral_Cream.html
http://checkmate.com/All/Natural/Washcloth.html...
0 http://checkmate.com/All_Natural/33 Applied.html34 roma.html38 1.html38 tic_Art.html34 yate.html35 er_Soap.html35 urvedic_Soap.html33 Bath_Salt_Bulk.html42 s.html25 Essence_Oils.html25 Mineral_Bath_Crystals.html38 Salt.html33 Cream.html
3345%
0 http://checkmate.com/All/Natural/Washcloth.html...
….systile syzygetic syzygial syzygy….2 5 5
Gzip may be much better...
![Page 59: Zone indexes Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 6.1](https://reader035.vdocuments.net/reader035/viewer/2022062320/56649d4a5503460f94a26567/html5/thumbnails/59.jpg)
….70systile 92zygeti c85ial 65y 110szaibelyite 82czecin92omo….
systile szaielyite
CTon a sample
2-level indexing
Disk
InternalMemory A disadvantage:
•Trade-off ≈ speed vs space (because of bucket size)
2 advantages:• Search ≈ typically 1 I/O
• Space ≈ Front-coding over buckets