Download - Dictionary search
![Page 1: Dictionary search](https://reader030.vdocuments.net/reader030/viewer/2022032708/56812d16550346895d92019b/html5/thumbnails/1.jpg)
Dictionary search
Exact string search
Paper on Cuckoo Hashing
![Page 2: Dictionary search](https://reader030.vdocuments.net/reader030/viewer/2022032708/56812d16550346895d92019b/html5/thumbnails/2.jpg)
Exact String Search
Given a dictionary D of K strings, of total
length N, store them in a way that we can
efficiently support searches for a pattern P
over them.
Hashing
![Page 3: Dictionary search](https://reader030.vdocuments.net/reader030/viewer/2022032708/56812d16550346895d92019b/html5/thumbnails/3.jpg)
Hashing with chaining
![Page 4: Dictionary search](https://reader030.vdocuments.net/reader030/viewer/2022032708/56812d16550346895d92019b/html5/thumbnails/4.jpg)
Key issue: a good hash function
Basic assumption: Uniform hashing
Avg #keys per slot = n * (1/m) = n/m = (load factor)
![Page 5: Dictionary search](https://reader030.vdocuments.net/reader030/viewer/2022032708/56812d16550346895d92019b/html5/thumbnails/5.jpg)
Search cost
m = (n)
![Page 6: Dictionary search](https://reader030.vdocuments.net/reader030/viewer/2022032708/56812d16550346895d92019b/html5/thumbnails/6.jpg)
In practice
A trivial hash function is:
prime
![Page 7: Dictionary search](https://reader030.vdocuments.net/reader030/viewer/2022032708/56812d16550346895d92019b/html5/thumbnails/7.jpg)
A “provably good” hash is
Each ai is selected at random in [0,m)
k0 k1 k2 kr
≈log2 m
r ≈ L / log2 m
a0 a1 a2 ar
K
a
prime
l = max string lenm = table size
not necessarily: (...mod p) mod m
![Page 8: Dictionary search](https://reader030.vdocuments.net/reader030/viewer/2022032708/56812d16550346895d92019b/html5/thumbnails/8.jpg)
Cuckoo Hashing
A B C
E D
2 hash tables, and 2 random choices where an item can be
stored
![Page 9: Dictionary search](https://reader030.vdocuments.net/reader030/viewer/2022032708/56812d16550346895d92019b/html5/thumbnails/9.jpg)
A B C
E D
F
A running example
![Page 10: Dictionary search](https://reader030.vdocuments.net/reader030/viewer/2022032708/56812d16550346895d92019b/html5/thumbnails/10.jpg)
A B FC
E D
A running example
![Page 11: Dictionary search](https://reader030.vdocuments.net/reader030/viewer/2022032708/56812d16550346895d92019b/html5/thumbnails/11.jpg)
A B FC
E D
G
A running example
![Page 12: Dictionary search](https://reader030.vdocuments.net/reader030/viewer/2022032708/56812d16550346895d92019b/html5/thumbnails/12.jpg)
E G B FC
A D
A running example
![Page 13: Dictionary search](https://reader030.vdocuments.net/reader030/viewer/2022032708/56812d16550346895d92019b/html5/thumbnails/13.jpg)
Cuckoo Hashing Examples
A B C
E D F
G
Random (bipartite) graph: node=cell, edge=key
![Page 14: Dictionary search](https://reader030.vdocuments.net/reader030/viewer/2022032708/56812d16550346895d92019b/html5/thumbnails/14.jpg)
Natural Extensions
More than 2 hashes (choices) per key.
Very different: hypergraphs instead of graphs. Higher memory utilization
3 choices : 90+% in experiments 4 choices : about 97%
2 hashes + bins of B-size.
Balanced allocation and tightly O(1)-size bins Insertion sees a tree of possible evict+ins paths
but more insert time(and random access)
more memory...but more local
![Page 15: Dictionary search](https://reader030.vdocuments.net/reader030/viewer/2022032708/56812d16550346895d92019b/html5/thumbnails/15.jpg)
Dictionary search
Making one-side errors
Paper on Bloom Filter
![Page 16: Dictionary search](https://reader030.vdocuments.net/reader030/viewer/2022032708/56812d16550346895d92019b/html5/thumbnails/16.jpg)
Crawling
How to keep track of the URLs visited by a
crawler?
URLs are long
Check should be very fast
No care about small errors (≈ page not crawled)
Bloom Filter
over crawled URLs
![Page 17: Dictionary search](https://reader030.vdocuments.net/reader030/viewer/2022032708/56812d16550346895d92019b/html5/thumbnails/17.jpg)
Searching with errors...
![Page 18: Dictionary search](https://reader030.vdocuments.net/reader030/viewer/2022032708/56812d16550346895d92019b/html5/thumbnails/18.jpg)
![Page 19: Dictionary search](https://reader030.vdocuments.net/reader030/viewer/2022032708/56812d16550346895d92019b/html5/thumbnails/19.jpg)
![Page 20: Dictionary search](https://reader030.vdocuments.net/reader030/viewer/2022032708/56812d16550346895d92019b/html5/thumbnails/20.jpg)
Problem: false positives
![Page 21: Dictionary search](https://reader030.vdocuments.net/reader030/viewer/2022032708/56812d16550346895d92019b/html5/thumbnails/21.jpg)
TTT 2
![Page 22: Dictionary search](https://reader030.vdocuments.net/reader030/viewer/2022032708/56812d16550346895d92019b/html5/thumbnails/22.jpg)
Not perfectly true but...
![Page 23: Dictionary search](https://reader030.vdocuments.net/reader030/viewer/2022032708/56812d16550346895d92019b/html5/thumbnails/23.jpg)
0
0,01
0,02
0,03
0,04
0,05
0,06
0,07
0,08
0,09
0,1
0 1 2 3 4 5 6 7 8 9 10
Fa
lse
po
siti
ve
rate
Hash functions
m/n = 8Opt k = 5.45...
We do have an
explicit formula
for the optimal k
![Page 24: Dictionary search](https://reader030.vdocuments.net/reader030/viewer/2022032708/56812d16550346895d92019b/html5/thumbnails/24.jpg)
![Page 25: Dictionary search](https://reader030.vdocuments.net/reader030/viewer/2022032708/56812d16550346895d92019b/html5/thumbnails/25.jpg)
![Page 26: Dictionary search](https://reader030.vdocuments.net/reader030/viewer/2022032708/56812d16550346895d92019b/html5/thumbnails/26.jpg)
Dictionary search
Prefix-string search
Reading 3.1 and 5.2
![Page 27: Dictionary search](https://reader030.vdocuments.net/reader030/viewer/2022032708/56812d16550346895d92019b/html5/thumbnails/27.jpg)
Prefix-string Search
Given a dictionary D of K strings, of total
length N, store them in a way that we can
efficiently support prefix searches for a
pattern P over them.
![Page 28: Dictionary search](https://reader030.vdocuments.net/reader030/viewer/2022032708/56812d16550346895d92019b/html5/thumbnails/28.jpg)
Trie: speeding-up searches
1
2 2
0
4
5
6
7
2 3
y
s
1z
stile zyg
5
etic
ialygy
aibelyite
czecin
omo
Pro: O(p) search time
Cons: edge + node labels and tree structure
![Page 29: Dictionary search](https://reader030.vdocuments.net/reader030/viewer/2022032708/56812d16550346895d92019b/html5/thumbnails/29.jpg)
Front-coding: squeezing strings
http://checkmate.com/All_Natural/http://checkmate.com/All_Natural/Applied.htmlhttp://checkmate.com/All_Natural/Aroma.htmlhttp://checkmate.com/All_Natural/Aroma1.htmlhttp://checkmate.com/All_Natural/Aromatic_Art.htmlhttp://checkmate.com/All_Natural/Ayate.htmlhttp://checkmate.com/All_Natural/Ayer_Soap.htmlhttp://checkmate.com/All_Natural/Ayurvedic_Soap.htmlhttp://checkmate.com/All_Natural/Bath_Salt_Bulk.htmlhttp://checkmate.com/All_Natural/Bath_Salts.htmlhttp://checkmate.com/All/Essence_Oils.htmlhttp://checkmate.com/All/Mineral_Bath_Crystals.htmlhttp://checkmate.com/All/Mineral_Bath_Salt.htmlhttp://checkmate.com/All/Mineral_Cream.html
http://checkmate.com/All/Natural/Washcloth.html...
0 http://checkmate.com/All_Natural/33 Applied.html34 roma.html38 1.html38 tic_Art.html34 yate.html35 er_Soap.html35 urvedic_Soap.html33 Bath_Salt_Bulk.html42 s.html25 Essence_Oils.html25 Mineral_Bath_Crystals.html38 Salt.html33 Cream.html
3345%
0 http://checkmate.com/All/Natural/Washcloth.html...
….systile syzygetic syzygial syzygy….2 5 5
Gzip may be much better...
![Page 30: Dictionary search](https://reader030.vdocuments.net/reader030/viewer/2022032708/56812d16550346895d92019b/html5/thumbnails/30.jpg)
….70systile 92zygeti c85ial 65y 110szaibelyite 82czecin92omo….
systile szaielyite
CTon a sample
2-level indexing
Disk
InternalMemory A disadvantage:
•Trade-off ≈ speed vs space (because of bucket size)
2 advantages:• Search ≈ typically 1 I/O
• Space ≈ Front-coding over buckets