dictionary search
DESCRIPTION
Dictionary search. Exact string search. Paper on Cuckoo Hashing. Exact String Search. Given a dictionary D of K strings , of total length N , store them in a way that we can efficiently support searches for a pattern P over them. Hashing. Hashing with chaining. - PowerPoint PPT PresentationTRANSCRIPT
Dictionary search
Exact string search
Paper on Cuckoo Hashing
Exact String Search
Given a dictionary D of K strings, of total
length N, store them in a way that we can
efficiently support searches for a pattern P
over them.
Hashing
Hashing with chaining
Key issue: a good hash function
Basic assumption: Uniform hashing
Avg #keys per slot = n * (1/m) = n/m = (load factor)
Search cost
m = (n)
In practice
A trivial hash function is:
prime
A “provably good” hash is
Each ai is selected at random in [0,m)
k0 k1 k2 kr
≈log2 m
r ≈ L / log2 m
a0 a1 a2 ar
K
a
prime
l = max string lenm = table size
not necessarily: (...mod p) mod m
Cuckoo Hashing
A B C
E D
2 hash tables, and 2 random choices where an item can be
stored
A B C
E D
F
A running example
A B FC
E D
A running example
A B FC
E D
G
A running example
E G B FC
A D
A running example
Cuckoo Hashing Examples
A B C
E D F
G
Random (bipartite) graph: node=cell, edge=key
Natural Extensions
More than 2 hashes (choices) per key.
Very different: hypergraphs instead of graphs. Higher memory utilization
3 choices : 90+% in experiments 4 choices : about 97%
2 hashes + bins of B-size.
Balanced allocation and tightly O(1)-size bins Insertion sees a tree of possible evict+ins paths
but more insert time(and random access)
more memory...but more local
Dictionary search
Making one-side errors
Paper on Bloom Filter
Crawling
How to keep track of the URLs visited by a
crawler?
URLs are long
Check should be very fast
No care about small errors (≈ page not crawled)
Bloom Filter
over crawled URLs
Searching with errors...
Problem: false positives
TTT 2
Not perfectly true but...
0
0,01
0,02
0,03
0,04
0,05
0,06
0,07
0,08
0,09
0,1
0 1 2 3 4 5 6 7 8 9 10
Fa
lse
po
siti
ve
rate
Hash functions
m/n = 8Opt k = 5.45...
We do have an
explicit formula
for the optimal k
Dictionary search
Prefix-string search
Reading 3.1 and 5.2
Prefix-string Search
Given a dictionary D of K strings, of total
length N, store them in a way that we can
efficiently support prefix searches for a
pattern P over them.
Trie: speeding-up searches
1
2 2
0
4
5
6
7
2 3
y
s
1z
stile zyg
5
etic
ialygy
aibelyite
czecin
omo
Pro: O(p) search time
Cons: edge + node labels and tree structure
Front-coding: squeezing strings
http://checkmate.com/All_Natural/http://checkmate.com/All_Natural/Applied.htmlhttp://checkmate.com/All_Natural/Aroma.htmlhttp://checkmate.com/All_Natural/Aroma1.htmlhttp://checkmate.com/All_Natural/Aromatic_Art.htmlhttp://checkmate.com/All_Natural/Ayate.htmlhttp://checkmate.com/All_Natural/Ayer_Soap.htmlhttp://checkmate.com/All_Natural/Ayurvedic_Soap.htmlhttp://checkmate.com/All_Natural/Bath_Salt_Bulk.htmlhttp://checkmate.com/All_Natural/Bath_Salts.htmlhttp://checkmate.com/All/Essence_Oils.htmlhttp://checkmate.com/All/Mineral_Bath_Crystals.htmlhttp://checkmate.com/All/Mineral_Bath_Salt.htmlhttp://checkmate.com/All/Mineral_Cream.html
http://checkmate.com/All/Natural/Washcloth.html...
0 http://checkmate.com/All_Natural/33 Applied.html34 roma.html38 1.html38 tic_Art.html34 yate.html35 er_Soap.html35 urvedic_Soap.html33 Bath_Salt_Bulk.html42 s.html25 Essence_Oils.html25 Mineral_Bath_Crystals.html38 Salt.html33 Cream.html
3345%
0 http://checkmate.com/All/Natural/Washcloth.html...
….systile syzygetic syzygial syzygy….2 5 5
Gzip may be much better...
….70systile 92zygeti c85ial 65y 110szaibelyite 82czecin92omo….
systile szaielyite
CTon a sample
2-level indexing
Disk
InternalMemory A disadvantage:
•Trade-off ≈ speed vs space (because of bucket size)
2 advantages:• Search ≈ typically 1 I/O
• Space ≈ Front-coding over buckets