Download - Using Fingerprints in n-Gram Indices
![Page 1: Using Fingerprints in n-Gram Indices](https://reader035.vdocuments.net/reader035/viewer/2022062308/56812cad550346895d916067/html5/thumbnails/1.jpg)
Using Fingerprints in n-Gram Indices
Digital Libraries:Advanced Methods and Technologies,Digital Collections
Stefan [email protected]
17.09.2009
![Page 2: Using Fingerprints in n-Gram Indices](https://reader035.vdocuments.net/reader035/viewer/2022062308/56812cad550346895d916067/html5/thumbnails/2.jpg)
Thursday, September 17, 2009
Using Fingerprints in n-Gram Indices
Overview
• Introduction– Inverted Index– N-Gram Index– Bitmaps– Signature Files
• n-Gram Fingerprints
• n-Gram Fingerprints in Combination with Posting Lists
• Fingerprint Compression
• Conclusion and Future Work
![Page 3: Using Fingerprints in n-Gram Indices](https://reader035.vdocuments.net/reader035/viewer/2022062308/56812cad550346895d916067/html5/thumbnails/3.jpg)
Thursday, September 17, 2009
INTRODUCTION
![Page 4: Using Fingerprints in n-Gram Indices](https://reader035.vdocuments.net/reader035/viewer/2022062308/56812cad550346895d916067/html5/thumbnails/4.jpg)
Thursday, September 17, 2009
Inverted Index
• Very common index structure• Term-oriented• Every term is linked to its postings
![Page 5: Using Fingerprints in n-Gram Indices](https://reader035.vdocuments.net/reader035/viewer/2022062308/56812cad550346895d916067/html5/thumbnails/5.jpg)
Thursday, September 17, 2009
n-Gram Index
• Uses n-Grams as indexing terms• Any kind of subsequence can be searched• n-Gram is a subsequence of a text with
• Postings for longer subsequences can be calculated:
2121 wposdecwposwwpos
PxxPdec 112
11 ww
nw
![Page 6: Using Fingerprints in n-Gram Indices](https://reader035.vdocuments.net/reader035/viewer/2022062308/56812cad550346895d916067/html5/thumbnails/6.jpg)
Thursday, September 17, 2009
n-Gram Index
• Index structure is very similar to an inverted index
• Searching is more complex
![Page 7: Using Fingerprints in n-Gram Indices](https://reader035.vdocuments.net/reader035/viewer/2022062308/56812cad550346895d916067/html5/thumbnails/7.jpg)
Thursday, September 17, 2009
Bitmaps
• Bitmaps are occurrence maps• Each bit signals an occurrence of a specific term in a
specific document
![Page 8: Using Fingerprints in n-Gram Indices](https://reader035.vdocuments.net/reader035/viewer/2022062308/56812cad550346895d916067/html5/thumbnails/8.jpg)
Thursday, September 17, 2009
Signature Files
![Page 9: Using Fingerprints in n-Gram Indices](https://reader035.vdocuments.net/reader035/viewer/2022062308/56812cad550346895d916067/html5/thumbnails/9.jpg)
Thursday, September 17, 2009
N-GRAM FINGERPRINT
![Page 10: Using Fingerprints in n-Gram Indices](https://reader035.vdocuments.net/reader035/viewer/2022062308/56812cad550346895d916067/html5/thumbnails/10.jpg)
Thursday, September 17, 2009
N-Gram Fingerprint
The idea:
Create fingerprints that:
•Have a fixed size•Contain information about the postings
![Page 11: Using Fingerprints in n-Gram Indices](https://reader035.vdocuments.net/reader035/viewer/2022062308/56812cad550346895d916067/html5/thumbnails/11.jpg)
Thursday, September 17, 2009
N-Gram Fingerprint
A 2D-Fingerprint is a bit-matrix
1,10,
1,00,0
off
o
w
bb
bb
B
o:0
mod
mod::1
,
therwisejopoffset
ifpfileidwpospb ji
![Page 12: Using Fingerprints in n-Gram Indices](https://reader035.vdocuments.net/reader035/viewer/2022062308/56812cad550346895d916067/html5/thumbnails/12.jpg)
Thursday, September 17, 2009
N-Gram Fingerprint
• Given two 1-grams and their fingerprintsBw1 and Bw2 the fingerprint Bw1w2 can beaproximated:
• B’w2 is constructed by cyclic shifting each column of Bw2 by one position to the left.
212121 '' wwwwww BBBB
![Page 13: Using Fingerprints in n-Gram Indices](https://reader035.vdocuments.net/reader035/viewer/2022062308/56812cad550346895d916067/html5/thumbnails/13.jpg)
Thursday, September 17, 2009
N-Gram Fingerprint
![Page 14: Using Fingerprints in n-Gram Indices](https://reader035.vdocuments.net/reader035/viewer/2022062308/56812cad550346895d916067/html5/thumbnails/14.jpg)
Thursday, September 17, 2009
N-Gram Fingerprint
Query Bit-matrix
Time for verification
Hits
rhinolo 219 ms 94 ms 18
sanfilipo 290 ms 0 ms 0
itracon 266 ms 336 ms 64
oxyuria 197 ms 48 ms 6
Search Speed
Results from the “Online Encyclopedia of Dermatology from P. Altmeyer”
![Page 15: Using Fingerprints in n-Gram Indices](https://reader035.vdocuments.net/reader035/viewer/2022062308/56812cad550346895d916067/html5/thumbnails/15.jpg)
Thursday, September 17, 2009
N-GRAM FINGERPRINTS IN COMBINATION WITH POSTING
LISTS
![Page 16: Using Fingerprints in n-Gram Indices](https://reader035.vdocuments.net/reader035/viewer/2022062308/56812cad550346895d916067/html5/thumbnails/16.jpg)
Thursday, September 17, 2009
Combining Fingerprints and Posting Lists
By combining fingerprints and posting lists
• No verification step is needed• Posting lists are partitioned into smaller subsets.
Each bit of the fingerprint corresponds to a separate posting list
• Costs for intersection of posting lists are being reduced
![Page 17: Using Fingerprints in n-Gram Indices](https://reader035.vdocuments.net/reader035/viewer/2022062308/56812cad550346895d916067/html5/thumbnails/17.jpg)
Thursday, September 17, 2009
Combining Fingerprints and Posting Lists
![Page 18: Using Fingerprints in n-Gram Indices](https://reader035.vdocuments.net/reader035/viewer/2022062308/56812cad550346895d916067/html5/thumbnails/18.jpg)
Thursday, September 17, 2009
Managing n-Gram Posting Lists
• Very large number of posting-subsets have to be managed:
For example:
1024 residue classes for the fileID128 residue classes for the offset14.000 different n-grams
• Subsets are stored in a hash• The hash value is a function of the residue classes
![Page 19: Using Fingerprints in n-Gram Indices](https://reader035.vdocuments.net/reader035/viewer/2022062308/56812cad550346895d916067/html5/thumbnails/19.jpg)
Thursday, September 17, 2009
Managing n-Gram Posting Lists
![Page 20: Using Fingerprints in n-Gram Indices](https://reader035.vdocuments.net/reader035/viewer/2022062308/56812cad550346895d916067/html5/thumbnails/20.jpg)
Thursday, September 17, 2009
Managing n-Gram Posting Lists
0
5000
10000
15000
20000
25000
30000
35000
40000
0 20 40 60 80 100 120 140
freq
uenc
y
number of ...
hash collisions and collision resolving
... collisions... comparisons
... comparisons after sorting
![Page 21: Using Fingerprints in n-Gram Indices](https://reader035.vdocuments.net/reader035/viewer/2022062308/56812cad550346895d916067/html5/thumbnails/21.jpg)
Thursday, September 17, 2009
Results
• Performance improved by 40% compared to the setup without posting lists
Query Bit-matrix
Time for verification
Hits
rhinolo 230 ms 10 ms 18sanfilipo 271 ms 0 ms 0itracon 245 ms 15 ms 64oxyuria 210 ms 12 ms 6
![Page 22: Using Fingerprints in n-Gram Indices](https://reader035.vdocuments.net/reader035/viewer/2022062308/56812cad550346895d916067/html5/thumbnails/22.jpg)
Thursday, September 17, 2009
FINGERPRINT COMPRESSION
![Page 23: Using Fingerprints in n-Gram Indices](https://reader035.vdocuments.net/reader035/viewer/2022062308/56812cad550346895d916067/html5/thumbnails/23.jpg)
Thursday, September 17, 2009
Fingerprint Compression
• Fingerprints with high or low densities do not contain much information
• Fingerprints can be compressed by reducing the resolution
• Dictionary based compression
![Page 24: Using Fingerprints in n-Gram Indices](https://reader035.vdocuments.net/reader035/viewer/2022062308/56812cad550346895d916067/html5/thumbnails/24.jpg)
Thursday, September 17, 2009
Fingerprint Compression
Density threshold for convolution
Performance loss
Fingerprint index reduction
no convolution 0 % 0 %
0-0,025 and 0.975-1 3.1 % 23 %0-0.05 and 0.95-1 3.2 % 27 %0-0.1 and 0.9-1 10 % 29 %0-0.2 and 0.8-1 25 % 31 %
• Results: Fingerprint convolution
• In combination with the dictionary based compression the index size is being reduced by additional 30%
![Page 25: Using Fingerprints in n-Gram Indices](https://reader035.vdocuments.net/reader035/viewer/2022062308/56812cad550346895d916067/html5/thumbnails/25.jpg)
Thursday, September 17, 2009
CONCLUSION AND FUTURE WORK
![Page 26: Using Fingerprints in n-Gram Indices](https://reader035.vdocuments.net/reader035/viewer/2022062308/56812cad550346895d916067/html5/thumbnails/26.jpg)
Thursday, September 17, 2009
Conclusion
• Fingerprints improve the scalability of n-gram indices• Fingerprints improve the performance of n-gram
indices• The index structure can be adjusted to user
behavior, so that common queries can be processed more efficiently
• The fingerprints can be stored in a compressed index with loosing only a minimum of performance
![Page 27: Using Fingerprints in n-Gram Indices](https://reader035.vdocuments.net/reader035/viewer/2022062308/56812cad550346895d916067/html5/thumbnails/27.jpg)
Thursday, September 17, 2009
Future Work
• Combination of term based inverted index and n-Gram fingerprint index
• Profit from the advantages of both using terms and n-Grams as indexing terms– Substring search– Ranking– Thesaurus information
![Page 28: Using Fingerprints in n-Gram Indices](https://reader035.vdocuments.net/reader035/viewer/2022062308/56812cad550346895d916067/html5/thumbnails/28.jpg)
Digital Libraries:Advanced Methods and Technologies,Digital Collections 17.09.2009
Thank You!