using fingerprints in n-gram indices digital libraries: advanced methods and technologies, digital...

28
Using Fingerprints in n- Gram Indices Digital Libraries: Advanced Methods and Technologies, Digital Collections Stefan Selbach [email protected] 17.09.2009

Upload: samuel-ryan

Post on 18-Dec-2015

222 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Using Fingerprints in n-Gram Indices Digital Libraries: Advanced Methods and Technologies, Digital Collections Stefan Selbach selbach@informatik.uni-wuerzburg.de

Using Fingerprints in n-Gram Indices

Digital Libraries:Advanced Methods and Technologies,Digital Collections

Stefan [email protected]

17.09.2009

Page 2: Using Fingerprints in n-Gram Indices Digital Libraries: Advanced Methods and Technologies, Digital Collections Stefan Selbach selbach@informatik.uni-wuerzburg.de

Thursday, September 17, 2009

Using Fingerprints in n-Gram Indices

Overview

• Introduction– Inverted Index– N-Gram Index– Bitmaps– Signature Files

• n-Gram Fingerprints

• n-Gram Fingerprints in Combination with Posting Lists

• Fingerprint Compression

• Conclusion and Future Work

Page 3: Using Fingerprints in n-Gram Indices Digital Libraries: Advanced Methods and Technologies, Digital Collections Stefan Selbach selbach@informatik.uni-wuerzburg.de

Thursday, September 17, 2009

INTRODUCTION

Page 4: Using Fingerprints in n-Gram Indices Digital Libraries: Advanced Methods and Technologies, Digital Collections Stefan Selbach selbach@informatik.uni-wuerzburg.de

Thursday, September 17, 2009

Inverted Index

• Very common index structure• Term-oriented• Every term is linked to its postings

Page 5: Using Fingerprints in n-Gram Indices Digital Libraries: Advanced Methods and Technologies, Digital Collections Stefan Selbach selbach@informatik.uni-wuerzburg.de

Thursday, September 17, 2009

n-Gram Index

• Uses n-Grams as indexing terms• Any kind of subsequence can be searched• n-Gram is a subsequence of a text with

• Postings for longer subsequences can be calculated:

2121 wposdecwposwwpos

PxxPdec 112

11 ww

nw

Page 6: Using Fingerprints in n-Gram Indices Digital Libraries: Advanced Methods and Technologies, Digital Collections Stefan Selbach selbach@informatik.uni-wuerzburg.de

Thursday, September 17, 2009

n-Gram Index

• Index structure is very similar to an inverted index

• Searching is more complex

Page 7: Using Fingerprints in n-Gram Indices Digital Libraries: Advanced Methods and Technologies, Digital Collections Stefan Selbach selbach@informatik.uni-wuerzburg.de

Thursday, September 17, 2009

Bitmaps

• Bitmaps are occurrence maps• Each bit signals an occurrence of a specific term in a

specific document

Page 8: Using Fingerprints in n-Gram Indices Digital Libraries: Advanced Methods and Technologies, Digital Collections Stefan Selbach selbach@informatik.uni-wuerzburg.de

Thursday, September 17, 2009

Signature Files

Page 9: Using Fingerprints in n-Gram Indices Digital Libraries: Advanced Methods and Technologies, Digital Collections Stefan Selbach selbach@informatik.uni-wuerzburg.de

Thursday, September 17, 2009

N-GRAM FINGERPRINT

Page 10: Using Fingerprints in n-Gram Indices Digital Libraries: Advanced Methods and Technologies, Digital Collections Stefan Selbach selbach@informatik.uni-wuerzburg.de

Thursday, September 17, 2009

N-Gram Fingerprint

The idea:

Create fingerprints that:

•Have a fixed size•Contain information about the postings

Page 11: Using Fingerprints in n-Gram Indices Digital Libraries: Advanced Methods and Technologies, Digital Collections Stefan Selbach selbach@informatik.uni-wuerzburg.de

Thursday, September 17, 2009

N-Gram Fingerprint

A 2D-Fingerprint is a bit-matrix

1,10,

1,00,0

off

o

w

bb

bb

B

o:0

mod

mod::1

,

therwisejopoffset

ifpfileidwpospb ji

Page 12: Using Fingerprints in n-Gram Indices Digital Libraries: Advanced Methods and Technologies, Digital Collections Stefan Selbach selbach@informatik.uni-wuerzburg.de

Thursday, September 17, 2009

N-Gram Fingerprint

• Given two 1-grams and their fingerprintsBw1 and Bw2 the fingerprint Bw1w2 can beaproximated:

• B’w2 is constructed by cyclic shifting each column of Bw2 by one position to the left.

212121 '' wwwwww BBBB

Page 13: Using Fingerprints in n-Gram Indices Digital Libraries: Advanced Methods and Technologies, Digital Collections Stefan Selbach selbach@informatik.uni-wuerzburg.de

Thursday, September 17, 2009

N-Gram Fingerprint

Page 14: Using Fingerprints in n-Gram Indices Digital Libraries: Advanced Methods and Technologies, Digital Collections Stefan Selbach selbach@informatik.uni-wuerzburg.de

Thursday, September 17, 2009

N-Gram Fingerprint

Query Bit-matrix

Time for verification

Hits

rhinolo 219 ms 94 ms 18

sanfilipo 290 ms 0 ms 0

itracon 266 ms 336 ms 64

oxyuria 197 ms 48 ms 6

Search Speed

Results from the “Online Encyclopedia of Dermatology from P. Altmeyer”

Page 15: Using Fingerprints in n-Gram Indices Digital Libraries: Advanced Methods and Technologies, Digital Collections Stefan Selbach selbach@informatik.uni-wuerzburg.de

Thursday, September 17, 2009

N-GRAM FINGERPRINTS IN COMBINATION WITH POSTING

LISTS

Page 16: Using Fingerprints in n-Gram Indices Digital Libraries: Advanced Methods and Technologies, Digital Collections Stefan Selbach selbach@informatik.uni-wuerzburg.de

Thursday, September 17, 2009

Combining Fingerprints and Posting Lists

By combining fingerprints and posting lists

• No verification step is needed• Posting lists are partitioned into smaller subsets.

Each bit of the fingerprint corresponds to a separate posting list

• Costs for intersection of posting lists are being reduced

Page 17: Using Fingerprints in n-Gram Indices Digital Libraries: Advanced Methods and Technologies, Digital Collections Stefan Selbach selbach@informatik.uni-wuerzburg.de

Thursday, September 17, 2009

Combining Fingerprints and Posting Lists

Page 18: Using Fingerprints in n-Gram Indices Digital Libraries: Advanced Methods and Technologies, Digital Collections Stefan Selbach selbach@informatik.uni-wuerzburg.de

Thursday, September 17, 2009

Managing n-Gram Posting Lists

• Very large number of posting-subsets have to be managed:

For example:

1024 residue classes for the fileID128 residue classes for the offset14.000 different n-grams

• Subsets are stored in a hash• The hash value is a function of the residue classes

Page 19: Using Fingerprints in n-Gram Indices Digital Libraries: Advanced Methods and Technologies, Digital Collections Stefan Selbach selbach@informatik.uni-wuerzburg.de

Thursday, September 17, 2009

Managing n-Gram Posting Lists

Page 20: Using Fingerprints in n-Gram Indices Digital Libraries: Advanced Methods and Technologies, Digital Collections Stefan Selbach selbach@informatik.uni-wuerzburg.de

Thursday, September 17, 2009

Managing n-Gram Posting Lists

0

5000

10000

15000

20000

25000

30000

35000

40000

0 20 40 60 80 100 120 140

freq

uenc

y

number of ...

hash collisions and collision resolving

... collisions... comparisons

... comparisons after sorting

Page 21: Using Fingerprints in n-Gram Indices Digital Libraries: Advanced Methods and Technologies, Digital Collections Stefan Selbach selbach@informatik.uni-wuerzburg.de

Thursday, September 17, 2009

Results

• Performance improved by 40% compared to the setup without posting lists

Query Bit-matrix

Time for verification

Hits

rhinolo 230 ms 10 ms 18sanfilipo 271 ms 0 ms 0itracon 245 ms 15 ms 64oxyuria 210 ms 12 ms 6

Page 22: Using Fingerprints in n-Gram Indices Digital Libraries: Advanced Methods and Technologies, Digital Collections Stefan Selbach selbach@informatik.uni-wuerzburg.de

Thursday, September 17, 2009

FINGERPRINT COMPRESSION

Page 23: Using Fingerprints in n-Gram Indices Digital Libraries: Advanced Methods and Technologies, Digital Collections Stefan Selbach selbach@informatik.uni-wuerzburg.de

Thursday, September 17, 2009

Fingerprint Compression

• Fingerprints with high or low densities do not contain much information

• Fingerprints can be compressed by reducing the resolution

• Dictionary based compression

Page 24: Using Fingerprints in n-Gram Indices Digital Libraries: Advanced Methods and Technologies, Digital Collections Stefan Selbach selbach@informatik.uni-wuerzburg.de

Thursday, September 17, 2009

Fingerprint Compression

Density threshold for convolution

Performance loss

Fingerprint index reduction

no convolution 0 % 0 %

0-0,025 and 0.975-1 3.1 % 23 %0-0.05 and 0.95-1 3.2 % 27 %0-0.1 and 0.9-1 10 % 29 %0-0.2 and 0.8-1 25 % 31 %

• Results: Fingerprint convolution

• In combination with the dictionary based compression the index size is being reduced by additional 30%

Page 25: Using Fingerprints in n-Gram Indices Digital Libraries: Advanced Methods and Technologies, Digital Collections Stefan Selbach selbach@informatik.uni-wuerzburg.de

Thursday, September 17, 2009

CONCLUSION AND FUTURE WORK

Page 26: Using Fingerprints in n-Gram Indices Digital Libraries: Advanced Methods and Technologies, Digital Collections Stefan Selbach selbach@informatik.uni-wuerzburg.de

Thursday, September 17, 2009

Conclusion

• Fingerprints improve the scalability of n-gram indices• Fingerprints improve the performance of n-gram

indices• The index structure can be adjusted to user

behavior, so that common queries can be processed more efficiently

• The fingerprints can be stored in a compressed index with loosing only a minimum of performance

Page 27: Using Fingerprints in n-Gram Indices Digital Libraries: Advanced Methods and Technologies, Digital Collections Stefan Selbach selbach@informatik.uni-wuerzburg.de

Thursday, September 17, 2009

Future Work

• Combination of term based inverted index and n-Gram fingerprint index

• Profit from the advantages of both using terms and n-Grams as indexing terms– Substring search– Ranking– Thesaurus information

Page 28: Using Fingerprints in n-Gram Indices Digital Libraries: Advanced Methods and Technologies, Digital Collections Stefan Selbach selbach@informatik.uni-wuerzburg.de

Digital Libraries:Advanced Methods and Technologies,Digital Collections 17.09.2009

Thank You!