ir presentation

Post on 25-Dec-2015

20 Views

Category:

Documents

1 Downloads

Preview:

Click to see full reader

DESCRIPTION

chapter 9

TRANSCRIPT

By Bushra Al-Za’areer

introducing

Signature File – Suffix Tree & Suffix Array

Chapter 9 Indexing & Searching

introducingSignature File – Suffix Tree & Suffix Array

1Signature File

2Suffi x Tree

3Suffi x Array

Signature File

Signature File – Suffix Tree & Suffix Array

1

Signature File chapter 9

• Consider:• H(information) = 010001• H(text) = 010010• H(data) = 110000• H(retrieval) = 100010

• The block signatures of a document D containing the text“textual retrieval and information retrieval” (after removingStop words and stemming) for a block size of two terms –would be:oB1D = 110010 andoB2D = 110011

Signature File chapter 9

To search for a given term we compare whether the term’s bit string could be “inside” the block signatures:• Consider we are searching for “text” in document Do H(text) = 010010 and B1D = 110010o H(text) bit-wise-AND B1D = 010010 = H(text)o Therefore “text” could be in B1D (it is in this particularocase)

• Consider we are now searching for “data”o H(data) bit-wise-AND B1D = 110000 = H(data)o H(data) bit-wise-AND B2D = 110000 = H(data)o Though “data” is not in either block !

• Signature files may yield false hits …

Signature File chapter 9

How to keep the probability of a false alarms low ?How to predict how good a signature is ?

o False drop occurs a document signature matches a query’s signature but the query’s word doesn’t match any word on document.

• The rate of false drop depends on:o The size of the signature.o The number of word per-block.

Signature File chapter 9

• Inverted or Signature? Inverted Files:

1. Slower retrieval2. More accurate 3. Easier to maintain

• In fact, inverted files are still the most popular storage for information retrieval.

2 Suffix Tree summary

Chapter 9

Signature File chapter 9

• Example:

3 Suffix Array summary

Chapter 9

Signature File chapter 9

• Suffix Trees and Suffix Arrays indexes see the text as one long string. Each position in the text is considered as a text suffix. Each suffix is thus uniquely identified by its position.

• Index points are selected from the text, which point to the beginning of the text positions which will be retrievable.

• This structure can be used to index words or characters.

Signature File chapter 9

• This structure can be used to index words or characters.

Signature File chapter 9

• Suffix arrays provide essentially the same functionality as suffix trees with much less space requirements.

• A suffix array is simply an array containing all the pointers to the text suffixes listed in lexicographical order.

• Suffix arrays are designed to allow binary searches done by comparing the contents of each pointer.

Signature File chapter 9

• With suffix trees and suffix arrays we can search for– Words– Prefixes & suffixes– Phrases.

? Any Question???Ask me!

Chapter 9

The most popular storage for information retrieval

inverted files…

Conclusion

What’s Your Message?Thank You

top related