indexing and searching

Post on 22-Jan-2016

37 Views

Category:

Documents

1 Downloads

Preview:

Click to see full reader

DESCRIPTION

Indexing and Searching. Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto Chapter 8. Outline. Inverted Files Other Indices for Text Sequential Searching Pattern Matching Compression. Inverted Files. - PowerPoint PPT Presentation

TRANSCRIPT

1

Indexing and Searching

Modern Information RetrievalModern Information Retrieval

by by R. Baeza-Yates and B. Ribeiro-NetoR. Baeza-Yates and B. Ribeiro-Neto

Chapter 8Chapter 8

2

Outline

Inverted FilesInverted Files Other Indices for TextOther Indices for Text Sequential SearchingSequential Searching Pattern MatchingPattern Matching CompressionCompression

3

Inverted Files

And inverted file (or And inverted file (or inverted indexinverted index) is a ) is a word-word-orientedoriented mechanism for indexing a text collection mechanism for indexing a text collection in order to speed up the searching task.in order to speed up the searching task.

StructureStructure :: vocabularyvocabulary and and occurrencesoccurrences Block addressingBlock addressing

The text is divided in blocks, and the The text is divided in blocks, and the occurrences point to the blocksoccurrences point to the blocks

Full inverted indicesFull inverted indices :: exactexact occurrences occurrences

4

5

6

Inverted Files

The search algorithm on an inverted indexThe search algorithm on an inverted index Vocabulary searchVocabulary search Retrieval of occurrencesRetrieval of occurrences Manipulation of occurrencesManipulation of occurrences

Construction (split the index into two files)Construction (split the index into two files) Posting filePosting file :: the lists of occurrences are the lists of occurrences are

stored contiguouslystored contiguously The vocabulary is stored in lexicographical The vocabulary is stored in lexicographical

order and points to its list.order and points to its list.

7

8

Inverted Files

For Large textsFor Large texts Partial indexPartial index Merging two indices consists of merging Merging two indices consists of merging

the sorted the sorted vocabulariesvocabularies..

9

10

Other Indices for Text

Suffix TreesSuffix Trees Suffix ArraysSuffix Arrays Signature FilesSignature Files

11

Suffix Trees and Suffix Arrays

Each position in the text is considered as a Each position in the text is considered as a text suffixtext suffix

Index points are selected form the text, Index points are selected form the text, which point to the which point to the beginningbeginning of the text of the text positions which will be retrievablepositions which will be retrievable

12

13

Suffix arrays

The main drawbacks of Suffix Array are its The main drawbacks of Suffix Array are its costlycostly construction processconstruction process..

Allow Allow binary searchesbinary searches done by comparing done by comparing the contents of each pointer.the contents of each pointer.

Supra-indices (for large suffix array)Supra-indices (for large suffix array)

14

15

16

Construction of Suffix Arrays for Large Texts

17

Signature Files

Word-oriented index structures base on Word-oriented index structures base on hashinghashing Maps Maps wordswords to bit masks of to bit masks of BB bits bits Divides the text in Divides the text in blocksblocks of of b b words eachwords each The mask is obtained by bitwise The mask is obtained by bitwise ORingORing the signat the signat

ures of all the words in the text block.ures of all the words in the text block. Hash the Hash the query query to a bit mask Wto a bit mask W If If W & Bi = WW & Bi = W, the text block may contain the wo, the text block may contain the wo

rdrd

18

19

Sequential Searching

Brute ForceBrute Force Knuth-Morris-PrattKnuth-Morris-Pratt Boyer-Moore FamilyBoyer-Moore Family Shift-OrShift-Or Suffix AutomatonSuffix Automaton

Backward DAWG matching (BDM)Backward DAWG matching (BDM) BNDMBNDM

20

Knuth-Morris-Pratt

21

Boyer-Moore Family

22

Shift-Or

23

Suffix Automaton

24

25

Pattern Matching

Searching allowing errorsSearching allowing errors Dynamic ProgrammingDynamic Programming AutomatonAutomaton

Regular Expressions and Extended patternsRegular Expressions and Extended patterns Pattern Matching Using IndicesPattern Matching Using Indices

Inverted filesInverted files Suffix Trees and Suffix ArraysSuffix Trees and Suffix Arrays

26

Dynamic Programming

27

Automaton

28

Regular Expressions

29

Pattern Matching Using Indices

Inverted FilesInverted Files The types of queries such as suffix or subThe types of queries such as suffix or sub

string queries, searching allowing errors astring queries, searching allowing errors and regular expressions, are solved by a nd regular expressions, are solved by a sesequential searchquential search

The The restrictionrestriction is to find approximate mat is to find approximate matches or regular expressions that span manches or regular expressions that span many word.y word.

30

Pattern Matching Using Indices

Suffix TreesSuffix Trees Suffix trees are able to perform Suffix trees are able to perform complex searchescomplex searches

Word, prefix, suffix, substring, and Range queriesWord, prefix, suffix, substring, and Range queriesRegular expressionsRegular expressionsUnrestricted approximate string matchingUnrestricted approximate string matching

Useful in specific areasUseful in specific areasFind the Find the longest substringlongest substringFind the Find the most common substringmost common substring of a fixed size of a fixed size

31

Pattern Matching Using Indices

Suffix ArraysSuffix Arrays Some patterns can be searched Some patterns can be searched directly in directly in

the suffix arraythe suffix array without simulation the su without simulation the suffix treeffix tree

Word, prefix, suffix, subword search and Word, prefix, suffix, subword search and range searchrange search

32

Compression

Compressed text--Huffman codingCompressed text--Huffman coding Taking words as Taking words as symbolssymbols Use an Use an alphabetalphabet of bytes instead of bits of bytes instead of bits

Compressed indicesCompressed indices Inverted FilesInverted Files Suffix Trees and Suffix ArraysSuffix Trees and Suffix Arrays Signature FilesSignature Files

top related