1 file structures information retrieval: data structures and algorithms by w.b. frakes and r....

20
1 File Structures Information Retrieval: Data Structures and A lgorithms by W.B. Frakes and R. Baeza-Yates (Ed s.) Englewood Cliffs, NJ: Prentice Hall, 199 2. (Chapters 3-5)

Post on 19-Dec-2015

244 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 File Structures Information Retrieval: Data Structures and Algorithms by W.B. Frakes and R. Baeza-Yates (Eds.) Englewood Cliffs, NJ: Prentice Hall, 1992

1

File Structures

Information Retrieval: Data Structures and Algorithms

by W.B. Frakes and R. Baeza-Yates (Eds.) Engle

wood Cliffs, NJ: Prentice Hall, 1992.

(Chapters 3-5)

Page 2: 1 File Structures Information Retrieval: Data Structures and Algorithms by W.B. Frakes and R. Baeza-Yates (Eds.) Englewood Cliffs, NJ: Prentice Hall, 1992

2

File Structures for IR

lexicographical indices» indices that are sorted» e.g. inverted files» e.g. Patricia (PAT) trees

cluster file structures indices based on hashing

» signature files

Page 3: 1 File Structures Information Retrieval: Data Structures and Algorithms by W.B. Frakes and R. Baeza-Yates (Eds.) Englewood Cliffs, NJ: Prentice Hall, 1992

3

Inverted Files

Information Retrieval: Data Structures and Algorithms

by W.B. Frakes and R. Baeza-Yates (Eds.) Englewood Cliffs, NJ: Prentice Hall, 1992.

(Chapters 3)

Page 4: 1 File Structures Information Retrieval: Data Structures and Algorithms by W.B. Frakes and R. Baeza-Yates (Eds.) Englewood Cliffs, NJ: Prentice Hall, 1992

4

Inverted Files

Each document is assigned a list of keywords or attributes. Each keyword (attribute) is associated with operational

relevance weights. An inverted file is the sorted list of keywords (attributes), with

each keyword having links to the documents containing that keyword.

Penalty» the size of inverted files ranges from 10% to 100%

of more of the size of the text itself

» need to update the index as the data set changes

Page 5: 1 File Structures Information Retrieval: Data Structures and Algorithms by W.B. Frakes and R. Baeza-Yates (Eds.) Englewood Cliffs, NJ: Prentice Hall, 1992

5

Indexing Restrications

A controlled vocabulary which is the collection of keywords that will be indexed. Words in the text that are not in the vocabulary will not be indexed

A list of stopwords that for reasons of volume will not be included in the index

A set of rules that decide the beginning of a word or a piece of text that is indexable

A list of character sequences to be indexed (or not indexed)

Page 6: 1 File Structures Information Retrieval: Data Structures and Algorithms by W.B. Frakes and R. Baeza-Yates (Eds.) Englewood Cliffs, NJ: Prentice Hall, 1992

Sorted array implementation of an inverted file

Page 7: 1 File Structures Information Retrieval: Data Structures and Algorithms by W.B. Frakes and R. Baeza-Yates (Eds.) Englewood Cliffs, NJ: Prentice Hall, 1992

7

Structures used in Inverted Files

Sorted Arrays» store the list of keywords in a sorted array

» using a standard binary search

» advantage: easy to implement

» disadvantage: updating the index is expensive Hashing Structures Tries (digital search trees) Combinations of these structures

Page 8: 1 File Structures Information Retrieval: Data Structures and Algorithms by W.B. Frakes and R. Baeza-Yates (Eds.) Englewood Cliffs, NJ: Prentice Hall, 1992

8

Sorted Arrays

1. The input text is parsed into a list of words along with theirlocation in the text. (time and storage consuming operation)

2. This list is inverted from a list of terms in location order to a list of terms in alphabetical order.

3. Add term weights, or reorganize or compress the files.

Page 9: 1 File Structures Information Retrieval: Data Structures and Algorithms by W.B. Frakes and R. Baeza-Yates (Eds.) Englewood Cliffs, NJ: Prentice Hall, 1992

Inversion of Word List

Page 10: 1 File Structures Information Retrieval: Data Structures and Algorithms by W.B. Frakes and R. Baeza-Yates (Eds.) Englewood Cliffs, NJ: Prentice Hall, 1992

10

Dictionary and postings fileIdea: the file to be searched should be as short as possible

split a single file into two pieces

e.g. data set: 38,304 records, 250,000 unique terms

(document #, frequency)

Page 11: 1 File Structures Information Retrieval: Data Structures and Algorithms by W.B. Frakes and R. Baeza-Yates (Eds.) Englewood Cliffs, NJ: Prentice Hall, 1992

Producing an Inverted File for Large Data Sets without Sorting

Idea: avoid the use of an explicit sort by using a right-threaded binary tree

current number of term postings &the storage location of postings list

traverse the binary tree and thelinked postings list

Page 12: 1 File Structures Information Retrieval: Data Structures and Algorithms by W.B. Frakes and R. Baeza-Yates (Eds.) Englewood Cliffs, NJ: Prentice Hall, 1992

12

A Fast Inversion Algorithm

Principle 1the large primary memories are availableIf databases can be split into memory loads that can be rapidly processed and then combined, the overall cost will be minimized.

Principle 2the inherent order of the input dataIt is very expensive to use polynomial or even nlogn sorting algorithms for large files

Page 13: 1 File Structures Information Retrieval: Data Structures and Algorithms by W.B. Frakes and R. Baeza-Yates (Eds.) Englewood Cliffs, NJ: Prentice Hall, 1992

FAST-INV algorithm

See p. 13.

concept postings/pointers

Page 14: 1 File Structures Information Retrieval: Data Structures and Algorithms by W.B. Frakes and R. Baeza-Yates (Eds.) Englewood Cliffs, NJ: Prentice Hall, 1992

document number

concept number (one concept numberfor each unique word)

Sample document vector

Similar to the document-word list shown in p. 7.

The concept numbers aresorted within documentnumbers, and document numbers are sorted within collection

Page 15: 1 File Structures Information Retrieval: Data Structures and Algorithms by W.B. Frakes and R. Baeza-Yates (Eds.) Englewood Cliffs, NJ: Prentice Hall, 1992

15

Preparation

Terminology » HCN= highest concept number in dictionary, or the number

of words to be indexed» L= number of document/concept pairs in the collection» M= available primary memory size

Assumption» M>>HCN» M<L

Page 16: 1 File Structures Information Retrieval: Data Structures and Algorithms by W.B. Frakes and R. Baeza-Yates (Eds.) Englewood Cliffs, NJ: Prentice Hall, 1992

: the range of concepts for each primary load

讀入 (Doc,Con)依 Con去查 Load表,確定這個配對該落在那個 Load

依序將每個 LoadFile反轉。 CONPTR表中的 Offset顯示每筆資料該填入的位置。

Page 17: 1 File Structures Information Retrieval: Data Structures and Algorithms by W.B. Frakes and R. Baeza-Yates (Eds.) Englewood Cliffs, NJ: Prentice Hall, 1992

Preparation

1. Allocate an array, con_entries_cnt, of size HCN.2. For each <doc#, con#> entry in the document vector file: increment con_entries_cnt[con#]

……………………0 (1,1), (1,4)……….. 2(2,3) …………….. 3(3,1), (3,2), (3,5) ... 6(4,2), (4,3) ………. 8…

(con#, doc#)

Page 18: 1 File Structures Information Retrieval: Data Structures and Algorithms by W.B. Frakes and R. Baeza-Yates (Eds.) Englewood Cliffs, NJ: Prentice Hall, 1992

Preparation (continued)

5. For each <con#,count> pair obtained from con_entries_cnt: if there is no room for documents with this concept to fit in the current load, then created an entry in the load table and initialize the next load entry; otherwise update information for the current load table entry.

Page 19: 1 File Structures Information Retrieval: Data Structures and Algorithms by W.B. Frakes and R. Baeza-Yates (Eds.) Englewood Cliffs, NJ: Prentice Hall, 1992

19

Building Load Table

Terminology» LL= length of current load» S= spread of concept numbers in the current load» 8 bytes = space needed for each concept/weight pair» 4 bytes = space needed for each concept to store count of

postings for it

Constraints» 8*LL+4*S<M

Page 20: 1 File Structures Information Retrieval: Data Structures and Algorithms by W.B. Frakes and R. Baeza-Yates (Eds.) Englewood Cliffs, NJ: Prentice Hall, 1992

: the range of concepts for each primary load

讀入 (Doc,Con)依 Con去查 Load表,確定這個配對該落在那個 Load

依序將每個 LoadFile反轉。 CONPTR表中的 Offset顯示每筆資料該填入的位置。