1 cs 430: information discovery sample midterm examination notes on the solutions

1

CS 430: Information Discovery

Sample Midterm Examination

Notes on the Solutions

2

Midterm Examination -- Question 1

1(a) Define the terms inverted file, inverted list, posting.

Inverted file: a list of the words in a set of documents and the documents in which they appear.

Inverted list: All the entries in an inverted file that apply to a specific word.

Posting: Entry in an inverted list

See Lecture 3

1(b) When implementing an inverted file system, what are the criteria that you would use to judge whether the system is suitable for very large-scale information retrieval?

3

Q1 (continued)

Storage

Inverted files are big, typically 10% to 100% the size of the collection of documents.

Update performance

It must be possible, with a reasonable amount of computation, to:

(a) Add a large batch of documents(b) Add a single document

Retrieval performance

Retrieval must be fast enough to satisfy users and not use excessive resource.

from Lecture 3

4

Q1 (continued)

1(c) You are designing an inverted file system to be used with Boolean queries on a very large collection of textual documents. New documents are being continually added to the collection.

(i) What file structure(s) would you use?

(ii) How well does your design satisfy the criteria listed in Part (b)?

5

Q1 (continued)

Term Pointer tolist of postings

ant

bee

cat

dog

elk

fox

gnu

hog

Lists of postings

Separate inverted index from lists of postings

inverted index

from Lecture 3

postings file

6

Question 1 (continued)

(a) Postings file may be stored sequentially as a linked list.

(b) Index file is best stored as a tree. Binary trees provide fast searching but have problems with updating. B-trees are better, with B+-trees as the best.

Note: Other answers are possible to this part of the question.

7


1(c)(ii) How well does your design satisfy the criteria listed in Part (b)?

• Sequential list for each term is efficient for storage and for processing Boolean queries. The disadvantage is a slow update time for long inverted lists.

• B-trees combine fast retrieval with moderately efficient updating.

• Bottom-up updating is usual fast, but may require recursive tree climbing to the root.

• The main weakness is poor storage utilization; typically buckets are only 0.69 full.

8


2(b) You have the collection of documents that contain the following index terms:

D1: alpha bravo charlie delta echo foxtrot golf

D2: golf golf golf delta alpha

D3: bravo charlie bravo echo foxtrot bravo

D4: foxtrot alpha alpha golf golf delta

(i) Use an incidence matrix of terms to calculate a similarity matrix for these four documents, with no term weighting.

9

Incidence array





alpha bravo charlie delta echo foxtrot golf

D1 1 1 1 1 1 1 1

D2 1 1 1

D3 1 1 1 1

D4 1 1 1 1

7

3

4

4

10

Document similarity matrix

D1 D2 D3 D4

D1 0.65 0.76 0.76

D2 0.65 0.00 0.87

D3 0.76 0.00 0.25

D4 0.76 0.87 0.25

11


2b(ii) Use a frequency matrix of terms to calculate a similarity matrix for these documents, with weights proportional to the term frequency and inversely proportional to the document frequency.

12

Frequency Array






D1 1 1 1 1 1 1 1

D2 1 1 3

D3 3 1 1 1

D4 2 1 1 2

13

Inverse Document Frequency Weighting

Principle:

(a) Weight is proportional to the number of times that the term appears in the document

(b) Weight is inversely proportional to the number of documents that contain the term:

wik = fik / dk

Where: wik is the weight given to term k in document i fik is the frequency with which term k appears in document i

dk is the number of documents that contain term k

14

Frequency Array with Weights






D1 0.33 0.50 0.50 0.33 0.50 0.33 0.33

D2 0.33 0.33 1.00

D3 1.50 0.50 0.50 0.33

D4 0.67 0.33 0.33 0.67

dk 3 2 2 3 2 3 3

length

0.94

0.65

1.08

0.76

15

Document similarity matrix

D1 D2 D3 D4

D1 0.46 0.74 0.58

D2 0.46 0.00 0.86

D3 0.74 0.00 0.06

D4 0.56 0.86 0.06

16

Question 3

3(a) Define the terms recall and precision.

3(b) Q is a query. D is a collection of 1,000,000 documents. When the query Q is run, a set of 200 documents is returned.

(i) How in a practical experiment would you calculate the precision?

Have an expert examine each of the 200 documents and decide whether it is relevant. Precision is number judged relevant divided by 200.

(ii) How in a practical experiment would you calculate the recall?

It is not feasible to examine 1,000,000 records. Therefore sampling must be used ...

17


3(c) Suppose that, by some means, it is known that 100 of the documents in D are relevant to Q. Of the 200 documents returned by the search, 50 are relevant.

(i) What is the precision?

50/200 = 0.25

(ii) What is the recall?

50/100 = 0.5

3(d) Explain in general terms the method used by TREC to estimate the recall.

18


For each query, a pool of potentially relevant documents is assembled, using the top 100 ranked documents from each participant

The human expert who set the query looks at every document in the pool and determines whether it is relevant.

Documents outside the pool are not examined.

[In TREC-8:7,100 documents in the pool1,736 unique documents (eliminating duplicates)94 judged relevant]

19


4(a) What is the Dublin Core principle of dumbing-down? Are there any fields in this record that do not satisfy the principle?

"The theory behind this principle is that consumers of metadata should be able to strip off qualifiers and return to the base form of a property. ... this principle makes it possible for client applications to ignore qualifiers in the context of more coarse-grained, cross-domain searches."

Lagoze 2001

20


Dumbing-down failures:

Description.note Title from home page as viewed on Nov. 1, 2000.

Description Title from home page as viewed on Nov. 1, 2000.

which is not a description of the object

Publisher.place Nashville, Tenn. :

Publisher Nashville, Tenn. :

which is not the publisher of the object

Correct dumbing-down:

Subject.class.LCC E840.8.G65

Subject E840.8.G65

which is a subject code

21


4(b) The metadata in the fields Publisher and Publisher place end in punctuation marks. Can you suggest any reasons for doing so?

This is a historic curiosity. It comes from the concept that the metadata will be printed, so that the metadata is stored in a printable format.

Publisher Gore/Lieberman,Publisher.place Nashville, Tenn. :

is intended to be combined with a date as follows:

Nashville, Tenn. : Gore/Lieberman, 2001

22


4(c) This record has no Creator field. It has a Contributor.nameCorporate field with value "Gore/Lieberman, Inc." Do you consider that this is correct use of Dublin Core? What would you put in the Creator and Contributor fields? Why?

23


Specification of Dublin Core:

A. All fields are optional. It is not necessary to have a Creator.

B. Definitions of fields

Creator The person or organization primarily responsible for the intellectual content of the resource.

Contributor A person or organization not specified in a creator element who has made significant intellectual contributions to the resource but whose contribution is secondary to any person or organization specified in a creator element.

Gore/Lieberman, Inc. is the corporate author of this web site and is therefore the Creator.

1 cs 430: information discovery sample midterm examination notes on the solutions

Documents

golf golf golf delta

postings file slide

alpha alpha golf golf

inverted file system

incidence array d

inverted lists

b index file

document similarity