federated text retrieval from uncooperative overlapped collections

Federated text retrieval from uncooperative overlapped col-

lections

Milad Shokouhi, RMIT University, Melbourne, Australia

Justin Zobel, RMIT University, Melbourne, Australia

SIGIR 2007(Collection representation in distributed IR)

2009-03-13

Presented by JongHeum Yeon, IDS Lab., Seoul National University

Copyright 2008 by CEBT

Abstract

Federated information retrieval (FIR)

Send query to multiple collections

Central broker merges the results and ranks them

Duplicated documents in collections

Final results contains high number of duplicates potentially

Authors propose a method for estimat-ing the rate of overlap among collec-tions based on sampling

Using the estimated overlap statistics, they propose two collection selection methods that aim to maximize the number of unique relevant documents in the final results

2

Broker

Collection

Collection

Collection

User


Federated Information Retrieval (FIR)

Query is sent simultaneously to several collections

Each collection evaluates the query and returns the re-sults to the broker

Advantage

No need to access the index of the collections

Search over the latest version of documents without crawl-ing and indexing

Broker selects collections that are most likely to return relevant documents

Collection selection problem

Collection representation problem

Result merging problem

3


Collection Selection Problem

FIR techniques assume that the degree of overlap among collections is either none or negligible

However, there are many collections that have a significant degree of overlap

Bibliographic databases

News resources

Selecting collections that are likely to return the same results by intro-ducing duplicate documents into the final results

Wastes costly resources

Degrades search effectiveness

Authors propose …

A method that estimates the degree of overlap among collections by sam-pling from each collection using random queries

two collection selection techniques that use the estimated overlap statis-tics to maximize the number of unique relevant documents in the final re-sults

4


Related Work

Cooperative collection selection techniques

Collections provide the broker with their index statistics and other useful information

CORI, GlOSS, CVV

Uncooperative collection selection techniques

Collections do not provide their index statistics to the bro-ker

The broker samples documents from each collection

ReDDE uses sampled documents for …

– Estimates the number of relevant documents in collections

– Ranks collections according to the number of highly ranked sampled documents

5


Overlap Estimation

Using the documents down-loaded by query-based sam-pling for estimating the rate of overlap and does not re-quire any additional informa-tion

Subset of sample documents

Size of m

The probability of any given document from m1 to be available in m2

6

C1 C2

S2S1

K

Expected number of docu-ments


Overlap Estimation (cont’d)

P(i) follows binomial distribution

7


Overlap Estimation (cont’d)

Binomial theorem

Expected number of documents in m1 ∩ m2

The number of overlap documents is independent of the collection size

8


The ‘RELAX’ Selection Method

Graph G = {(u,v) | vertex u, v are collections, edges indi-cates overlap documents between vertices}

Output : final merged document lists that minimized du-plicates

9


The ‘RELAX’ Selection Method (cont’d)

10


Overlap Filtering for ReDDE

F-ReDDE

1. The overlaps among collections are estimated as described for the Relax selection

2. Collections are ranked using a resource selection algorithm such as ReDDE

3. Each collection is compared with the previously selected collections. It is removed from the list if it has a high over-lap (greater than γ) with any of the previously selected col-lections. We empirically choose γ = 30% and leave meth-ods for finding the optimum value as future work

11


Testbeds

Authors create three new testbeds with overlapping collections based on the documents available in the TREC GOV dataset

Qprobed-280

360 most frequent queries in a search engine in the .gov

A random number of documents (between 5000 and 20000) are downloaded as a collection

Generate 280 collections with average size of 12194 documents

Qprobed-300

every twentieth collection is merged into a single large collection

Sliding-115

Using a sliding window of 30 000 documents

Generate 112 collections

12


Testbeds (cont’d)

Qprobed-280

74492 collection pairs < 10% overlap

79 pairs < 90%

1.1% of collection pairs > 50% overlap

Qprobed-300


Sliding-115


13


Results

The initial estimated values for D(i, j) suggested that the degree of overlap among collections is usually overesti-mated

Document retrieval models are biased towards returning some popular documents for many queries

Samples produced by query-based sampling are not ran-dom

14


Results (cont’d)

15


Results (cont’d)

16


Conclusion & Discussion

Pros

Propose the efficient algorithm for handling duplicates

Cons

Experiments show the improved performance

In practical environment?

17

federated text retrieval from uncooperative overlapped collections

Documents