comparing offline and online statistics estimation for text retrieval from overlapped collections ms...

Comparing Offline and Online Statistics Estimation for Text Retrieval from Overlapped Collections

MS Thesis DefenseBhaumik Chokshi

Committee Members:Prof. Subbarao Kambhampati (Chair)Prof. Yi ChenProf. Hasan Davulcu

My MS Work

Collection Selection : ROSCO

Query Processing over Incomplete Autonomous Databases: QPIAD

Handling Query Imprecision and Data Incompleteness: QUIC

Multi Source Information Retrieval In multi source information retrieval problem, searching every

information source is not efficient. The retrieval system must choose one collection or subset of collections to call to answer a given query.

Overlapping Collections

Many real world collections have significant overlap. For example, multiple bibliography collections (e.g., ACMDL, IEEE, DBLP

etc.) may store some of the same papers and multiple news archives (e.g., New York Times, Washington Post etc.) may store very similar news stories.

CSB

IEEE

ACM• How likely it is that a given collection has documents relevant to the query.• Whether a collection will provide novel results given the collections already selected.

DBLPScience

Related Work Most collection selection approaches do not consider

overlap Existing systems like CORI, ReDDE try to create a representative for each

collection based on term and document frequency information. ReDDE uses collection samples to estimate relevance of each collection.

Same samples can be used to estimate overlap among collections.

16.6% of the documents in runs submitted to the TREC 2004 terabyte track were redundant. [Bernstein and Zobel, 2005]

Using coverage and overlap statistics in context of relational data sources. [Nie and Kambhampati, 2004] Overlap among tuples can be identified in a much straightforward way

compared to text documents.

Challenges Involved

Need for query specific overlap Two collections may have low overlap as a whole but can have high overlap

for a particular set of queries.

Overlap assessment offline vs. online Offline approach can store statistics for general keywords and map incoming

query to these keywords to obtain relevance and overlap statistics. Online approach can use the samples to estimate relevance and overlap

statistics.

Efficiently determine true overlap between collections True overlap between collections can be estimated using result to result

comparison for different collections.

COSCO

Context of this work COSCO takes overlap into account while determining collection

order.

But it does it offline.

Samples built for the collections can be used to estimate overlap statistics which can be a better estimate as it is for a particular query.

COSCO estimates overlap using bag similarity over result-set document.

True overlap between collections can be obtained using result to result comparison.

COSCO does not do experiments on TREC data.

Contributions

ROSCO, an online approach which estimates overlap statistics from the samples of the collections.

Comparison of offline (COSCO) and online (ROSCO) approaches for statistics estimation for text retrieval from overlapping collections.

Outline

COSCO and ROSCO Architecture ROSCO Approach Empirical Evaluation Other Contributions Conclusion

COSCO Architecture

ROSCO Architecture

Outline

COSCO and ROSCO Architecture ROSCO Approach Empirical Evaluations Other Contributions Conclusion

ROSCO (Offline Component)Collection representation through query based sampling

C2 C1TrainingQueries

Samples

TrainingQueries

S2 S1

Union of Samples

ROSCO (Offline Component)Collection Size Estimation

C2 C1

RandomQueries

RandomQueries

Samples

S2 S1

sizeSd

daverageizeEstimatedSC i

S

Ci

i

i .*.

Number of documentsreturned from collection Ci

Number of documentsreturned from sample Si

SizeEstimates

ROSCO (Offline Component)Grainy Hash Vector

Sample

Hash

GHV

w bitsn bits

ROSCO (Online Component)Assessing Relevance

Union of Samples

Query

S2

S1

Samples

Query

Determinetop –k

relevantdocuments

for eachcollections

SizeEstimates

Top-kdocuments

for eachcollection

ROSCO (Online Component)Assessing Overlap and Combining with Relevance

Estimate no. ofrelevant new

documents for eachcollection

SizeEstimates

GHVs ofthe top-k

documents of each collection

GHVs ofdocuments of the collections

selected till now

Collection with maximumno. of new relevant documents

Comparison of ROSCO and COSCO COSCO:

Offline method for estimating coverage and overlap statistics.

Gets estimate for a query by using statistics for corresponding frequent item sets. Statistics for “data mining integration” can be obtained by using statistics from “data mining” and “data integration”.

This way of computing statistics can lead to a much different estimate from actual statistics.

ROSCO: Online method for estimating

coverage and overlap statistics.

Gets estimate by sending query to sample which can give better estimate for a particular query at hand.

Success of this approach depends on the quality of sample. Sometimes it can be hard to obtain a good sample of the collection.

Outline

ROSCO and COSCO Architecture ROSCO Approach Empirical Evaluation Other Contributions Conclusion

Empirical Evaluation

Whether ROSCO can perform better in an environment of overlapping text collections compared to the approaches which do not consider overlap.

Compare ROSCO and COSCO in presence of overlap among collections.

Testbed Creation Test Data

TREC Genomics data. 50 queries with their relevance judgment.

Testbed Creation 100 disjoint clusters from 200,000 documents to create topic specific

collections. uniform-50cols:

50 collections. Each of the 200,000 documents is randomly assigned to 10 different

collections. Total of 2 million documents.

skewed-100cols: 100 collections. Each of the 100 clusters is randomly assigned to 10 different collections. Total of 2 million documents. As each cluster is assigned to multiple collections, topic specific overlap

among collections is more prominent in this testbed compared to uniform-50cols.

Collection Size and Relevance Statistics Testbed 1

Testbed 2

Mean Relevant Documents

0

5

10

15

20

25

30

1 11 21 31 41

Collection

Mean

Rele

van

t D

ocu

men

ts

uniform-50cols skewed-50cols

Collection Overlap Statisticsuniform-50cols skewed-100cols

Tested Methods COSCO, ReDDE and ROSCO.

Greedy Ideal for establishing performance bound

Setting up COSCO 40 training queries to each of the collection

Setting up ROSCO and ReDDE Training Queries: 25 queries for each collection. Sample size: 10% of the actual collections. 10 size estimates Duplicate detection: GHV containing 32 vectors of 2 bits each (total of 64 bits). Mismatches allowed: 0 mismatch allowed for exact duplicates

Evaluation Recall after each collection called. (Central evaluation and TREC evaluation) Processing time.

Greedy Ideal This method attempts to greedily maximize the

percentage recall assuming oracular information.

It is used for establishing performance bound and as a baseline ranking method in evaluation.

Experimental Results (Central Evaluation) 10 queries different from training queries for evaluation. 5-fold cross validation Evaluation metric:

For both the testbeds ROSCO performs better than ReDDE and COSCO by 7-8% in terms of recall metric R.

Ranking by a particular method

Ranking by the baseline method

Experimental Results (TREC Evaluation)

For both testbeds ROSCO is performing better than ReDDE and ROSCO in terms of recall metric R.

As skewed-100col testbed is created by topic specific clusters, ROSCO shows more improvement compared to uniform-50col testbed over other approaches.

Experimental Results (Processing Cost)

Processing time for ReDDE and ROSCO is more compared to COSCO. But no. of collections called by ReDDE and ROSCO are less for same amount of recall.

Summary of Experimental Results Evaluated ROSCO, ReDDE and COSCO on two different testbeds

with overlapping collections.

ROSCO shows improvement over ReDDE and COSCO by 7-8% for central evaluations on both testbeds. TREC evaluation: 3-5% on uniform-50cols and 8-10% on clustered-100cols.

Processing time for ReDDE and ROSCO is more compared to COSCO. But no. of collections called by ReDDE and ROSCO are less for same amount of recall.

Outline


Other Contributions (QPIAD Project)Id Make Model Year Body

1 Audi A4 2001 Convt

2 BMW Z4 2002 Convt

3 Porsche Boxster 2005 Convt

4 BMW Z4 2003 NULL

5 Honda Civic 2004 NULL

6 Toyota Camry 2002 Sedan

7 Audi A4 2006 NULL

F Measure based query rewriting for incomplete autonomous web databases

Given a query Q:(Body Style=Convt) retrieve all relevant tuples

Id Make Model Year Body Confidence

4 BMW Z4 2003 NULL 0.7

7 Audi A4 2006 NULL 0.3

Ranked Relevant Uncertain Answers

Select Top K Rewritten Queries

Q1’: Model=A4

Q2’: Model=Z4

Q3’: Model=BoxsterRe-order queries based on Estimated Precision

Id Make Model Year Body

1 Audi A4 2001 Convt

2 BMW Z4 2002 Convt

3 Porsche

Boxster

2005 Convt

AFD: Model~> Body style

Other Contributions (QPIAD Project) Sources may impose resource limitations

on the # of queries we can issue

Therefore, we should select only the top-K

queries while ensuring the proper balance

between precision and recall

SOLUTION: Use F-Measure based

selection with configurable alpha parameter α=1 P = R α<1 P > R

α>1 P < R

JOINS

RP

RPF

1

P – Estimated Precision

R – Estimated Recall (based on P & Est. Sel.)

F Measure based query rewriting for incomplete autonomous web databases

Co-author on VLDB 2007 research paper

Other Contributions (QUIC Project)

R Metric

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

Make/Year/Price Make/Year/Mileage Model/Year/Price Body/Year/Price Model/Year

w/o unconstrained attributes

with unconstrained attributes

Given a query Q: model = Civic, an Accord with sedan body style may bemore relevant than Civic with coupe body style.

Handling unconstrained attributes in presence of query imprecision and data incompleteness

Tuples matching user query can be ranked based on unconstrained attributes.[Surajit Chaudhuri, Gautam Das, Vagelis Hristidis and Gerhard Weikum, 2004]

In absence of query log, relevance for unconstrained attributes can be approximated from database.

92

1r

R10 queries, 13 users

Approach considering unconstrained attributes performs better thanthe one ignoring unconstrained attributes.

Co-author on CIDR 2007 demo paper

Outline


Conclusion

An online method ROSCO for overlap estimation.

Comparison of offline and online approaches for text retrieval in an environment composed of overlapping collections.

Results of empirical evaluation show that online method for overlap estimation performs better than offline method for overlap estimation as well as method which does not consider overlap among collections.

Co-author on two other works appearing in

CIDR – 2007 and VLDB - 2007