comparing offline and online statistics estimation for text retrieval from overlapped collections ms...
Post on 18-Dec-2015
213 views
TRANSCRIPT
Comparing Offline and Online Statistics Estimation for Text Retrieval from Overlapped Collections
MS Thesis DefenseBhaumik Chokshi
Committee Members:Prof. Subbarao Kambhampati (Chair)Prof. Yi ChenProf. Hasan Davulcu
My MS Work
Collection Selection : ROSCO
Query Processing over Incomplete Autonomous Databases: QPIAD
Handling Query Imprecision and Data Incompleteness: QUIC
Multi Source Information Retrieval In multi source information retrieval problem, searching every
information source is not efficient. The retrieval system must choose one collection or subset of collections to call to answer a given query.
Overlapping Collections
Many real world collections have significant overlap. For example, multiple bibliography collections (e.g., ACMDL, IEEE, DBLP
etc.) may store some of the same papers and multiple news archives (e.g., New York Times, Washington Post etc.) may store very similar news stories.
CSB
IEEE
ACM• How likely it is that a given collection has documents relevant to the query.• Whether a collection will provide novel results given the collections already selected.
DBLPScience
Related Work Most collection selection approaches do not consider
overlap Existing systems like CORI, ReDDE try to create a representative for each
collection based on term and document frequency information. ReDDE uses collection samples to estimate relevance of each collection.
Same samples can be used to estimate overlap among collections.
16.6% of the documents in runs submitted to the TREC 2004 terabyte track were redundant. [Bernstein and Zobel, 2005]
Using coverage and overlap statistics in context of relational data sources. [Nie and Kambhampati, 2004] Overlap among tuples can be identified in a much straightforward way
compared to text documents.
Challenges Involved
Need for query specific overlap Two collections may have low overlap as a whole but can have high overlap
for a particular set of queries.
Overlap assessment offline vs. online Offline approach can store statistics for general keywords and map incoming
query to these keywords to obtain relevance and overlap statistics. Online approach can use the samples to estimate relevance and overlap
statistics.
Efficiently determine true overlap between collections True overlap between collections can be estimated using result to result
comparison for different collections.
COSCO
Context of this work COSCO takes overlap into account while determining collection
order.
But it does it offline.
Samples built for the collections can be used to estimate overlap statistics which can be a better estimate as it is for a particular query.
COSCO estimates overlap using bag similarity over result-set document.
True overlap between collections can be obtained using result to result comparison.
COSCO does not do experiments on TREC data.
Contributions
ROSCO, an online approach which estimates overlap statistics from the samples of the collections.
Comparison of offline (COSCO) and online (ROSCO) approaches for statistics estimation for text retrieval from overlapping collections.
Outline
COSCO and ROSCO Architecture ROSCO Approach Empirical Evaluation Other Contributions Conclusion
COSCO Architecture
ROSCO Architecture
Outline
COSCO and ROSCO Architecture ROSCO Approach Empirical Evaluations Other Contributions Conclusion
ROSCO (Offline Component)Collection representation through query based sampling
C2 C1TrainingQueries
Samples
TrainingQueries
S2 S1
Union of Samples
ROSCO (Offline Component)Collection Size Estimation
C2 C1
RandomQueries
RandomQueries
Samples
S2 S1
sizeSd
daverageizeEstimatedSC i
S
Ci
i
i .*.
Number of documentsreturned from collection Ci
Number of documentsreturned from sample Si
SizeEstimates
ROSCO (Offline Component)Grainy Hash Vector
Sample
Hash
GHV
w bitsn bits
ROSCO (Online Component)Assessing Relevance
Union of Samples
Query
S2
S1
Samples
Query
Determinetop –k
relevantdocuments
for eachcollections
SizeEstimates
Top-kdocuments
for eachcollection
ROSCO (Online Component)Assessing Overlap and Combining with Relevance
Estimate no. ofrelevant new
documents for eachcollection
SizeEstimates
GHVs ofthe top-k
documents of each collection
GHVs ofdocuments of the collections
selected till now
Collection with maximumno. of new relevant documents
Comparison of ROSCO and COSCO COSCO:
Offline method for estimating coverage and overlap statistics.
Gets estimate for a query by using statistics for corresponding frequent item sets. Statistics for “data mining integration” can be obtained by using statistics from “data mining” and “data integration”.
This way of computing statistics can lead to a much different estimate from actual statistics.
ROSCO: Online method for estimating
coverage and overlap statistics.
Gets estimate by sending query to sample which can give better estimate for a particular query at hand.
Success of this approach depends on the quality of sample. Sometimes it can be hard to obtain a good sample of the collection.
Outline
ROSCO and COSCO Architecture ROSCO Approach Empirical Evaluation Other Contributions Conclusion
Empirical Evaluation
Whether ROSCO can perform better in an environment of overlapping text collections compared to the approaches which do not consider overlap.
Compare ROSCO and COSCO in presence of overlap among collections.
Testbed Creation Test Data
TREC Genomics data. 50 queries with their relevance judgment.
Testbed Creation 100 disjoint clusters from 200,000 documents to create topic specific
collections. uniform-50cols:
50 collections. Each of the 200,000 documents is randomly assigned to 10 different
collections. Total of 2 million documents.
skewed-100cols: 100 collections. Each of the 100 clusters is randomly assigned to 10 different collections. Total of 2 million documents. As each cluster is assigned to multiple collections, topic specific overlap
among collections is more prominent in this testbed compared to uniform-50cols.
Collection Size and Relevance Statistics Testbed 1
Testbed 2
Mean Relevant Documents
0
5
10
15
20
25
30
1 11 21 31 41
Collection
Mean
Rele
van
t D
ocu
men
ts
uniform-50cols skewed-50cols
Collection Overlap Statisticsuniform-50cols skewed-100cols
Tested Methods COSCO, ReDDE and ROSCO.
Greedy Ideal for establishing performance bound
Setting up COSCO 40 training queries to each of the collection
Setting up ROSCO and ReDDE Training Queries: 25 queries for each collection. Sample size: 10% of the actual collections. 10 size estimates Duplicate detection: GHV containing 32 vectors of 2 bits each (total of 64 bits). Mismatches allowed: 0 mismatch allowed for exact duplicates
Evaluation Recall after each collection called. (Central evaluation and TREC evaluation) Processing time.
Greedy Ideal This method attempts to greedily maximize the
percentage recall assuming oracular information.
It is used for establishing performance bound and as a baseline ranking method in evaluation.
Experimental Results (Central Evaluation) 10 queries different from training queries for evaluation. 5-fold cross validation Evaluation metric:
For both the testbeds ROSCO performs better than ReDDE and COSCO by 7-8% in terms of recall metric R.
Ranking by a particular method
Ranking by the baseline method
Experimental Results (TREC Evaluation)
For both testbeds ROSCO is performing better than ReDDE and ROSCO in terms of recall metric R.
As skewed-100col testbed is created by topic specific clusters, ROSCO shows more improvement compared to uniform-50col testbed over other approaches.
Experimental Results (Processing Cost)
Processing time for ReDDE and ROSCO is more compared to COSCO. But no. of collections called by ReDDE and ROSCO are less for same amount of recall.
Summary of Experimental Results Evaluated ROSCO, ReDDE and COSCO on two different testbeds
with overlapping collections.
ROSCO shows improvement over ReDDE and COSCO by 7-8% for central evaluations on both testbeds. TREC evaluation: 3-5% on uniform-50cols and 8-10% on clustered-100cols.
Processing time for ReDDE and ROSCO is more compared to COSCO. But no. of collections called by ReDDE and ROSCO are less for same amount of recall.
Outline
ROSCO and COSCO Architecture ROSCO Approach Empirical Evaluation Other Contributions Conclusion
Other Contributions (QPIAD Project)Id Make Model Year Body
1 Audi A4 2001 Convt
2 BMW Z4 2002 Convt
3 Porsche Boxster 2005 Convt
4 BMW Z4 2003 NULL
5 Honda Civic 2004 NULL
6 Toyota Camry 2002 Sedan
7 Audi A4 2006 NULL
F Measure based query rewriting for incomplete autonomous web databases
Given a query Q:(Body Style=Convt) retrieve all relevant tuples
Id Make Model Year Body Confidence
4 BMW Z4 2003 NULL 0.7
7 Audi A4 2006 NULL 0.3
Ranked Relevant Uncertain Answers
Select Top K Rewritten Queries
Q1’: Model=A4
Q2’: Model=Z4
Q3’: Model=BoxsterRe-order queries based on Estimated Precision
Id Make Model Year Body
1 Audi A4 2001 Convt
2 BMW Z4 2002 Convt
3 Porsche
Boxster
2005 Convt
AFD: Model~> Body style
Other Contributions (QPIAD Project) Sources may impose resource limitations
on the # of queries we can issue
Therefore, we should select only the top-K
queries while ensuring the proper balance
between precision and recall
SOLUTION: Use F-Measure based
selection with configurable alpha parameter α=1 P = R α<1 P > R
α>1 P < R
JOINS
RP
RPF
1
P – Estimated Precision
R – Estimated Recall (based on P & Est. Sel.)
F Measure based query rewriting for incomplete autonomous web databases
Co-author on VLDB 2007 research paper
Other Contributions (QUIC Project)
R Metric
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
Make/Year/Price Make/Year/Mileage Model/Year/Price Body/Year/Price Model/Year
w/o unconstrained attributes
with unconstrained attributes
Given a query Q: model = Civic, an Accord with sedan body style may bemore relevant than Civic with coupe body style.
Handling unconstrained attributes in presence of query imprecision and data incompleteness
Tuples matching user query can be ranked based on unconstrained attributes.[Surajit Chaudhuri, Gautam Das, Vagelis Hristidis and Gerhard Weikum, 2004]
In absence of query log, relevance for unconstrained attributes can be approximated from database.
92
1r
R10 queries, 13 users
Approach considering unconstrained attributes performs better thanthe one ignoring unconstrained attributes.
Co-author on CIDR 2007 demo paper
Outline
ROSCO and COSCO Architecture ROSCO Approach Empirical Evaluation Other Contributions Conclusion
Conclusion
An online method ROSCO for overlap estimation.
Comparison of offline and online approaches for text retrieval in an environment composed of overlapping collections.
Results of empirical evaluation show that online method for overlap estimation performs better than offline method for overlap estimation as well as method which does not consider overlap among collections.
Co-author on two other works appearing in
CIDR – 2007 and VLDB - 2007