presentation topic searching distributed collections with inference networks searchingdistributed...

Presentation Topic

Searching Distributed Collections With Inference Networks

SearchingDistributed Collections

Inference Network

Keywords:

COSC 6341-Information Retrieval

What is Distributed IR ?

Homogeneous Collections Large single collection is partitioned and distributed over network to improve efficiency E.g.: Google

Heterogeneous Collections Internet offers thousands of diverse collections available for searching on demand E.g.: P2P

Architectural Aspects

P2P –peer to peer Architecture

How to search such a collection ?Consider the entire collection as a

“single LARGE VIRTUAL Collection”

C6

C5

C3

C2C1

C4

Virtual Collection

Search each collection individually

C6

C5

C3

C2C1

C4

Virtual Collection

Results1

Results4

Results3

Results2

Results6

Results5

Search3

Search2

Search5

Search1

Search4

Search6

How to Merge the results ??

Get the Results

Communication Costs??

Time required ??

Solution : An IR system must automatically: Rank Collections Collection Selection - Select specific

collections to be searched Merging the Ranked Results - Effectively

Merge the results

Ranking Collection Ranking of Collection can be addressed by Inference Network CORI net - Collection Retrieval Inference

Network, is used to rank collections, as opposed to more commonDocument Retrieval Inference Network.

Inference Network

dj

k1 k2 ki

qq2

kt

I q1or

and

or

Document dj has index terms - k1,k2,kiQuery Q composed of index terms-k1,k2,kiq1 = [( k1 and k2 ) or ki ] Boolean q2 = ( k1 and k2 ) Formulation

Information Need I =q or q1

Ranking Documents in Inference Network

using tf - idf strategies Term frequency ( tf ) =f (i, j)= P (ki|dj) Inverse document Frequency ( idfi ) = P( q|¯k) In an Inference network ,ranking of a document is computed as :

P ( q ∆ dj ) = Cj * 1/|dj| * Σ f (i, j) * idfi * 1/(1-f (i, j))

P (ki | dj ) = influence of keyword ki on document dj

P (q | ¯k ) = influence of index terms ki in the query node

Analogy between a Document Retrieval and a Collection retrieval Network.

Documents• tf – term frequency. Number of occurrences of term in a document

• idf - inverse document freq. f (Number of documents containing term)

• dl - document length

• D – Number of Documents

Collections• df – document frequency # of documents containing term in a collection

• icf – inverse collection freq. f (Number of collections containing term)

• cl – Collection length

• C – Number of Collections

tf – idf Scheme for Documents

df – icf Scheme for Collections

Comparison between Document Retrieval and Collection Retrieval Inference Networks

Document Retrieval Inference Network

Collection Retrieval Inference Network – CORI Net

Retrieve documents based on a query

Retrieve Collections based on a query

Why to use Inference Network ? One system is used for ranking both Documents

and Collections.

Document retrieval becomes a simple process: Use the query to retrieve a ranked list of collections Select the top group of collections Search the top group of collections, in parallel or sequentially Merge the results from the various collections into a single

ranking

To the retrieval Algorithm ,a CORI net looks like a document retrieval inference network with very big documents:

Each document is a surrogate for a complete collection

Interesting facts: CORI net for 1.2 GB collection = 5MB(0.4 % of the

original collection)

CORI net to search well-known collection of CACM (having 3000document collections) shows high values of df and icf but does not affect the computational complexity of retrieval.

Experiments on Trec Collection

T = d_t +(1 –d_t) * log(df +0.5)/ log(max_df+1.0)

I = log(|C|+0.5/cf) / log(|C|+1.0)

Belief p(rk/ci) in collection ci due to observing term rk is given by:

P ( rk / ci ) = d_b + ( 1 – d_b ) * T * I

where,

df=number of documents in ci containing term rk

max_df=number of documents containing the most frequent term in ci

|C| =number of collections

cf=number of collections containing term rk

d_t =minimum term frequency component when term rk occurs in collection ci

d_b=minimum belief component when term rk occurs in collection ci

Effectiveness This approach to ranking collections was evaluated using

INQUERY retrieval system and TREC collection. Experiments were conducted with 100queries TREC Volume-1 Topics 51-150 Mean squared error of the collection ranking for a single query

is calculated as: 1/|C|*Σi € C (Oi - Ri)2

where, Oi = optimal ranking for collection i based on the number of

relevant documents it contained

Ri = the rank for collection determined by the retrieval algorithm

C = the set of collections being ranked

Results Mean squared error averaged for first 50 queries=2.3471 Ranking for almost 75% of the queries was perfect For rest of 25% - disorganized ranks

Reason: Ranking Collections is not exactly similar to Ranking Documents

Scaling df max_df restricts small sets of interesting documentsSo, modification: “scaling df df+K”

K=k *((1-b)+b*cw / ¯cw)Where, cw = number of words in the collection k,b =constants (b [0,1])

Thus, T = d_t +(1-d_t)* df / df +K

Modified results

Best combination of b and k, b: b=0.75 and k=200

Mean squared error averaged over 50 queries = 1.4586(38% better than previous results)

Rankings for 30 queries improved

Rankings for 8 queries changed slightly

Rankings for 12 queries did not change

Merging the ResultsFour approaches:

Interleaving Raw scores Normalized scores Weighted scores

Effectiveness increases

C1D1-D10

C4D31-D40

C3D21-D30

C5D41-D50

C2D11-D20

C6D51-D60

SINGLE LARGE HOMOGENOUS COLLECTIOND1-D60

Step 1:Ranking Collection

D1,D5,D7

D44, D37 D60,D52,D57,D59

Step 2:Merging Results

1

2

3D1

D60D44

D52D5D37D57D7D59

This scheme is not satisfying, as we have only document Rankings and Collection rankings

Raw ScoreInterleaving

10,12,37

29, 69 90, 32, 25, 1

Assigning Weights90

69

37D7

D37

D52 -32 D44 -29D57-25D5-12D1-10D59-1

But again these weights from different collections may not be directly comparable….

Normalized Scores In inference network, normalizing scores requires a

preprocessing step prior to query evaluation

Preprocessing : The system obtains from each collection the statistics about how many documents each query term matches. The statistics are merged to obtain normalized idf .(Global weighing scheme)

Problem: High communicational and Computational costs. (if wide distributed network)

Weighted Scores Weights can be based on document’s score

and/or the collection ranking information. Offers computational simplicity. Weight w gives weight results for different

collections.W = 1 + |C| * (s- ¯s / ¯s),

where,

|C|= the number of collections searched S = the collection’s score (not its rank)*

¯s = the mean of the collection scores*Assumption : Similar collections have similar weights

Weighted Scores Rank (document) = score (document)*

weight (collection) This method favors documents from

collections with high scores but also enables a good document from a poor collection. [which we are looking for]

Comparing results of the merging methods

Average precision -using 11 point Recall

10

15

20

25

30

35

40

Merge Techniques

Prec

ision

Precision 37.76 17.54 33.64 38.18

Normalized Interleaved Raw Score Weighted

R-precision(Trec vol-1-51-100)

20

25

30

35

40

45

Merge techniquesEx

act p

recis

ion

valu

es

R-precision 41.96 25.42 38.67 42.46

Normalized Interleaved Raw Score Weighted

Source: TREC Collection Volume-1

Topics: 51-100

Merging Results-Pros and Cons Interleaving: Extremely ineffective, losses in average precision.

(The reason is that documents ranked high from non-relevantcollection may reside near high-ranked documents from more relevantcollections)

Raw Scores: Sometimes scores from different collections may not bedirectly comparable (like idf weights). Use of some terms in the collection may penalize its common use.

Normalized Scores: Resembles at most the search in single collections,but normalizing has significant communication and computationalcosts when collections are distributed across wide-area network

Weighted Scores: As effective as normalized scores, but less robust(introduces deviations in recall and precision)

Collection SelectionApproaches: Top n collections Any collection with a score greater than

some threshold Top group (based on clustering) Cost based selection

Results

Topics 51-100 101-150Difference between 2 best clusters and using all 7 collections

<-5% >5%

R-precision -0.9% -5.6%

11-Point precision -2.2% -9.1%

Document Cut-Off 500-1000 200

Eliminating collections reduces Recall.

Related work on collection selection:Group Manually and Select Manually

Collections are organized into groups with a common theme e.g., financial, technology, appellate court decisions

Users selects which group to search Found in commercial service providers e.g., experienced users like Librarians

Groupings determined manually (+) time consuming, inconsistent groupings, coarse groupings, not good

for unusual information needs

Groupings determined automatically Broker agents maintain a centralized cluster index by periodically

querying collections on eachsubject

(+) automatic creation, better consistency, coarse groupings ( - )not good for unusual information needs

Rule-Based Selection The contents of each collection are described in a

knowledge-base A rule-based system selects the collections for a query.

EXPERT CONIT, a research system, tested on static and homogeneous collections

(-) time consuming to create(-)inconsistent selection if rules change(-)coarse groupings so not good for unusual information

needs

OptimizationRepresent a Collection With a Subset of Terms Inference network with most frequent terms

Atleast 20% most frequent words must be there

Proximity Information Proximity of terms can be handled by

CORI net. CORI net for one collection would be 30%

of the original collection. [before we had CORI net = 0.4% of original collection]

Retrieving Few documents Usually a user is interested in first 10 or utmost 20 results. Rest of the results are discarded

( Oh! Waste of resources and time)Example:

C220DOCS

C110DOCS

C330DOCS

MERGING OF THE RESULTS AND

SELECTING TOP 10 MERGED RESULTS

TOP 10TOP 10TOP 10

USER

20 Documents thrown out without even user looking into them

Collections=C=3Rankings of Interest=10Docs retrieved= C*n=3*10=30Docs discarded=(C-1)*n= (3-1)*10=20

Experiments and results The number of documents R retrieved

from the ith ranked collection is:

R(i) = M * n * [ 2(1+C – i ) / C * C + 1 ]

where,M Є [1,C+1/2 ],M*n =number of documents to be retrieved from all

collections

Result summaryCollection: TREC Volume-1C=5 M=2

Collection

C1 C2 C3 C4 C5

Docs retrieved

0.67n

0.53n

0.40n

0.27n

0.13n

Documents Retrieved

n (number of results reported)

Before After Savings

1000 4050 2000 51%M=1 Same results for ranks 1200M=2 Identical results for ranks 1500M=3 Better results for ranks 500 1000

Conclusions Representing collections by terms and frequencies is effective. Controlled vocabularies and schemas are not necessary. Collections and documents can be ranked with one algorithm (using different statistics). e.g., GlOSS, inference networks Rankings from different collections can be merged efficiently: – with precisely normalized scores (Infoseek’s method), or – without precisely normalized document scores, – with only minimal effort, and – with only minimal communication between client and server. Large scale distributed retrieval can be accomplished now.

Open Problems Multiple representations

stemming, stopwords, query processing, indexing cheating / spamming

How to integrate relevance feedback

query expansion browsing

References: Primary Source: Searching Distributed Collections With Inference

Networks By: James P. Callan , Zhihong Lu , W.Bruce Croft Secondary Source: 1. Distributed Information retrieval

By: James Allan, University of Massachusetts Amherst 2. Methodologies for Distributed Information Retrieval.

By: Owen de Kretser, Alistair Moffat, Tim Shimmin, Justin Zobel. (The proceedings from 18th International Conference on Distributed Computing Systems )

Questions ??