presentation topic searching distributed collections with inference networks searchingdistributed...

37
Presentation Topic Searching Distributed Collections With Inference Networks Searching Distributed Collections Inference Network Keywords: COSC 6341-Information Retrieval

Upload: warren-atkins

Post on 18-Jan-2018

225 views

Category:

Documents


0 download

DESCRIPTION

P2P –peer to peer Architecture

TRANSCRIPT

Page 1: Presentation Topic Searching Distributed Collections With Inference Networks SearchingDistributed CollectionsInference Network Keywords: COSC 6341-Information

Presentation Topic

Searching Distributed Collections With Inference Networks

SearchingDistributed Collections

Inference Network

Keywords:

COSC 6341-Information Retrieval

Page 2: Presentation Topic Searching Distributed Collections With Inference Networks SearchingDistributed CollectionsInference Network Keywords: COSC 6341-Information

What is Distributed IR ?

Homogeneous Collections Large single collection is partitioned and distributed over network to improve efficiency E.g.: Google

Heterogeneous Collections Internet offers thousands of diverse collections available for searching on demand E.g.: P2P

Architectural Aspects

Page 3: Presentation Topic Searching Distributed Collections With Inference Networks SearchingDistributed CollectionsInference Network Keywords: COSC 6341-Information

P2P –peer to peer Architecture

Page 4: Presentation Topic Searching Distributed Collections With Inference Networks SearchingDistributed CollectionsInference Network Keywords: COSC 6341-Information

How to search such a collection ?Consider the entire collection as a

“single LARGE VIRTUAL Collection”

C6

C5

C3

C2C1

C4

Virtual Collection

Page 5: Presentation Topic Searching Distributed Collections With Inference Networks SearchingDistributed CollectionsInference Network Keywords: COSC 6341-Information

Search each collection individually

C6

C5

C3

C2C1

C4

Virtual Collection

Results1

Results4

Results3

Results2

Results6

Results5

Search3

Search2

Search5

Search1

Search4

Search6

How to Merge the results ??

Get the Results

Communication Costs??

Time required ??

Page 6: Presentation Topic Searching Distributed Collections With Inference Networks SearchingDistributed CollectionsInference Network Keywords: COSC 6341-Information

Solution : An IR system must automatically: Rank Collections Collection Selection - Select specific

collections to be searched Merging the Ranked Results - Effectively

Merge the results

Page 7: Presentation Topic Searching Distributed Collections With Inference Networks SearchingDistributed CollectionsInference Network Keywords: COSC 6341-Information

Ranking Collection Ranking of Collection can be addressed by Inference Network CORI net - Collection Retrieval Inference

Network, is used to rank collections, as opposed to more commonDocument Retrieval Inference Network.

Page 8: Presentation Topic Searching Distributed Collections With Inference Networks SearchingDistributed CollectionsInference Network Keywords: COSC 6341-Information

Inference Network

dj

k1 k2 ki

qq2

kt

I q1or

and

or

Document dj has index terms - k1,k2,kiQuery Q composed of index terms-k1,k2,kiq1 = [( k1 and k2 ) or ki ] Boolean q2 = ( k1 and k2 ) Formulation

Information Need I =q or q1

Page 9: Presentation Topic Searching Distributed Collections With Inference Networks SearchingDistributed CollectionsInference Network Keywords: COSC 6341-Information

Ranking Documents in Inference Network

using tf - idf strategies Term frequency ( tf ) =f (i, j)= P (ki|dj) Inverse document Frequency ( idfi ) = P( q|¯k) In an Inference network ,ranking of a document is computed as :

P ( q ∆ dj ) = Cj * 1/|dj| * Σ f (i, j) * idfi * 1/(1-f (i, j))

P (ki | dj ) = influence of keyword ki on document dj

P (q | ¯k ) = influence of index terms ki in the query node

Page 10: Presentation Topic Searching Distributed Collections With Inference Networks SearchingDistributed CollectionsInference Network Keywords: COSC 6341-Information

Analogy between a Document Retrieval and a Collection retrieval Network.

Documents• tf – term frequency. Number of occurrences of term in a document

• idf - inverse document freq. f (Number of documents containing term)

• dl - document length

• D – Number of Documents

Collections• df – document frequency # of documents containing term in a collection

• icf – inverse collection freq. f (Number of collections containing term)

• cl – Collection length

• C – Number of Collections

tf – idf Scheme for Documents

df – icf Scheme for Collections

Page 11: Presentation Topic Searching Distributed Collections With Inference Networks SearchingDistributed CollectionsInference Network Keywords: COSC 6341-Information

Comparison between Document Retrieval and Collection Retrieval Inference Networks

Document Retrieval Inference Network

Collection Retrieval Inference Network – CORI Net

Retrieve documents based on a query

Retrieve Collections based on a query

Page 12: Presentation Topic Searching Distributed Collections With Inference Networks SearchingDistributed CollectionsInference Network Keywords: COSC 6341-Information

Why to use Inference Network ? One system is used for ranking both Documents

and Collections.

Document retrieval becomes a simple process: Use the query to retrieve a ranked list of collections Select the top group of collections Search the top group of collections, in parallel or sequentially Merge the results from the various collections into a single

ranking

To the retrieval Algorithm ,a CORI net looks like a document retrieval inference network with very big documents:

Each document is a surrogate for a complete collection

Page 13: Presentation Topic Searching Distributed Collections With Inference Networks SearchingDistributed CollectionsInference Network Keywords: COSC 6341-Information

Interesting facts: CORI net for 1.2 GB collection = 5MB(0.4 % of the

original collection)

CORI net to search well-known collection of CACM (having 3000document collections) shows high values of df and icf but does not affect the computational complexity of retrieval.

Page 14: Presentation Topic Searching Distributed Collections With Inference Networks SearchingDistributed CollectionsInference Network Keywords: COSC 6341-Information

Experiments on Trec Collection

T = d_t +(1 –d_t) * log(df +0.5)/ log(max_df+1.0)

I = log(|C|+0.5/cf) / log(|C|+1.0)

Belief p(rk/ci) in collection ci due to observing term rk is given by:

P ( rk / ci ) = d_b + ( 1 – d_b ) * T * I

where,

df=number of documents in ci containing term rk

max_df=number of documents containing the most frequent term in ci

|C| =number of collections

cf=number of collections containing term rk

d_t =minimum term frequency component when term rk occurs in collection ci

d_b=minimum belief component when term rk occurs in collection ci

Page 15: Presentation Topic Searching Distributed Collections With Inference Networks SearchingDistributed CollectionsInference Network Keywords: COSC 6341-Information

Effectiveness This approach to ranking collections was evaluated using

INQUERY retrieval system and TREC collection. Experiments were conducted with 100queries TREC Volume-1 Topics 51-150 Mean squared error of the collection ranking for a single query

is calculated as: 1/|C|*Σi € C (Oi - Ri)2

where, Oi = optimal ranking for collection i based on the number of

relevant documents it contained

Ri = the rank for collection determined by the retrieval algorithm

C = the set of collections being ranked

Page 16: Presentation Topic Searching Distributed Collections With Inference Networks SearchingDistributed CollectionsInference Network Keywords: COSC 6341-Information

Results Mean squared error averaged for first 50 queries=2.3471 Ranking for almost 75% of the queries was perfect For rest of 25% - disorganized ranks

Reason: Ranking Collections is not exactly similar to Ranking Documents

Scaling df max_df restricts small sets of interesting documentsSo, modification: “scaling df df+K”

K=k *((1-b)+b*cw / ¯cw)Where, cw = number of words in the collection k,b =constants (b [0,1])

Thus, T = d_t +(1-d_t)* df / df +K

Page 17: Presentation Topic Searching Distributed Collections With Inference Networks SearchingDistributed CollectionsInference Network Keywords: COSC 6341-Information

Modified results

Best combination of b and k, b: b=0.75 and k=200

Mean squared error averaged over 50 queries = 1.4586(38% better than previous results)

Rankings for 30 queries improved

Rankings for 8 queries changed slightly

Rankings for 12 queries did not change

Page 18: Presentation Topic Searching Distributed Collections With Inference Networks SearchingDistributed CollectionsInference Network Keywords: COSC 6341-Information

Merging the ResultsFour approaches:

Interleaving Raw scores Normalized scores Weighted scores

Effectiveness increases

Page 19: Presentation Topic Searching Distributed Collections With Inference Networks SearchingDistributed CollectionsInference Network Keywords: COSC 6341-Information

C1D1-D10

C4D31-D40

C3D21-D30

C5D41-D50

C2D11-D20

C6D51-D60

SINGLE LARGE HOMOGENOUS COLLECTIOND1-D60

Step 1:Ranking Collection

D1,D5,D7

D44, D37 D60,D52,D57,D59

Step 2:Merging Results

1

2

3D1

D60D44

D52D5D37D57D7D59

This scheme is not satisfying, as we have only document Rankings and Collection rankings

Raw ScoreInterleaving

10,12,37

29, 69 90, 32, 25, 1

Assigning Weights90

69

37D7

D37

D52 -32 D44 -29D57-25D5-12D1-10D59-1

But again these weights from different collections may not be directly comparable….

Page 20: Presentation Topic Searching Distributed Collections With Inference Networks SearchingDistributed CollectionsInference Network Keywords: COSC 6341-Information

Normalized Scores In inference network, normalizing scores requires a

preprocessing step prior to query evaluation

Preprocessing : The system obtains from each collection the statistics about how many documents each query term matches. The statistics are merged to obtain normalized idf .(Global weighing scheme)

Problem: High communicational and Computational costs. (if wide distributed network)

Page 21: Presentation Topic Searching Distributed Collections With Inference Networks SearchingDistributed CollectionsInference Network Keywords: COSC 6341-Information

Weighted Scores Weights can be based on document’s score

and/or the collection ranking information. Offers computational simplicity. Weight w gives weight results for different

collections.W = 1 + |C| * (s- ¯s / ¯s),

where,

|C|= the number of collections searched S = the collection’s score (not its rank)*

¯s = the mean of the collection scores*Assumption : Similar collections have similar weights

Page 22: Presentation Topic Searching Distributed Collections With Inference Networks SearchingDistributed CollectionsInference Network Keywords: COSC 6341-Information

Weighted Scores Rank (document) = score (document)*

weight (collection) This method favors documents from

collections with high scores but also enables a good document from a poor collection. [which we are looking for]

Page 23: Presentation Topic Searching Distributed Collections With Inference Networks SearchingDistributed CollectionsInference Network Keywords: COSC 6341-Information

Comparing results of the merging methods

Average precision -using 11 point Recall

10

15

20

25

30

35

40

Merge Techniques

Prec

ision

Precision 37.76 17.54 33.64 38.18

Normalized Interleaved Raw Score Weighted

R-precision(Trec vol-1-51-100)

20

25

30

35

40

45

Merge techniquesEx

act p

recis

ion

valu

es

R-precision 41.96 25.42 38.67 42.46

Normalized Interleaved Raw Score Weighted

Source: TREC Collection Volume-1

Topics: 51-100

Page 24: Presentation Topic Searching Distributed Collections With Inference Networks SearchingDistributed CollectionsInference Network Keywords: COSC 6341-Information

Merging Results-Pros and Cons Interleaving: Extremely ineffective, losses in average precision.

(The reason is that documents ranked high from non-relevantcollection may reside near high-ranked documents from more relevantcollections)

Raw Scores: Sometimes scores from different collections may not bedirectly comparable (like idf weights). Use of some terms in the collection may penalize its common use.

Normalized Scores: Resembles at most the search in single collections,but normalizing has significant communication and computationalcosts when collections are distributed across wide-area network

Weighted Scores: As effective as normalized scores, but less robust(introduces deviations in recall and precision)

Page 25: Presentation Topic Searching Distributed Collections With Inference Networks SearchingDistributed CollectionsInference Network Keywords: COSC 6341-Information

Collection SelectionApproaches: Top n collections Any collection with a score greater than

some threshold Top group (based on clustering) Cost based selection

Page 26: Presentation Topic Searching Distributed Collections With Inference Networks SearchingDistributed CollectionsInference Network Keywords: COSC 6341-Information

Results

Topics 51-100 101-150Difference between 2 best clusters and using all 7 collections

<-5% >5%

R-precision -0.9% -5.6%

11-Point precision -2.2% -9.1%

Document Cut-Off 500-1000 200

Eliminating collections reduces Recall.

Page 27: Presentation Topic Searching Distributed Collections With Inference Networks SearchingDistributed CollectionsInference Network Keywords: COSC 6341-Information

Related work on collection selection:Group Manually and Select Manually

Collections are organized into groups with a common theme e.g., financial, technology, appellate court decisions

Users selects which group to search Found in commercial service providers e.g., experienced users like Librarians

Groupings determined manually (+) time consuming, inconsistent groupings, coarse groupings, not good

for unusual information needs

Groupings determined automatically Broker agents maintain a centralized cluster index by periodically

querying collections on eachsubject

(+) automatic creation, better consistency, coarse groupings ( - )not good for unusual information needs

Page 28: Presentation Topic Searching Distributed Collections With Inference Networks SearchingDistributed CollectionsInference Network Keywords: COSC 6341-Information

Rule-Based Selection The contents of each collection are described in a

knowledge-base A rule-based system selects the collections for a query.

EXPERT CONIT, a research system, tested on static and homogeneous collections

(-) time consuming to create(-)inconsistent selection if rules change(-)coarse groupings so not good for unusual information

needs

Page 29: Presentation Topic Searching Distributed Collections With Inference Networks SearchingDistributed CollectionsInference Network Keywords: COSC 6341-Information

OptimizationRepresent a Collection With a Subset of Terms Inference network with most frequent terms

Atleast 20% most frequent words must be there

Page 30: Presentation Topic Searching Distributed Collections With Inference Networks SearchingDistributed CollectionsInference Network Keywords: COSC 6341-Information

Proximity Information Proximity of terms can be handled by

CORI net. CORI net for one collection would be 30%

of the original collection. [before we had CORI net = 0.4% of original collection]

Page 31: Presentation Topic Searching Distributed Collections With Inference Networks SearchingDistributed CollectionsInference Network Keywords: COSC 6341-Information

Retrieving Few documents Usually a user is interested in first 10 or utmost 20 results. Rest of the results are discarded

( Oh! Waste of resources and time)Example:

C220DOCS

C110DOCS

C330DOCS

MERGING OF THE RESULTS AND

SELECTING TOP 10 MERGED RESULTS

TOP 10TOP 10TOP 10

USER

20 Documents thrown out without even user looking into them

Collections=C=3Rankings of Interest=10Docs retrieved= C*n=3*10=30Docs discarded=(C-1)*n= (3-1)*10=20

Page 32: Presentation Topic Searching Distributed Collections With Inference Networks SearchingDistributed CollectionsInference Network Keywords: COSC 6341-Information

Experiments and results The number of documents R retrieved

from the ith ranked collection is:

R(i) = M * n * [ 2(1+C – i ) / C * C + 1 ]

where,M Є [1,C+1/2 ],M*n =number of documents to be retrieved from all

collections

Page 33: Presentation Topic Searching Distributed Collections With Inference Networks SearchingDistributed CollectionsInference Network Keywords: COSC 6341-Information

Result summaryCollection: TREC Volume-1C=5 M=2

Collection

C1 C2 C3 C4 C5

Docs retrieved

0.67n

0.53n

0.40n

0.27n

0.13n

Documents Retrieved

n (number of results reported)

Before After Savings

1000 4050 2000 51%M=1 Same results for ranks 1200M=2 Identical results for ranks 1500M=3 Better results for ranks 500 1000

Page 34: Presentation Topic Searching Distributed Collections With Inference Networks SearchingDistributed CollectionsInference Network Keywords: COSC 6341-Information

Conclusions Representing collections by terms and frequencies is effective. Controlled vocabularies and schemas are not necessary. Collections and documents can be ranked with one algorithm (using different statistics). e.g., GlOSS, inference networks Rankings from different collections can be merged efficiently: – with precisely normalized scores (Infoseek’s method), or – without precisely normalized document scores, – with only minimal effort, and – with only minimal communication between client and server. Large scale distributed retrieval can be accomplished now.

Page 35: Presentation Topic Searching Distributed Collections With Inference Networks SearchingDistributed CollectionsInference Network Keywords: COSC 6341-Information

Open Problems Multiple representations

stemming, stopwords, query processing, indexing cheating / spamming

How to integrate relevance feedback

query expansion browsing

Page 36: Presentation Topic Searching Distributed Collections With Inference Networks SearchingDistributed CollectionsInference Network Keywords: COSC 6341-Information

References: Primary Source: Searching Distributed Collections With Inference

Networks By: James P. Callan , Zhihong Lu , W.Bruce Croft Secondary Source: 1. Distributed Information retrieval

By: James Allan, University of Massachusetts Amherst 2. Methodologies for Distributed Information Retrieval.

By: Owen de Kretser, Alistair Moffat, Tim Shimmin, Justin Zobel. (The proceedings from 18th International Conference on Distributed Computing Systems )

Page 37: Presentation Topic Searching Distributed Collections With Inference Networks SearchingDistributed CollectionsInference Network Keywords: COSC 6341-Information

Questions ??