web text retrieval with a p2p query-driven index

27
G.Skobeltsyn | Web Text Retrieval with a P2P Query- Driven Index Web Text Retrieval Web Text Retrieval with a P2P Query- with a P2P Query- Driven Index Driven Index Gleb Skobeltsyn EPFL, Lausanne Switzerland July 26, 2007 Joint work with: Toan Luu Ivana Podnar Žarko Martin Rajman Karl Aberer Alvis Alvis

Upload: tareq

Post on 06-Jan-2016

59 views

Category:

Documents


0 download

DESCRIPTION

Alvis. Web Text Retrieval with a P2P Query-Driven Index. Gleb Skobeltsyn EPFL, Lausanne Switzerland July 26, 2007. Joint work with: Toan Luu Ivana Podnar Žarko Martin Rajman Karl Aberer. Goal. Our goal is to achieve scalable full-text retrieval with structured P2P networks. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Web Text Retrieval with a P2P Query-Driven Index

G.Skobeltsyn | Web Text Retrieval with a P2P Query-Driven Index

Web Text Retrieval with a Web Text Retrieval with a P2P Query-Driven IndexP2P Query-Driven Index

Gleb SkobeltsynEPFL, Lausanne

SwitzerlandJuly 26, 2007

Joint work with: • Toan Luu• Ivana Podnar Žarko• Martin Rajman• Karl Aberer

AlvisAlvis

Page 2: Web Text Retrieval with a P2P Query-Driven Index

P2PP2P

GoalGoal

• Our goalgoal is to achieve scalablescalable full-text retrieval with structured P2P networks

G.Skobeltsyn | Web Text Retrieval with a P2P Query-Driven Index 22 // 2727

Page 3: Web Text Retrieval with a P2P Query-Driven Index

Distributed P2P IR architectureDistributed P2P IR architecture

G.Skobeltsyn | Web Text Retrieval with a P2P Query-Driven Index

• Each peer provides a local document collection for search

• Each peer is responsible for a fraction of the global index

33 // 2727

Page 4: Web Text Retrieval with a P2P Query-Driven Index

P2P IR basicsP2P IR basics

G.Skobeltsyn | Web Text Retrieval with a P2P Query-Driven Index

• P2P network (Distributed Hash Table) with NN peers

• Each peer maintains connections with logNlogN neighbors

• The posting list posting list associated with a given indexing indexing key key is stored at the peer responsible for that key

• This peer can be located in logNlogN overlay hops k=hash(indexing_key)put(k,posting_list)

k=hash(indexing_key)get(k)

k -> p_list…-> …

posting_list

44 // 2727

Page 5: Web Text Retrieval with a P2P Query-Driven Index

G.Skobeltsyn | Web Text Retrieval with a P2P Query-Driven Index

(Naïve) Single-term indexing approach(Naïve) Single-term indexing approach

Single term based partitioning strategy leads to unscalable unscalable bandwidth con-sumptionbandwidth con-sumption at retrieval

(frequent intersections of large posting lists)

Query: “epfl & gleb”

h(“epfl”)-{d1,d2}

h(“gleb”)-{d2,d3}

h(t’)-{d4,d5}{d1,d2}

{d2}

55 // 2727

Page 6: Web Text Retrieval with a P2P Query-Driven Index

Single-term indexing

Multi-term indexing

term 1 posting list 1 term 2 posting list 2

term M-1 posting list M-1term M posting list M

®

®

®... ...

long posting listssm

all v

oc.

key 11 posting list 11 key 12 posting list 12

key 1i posting list 1i

®

®

®... ...

short posting lists

larg

e vo

c.

PEER 1

...

key N1 posting list N1 key N2 posting list N2

key Nj posting list Nj

®

®

®... ... PEER N

®

PEER 1

PEER N

...

Multi-term keys

Single-term vs. multi-term P2P Single-term vs. multi-term P2P indexingindexing

G.Skobeltsyn | Web Text Retrieval with a P2P Query-Driven Index 66 // 2727

Page 7: Web Text Retrieval with a P2P Query-Driven Index

Multi-term indexing: frameworkMulti-term indexing: framework

G.Skobeltsyn | Web Text Retrieval with a P2P Query-Driven Index

• Each peer is responsibleresponsible for a set of indexing keys

• Each indexing keyindexing key = {term1, term2, .., termk}, k>0

• Keys are assigned to peers by the underlying DHT DHT using the standard hashing mechanism

• Each key is associated with a truncated posting list truncated posting list (TPL) (TPL) that stores at most DFDFmaxmax top-ranked top-ranked document references

Distributed index contains {key,TPL} pairs

77 // 2727

Page 8: Web Text Retrieval with a P2P Query-Driven Index

Single-term indexing

Multi-term indexing

term 1 posting list 1 term 2 posting list 2

term M-1 posting list M-1term M posting list M

®

®

®... ...

long posting listssm

all v

oc.

key 11 posting list 11 key 12 posting list 12

key 1i posting list 1i

®

®

®... ...

short posting lists

larg

e vo

c.

PEER 1

...

key N1 posting list N1 key N2 posting list N2

key Nj posting list Nj

®

®

®... ... PEER N

®

PEER 1

PEER N

...

Multi-term keys

Single-term vs. multi-term P2P Single-term vs. multi-term P2P indexingindexing

How to select keys to keep a satisfactory retrieval quality?

voc. sizecould grow

exponentially!

G.Skobeltsyn | Web Text Retrieval with a P2P Query-Driven Index 88 // 2727

Page 9: Web Text Retrieval with a P2P Query-Driven Index

Indexing with HDK (Podnar et al. Indexing with HDK (Podnar et al. ICDE’07)ICDE’07)

• Document-Driven key generation:

• Each time a new document is indexed, some posting lists for an indexin key k k can reach the max size of DFmax

It triggerstriggers the generation of new keys (k + additional frequent keys)

• Use a number of filters to reduce the number of keys (proximity, redundancy and size filters)

G.Skobeltsyn | Web Text Retrieval with a P2P Query-Driven Index 99 // 2727

Page 10: Web Text Retrieval with a P2P Query-Driven Index

Indexing with HDKIndexing with HDK

• Pro’sPro’s: – ICDE’07 paper proves that the approach is scalable– Elegant key generation mechanism– Low bandwidth during retrieval (PL’s of limited size)

• Con’sCon’s:– Practically the number of keys is still LARGE:

• 113keys/doc (68M for 0.6M docs)

– High bandwidth consumption at indexing

• ProblemProblem:– Too many keys are superfluous (almost never used)

G.Skobeltsyn | Web Text Retrieval with a P2P Query-Driven Index

Let’s index what is queried!Let’s index what is queried!

1010 // 2727

Page 11: Web Text Retrieval with a P2P Query-Driven Index

ContentsContents

• Introduction• Single-term vs. multi term indexing• HDK approach for indexing• Query-driven approach for indexing/retrieval

– Indexing structure– Example– Scalability– Evaluation

• Conclusion

G.Skobeltsyn | Web Text Retrieval with a P2P Query-Driven Index 1111 // 2727

Page 12: Web Text Retrieval with a P2P Query-Driven Index

QDI: Query-Driven IndexQDI: Query-Driven Index

• Query-Driven Indexing strategy solves the “Too-Many-KeysToo-Many-Keys” problem:

– Avoids maintenance of superfluous keys

– Generates only such keys that are requested by users on-the-flyon-the-fly

– Utilizes query-log to discover such keys (monitors query frequency)

G.Skobeltsyn | Web Text Retrieval with a P2P Query-Driven Index 1212 // 2727

Page 13: Web Text Retrieval with a P2P Query-Driven Index

QDI: ChallengesQDI: Challenges

• Challenges– Indexing of a new key requires a bandwidth-

efficient mechanism to obtain the top-k posting list associated with the key Conventional intersection like Conventional intersection like

threshold algorithm, but threshold algorithm, but less oftenless often

– Incomplete index causes degradation of query results quality Show that the degradation is Show that the degradation is lowlow

G.Skobeltsyn | Web Text Retrieval with a P2P Query-Driven Index 1313 // 2727

Page 14: Web Text Retrieval with a P2P Query-Driven Index

QDI: RetrievalQDI: Retrieval

G.Skobeltsyn | Web Text Retrieval with a P2P Query-Driven Index

a b c

abc

ab bc ac

• Single term index is generated

• Process abc1) Probe Pabc

2) Probe Pab Pbc and Pac

3) Probe Pa Pb and Pc

4) Obtain top-DFmax results for a, b and c (ranked w.r.t a, b and c respectively)

5) Contact peers in the list, re-rank the obtained results w.r.t abc

6) Output top-10

• Inc. the QF for ab, bc and ac• Activate (index) ac

peer?abc nothing

?abc

nothing

nothing

nothing

?abc

+1 +1 +1

DFmax

popularpopular

1414 // 2727

Page 15: Web Text Retrieval with a P2P Query-Driven Index

QDI: Retrieval 2QDI: Retrieval 2

G.Skobeltsyn | Web Text Retrieval with a P2P Query-Driven Index

abc

ab bc ac

a b c

• Assume the frequency of b is below DFmax

• Note, how the redundancy filter simplifies the lattice in such a case(grayed nodes do not have to be activated)

DFmax

abc

ab bc

1515 // 2727

Page 16: Web Text Retrieval with a P2P Query-Driven Index

QDI: Retrieval 3QDI: Retrieval 3

G.Skobeltsyn | Web Text Retrieval with a P2P Query-Driven Index

abc

ab bc ac

a b c

• Single term index is generated and ac is indexed

• Process abc1) Probe Pabc

2) Probe Pab Pbc and Pac – obtain the result for ac

3) Probe Pb and obtain the result for b

4) Contact all peers in the list to re-rank the obtained results w.r.t abc

5) Output top-10

• Inc. the QF for ab, bc and ac

peer?abc nothing

?abc

nothing

nothing

?abc

+1+1 +1

1616 // 2727

Page 17: Web Text Retrieval with a P2P Query-Driven Index

QDI: SummaryQDI: Summary

• Each single-term found in the document collection has to be indexed. – We call all single-term keys a basic single term indexbasic single term index.

– The posting lists are truncated at DFmax.

• A multi-term key k is activated (indexed) iff:

– k is popular: QF(k) ≥QFmin, where QF(k) is the popularity of the key k derived from the available query log and QFmin is a parameter for our model (popularity filter).

– k contains from 2 to smax terms: 2≤|k|≤ smax, where smax is a parameter of our model (size filter).

– all immediate sub-keys of k (of size |k-1|) are indexed and their associated postings lists are truncated (redundancy filter).

G.Skobeltsyn | Web Text Retrieval with a P2P Query-Driven Index 1717 // 2727

Page 18: Web Text Retrieval with a P2P Query-Driven Index

Scalability (see Skobeltsyn et al, Scalability (see Skobeltsyn et al, Infoscale’07)Infoscale’07)

• The retrievalretrieval traffic is still scalable (depends on DFmax and a query size)

• The indexingindexing traffic now depends on the number of activatedactivated keys

The number of keys does does notnot depend on the depend on the document collection sizedocument collection size but only on the size of the query log

We can use the QFQFminmin and DFDFmaxmax parameter to adjust the tradeoff:

indexing traffic <—> retrieval qualityindexing traffic <—> retrieval quality

G.Skobeltsyn | Web Text Retrieval with a P2P Query-Driven Index 1818 // 2727

Page 19: Web Text Retrieval with a P2P Query-Driven Index

ContentsContents

• Introduction• Single-term vs. multi term indexing• HDK approach for indexing• Query-driven approach for indexing/retrieval

– Indexing structure– Example– Scalability– Evaluation

• Conclusion

G.Skobeltsyn | Web Text Retrieval with a P2P Query-Driven Index 1919 // 2727

Page 20: Web Text Retrieval with a P2P Query-Driven Index

AOL logsAOL logs

• 17M Queries from March, April, May 2006 (92 days)• 650K anonymous user sessions• Extracted all unique queries from each user

session:

G.Skobeltsyn | Web Text Retrieval with a P2P Query-Driven Index

…2006-05-31 23:50:30 wearthbow.com native.cheyenne origin.2006-05-31 23:50:30 l6 screensaver2006-05-31 23:50:30 horses for sale in tn ky2006-05-31 23:50:30 bank of america.com2006-05-31 23:50:30 ask2006-05-31 23:50:29 del rosa lanes2006-05-31 23:50:28 www.spirit airlines.com2006-05-31 23:50:28 find holy women of the bible2006-05-31 23:50:27 trains2006-05-31 23:50:27 todaysmiricles2006-05-31 23:50:27 constition2006-05-31 23:50:26 german grocceries in las vegas nv2006-05-31 23:50:25 porn2006-05-31 23:50:25 northwest indiana2006-05-31 23:50:24 united.eprize.net2006-05-31 23:50:24 jessica laguna…

<-0.7Gb

2020 // 2727

Page 21: Web Text Retrieval with a P2P Query-Driven Index

Overlap experimentOverlap experiment

• Use the query-log to build the index (days 1..91)• Choose randomly 2K test queries from the day 92• Answer each test queryquery with Google and compare to the union

of top-DFmax Google results for each of its combinationsits combinations that are indexed according to the logs.

• Mimics our P2PIR system if Google’s ranking is used.• Example:

Original query

Non-superfluous (indexed) combinations

X

X

overlap@5=3/5=60%

G.Skobeltsyn | Web Text Retrieval with a P2P Query-Driven Index 2121 // 2727

Page 22: Web Text Retrieval with a P2P Query-Driven Index

Overlap exampleOverlap example

G.Skobeltsyn | Web Text Retrieval with a P2P Query-Driven Index

>id=481, q=“what did babe ruth do in the 1920”what did babe ruth do in the 1920”

“1920 babe ruth”, qf=0 ----> Ov@100= 100%

“1920 babe”, qf=0 ---------> Ov@100= 9% +++“1920 ruth”1920 ruth”, qf=1 ---------> Ov@100= 33%33% +++“babe ruth”babe ruth”, qf=495 -------> Ov@100= 69% 69%

---“1920”, qf=716 ------------> Ov@100= 1% ---“babe”, qf=3196 -----------> Ov@100= 2% ---“ruth”, qf=1653 -----------> Ov@100= 7%

Size: 192192, Keys used: 22, Overlap@100: 94%94%

• Cut-n-paste from the simulation log:

2222 // 2727

Page 23: Web Text Retrieval with a P2P Query-Driven Index

Overlap experiment: impact of QFOverlap experiment: impact of QFminmin, , DFDFmaxmax

G.Skobeltsyn | Web Text Retrieval with a P2P Query-Driven Index

Top20 overlap: impact of DFmax with QFmin=1Top20 overlap: impact of QFmin with DFmax=600

2323 // 2727

Page 24: Web Text Retrieval with a P2P Query-Driven Index

Overlap experiment: impact of the log Overlap experiment: impact of the log sizesize

G.Skobeltsyn | Web Text Retrieval with a P2P Query-Driven Index

Top20 overlap: impact of the log size (Qfmin =1, DFmax=600)

2424 // 2727

Page 25: Web Text Retrieval with a P2P Query-Driven Index

TREC ExperimentTREC Experiment

• WT10G collection (~1.69 M docs)• 100 TREC queries (from TREC Web Track 9 & 10)• Query statistics generated form 17M AOL

queries• Using Okapi-BM25 weighting schema to

compute ranking score• QFmin = 1, 3, 5, ∞• DFmax = 100, 500• smax=3

G.Skobeltsyn | Web Text Retrieval with a P2P Query-Driven Index

DFmax=100 DFmax=500ST-BM25

QFmin=∞ QFmin=5 QFmin=3 QFmin=1 QFmin=∞ QFmin=5 QFmin=3 QFmin=1

P@1 0.408 0.449 0.449 0.449 0.429 0.439 0.439 0.439 0.439

P@2 0.388 0.439 0.434 0.434 0.418 0.429 0.429 0.429 0.429

P@3 0.347 0.412 0.412 0.408 0.391 0.395 0.395 0.395 0.395

P@4 0.324 0.370 0.372 0.370 0.367 0.362 0.362 0.362 0.360

P@5 0.306 0.345 0.347 0.341 0.345 0.343 0.343 0.343 0.337

P@10 0.266 0.299 0.295 0.294 0.307 0.302 0.303 0.302 0.298

P@15 0.237 0.267 0.267 0.267 0.276 0.279 0.280 0.278 0.278

P@20 0.212 0.243 0.243 0.246 0.254 0.259 0.259 0.259 0.257

P@30 0.174 0.206 0.209 0.212 0.214 0.221 0.221 0.224 0.226

P@50 0.139 0.169 0.171 0.174 0.175 0.181 0.181 0.183 0.186

P@100 0.097 0.126 0.127 0.130 0.128 0.135 0.135 0.136 0.140

Precision is similar to centralized indexing

Precision is similar to centralized indexing

TREC: Precision at Top Ranked Pages (table)

2525 // 2727

Page 26: Web Text Retrieval with a P2P Query-Driven Index

ConclusionsConclusions

• We presented the query-driven indexing strategy query-driven indexing strategy for scalable web text retrieval with structured P2P networks:

– Keeps only popularpopular and non-redundant multi-term keys in the index

– Associates keys with truncatedtruncated posting lists

• And we showed that:– With real query-logs our approach achieves good good

retrieval quality retrieval quality comparable to a centralized engine

– The QFmin parameter alows to adjust the traffic/quality tradeofftradeoff

G.Skobeltsyn | Web Text Retrieval with a P2P Query-Driven Index 2626 // 2727

Page 27: Web Text Retrieval with a P2P Query-Driven Index

G.Skobeltsyn | Web Text Retrieval with a P2P Query-Driven Index

Last slideLast slide

Thank you for your attention!Questions?

AlvisP2P web site:http://globalcomputing.epfl.ch/alvis/

2727 // 2727