web text retrieval with a p2p query-driven index

G.Skobeltsyn | Web Text Retrieval with a P2P Query-Driven Index

Web Text Retrieval with a Web Text Retrieval with a P2P Query-Driven IndexP2P Query-Driven Index

Gleb SkobeltsynEPFL, Lausanne

SwitzerlandJuly 26, 2007

Joint work with: • Toan Luu• Ivana Podnar Žarko• Martin Rajman• Karl Aberer

AlvisAlvis

P2PP2P

GoalGoal

• Our goalgoal is to achieve scalablescalable full-text retrieval with structured P2P networks

G.Skobeltsyn | Web Text Retrieval with a P2P Query-Driven Index 22 // 2727

Distributed P2P IR architectureDistributed P2P IR architecture


• Each peer provides a local document collection for search

• Each peer is responsible for a fraction of the global index

33 // 2727

P2P IR basicsP2P IR basics


• P2P network (Distributed Hash Table) with NN peers

• Each peer maintains connections with logNlogN neighbors

• The posting list posting list associated with a given indexing indexing key key is stored at the peer responsible for that key

• This peer can be located in logNlogN overlay hops k=hash(indexing_key)put(k,posting_list)

k=hash(indexing_key)get(k)

k -> p_list…-> …

posting_list

44 // 2727


(Naïve) Single-term indexing approach(Naïve) Single-term indexing approach

Single term based partitioning strategy leads to unscalable unscalable bandwidth con-sumptionbandwidth con-sumption at retrieval

(frequent intersections of large posting lists)

Query: “epfl & gleb”

h(“epfl”)-{d1,d2}

h(“gleb”)-{d2,d3}

h(t’)-{d4,d5}{d1,d2}

{d2}

55 // 2727

Single-term indexing

Multi-term indexing

term 1 posting list 1 term 2 posting list 2

term M-1 posting list M-1term M posting list M

®

®

®... ...

long posting listssm

all v

oc.

key 11 posting list 11 key 12 posting list 12

key 1i posting list 1i

®

®

®... ...

short posting lists

larg

e vo

c.

PEER 1

...

key N1 posting list N1 key N2 posting list N2

key Nj posting list Nj

®

®

®... ... PEER N

®

PEER 1

PEER N

...

Multi-term keys

Single-term vs. multi-term P2P Single-term vs. multi-term P2P indexingindexing


Multi-term indexing: frameworkMulti-term indexing: framework


• Each peer is responsibleresponsible for a set of indexing keys

• Each indexing keyindexing key = {term1, term2, .., termk}, k>0

• Keys are assigned to peers by the underlying DHT DHT using the standard hashing mechanism

• Each key is associated with a truncated posting list truncated posting list (TPL) (TPL) that stores at most DFDFmaxmax top-ranked top-ranked document references

Distributed index contains {key,TPL} pairs

77 // 2727

Single-term indexing

Multi-term indexing

term 1 posting list 1 term 2 posting list 2

term M-1 posting list M-1term M posting list M

®

®

®... ...

long posting listssm

all v

oc.

key 11 posting list 11 key 12 posting list 12

key 1i posting list 1i

®

®

®... ...

short posting lists

larg

e vo

c.

PEER 1

...

key N1 posting list N1 key N2 posting list N2

key Nj posting list Nj

®

®

®... ... PEER N

®

PEER 1

PEER N

...

Multi-term keys

Single-term vs. multi-term P2P Single-term vs. multi-term P2P indexingindexing

How to select keys to keep a satisfactory retrieval quality?

voc. sizecould grow

exponentially!


Indexing with HDK (Podnar et al. Indexing with HDK (Podnar et al. ICDE’07)ICDE’07)

• Document-Driven key generation:

• Each time a new document is indexed, some posting lists for an indexin key k k can reach the max size of DFmax

It triggerstriggers the generation of new keys (k + additional frequent keys)

• Use a number of filters to reduce the number of keys (proximity, redundancy and size filters)


Indexing with HDKIndexing with HDK

• Pro’sPro’s: – ICDE’07 paper proves that the approach is scalable– Elegant key generation mechanism– Low bandwidth during retrieval (PL’s of limited size)

• Con’sCon’s:– Practically the number of keys is still LARGE:

• 113keys/doc (68M for 0.6M docs)

– High bandwidth consumption at indexing

• ProblemProblem:– Too many keys are superfluous (almost never used)


Let’s index what is queried!Let’s index what is queried!

1010 // 2727

ContentsContents

• Introduction• Single-term vs. multi term indexing• HDK approach for indexing• Query-driven approach for indexing/retrieval

– Indexing structure– Example– Scalability– Evaluation

• Conclusion


QDI: Query-Driven IndexQDI: Query-Driven Index

• Query-Driven Indexing strategy solves the “Too-Many-KeysToo-Many-Keys” problem:

– Avoids maintenance of superfluous keys

– Generates only such keys that are requested by users on-the-flyon-the-fly

– Utilizes query-log to discover such keys (monitors query frequency)


QDI: ChallengesQDI: Challenges

• Challenges– Indexing of a new key requires a bandwidth-

efficient mechanism to obtain the top-k posting list associated with the key Conventional intersection like Conventional intersection like

threshold algorithm, but threshold algorithm, but less oftenless often

– Incomplete index causes degradation of query results quality Show that the degradation is Show that the degradation is lowlow


QDI: RetrievalQDI: Retrieval


a b c

abc

ab bc ac

• Single term index is generated

• Process abc1) Probe Pabc

2) Probe Pab Pbc and Pac

3) Probe Pa Pb and Pc

4) Obtain top-DFmax results for a, b and c (ranked w.r.t a, b and c respectively)

5) Contact peers in the list, re-rank the obtained results w.r.t abc

6) Output top-10

• Inc. the QF for ab, bc and ac• Activate (index) ac

peer?abc nothing

?abc

nothing

nothing

nothing

?abc

+1 +1 +1

DFmax

popularpopular

1414 // 2727

QDI: Retrieval 2QDI: Retrieval 2


abc

ab bc ac

a b c

• Assume the frequency of b is below DFmax

• Note, how the redundancy filter simplifies the lattice in such a case(grayed nodes do not have to be activated)

DFmax

abc

ab bc

1515 // 2727

QDI: Retrieval 3QDI: Retrieval 3


abc

ab bc ac

a b c

• Single term index is generated and ac is indexed

• Process abc1) Probe Pabc

2) Probe Pab Pbc and Pac – obtain the result for ac

3) Probe Pb and obtain the result for b

4) Contact all peers in the list to re-rank the obtained results w.r.t abc

5) Output top-10

• Inc. the QF for ab, bc and ac

peer?abc nothing

?abc

nothing

nothing

?abc

+1+1 +1

1616 // 2727

QDI: SummaryQDI: Summary

• Each single-term found in the document collection has to be indexed. – We call all single-term keys a basic single term indexbasic single term index.

– The posting lists are truncated at DFmax.

• A multi-term key k is activated (indexed) iff:

– k is popular: QF(k) ≥QFmin, where QF(k) is the popularity of the key k derived from the available query log and QFmin is a parameter for our model (popularity filter).

– k contains from 2 to smax terms: 2≤|k|≤ smax, where smax is a parameter of our model (size filter).

– all immediate sub-keys of k (of size |k-1|) are indexed and their associated postings lists are truncated (redundancy filter).


Scalability (see Skobeltsyn et al, Scalability (see Skobeltsyn et al, Infoscale’07)Infoscale’07)

• The retrievalretrieval traffic is still scalable (depends on DFmax and a query size)

• The indexingindexing traffic now depends on the number of activatedactivated keys

The number of keys does does notnot depend on the depend on the document collection sizedocument collection size but only on the size of the query log

We can use the QFQFminmin and DFDFmaxmax parameter to adjust the tradeoff:

indexing traffic <—> retrieval qualityindexing traffic <—> retrieval quality


ContentsContents

• Introduction• Single-term vs. multi term indexing• HDK approach for indexing• Query-driven approach for indexing/retrieval

– Indexing structure– Example– Scalability– Evaluation

• Conclusion


AOL logsAOL logs

• 17M Queries from March, April, May 2006 (92 days)• 650K anonymous user sessions• Extracted all unique queries from each user

session:


…2006-05-31 23:50:30 wearthbow.com native.cheyenne origin.2006-05-31 23:50:30 l6 screensaver2006-05-31 23:50:30 horses for sale in tn ky2006-05-31 23:50:30 bank of america.com2006-05-31 23:50:30 ask2006-05-31 23:50:29 del rosa lanes2006-05-31 23:50:28 www.spirit airlines.com2006-05-31 23:50:28 find holy women of the bible2006-05-31 23:50:27 trains2006-05-31 23:50:27 todaysmiricles2006-05-31 23:50:27 constition2006-05-31 23:50:26 german grocceries in las vegas nv2006-05-31 23:50:25 porn2006-05-31 23:50:25 northwest indiana2006-05-31 23:50:24 united.eprize.net2006-05-31 23:50:24 jessica laguna…

<-0.7Gb

2020 // 2727

Overlap experimentOverlap experiment

• Use the query-log to build the index (days 1..91)• Choose randomly 2K test queries from the day 92• Answer each test queryquery with Google and compare to the union

of top-DFmax Google results for each of its combinationsits combinations that are indexed according to the logs.

• Mimics our P2PIR system if Google’s ranking is used.• Example:

Original query

Non-superfluous (indexed) combinations

X

X

overlap@5=3/5=60%


Overlap exampleOverlap example


>id=481, q=“what did babe ruth do in the 1920”what did babe ruth do in the 1920”

“1920 babe ruth”, qf=0 ----> Ov@100= 100%

“1920 babe”, qf=0 ---------> Ov@100= 9% +++“1920 ruth”1920 ruth”, qf=1 ---------> Ov@100= 33%33% +++“babe ruth”babe ruth”, qf=495 -------> Ov@100= 69% 69%

---“1920”, qf=716 ------------> Ov@100= 1% ---“babe”, qf=3196 -----------> Ov@100= 2% ---“ruth”, qf=1653 -----------> Ov@100= 7%

Size: 192192, Keys used: 22, Overlap@100: 94%94%

• Cut-n-paste from the simulation log:

2222 // 2727

Overlap experiment: impact of QFOverlap experiment: impact of QFminmin, , DFDFmaxmax


Top20 overlap: impact of DFmax with QFmin=1Top20 overlap: impact of QFmin with DFmax=600

2323 // 2727

Overlap experiment: impact of the log Overlap experiment: impact of the log sizesize


Top20 overlap: impact of the log size (Qfmin =1, DFmax=600)

2424 // 2727

TREC ExperimentTREC Experiment

• WT10G collection (~1.69 M docs)• 100 TREC queries (from TREC Web Track 9 & 10)• Query statistics generated form 17M AOL

queries• Using Okapi-BM25 weighting schema to

compute ranking score• QFmin = 1, 3, 5, ∞• DFmax = 100, 500• smax=3


DFmax=100 DFmax=500ST-BM25

QFmin=∞ QFmin=5 QFmin=3 QFmin=1 QFmin=∞ QFmin=5 QFmin=3 QFmin=1

P@1 0.408 0.449 0.449 0.449 0.429 0.439 0.439 0.439 0.439

P@2 0.388 0.439 0.434 0.434 0.418 0.429 0.429 0.429 0.429

P@3 0.347 0.412 0.412 0.408 0.391 0.395 0.395 0.395 0.395

P@4 0.324 0.370 0.372 0.370 0.367 0.362 0.362 0.362 0.360

P@5 0.306 0.345 0.347 0.341 0.345 0.343 0.343 0.343 0.337

P@10 0.266 0.299 0.295 0.294 0.307 0.302 0.303 0.302 0.298

P@15 0.237 0.267 0.267 0.267 0.276 0.279 0.280 0.278 0.278

P@20 0.212 0.243 0.243 0.246 0.254 0.259 0.259 0.259 0.257

P@30 0.174 0.206 0.209 0.212 0.214 0.221 0.221 0.224 0.226

P@50 0.139 0.169 0.171 0.174 0.175 0.181 0.181 0.183 0.186

P@100 0.097 0.126 0.127 0.130 0.128 0.135 0.135 0.136 0.140

Precision is similar to centralized indexing

Precision is similar to centralized indexing

TREC: Precision at Top Ranked Pages (table)

2525 // 2727

ConclusionsConclusions

• We presented the query-driven indexing strategy query-driven indexing strategy for scalable web text retrieval with structured P2P networks:

– Keeps only popularpopular and non-redundant multi-term keys in the index

– Associates keys with truncatedtruncated posting lists

• And we showed that:– With real query-logs our approach achieves good good

retrieval quality retrieval quality comparable to a centralized engine

– The QFmin parameter alows to adjust the traffic/quality tradeofftradeoff



Last slideLast slide

Thank you for your attention!Questions?

AlvisP2P web site:http://globalcomputing.epfl.ch/alvis/

2727 // 2727

web text retrieval with a p2p query-driven index

Documents

skobeltsyn web text

term mposting list

iposting list

truncated posting list

multiterm p2p indexingg

multiterm p2p indexinghow

structured p2p networksg

long posting listssmall