random walks on the click graph

19
1 Random Walks on the Random Walks on the Click Graph Click Graph Nick Craswell and Martin Szummer Nick Craswell and Martin Szummer Microsoft Research Cambridge Microsoft Research Cambridge SIGIR 2007 SIGIR 2007

Upload: mercedes-flores

Post on 30-Dec-2015

77 views

Category:

Documents


0 download

DESCRIPTION

Random Walks on the Click Graph. Nick Craswell and Martin Szummer Microsoft Research Cambridge SIGIR 2007. Introduction (1/2). A search engine can track which of its search results were clicked for which query - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Random Walks on the Click Graph

11

Random Walks on the Random Walks on the Click GraphClick Graph

Nick Craswell and Martin SzummerNick Craswell and Martin Szummer

Microsoft Research CambridgeMicrosoft Research Cambridge

SIGIR 2007SIGIR 2007

Page 2: Random Walks on the Click Graph

22

Introduction (1/2)Introduction (1/2)

A search engine can track which of its search A search engine can track which of its search results were clicked for which queryresults were clicked for which query

Click records of query-document pairs can be Click records of query-document pairs can be viewed as a viewed as a weak indication of relevanceweak indication of relevance– The user decided to at least view the The user decided to at least view the

document, based on its description in the document, based on its description in the search resultssearch results

We can use the clicks of past users to We can use the clicks of past users to improve the current search resultsimprove the current search results– The clicked set of documents is likely to differ The clicked set of documents is likely to differ

from the current user’s relevance setfrom the current user’s relevance set

Page 3: Random Walks on the Click Graph

33

Introduction (2/2)Introduction (2/2)

From the perspective of a user conducting a From the perspective of a user conducting a search:search:– Documents that are clicked but not relevant Documents that are clicked but not relevant

constitute constitute noisenoise– Documents that are relevant but not clicked Documents that are relevant but not clicked

constitute constitute sparsitysparsity in the click data in the click data Power law distributionPower law distribution: most queries in the click : most queries in the click

log have a small number of clicked documentslog have a small number of clicked documents

This paper focuses on the sparsity problem by This paper focuses on the sparsity problem by giving a Markov random walk model, although giving a Markov random walk model, although the model also has noise reduction propertiesthe model also has noise reduction properties

Page 4: Random Walks on the Click Graph

44

Algorithm on the Click Algorithm on the Click GraphGraph The current model uses click data alone, The current model uses click data alone,

without considering document content or without considering document content or query contentquery content

The click graph:The click graph:– BipartiteBipartite– Two types of nodes: queries and documentsTwo types of nodes: queries and documents– An edge connects a query and a document if a An edge connects a query and a document if a

click for that query-document pair is observedclick for that query-document pair is observed– The edge may be weighted according to the The edge may be weighted according to the

total number of clicks from all userstotal number of clicks from all users

Page 5: Random Walks on the Click Graph

55

Click Graph ExampleClick Graph Example

Page 6: Random Walks on the Click Graph

66

Application Areas for Application Areas for Algorithms on Click Algorithms on Click GraphGraph Query-to-document ‘search’Query-to-document ‘search’

– Given a query, find relevant documents, as in ad hoc Given a query, find relevant documents, as in ad hoc searchsearch

Query-to-query ‘suggestion’Query-to-query ‘suggestion’– Given a query, find other queries that the user might like Given a query, find other queries that the user might like

to runto run

Document-to-query ‘annotation’Document-to-query ‘annotation’– Given a document, attach related queries to itGiven a document, attach related queries to it

Document-to-document ‘relevance feedback’Document-to-document ‘relevance feedback’– Given an example document that is relevant to the user, Given an example document that is relevant to the user,

find additional relevant documentsfind additional relevant documents

Page 7: Random Walks on the Click Graph

77

Random Walk ModelRandom Walk Model

A basic query formulation modelA basic query formulation model1.1. Imagine a document (information need)Imagine a document (information need)

2.2. Think of a query associated with the documentThink of a query associated with the document

3.3. Issue the query or imagine another document related to the Issue the query or imagine another document related to the queryquery

4.4. Iterative thought process (noise process)Iterative thought process (noise process)– A Markov random walk which describes a probability A Markov random walk which describes a probability

distribution over queriesdistribution over queries

The retrieval model is obtained by inverting The retrieval model is obtained by inverting the query formulation modelthe query formulation model– Starts from an observed query, and attempts to undo the noise, Starts from an observed query, and attempts to undo the noise,

inferring the underlying information needinferring the underlying information need– Backward walksBackward walks

Page 8: Random Walks on the Click Graph

88

Random Walk Random Walk ComputationComputation

CCjkjk: click counts associating node : click counts associating node jj to to kk

Define transition probabilities Define transition probabilities PPt+1|tt+1|t((k|jk|j)) from from jj to to kk

s is the self-transition probability, which corresponds to s is the self-transition probability, which corresponds to the user favoring the current query or documentthe user favoring the current query or document

Transition matrix Transition matrix [[AA]]jkjk = = PPt+1|tt+1|t((k|jk|j)) PPtt|0|0((kk||jj)=[)=[AAtt]]jkjk– A measure of the A measure of the volumevolume of paths between of paths between jj and and kk

 s

jkCCsjkP

jk

ijijk

tt

when

/)1()|(|1

Page 9: Random Walks on the Click Graph

99

Random Walk Model for Random Walk Model for RetrievalRetrieval Backward random walk for retrieval:Backward random walk for retrieval:

Given that we Given that we endedended a a tt-step walk at node -step walk at node jj, we find , we find the probability of the probability of startingstarting at node at node kk, , PP0|0|tt((kk||jj))

Bayes rule: Bayes rule: PP0|0|tt((kk||jj) = ) = PPtt|0|0((jj||kk))PP0 0 ((kk)╱)╱PPtt((jj),),

assumingassuming PP0 0 ((kk)=1/)=1/N N andand P Ptt((jj) = Σ) = Σii[[AAtt]]ijij

PP0|0|tt((kk||jj) = [) = [AAttZZ-1-1]]kjkj where where ZZ is diagonal and is diagonal and ZZjjjj= Σ= Σii[[AAtt]]ijij

Forward random walk:Forward random walk:

PPtt|0|0((kk||jj) = [) = [vvjj .. AAtt]]kk

Page 10: Random Walks on the Click Graph

1010

Forward vs. Backward Forward vs. Backward WalksWalks PageRank: a query-independent forward random PageRank: a query-independent forward random

walk on the link graph, which proceeds to its walk on the link graph, which proceeds to its stationary distributionstationary distribution

In statistics, the backward walk model is referred to In statistics, the backward walk model is referred to as as diagnosticdiagnostic, and in contrast, the forward walk , and in contrast, the forward walk model is model is predictivepredictive

When When t t → ∞:→ ∞:– The forward random walk approaches the stationary The forward random walk approaches the stationary

distributiondistribution Gives high probability to nodes with large number of Gives high probability to nodes with large number of

clicksclicks

– The backward random walk approaches the prior The backward random walk approaches the prior starting distribution, which we have taken to be starting distribution, which we have taken to be uniformuniform

Page 11: Random Walks on the Click Graph

1111

Clustering EffectClustering Effect

Given an end node that is part of a cluster, we have Given an end node that is part of a cluster, we have similar probabilities of having started the walk from any similar probabilities of having started the walk from any node in the clusternode in the cluster

Page 12: Random Walks on the Click Graph

1212

Walk ParametersWalk Parameters

Figure: Probability distribution of non-self transitions under differentcombinations of t and s

Page 13: Random Walks on the Click Graph

1313

Experiment DataExperiment Data

A 14-day click log of web image search enginesA 14-day click log of web image search engines– Judged images with distance 1 from the query had Judged images with distance 1 from the query had

precision of 75%precision of 75%– Pruning: remove URLs only connected to one query Pruning: remove URLs only connected to one query

and remove queries that only connected to one URLand remove queries that only connected to one URL– After pruning: 505,000 URLs, 202,000 queries and After pruning: 505,000 URLs, 202,000 queries and

1.1 million edges1.1 million edges– Uniformly sampling 45 queries for evaluationUniformly sampling 45 queries for evaluation– TREC-style pooling relevance judgments of depth 20TREC-style pooling relevance judgments of depth 20

2278 relevance judgments identify 818 relevant 2278 relevance judgments identify 818 relevant imagesimages

Page 14: Random Walks on the Click Graph

1414

Experiment Result-1Experiment Result-1

Table 1. The furthest node Table 1. The furthest node from any of our test queries from any of our test queries is at distance 41 (‘101-0.9-is at distance 41 (‘101-0.9-backward’). ‘dist’ and ‘1-0-backward’). ‘dist’ and ‘1-0-forward’ are the baselines.forward’ are the baselines.

Page 15: Random Walks on the Click Graph

1515

Experiment Result-2Experiment Result-2

Figure: The number of images retrieved at different Figure: The number of images retrieved at different distances from the query for each method. The 101-distances from the query for each method. The 101-step walk with zero-self-transition possibly goes too step walk with zero-self-transition possibly goes too far, returning too few distance-1 images.far, returning too few distance-1 images.

Page 16: Random Walks on the Click Graph

1616

Experiment Result-3Experiment Result-3

Figure: The precision at different distances from the Figure: The precision at different distances from the query for each method.query for each method.

Page 17: Random Walks on the Click Graph

1717

Experiment Result-4Experiment Result-4

Figure: Precision-recall curves of forward and Figure: Precision-recall curves of forward and backward walks, with zero self-transition probability backward walks, with zero self-transition probability (1000 URLs retrieved)(1000 URLs retrieved)

Page 18: Random Walks on the Click Graph

1818

Experiment Result-5Experiment Result-5

Figure: Parameter sensitivity for a backwards walk. Each Figure: Parameter sensitivity for a backwards walk. Each contour shows a 0.01 variation in MAP@20. Grid contour shows a 0.01 variation in MAP@20. Grid intersections indicate the parameter combinations tried. intersections indicate the parameter combinations tried. The large plateau has the highest MAP@20 (0.56-0.57)The large plateau has the highest MAP@20 (0.56-0.57)

Page 19: Random Walks on the Click Graph

1919

ConclusionConclusion

We have applied a Markov random walk model to the We have applied a Markov random walk model to the click graph, giving us a high-quality ranking of click graph, giving us a high-quality ranking of documents for a given query, including those as-yet documents for a given query, including those as-yet unclicked for that queryunclicked for that query

A backward walk was more effective than a forward walk, A backward walk was more effective than a forward walk, which supports the notion underlying our backward walkwhich supports the notion underlying our backward walk

We got the best results from a walk of 11 steps, or 101 We got the best results from a walk of 11 steps, or 101 steps with high self-transition probabilitysteps with high self-transition probability

We have studied ad hoc retrieval in this paper and the We have studied ad hoc retrieval in this paper and the model could be effective and easily applied in the model could be effective and easily applied in the applications listedapplications listed

Given our model, another possible step would be to Given our model, another possible step would be to incorporate document content and query content, by incorporate document content and query content, by incorporating a language model, aiming to find document incorporating a language model, aiming to find document that are not yet part of the click graphthat are not yet part of the click graph