query-log based authority analysis for web information...

105
U N I V E R S I T A S S A R A V I E N S I S Universit¨ at des Saarlandes, FR Informatik Max-Planck-Institut f¨ ur Informatik, AG 5 Query-log based Authority Analysis for Web Information Search Diplomarbeit im Fach Informatik Diploma Thesis in Computer Science von / by Julia Luxenburger unter Anleitung von / supervised by Prof. Dr. Gerhard Weikum Saarbr¨ ucken, 28. Oktober 2004

Upload: others

Post on 01-Oct-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Query-log based Authority Analysis for Web Information Searchdomino.mpi-inf.mpg.de/imprs/imprspubl.nsf/a1f10a9f...be leveraged by a link-analysis algorithm. The problem that we study

UN

IVE R S I T AS

SA

RA V I E N

SI S

Universitat des Saarlandes, FR InformatikMax-Planck-Institut fur Informatik, AG 5

Query-log based AuthorityAnalysis for Web Information

Search

Diplomarbeit im Fach InformatikDiploma Thesis in Computer Science

von / by

Julia Luxenburger

unter Anleitung von / supervised by

Prof. Dr. Gerhard Weikum

Saarbrucken, 28. Oktober 2004

Page 2: Query-log based Authority Analysis for Web Information Searchdomino.mpi-inf.mpg.de/imprs/imprspubl.nsf/a1f10a9f...be leveraged by a link-analysis algorithm. The problem that we study

Hilfsmittelerklarung (Non-plagiarism statement)Hiermit versichere ich, die vorliegende Arbeit selbstandig verfasst und keine

anderen als die angegebenen Quellen und Hilfsmittel benutzt zu haben.Saarbrucken, den 28. Oktober 2004,

(Julia Luxenburger)

Page 3: Query-log based Authority Analysis for Web Information Searchdomino.mpi-inf.mpg.de/imprs/imprspubl.nsf/a1f10a9f...be leveraged by a link-analysis algorithm. The problem that we study

Abstract

The ongoing explosion of web information calls for more intelligent and personal-ized methods towards better search result quality for advanced queries. Query logsand click streams obtained from web browsers or search engines can contribute tobetter quality by exploiting the collaborative recommendations that are implicitlyembedded in this information. The method presented in this work incorporates thenotion of query nodes into the PageRank model and integrates the implicit relevancefeedback given by click streams into the automated process of authority analysis.The enhanced PageRank scores, coined QRank scores, can be computed offline; atquery-time they are combined with query-specific relevance measures with virtuallyno overhead. In our experiments significant improvements in the precision of searchresults were observed, which demonstrate the effectiveness of our model.

Page 4: Query-log based Authority Analysis for Web Information Searchdomino.mpi-inf.mpg.de/imprs/imprspubl.nsf/a1f10a9f...be leveraged by a link-analysis algorithm. The problem that we study

Contents

1 Introduction 41.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.2 Our Approach and Contribution . . . . . . . . . . . . . . . . . . . . . 51.3 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.3.1 Click Stream, Query Refinement and Query Session . . . . . . 61.3.2 Extracting Information from Query Logs . . . . . . . . . . . . 7

1.4 Overview of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . 81.5 Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2 Foundations 92.1 Content-based Information Retrieval . . . . . . . . . . . . . . . . . . 10

2.1.1 Boolean Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . 102.1.2 Ranked Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.2 Structure-based Information Retrieval . . . . . . . . . . . . . . . . . . 122.2.1 Markov chains . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.2.2 PageRank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.2.3 HITS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.3 Evaluation of Search Results . . . . . . . . . . . . . . . . . . . . . . . 182.4 Relevance Feedback . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.4.1 Query Expansion . . . . . . . . . . . . . . . . . . . . . . . . . 202.4.2 Adjusting the Ranking Function . . . . . . . . . . . . . . . . . 20

3 Related Work 223.1 Variations of HITS and PageRank . . . . . . . . . . . . . . . . . . . . 22

3.1.1 Extensions to HITS . . . . . . . . . . . . . . . . . . . . . . . . 223.1.2 Extensions to PageRank . . . . . . . . . . . . . . . . . . . . . 263.1.3 Unified Frameworks . . . . . . . . . . . . . . . . . . . . . . . . 30

3.2 Query-dependent Link Analysis . . . . . . . . . . . . . . . . . . . . . 333.2.1 Query-dependent PageRank . . . . . . . . . . . . . . . . . . . 333.2.2 Unified Framework

Comprising Link Structure, Content and Queries . . . . . . . 343.3 Exploiting Query Logs and Click Streams . . . . . . . . . . . . . . . . 35

3.3.1 Query Clustering . . . . . . . . . . . . . . . . . . . . . . . . . 35

2

Page 5: Query-log based Authority Analysis for Web Information Searchdomino.mpi-inf.mpg.de/imprs/imprspubl.nsf/a1f10a9f...be leveraged by a link-analysis algorithm. The problem that we study

CONTENTS 3

3.3.2 Query Expansion . . . . . . . . . . . . . . . . . . . . . . . . . 383.3.3 Exploiting Click Streams . . . . . . . . . . . . . . . . . . . . . 39

4 Incorporating Queries into PageRank 414.1 QRank Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.1.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 424.1.2 Formalized Description . . . . . . . . . . . . . . . . . . . . . . 44

4.2 Properties of QRank . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

5 Implementation 525.1 Extracting Query Sessions from Browser Histories . . . . . . . . . . . 545.2 Computation of Query Similarity . . . . . . . . . . . . . . . . . . . . 595.3 Computation of Document Similarity . . . . . . . . . . . . . . . . . . 61

6 Experiments 636.1 Document Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . 636.2 Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

6.2.1 Search Engine . . . . . . . . . . . . . . . . . . . . . . . . . . . 636.2.2 Evaluation Interface . . . . . . . . . . . . . . . . . . . . . . . 65

6.3 Preliminary Experiment . . . . . . . . . . . . . . . . . . . . . . . . . 656.3.1 Data Acquisition . . . . . . . . . . . . . . . . . . . . . . . . . 656.3.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

6.4 Comprehensive Experiment . . . . . . . . . . . . . . . . . . . . . . . 666.4.1 Data Acquisition . . . . . . . . . . . . . . . . . . . . . . . . . 676.4.2 Evaluation Methodology . . . . . . . . . . . . . . . . . . . . . 676.4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

6.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

7 Conclusion and Future Work 81

A Top-10 Rankings for trained Queries 87

B Top-10 Rankings for untrained Queries 99

Page 6: Query-log based Authority Analysis for Web Information Searchdomino.mpi-inf.mpg.de/imprs/imprspubl.nsf/a1f10a9f...be leveraged by a link-analysis algorithm. The problem that we study

Chapter 1

Introduction

1.1 Motivation

Today’s web information search is based on two orthogonal paradigms, traditionaltextual similarity (e.g., tf*idf-based cosine similarity between feature vectors), on onehand, and link-based authority ranking, on the other hand. The most prominentapproach for the latter is the PageRank method by Brin and Page [12, 11] thatmeasures the (query-independent) importance of web pages by analyzing the linkstructure of the web graph. The key idea of PageRank scoring is to grant a webpage higher prestige the more high-quality pages link to it. The rationale behindis that the outgoing links of a web page exhibit an intellectual endorsement of thepage creator who judges the linked pages valuable. This kind of intellectual userinput can be generalized to analyzing and exploiting entire surf trails and query logsof individual users or an entire user community. These trails, which can be gatheredfrom browser histories, local proxies, or Web servers, reflect semantic correlationsbetween documents and implicit user judgements. For example, suppose a userclicks on a specific subset of the top-10 results returned by a search engine fora multi-keyword query, based on having seen the summaries of these documents.This implicit form of relevance feedback establishes a strong correlation betweenthe query and the clicked-on documents. Further suppose that the user refines aquery by adding or replacing keywords, e.g., to eliminate ambiguities in the previousquery; the ambiguity may be recognized by the user by observing multiple unrelatedtopics in the query result. Again, this establishes correlations between the newkeywords and the subsequently clicked-on documents, but also, albeit possibly toa lesser extent, between the original query and the eventually relevant documents.Here it is less obvious how to cast the observed correlations into relations that canbe leveraged by a link-analysis algorithm. The problem that we study is how tosystematically exploit user-behavior information gathered from query logs and clickstreams and incorporate it into a PageRank-style link analysis algorithm. We believethat observing user behavior is a key element in adding intelligence to a web searchengine and boosting its search result quality. Of course, these considerations and

4

Page 7: Query-log based Authority Analysis for Web Information Searchdomino.mpi-inf.mpg.de/imprs/imprspubl.nsf/a1f10a9f...be leveraged by a link-analysis algorithm. The problem that we study

1.2. OUR APPROACH AND CONTRIBUTION 5

the results presented can also be applied to search in intranets or digital librarieswhere explicit hyperlinks are much more infrequent and can usually not be exploitedthat well [20].

1.2 Our Approach and Contribution

The key idea of our approach is to extend the Markov chain structure that underliesthe PageRank model (see Section 2.2.2) by additional nodes and edges that reflectthe observed user behavior extracted from query and click-stream logs. Just likePageRank models visits to web pages to states of a Markov-chain random walk,we model all observed queries and query refinements as additional states, and weintroduce transitions from these states to the pages that were clicked-on in the userhistories. The transition probabilities are chosen in a biased way to reflect the users’preferences. Then we employ standard techniques for computing stationary stateprobabilities, yielding the query-log-enhanced authority scores of web pages, coinedQRank scores. This approach generalizes the standard random-surfer model into arandom-expert model that mimics the behavior of an expert user in an extendedsession consisting of queries, query refinements, and result-navigation steps. Be-cause of the close relationship to the PageRank computation, on one hand, andthe exploitation of query-log information, on the other hand, we coin our techniqueQRank. The QRank procedure consists of three stages:

1. We extract information from the browser histories of users. We recognize allURLs that encode queries (i.e., with query keywords being part of the URLof an HTTP GET request) and all URLs that are subsequently clicked onwithin a specified time window. In addition to mining these pairs of queriesand clicked-on documents, we also aim to recognize entire user sessions, basedon heuristics about the user’s timing, and identify refinements of a previouslyposed query.

2. From the observed queries, query refinements, and clicked-on documents weconstruct an extended web graph by adding extra nodes and edges to thehyperlink-induced graph. We also introduce ontology-based correlations be-tween similar queries, and compute textual similarities between documentsto construct additional associative links for pages with high content similar-ities. These are offline computations that take place on the small web-graphfragment previously visited by our web crawler and indexed in a database.

3. We now solve the extended Markov chain for its stationary state probabilities,i.e., the QRank scores of the indexed web pages. At query time we merely com-bine the offline precomputed QRank score of a page with the page’s contentrelevance for a given set of query keywords, e.g., by weighted linear combi-nation for the overall scoring and final result ranking. We can use either a

Page 8: Query-log based Authority Analysis for Web Information Searchdomino.mpi-inf.mpg.de/imprs/imprspubl.nsf/a1f10a9f...be leveraged by a link-analysis algorithm. The problem that we study

6 CHAPTER 1. INTRODUCTION

weighted linear combination or the product of re-adjusted QRank and tf*idf-based relevance measures, in order to compute the overall scoring and finalresult ranking.

Preliminary results of this work have been published in [50].

1.3 Preliminaries

1.3.1 Click Stream, Query Refinement and Query Session

By gathering the queries users posed and the documents they subsequently visited,we can deduce frequently asked questions as well as the answers that were regarded(at least upon first glance) as relevant. Interpreting every user click as an endorse-ment would, of course, be an overstatement, as a user might make error-clicks ormisunderstand the excerpt of the web site that goes with the link presented in thesearch engine’s result set. But nevertheless clicks are meaningful with high proba-bility. Certainly users do not follow links at random, but have a certain intentionthey act upon. In this way query sessions offer a kind of implicit relevance feedbackthat is stronger than pseudo-relevance feedback as in addition to the clicked doc-uments being part of the search engine’s result set, also users regarded the clickeddocuments as relevant or at least had some reason to click them. In contrast tousers being explicitly asked to give feedback, this implicit relevance judgement doesnot suffer from the likely reluctance and laziness of users (at the cost of some smallinaccuracy in the extraction process depending on the extent to which heuristics areused in the browser history mining).

By analyzing the timestamps of HTTP requests, equivalent variations of queriesrepresenting the same search intention can be retrieved. This situation most proba-bly occurs whenever the user is not satisfied by the search results and then tries toachieve an improvement by reformulating and refining her original query.

In the following a more formal definition of query sessions and query refinementsshall be given.

Definition 1 (Query Click Stream). A query click stream is a pair consisting ofa query and the set of subsequently visited documents, i.e.,query click stream :=< query, {d1, d2, . . . , dn} >.

Definition 2 (Query Refinement). A query refinement is a pair consisting of aquery and its reformulation, i.e.,query refinement :=< query, refined query >

Definition 3 (Query Session). A query session is an initial query together withits reformulations, each with its own set of visited documents, i.e.,query session :=< query, {refined query click stream1, . . . , refined query click streamk} >.

Page 9: Query-log based Authority Analysis for Web Information Searchdomino.mpi-inf.mpg.de/imprs/imprspubl.nsf/a1f10a9f...be leveraged by a link-analysis algorithm. The problem that we study

1.3. PRELIMINARIES 7

The demarcations of query sessions, i.e., when one session ends and the next onestarts, are identified using heuristic rules on time intervals. When a user is inactivefor more than a specified timespan (e.g., 1 minute), it is assumed that a new querysession is started. This is certainly not a foolproof technique, but is empiricallyfound to work with high probability.

1.3.2 Extracting Information from Query Logs

Query logs are log files generated by search engines or web browsers, that record allrequests they process. Depending on the actual logging tool slightly different infor-mation is captured, but most importantly all URLs together with their timestampsare remembered. One faces various problems when trying to extract query sessionsfrom query logs. With query click streams the first difficulty is how to determinethe end of a click stream. While its start is uniquely defined by the URL of thesearch engine together with a request parameter containing the actual query string,the end is somehow unclear. To make things even more complicated, a user mightswitch interests and jump to totally unrelated pages shortly after executing a websearch on a certain subject. Similarly it is hard to decide whether subsequent queriesrepresent isolated search intentions. Further problems arise depending on the kindof log-files that are considered. In most cases the use of heuristics is the means ofchoice to handle vagueness.

Web browser history

The browser history of a user can be extracted from the browser’s internal files usingspecial tools such as UrlSearch [45]. An excerpt of a browser history is shown inTable 1.1. Here the user query “france highest mountain” is refined by the query“france highest summit” and a page about the Mont Blanc is clicked. In the caseof browser histories, life seems to be especially simple, as one does not need tocare about separating requests by users. A slight complication is the storage-savingpolicies of most browsers that erase multiple occurences of the same web page andonly store a page once with the time it was last visited. This becomes especiallycrucial with the common navigation via a browser’s back button. However, by eitherencoding server-supplied timestamps into URLs and thus making URLs unique or bytimestamp-based and other heuristics one can still extract all relevant click-streaminformation from the browser history (for more details see Section 5.1).

URL Timestamphttp://. . . /index. jsp?query=france+highest+mountain 2004/05/28, 15:24:04http://. . . /index.jsp?query=france+highest+summit 2004/05/28, 15:24:14

http://. . . /wiki/wiki.phtml?title=Mont Blanc 2004/05/28, 15:24:44

Table 1.1: Browser history

Page 10: Query-log based Authority Analysis for Web Information Searchdomino.mpi-inf.mpg.de/imprs/imprspubl.nsf/a1f10a9f...be leveraged by a link-analysis algorithm. The problem that we study

8 CHAPTER 1. INTRODUCTION

Search Engine log

Extracting query sessions from server log files is known to be problematic be-cause of difficulties of identifying the same user in a series of requests (see, e.g.,[8, 39, 54, 51, 48]). The IP address cannot uniquely identify a user in the courseof the massive usage of local caches, firewalls and proxy servers. Cookies or userIDs fail because of privacy concerns and the reluctance of users. Again imperfectheuristices are unavoidable.

Therefore and for the sake of generality and re-usability (we did not want to berestricted to a certain search engine), we used browser histories on the client side.We simply asked a number of voluntary trial users to provide us with their historiesextracted by the UrlSearch tool.

1.4 Overview of the Thesis

Chapter 2 comprises some fundamentals on the general information retrieval processas well as some mathematical foundations needed for later models. Recent relatedwork is summarized in the subsequent Chapter 3. The theoretical framework of ourapproach including proofs of some desirable properties of the introduced model isdescribed in Chapter 4. Chapter 5 is devoted to some implementational issues andChapter 6 contains descriptions and results of the conducted experiments.

1.5 Acknowledgements

We want to thank all the voluntary testpersons who sacrificed their leisure time forproviding us with query sessions and ranking evaluations.

Page 11: Query-log based Authority Analysis for Web Information Searchdomino.mpi-inf.mpg.de/imprs/imprspubl.nsf/a1f10a9f...be leveraged by a link-analysis algorithm. The problem that we study

Chapter 2

Foundations

State-of-the-art web information search is roughly speaking based on two orthogonalparadigms, the content-based approach on the one hand and the structure-based onthe other hand. The latter has not arisen until the late nineties and caused a kind ofrevolution of models and methods in the IR community. But nevertheless content-based models are still essential as today’s best results are only obtained with acombination of both methods which tend to complement each other. Before delvinginto the details of certain models the general framework inherent in any informationretrieval process should be outlined.First of all, a rich information base has to be created. In the context of web searchthis is achieved by crawling and indexing. This means, starting from a set of baseURLs, documents are fetched, parsed and contained links and terms are extracted.All outgoing links are fed back for further crawling, whereas the obtained terms aremapped on features by normalizations (lower case, word stem, stopword elimination)and an inverted index is built up. The characterization ”inverted” here refers to thereversed mapping direction. Instead of each document pointing to the set of featuresit contains, each feature is associated with the set of documents it occurs in. Withthe use of sophisticated datastructures, like B+-trees, a fast lookup of documents viaindex features is facilitated. The index sizes of current search engines cover about4 billion web pages. Although this number sounds incredible huge still the bulkof the web is uncovered. Limiting factors are, of course, storage and running timerestrictions, but also the web itself. Recent analyses of its structure show that anenormous amount of information is hidden behind portals and password-protectedareas and thus remains invisible for common search engines. Apart from that the”small world” assumption that the web is a single connected component with onlya small number of links seperating most web pages as stated by Albert, Jeong andBarabasi [3] has been revised. Broder et al. [10] provided evidence that the webrather obeys a more complicated structure as despicted in Figure 2.1. Most webpages seem to belong to a weakly connected component that consists of four parts:a strongly connected component SCC, a set IN of pages that link to the SCC, butare not reachable from there, a set OUT of pages that are reachable from the SCC,

9

Page 12: Query-log based Authority Analysis for Web Information Searchdomino.mpi-inf.mpg.de/imprs/imprspubl.nsf/a1f10a9f...be leveraged by a link-analysis algorithm. The problem that we study

10 CHAPTER 2. FOUNDATIONS

SCCIN OUT

Tubes

Tendrils Disconnected Components

Figure 2.1: Web structure observed by Broder et al.

but themselves do not reach the SCC, as well as tendrils connected to IN or OUTbut not to SCC. In addition there are several completely disconnected components.This structure suggests that crawling the web depends highly on the choice of thestarting URLs.

Returning to the process of information retrieval, the next steps following crawl-ing and indexing are query-dependent. Retrieving relevant documents can be achievedaccording to several models that differ in the representation of documents andqueries as well as the definition of similarity, respectively relevance.

2.1 Content-based Information Retrieval

2.1.1 Boolean Retrieval

This approach is based upon a binary decision strategy, i.e., either a document isrelevant or not. Thus the returned result set is completely unordered as all quali-fying documents are equally relevant. In the course of boolean retrieval documentsare modelled as binary vectors in the feature space, such that each component indi-cates whether the corresponding feature occurs in the represented document or not.

Page 13: Query-log based Authority Analysis for Web Information Searchdomino.mpi-inf.mpg.de/imprs/imprspubl.nsf/a1f10a9f...be leveraged by a link-analysis algorithm. The problem that we study

2.1. CONTENT-BASED INFORMATION RETRIEVAL 11

Composed queries are formulated using boolean algebra which can be translatedto the set operations union, intersection and complement on the sets of documentssatisfying the atomar (single-keyword) queries. More formally queries can be rep-resented in disjunctive normal form, i.e., as a disjunction of several binary featurevectors. Result sets are then obtained by the union of all documents associated withthese feature vectors.

Example 1. Consider the feature space {apple, orange, juice} and documents d1 =(1, 1, 0) about apples and oranges, d2 = (1, 0, 1) about apple juice and d3 = (0, 1, 1)about orange juice.A user submits the query (apple ∨ orange) ∧ juice to find information on apple ororange juice. The union of the result sets for the atomar queries apple = {d1, d2}and orange = {d1, d3} yields the whole corpus that has to be intersected with juice ={d2, d3} for the final result set S = {d2, d3}More formally the query can be reformulated as follows:

(apple ∨ orange) ∧ juice = (apple ∧ juice) ∨ (orange ∧ juice)

= (apple ∧ orange ∧ juice) ∨ (¬apple ∧ orange ∧ juice)

∨ (apple ∧ ¬orange ∧ juice)

2.1.2 Ranked Retrieval

To avoid overwhelming the user with a huge unordered result set, weights are in-troduced to differentiate between more or less relevant documents. The commonvector space model views documents as n-dimensional feature vectors where n is thetotal number of features contained in the considered corpus. In contrast to booleanretrieval the feature vector entries are term relevance scores obtained from the termfrequency with respect to the document and the corpus. The idea behind this is thatthe relevance of a document should grow with the number of query terms it contains,but shrink with the number of documents the query term also occurs in. Thus morediscriminative terms are weighted higher. Term weights are usually computed by aversion of the tf*idf formula where

tfij denotes the term frequency, i.e., the number of occurrences of term ti in docu-ment dj,

dfi is the document frequency, i.e., the number of documents term ti occurs in,

N is the number of documents in the corpus, and

idfi is the inverse document frequency with

idfi =N

dfi

Page 14: Query-log based Authority Analysis for Web Information Searchdomino.mpi-inf.mpg.de/imprs/imprspubl.nsf/a1f10a9f...be leveraged by a link-analysis algorithm. The problem that we study

12 CHAPTER 2. FOUNDATIONS

The weight wij of term ti in the feature vector of document dj is then computedas

wij = tfij · idfi

To give short and long documents equal chances, the term frequency is usuallynormalized and the otherwise dominating influence of the idf-values is damped:

wij =tfij

maxk tfkj

· log idfi

Finally the term weights are normalized:

wij =wij√∑

k w2ik

Queries are also modelled as n-dimensional weighted feature vectors, so thatcommon distance metrics on vectors, e.g., the Euclidian distance or the cosine met-ric, can be used to provide a similarity-based ranking on the search results.Let ~d ∈ [0, 1]n be a document vector and ~q ∈ [0, 1]n a query vector. Then the simi-larity can be computed as follows:

sim(~d, ~q) =

∑ni=1 di · qi∑n

i=1 d2i ·∑n

i=1 q2i

2.2 Structure-based Information Retrieval

Structure-based approaches try to improve the precision of search results by intro-ducing additional ranking schemes. The various approaches in common is that theyview the web as a directed graph with the web pages being nodes and the hyperlinksthe connecting edges. The central assumption is that a hyperlink from page A topage B can be interpreted as a recommendation or citation of B by A.Before turning to the two most prominent approaches PageRank [11, 12] and HITS[30], some mathematical foundations are needed.

2.2.1 Markov chains

A stochastic process is a family of random variables {X (t) |t ∈ T} where T is calledparameter space and the domain M of X(t) is called state space. T and M can bediscrete or continuous. A Markov process is a special stochastic process that addi-tionally fulfills the following property:For every choice of t1, . . . , tn+1 from the parameter space and every choice of x1, . . . , xn+1

from the state space it holds that

P [X (tn+1) |X (t1) = x1, . . . , X (tn) = xn] = [X (tn+1) |X (tn) = xn]

Page 15: Query-log based Authority Analysis for Web Information Searchdomino.mpi-inf.mpg.de/imprs/imprspubl.nsf/a1f10a9f...be leveraged by a link-analysis algorithm. The problem that we study

2.2. STRUCTURE-BASED INFORMATION RETRIEVAL 13

Thus the transition probabilities in a Markov process only depend on the currentstate and forget about their preceding transition history.

A Markov process is called a Markov chain if its state space is discrete. In thiscase states are usually modelled using the set of nonnegative integers. For time-discrete Markov chains the formulation X(tn) for being in state X at time tn can beabbreviated to Xn.

Definition 4. A Markov chain with discrete parameter space is

1. homogeneous, if the transition probabilities pij = P [Xn+1 = j|Xn = i] are in-dependent of n.

2. irreducible, if every state is reachable from every other state with positiveprobability, i.e.,

∞∑n=1

P [Xn = j|X0 = i] > 0 ∀ i, j

3. aperiodic, if every state i with P [Xn = i|X0 = i] > 0 for some n ≥ 1 has period1. The period of state i is defined as the greatest common divisor of all recur-rence values n for which P [Xn = i ∧Xk 6= i for k = 1, . . . , n− 1|X0 = i] >0. In the special case of a self-loop (P [X1 = i|X0 = i]) state i has period 1and thus is aperiodic.

4. positive recurrent, if for every state i the recurrence probability is 1 and themean recurrence time is finite:

∞∑n=1

P [Xn = i ∧Xk 6= i for k = 1, . . . , n− 1|X0 = i] = 1

∞∑n=1

n · P [Xn = i ∧Xk 6= i for k = 1, . . . , n− 1|X0 = i] < ∞

5. ergodic, if it is homogeneous, irreducible, aperiodic and positive recurrent.

In summary a time-discrete homogeneous Markov chain can be defined as follows:

Definition 5 (Time-discrete homogeneous Markov chain). A time-discretehomogeneous Markov chain is a pair (X,p) with a state set X and a transitionprobability function p: X × X → [0, 1] with the property

∑∞j=0 pij = 1 ∀ i with

i, j ∈ X and pij = p (i, j).

For finite-state time-discrete homogeneous Markov chains the transition proba-bilities pij can be exhibited as a square matrix, the transition probability matrix ofthe chain.

Page 16: Query-log based Authority Analysis for Web Information Searchdomino.mpi-inf.mpg.de/imprs/imprspubl.nsf/a1f10a9f...be leveraged by a link-analysis algorithm. The problem that we study

14 CHAPTER 2. FOUNDATIONS

p00 p01 . . . p0n

p10 p11 . . . p1n...

......

...pn0 pn1 . . . pnn

All matrix entries are ≥ 0 and each row sums up to 1.

The n-step transition probabilities can be computed using the Chapman-Kolmogorovequations. Thus

p(n)ij = P [Xn = j|X0 = i]

=∑

k

p(n−1)ik · p(1)

kj

=∑

k

p(n−l)ik · p(l)

kj for 1 ≤ l ≤ n− 1

andp

(1)ij = pij

Besides the conditional probabilities of being in state j after n steps, given thatthe starting state is i, the unconditional probilities of being in state j at time n areof interest:

π(n)j = P [Xn = j]

with ∑j

π(n)j = 1 ∀n

and

P [Xn = j] =∑

i

P [Xn = j|X0 = i] · P [X0 = i]

=∑

i

p(n)ij · π(0)

i

For every ergodic Markov chain there exists a limiting probability distributionπ = (π0, π1, . . .) such that πi = limn→∞ π

(n)j . These limiting probabilities are in-

dependent of the initial state probabilities Π0 and form a stationary probabilitydistribution satisfying the so-called balance equations :

∀ j : πj =∑

i

πi · pij

∑j

πj = 1

Page 17: Query-log based Authority Analysis for Web Information Searchdomino.mpi-inf.mpg.de/imprs/imprspubl.nsf/a1f10a9f...be leveraged by a link-analysis algorithm. The problem that we study

2.2. STRUCTURE-BASED INFORMATION RETRIEVAL 15

Theorem 1. A finite Markov chain with discrete parameter space that is homoge-neous, irreducible and aperiodic is ergodic and positive recurrent.

Proof. See [4].

2.2.2 PageRank

PageRank developed by Sergey Brin and Lawrence Page [12, 11] in 1998 measuresthe importance of web pages by link analysis on the web graph. The key idea is togrant a web page higher prestige the more high-quality pages link to it. This wayPageRank goes beyond simple inlink counting as also the quality of inlinks is takeninto account. The rationale behind is that outlinks of a web page exhibit the intel-lectual input of its creator who judges the linked pages valuable. Formalizing thisconcept the PageRank of a document d, denoted by ~p (d), should be proportional to∑

c∈predecessors(d) ~p (c).Another intuitive deduction of the PageRank scores assumes a random surfer whostarts at page c with probability ~p0 (c) according to the initial probability distribu-tion ~p0 with

∑c ~p0 (c) = 1. Then she picks at each stage an outgoing link uniformly

at random. Thus the probability of the random surfer visiting page d after n stepsis

~pn (d) =∑

c∈predecessors(d)

~pn−1 (c)

outdegree (c)

The importance of page d then would be the stationary visiting probability ~p (d) ofthe Markov chain modelling the random surfer which is defined by the set of states(nodes) V and the transition probability function

t : V × V → [0, 1] , v 7→

{1

outdegree(u), if (u,v) ∈ E

0 , otherwise

where E is the set of edges.We know from the theory about Markov chains, that the limiting probabilities

exist, if the considered Markov chain is ergodic. As V is finite, the property of er-godicity follows if the Markov chain (V,t) is homogeneous, irreducible and aperiodicaccording to Theorem 1. Clearly (V,t) is homogeneous as the transition probabilitiesare independent of the time the transition takes place. Irreducibility turns out tobe more difficult as recent studies prove the opposite. To overcome this drawbackrandom jumps are added to the random surfer model, that is, at each state therandom surfer either

• performs a random jump to any other state with small probability ε (usuallybetween 0.1 and 0.2)

or

• chooses an outgoing link uniformly at random with probability 1− ε

Page 18: Query-log based Authority Analysis for Web Information Searchdomino.mpi-inf.mpg.de/imprs/imprspubl.nsf/a1f10a9f...be leveraged by a link-analysis algorithm. The problem that we study

16 CHAPTER 2. FOUNDATIONS

This yields the transition probability function

t : V × V → [0, 1] , v 7→

{ε · ~r(v) + (1− ε) · 1

outdegree(u), if (u,v) ∈ E

ε · ~r(v) , otherwise

with random jump vector ~r. Now irreducibility and aperiodicity are naturally givenand stationary state probabilities can be computed either by power iteration untilconvergence or by solving the linear system of balance equations.

Power iteration

~pn (d) = ε · ~r (d) + (1− ε) ·∑

c∈predecessors(d)

~pn−1 (c)

outdegree (c)

Solving linear system of equations

~p = ε · ~r + (1− ε) · E ′ · ~p

where E ′ is the matrix (e ′ij) with e ′ij =

{1

outdegree(i), if (i,j) ∈ E

0 , otherwise

⇔ ~p =(~ε · ~r T + (1− ε) · E ′)︸ ︷︷ ︸

A

·~p

Thus ~p is the principal eigenvector of the composed matrix A.

The PageRank vector ~p can be computed offline. At query-time the PageRankscores are combined with relevance scores, e.g., by a weighted sum, to produce anauthority-enhanced ranking.

Special case: Zero-outdegree nodes

Zero-outdegree nodes have to be handled as they hurt the essential property of theprobability transition matrix that row entries sum up to 1. There are mainly twosolutions: One possibility is to augment the probability mass for random jumps forzero-outdegree nodes from ε to 1. The other option is to recursivly remove zero-outdegree nodes from the web graph for the PageRank computation. This is feasibleas zero-outdegree nodes take no effect on the PageRank scores of the other nodes.Later their PageRank scores can be computed from the scores of their predecessors.

2.2.3 HITS

Hyperlink-Induced Topic Search developed by Jon Kleinberg introduces the no-tions of hubs and authorities that mutually reinforce each other. Hubs are usually

Page 19: Query-log based Authority Analysis for Web Information Searchdomino.mpi-inf.mpg.de/imprs/imprspubl.nsf/a1f10a9f...be leveraged by a link-analysis algorithm. The problem that we study

2.2. STRUCTURE-BASED INFORMATION RETRIEVAL 17

characterized by a high outdegree, thus serving as good link sources, whereas au-thorities stand out for their high indegree and good content. The HITS-underlyingkey concept can be summarized as follows:

• Good hubs link to good authorities.

• Good authorities have good hubs as predecessors.

This relationship is described formally for the web graph (V,E) and nodes u,v ∈ Vby the authority score

xv =∑

(u,v)∈E

yu

and the hub score

yu =∑

(u,v)∈E

xv

With E being overloaded and also denotating the adjacency matrix of the webgraph the above equations can be reformulated using matrix notation.

~x = ET · ~y ~y = E · ~x

⇔ ~x = ET · E · ~x ~y = E · ET · ~y

Thus ~x and ~y are the principal eigenvectors of ET · E and E · ET , respectively.Intuitively ET ·E is the co-citation matrix with a nonzero entry

(ET · E

)ij

indicating

the number of nodes pointing to both i and j, whereas E · ET is the co-referencematrix with a nonzero entry

(E · ET

)ij

indicating the number of nodes to which

both i and j point.

HITS algorithm

1. Formation of root pages via query-dependent relevance ranking

2. Expansion of root set to base set by addition of all successors as well as alimited number of preceding pages

3. Iterative computation of authority and hub scores with initialization: ~x = ~y =1

|base set|

4. Presentation of pages in descending order of authority scores

Page 20: Query-log based Authority Analysis for Web Information Searchdomino.mpi-inf.mpg.de/imprs/imprspubl.nsf/a1f10a9f...be leveraged by a link-analysis algorithm. The problem that we study

18 CHAPTER 2. FOUNDATIONS

2.3 Evaluation of Search Results

To assess the quality of an information retrieval system, the two orthogonal measuresprecision and recall try to formally capture the attitudes of valuable query results.Whereas the latter evaluates on the capability of the system to retrieve the relevantdocuments, the first punishes systems that deliver no query specific response, butbeyond relevant documents return a lot of irrelevant junk. Precision and recall aredefined as follows.

Definition 6 (Precision). Let D be a set of documents, R ⊆ D the subset ofrelevant documents with respect to query q, and A ⊆ D the search result for queryq. Then

Precision =|R ∩ A||A|

, i.e., precision is the fraction of relevant documents within the set of retrieveddocuments.

Definition 7 (Recall). Let D be a set of documents, R ⊆ D the subset of relevantdocuments with respect to query q, and A ⊆ D the search result for query q. Then

Recall =|R ∩ A||R|

, i.e., recall is the fraction of retrieved documents within the set of relevant docu-ments.

Usually there is a tradeoff between precision and recall, i.e., the higher recallgets, the lower precision tends to be. Thus a retrieval system is characterized by theratio of precision to recall. Precision-recall curves that plot precision with respectto several recall levels visualize this relationship. In the course of ranked retrievalthe position of relevant documents yields an additional measure of the goodness ofan information retrieval system. Consider the following rankings for query q withrelevant documents being marked.

Page 21: Query-log based Authority Analysis for Web Information Searchdomino.mpi-inf.mpg.de/imprs/imprspubl.nsf/a1f10a9f...be leveraged by a link-analysis algorithm. The problem that we study

2.3. EVALUATION OF SEARCH RESULTS 19

1. d1

2. d5 ?

3. d19

4. d8 ?

5. d2

6. d32 ?

7. d67

8. d3

9. d45

10. d74

1. d9

2. d29

3. d73

4. d8 ?

5. d0

6. d82

7. d95

8. d32 ?

9. d7

10. d5 ?

Clearly both rankings have equal precision (0.3) and also share the same recallvalue, thus are assessed to be of equal quality. However, a user might prefer theranking on the left that presents relevant documents earlier. To also reflect theposition of relevant documents in the evaluation, precision at a certain cutoff, forexample 5 or 10 documents, or at certain recall levels (10%, 20%, . . .), is considered.Several precision values can be aggregated to a single average precision, e.g., theuninterpolated average precision. Thereby precision is computed for each point inthe ranking where a relevant document occurs. For the above left ranking this is 1

2,

12,12, and for the right one we get 1

4, 1

4, 3

10, which averages to 0.5, repectively 0.26.

When comparing different retrieval algorithms one often wants to average theprecisions for several queries. Two averaging schemes are distinguished, macro-averaged precision and micro-averaged precision.

Definition 8 (Macro-averaged precision). Let R(qi) be the set of relevant doc-uments with respect to query qi and A(qi) the set of retrieved documents for queryqi. Then the macro-averaged precision of the queries q1, . . . , qk is defined as

macro-averaged precision =k∑

i=1

|R(qi)||A(qi)|

Definition 9 (Micro-averaged precision). Let R(qi) be the set of relevant doc-uments with respect to query qi and A(qi) the set of retrieved documents for queryqi. Then the micro-averaged precision of the queries q1, . . . , qk is defined as

micro-averaged precision =

∑ki=1 |R(qi)|∑ki=1 |A(qi)|

Page 22: Query-log based Authority Analysis for Web Information Searchdomino.mpi-inf.mpg.de/imprs/imprspubl.nsf/a1f10a9f...be leveraged by a link-analysis algorithm. The problem that we study

20 CHAPTER 2. FOUNDATIONS

The main difference between both schemes lies in the weighting of the length ofquery results. While the micro-average is influenced to a greater extent by long resultsets than by short ones, thus implicitly takes recall into account, the macro-averagecompletely ignores the length of search results. For further details see [32].

2.4 Relevance Feedback

Several approaches try to improve first provisional search results by further post-processing steps that take into account user feedback. The general underlying frame-work can be described as follows. Given a user query q and an answer set A of doc-uments, relevance feedback is captured in an assessment function a : A → {+,−}yielding a set A+ ∈ A of positively assessed documents and a set A− ∈ A of nega-tively assessed documents with A+ ∩ A− = ∅.

2.4.1 Query Expansion

Query expansion techniques rewrite an user query q into q’ based on the feedbacka user provides on the results of the original query q. Then the whole retrievalprocedure is repeated for q’ and a new answer set is presented to the user.

Rocchio Method

In the course of the Rocchio method the original query q is rewritten into q ′ accordingto the following equation.

q ′ = α · q +β

|A+|·∑d∈A+

d− γ

|A−|·∑

d∈A−

d

where d, q, q ′ are vectors in the feature space and α, β, γ ∈ [0, 1].

Pseudo-relevance Feedback

Due to the reluctance and laziness of the user in providing relevance feedback, Xuand Croft [52] proposed a simple heuristic to simulate user feedback that aston-ishingly succeeds in ameliorating first search results. They simply view the top-nresults for query q as positive, relevant documents and expand q with the best termsof the top-ranked documents in the answer set of q.

2.4.2 Adjusting the Ranking Function

Other methods of relevance feedback do not rely on a query expansion or queryreformulation, but on an adjustment of the ranking function. Consider the following

Page 23: Query-log based Authority Analysis for Web Information Searchdomino.mpi-inf.mpg.de/imprs/imprspubl.nsf/a1f10a9f...be leveraged by a link-analysis algorithm. The problem that we study

2.4. RELEVANCE FEEDBACK 21

distance function for ranking documents with respect to query q.

Lp(q, d) = p

√√√√ n∑i=1

wi · (qi − di)p

By re-adjusting the dimension weights wi according to the role of dimension i in pos-itively assessed documents d ∈ A+, search results can be adapted to the user’s searchintention. For example, wi can be chosen inversely proportional to the variance infeatures of dimension i of documents d ∈ A+, i.e.,

wi ∝1

V ar [di|d ∈ A+]

To avoid overshooting, old and refined dimension weights are linearly combined toform new dimension weights.

Page 24: Query-log based Authority Analysis for Web Information Searchdomino.mpi-inf.mpg.de/imprs/imprspubl.nsf/a1f10a9f...be leveraged by a link-analysis algorithm. The problem that we study

Chapter 3

Related Work

Beyond the two most prominent approaches to link analysis, the HITS method byKleinberg [30] and PageRank by Brin and Page [11, 12] (for a description see Section2.2.3, respectively 2.2.2), a variety of extensions and generalizations of both meth-ods has been developed [6, 9, 31, 1, 18, 23, 14]. Some of these shall be discussed inSection 3.1.Query-dependent link analysis has been explicitly addressed by Richardson andDomingos [36], inspired by and extending work of Cohn and Hofmann [15] on apredictive probabilistic model for document content and hyperlink connectivity. Analgebraic, spectral-analysis-type approach to such predictive models has been devel-oped by Achlioptas et al. [2]. Among these, only Richardson and Domingos providea practical algorithm for query-dependent authority ranking. They use a “directedsurfer” model where the probabilities for following links are biased towards targetpages with high similarity to the given query (for a more detailed description seeSection 3.2). None of these methods considers query-log or click-stream information.

However, exploiting query logs for an improvement of search results is not acompletely new idea and in fact several actual works build on this concept. Theseare summarized in Section 3.3.

3.1 Variations of HITS and PageRank

3.1.1 Extensions to HITS

Topic-specific HITS

Bharat and Henzinger [6] discovered and tackled three problems of the original HITSalgorithm (refer to Section 2.2.3):

1. Mutual reinforcing relationships between hostsSeveral documents on one host pointing to a single document on a secondhost, as well as the reverse case, one document on a first host pointing to anumber of documents on a second host, lays too much weight on the opinion

22

Page 25: Query-log based Authority Analysis for Web Information Searchdomino.mpi-inf.mpg.de/imprs/imprspubl.nsf/a1f10a9f...be leveraged by a link-analysis algorithm. The problem that we study

3.1. VARIATIONS OF HITS AND PAGERANK 23

Host A Host B

hub_weight = 1/4

auth_weight = 1/3

4

6

7

8

9

2

510

3

1

Figure 3.1: Edges with authority and hub weights

of one ”person” and pushes up the hub, respectively authority, scores of theconsidered documents to an undue extent.

2. Automatically generated linksFor automatically generated links, as well as pure navigational links, the HITSunderlying assumption that hyperlinks reflect an endorsement of the linkeddocuments by the web page’s creator is no longer valid.

3. Non-relevant documentsSome highly linked documents in the root set that are not relevant to the querytopic often cause a topic drift: the most authoritative documents according toHITS are no longer about the original query topic.

To meet the first problem Bharat and Henzinger introduced edge weights. In thecase of k documents on a first host pointing to a single document on a second host,each edge is given an authority weight of 1

k. In the reverse case of one document

on a first host pointing to l documents on a second host, each edge is given a hubweight of 1

l. For example consider Figure 3.1. Here the edges from documents 1,

2, and 4 to document 6 have each an authority weight of 13

and a hub weight of 1.Analogously the edges from 5 to 7, 8, 9, and 10 have each a hub weight of 1

4and an

authority weight of 1.To avoid topic drift Bharat and Henzinger suggested two approaches: elimi-

nating non-relevant documents from the base set, and regulating the influence ofdocuments according to their relevance. Thereby the relevance of a document to

Page 26: Query-log based Authority Analysis for Web Information Searchdomino.mpi-inf.mpg.de/imprs/imprspubl.nsf/a1f10a9f...be leveraged by a link-analysis algorithm. The problem that we study

24 CHAPTER 3. RELATED WORK

the query topic is determined via a tf*idf-based cosine similarity (see Section 2.1.2)between the query topic (characterized by the documents in the root set of the HITScomputation) and the first 1000 words of each document in the base set. The thatway obtained relevance scores are then used to replace the original authority andhub scores with relevance-based authority and hub scores (simply the product ofrelevance and authority, respectively hub, scores).

For the pruning of documents from the base set Bharat and Henzinger testedthree different strategies to decide on a threshold for the relevance score of docu-ments:

Median Weight The threshold is the median of all the relevance scores.

Root Set Median Weight The threshold is the median of the relevance weightsof the documents in the root set.

Fraction of Maximum Weight The threshold is a fixed fraction of the maximumweight.

Depending on how the above approaches are combined, several modified-HITS al-gorithms result. The generic scheme is despicted in the following with the optionalparts written in italics.

1. Formation of root pages via query-dependent relevance ranking

2. Expansion of root set to base set by addition of all successors as wellas a limited number of preceding pages

3. Computation of query-specific relevance scores

4. Pruning of non-relevant documents from the base set according to some

thresholding strategy

5. Iterative computation of authority and hub scores with initialization:~x = ~y = 1

|base set|

6. While the authority vector ~x and the hub vector ~y have not converged:

xv =∑

(u,v)∈E

relevance(u) · yu · auth weight(u, v)

yu =∑

(u,v)∈E

relevance(v) · xv · hub weight(u, v)

where E is the set of hyperlinks

Page 27: Query-log based Authority Analysis for Web Information Searchdomino.mpi-inf.mpg.de/imprs/imprspubl.nsf/a1f10a9f...be leveraged by a link-analysis algorithm. The problem that we study

3.1. VARIATIONS OF HITS AND PAGERANK 25

P(d)

d z

P(z|d) P(c|z)

c

Figure 3.2: Document citation model

Evaluation For rankings based on authority scores, Bharat and Henzinger observedan improvement in terms of precision by at least 26% over standard HITS when only theproblem of mutually-reinforcing relationships between hosts was tackled. Additionallyapplying pruning and relevance-based regulation each resulted in further enhancements inprecision by about 10%. However, the combination of pruning and regulation yielded nogreater benefit. In the case of the authority-based ranking, the differences between thethree thresholding strategies were only minor and not clearly in favor of one particularstrategy. Only in the case of a ranking based on hub scores the median weight strategysignificantly outperformed the others.

PHITS - Probabilistic HITS

Whereas the authority of a document in HITS (refer to Section 2.2.3) only relies on theprincipal (largest) eigenvector of the cocitation matrix ET · E, Cohn and Chang [14]advocate to also take smaller eigenvectors into account. The ignorance of other eigenvec-tors than the principal one might be crucial in the case of only slight differences betweeneigenvectors in terms of magnitude, but huge differences in terms of orientation. Cohn andChang also criticize principal component analysis (PCA), that extracts multiple eigenvec-tors, for being built on faulty statistical assumptions. In contrast their method of choiceis a statistical model, called aspect model [37], to describe document citations: The twoobservables, documents and citations (cited documents), are explained in terms of a smallnumber of common, but unobserved variables (factors/aspects) and the following gener-ative process is considered: A document d is generated with probability P(d), the factoror topic z associated with d is chosen probabilistically according to P (z|d), and citationsc are generated according to P (c|z) (confer Figure 3.2).

Given a set of document-citation pairs (d,c), the likelihood of document d citing c isdescribed by

P (d ∧ c) = P (d) · P (c|d)

withP (c|d) =

∑z

P (c|z) · P (z|d)

Putting both equations together and using Bayes rule (P (z|d) = P (d|z)·P (z)P (d) ) yields:

P (d ∧ c) =∑

z

P (c|z) · P (d|z) · P (z)

Page 28: Query-log based Authority Analysis for Web Information Searchdomino.mpi-inf.mpg.de/imprs/imprspubl.nsf/a1f10a9f...be leveraged by a link-analysis algorithm. The problem that we study

26 CHAPTER 3. RELATED WORK

The total likelihood L of all observed document-citation pairs

L =∏(d,c)

P (d ∧ c) =∏(d,c)

(∑

z

P (c|z) · P (d|z) · P (z))

forms the objective function that is to be maximized by choosing appropiate values forP(d), P (z|d) and P (c|z). This is done by a ”tempered” version of the EM algorithm [17]

Evaluation Two corpora served as data sets: the online archive of computer scienceresearch papers, Cora [33], as well as query-specific web documents retrieved via Altavista(www.altavista.com) that formed the root set for the PHITS computation. The Coraarchive contained 30,000 papers which were categorized into a Yahoo-like topic hierarchywith about 70 leaf categories. The ”Machine Learning” category with its 7 subtopics,consisting of 4240 documents in total, served as training data for PHITS. Here one draw-back of PHITS became obvious: one has to decide on the number of factors a priori.A computation of PHITS with 7 factors recovered the original subtopics, and for eachfactor z the ordering according to P (c|z) ranked quite authoritative papers at the top.Alternatively, P (z|c) can be used to classify a document c according to communities, andthe product P (z|c) · P (c|z) indentifies heavily-cited, domain-specific documents, i.e., doc-uments that are characteristic for z. Analogous considerations yield factor-specific hubsaccording to P (d|z), as well as classification via P (z|d), and the characteristic probabilityP (z|d) · P (d|z).The results of HITS and PHITS on the same root set of web documents for the query”jaguar” revealed one characteristic difference between HITS and PHITS. Whereas withHITS the rankings of the largest eigenvectors are disjoint, thus represent distinct topics,with PHITS two topics are not necessarily segregated into one factor each, but might in-terleave in both. In summary these anecdotic results militate in favor of PHITS, however,statistic evidence is missing.

3.1.2 Extensions to PageRank

Topic-sensitive PageRank

Haveliwala [23] fully explored the idea of a personalized PageRank that was first suggestedby Page in [11]. Instead of an completely unbiased random jump vector that follows auniform distribution, or the extreme opposite, a single web page that is target of all randomjumps as tested by Brin and Page, Haveliwala suggested a limited set of topic-specific jumpvectors to facilitate a scalable personalization of web search. For his experiments he chose16 basic topic vectors which were each biased towards the URLs in one of the 16 top levelcategories of the Open Directory Project [34]:Let Tj be the set of URLs in the ODP category cj . Then the corresponding basic topicvector is defined as

~vj(i) =

{1|Tj | , i ∈ Tj ,

0 , otherwise

For each of these 16 basic topics biased PageRanks ~pj can be computed by replacing therandom jump vector with the appropiate basic topic vector. By the Linearity Theorem

Page 29: Query-log based Authority Analysis for Web Information Searchdomino.mpi-inf.mpg.de/imprs/imprspubl.nsf/a1f10a9f...be leveraged by a link-analysis algorithm. The problem that we study

3.1. VARIATIONS OF HITS AND PAGERANK 27

the PageRank ~p for a topic that is formed by a linear combination of the basic topics canbe easily computed by the linear combination of the basic biased PageRanks.

Theorem 2 (Linearity Theorem).

ε ·

∑j

wj · ~vj

+ (1− ε) · E′ · ~p

⇔∑

j

wj ·(ε · ~vj + (1− ε) · E′ · ~pj

)with ∑

j

wj = 1

Thus the 16 PageRank-vectors for each of the top-level ODP-classes can be computedoffline and at query time only the weights wj are to be chosen. These are determinedaccording to the class probabilities for each ODP-category conditioned on the context ofthe posed query q. This context, coined q ′, might be the document that contains thequery term highlighted by the user, the history of previously posed queries, or a queryindependent user context obtained from the user’s browsing patterns or bookmarks.

wj = P (cj |q ′) =P (cj) · P (q ′|cj)

P (q ′)

∝ P (cj) ·∏

i

P (q ′i )|cj)

P (q ′i |cj) can be computed utilizing the term distribution in the set of documents con-tained in the ODP-category cj . The class probability P (cj) offers a second possibility forpersonalization: an user-specific class probability Pk(cj) that reflects the interests of userk is an alternative to a uniform distribution of class probabilities.

Evaluation As data source for his experiments served a web crawl from the Stan-ford WebBase containing about 120 million pages and 280,000 of the 3 million URLs in theODP. Haveliwala chose ε = 0.25 and did not delve into elaborated experiments on param-eter tuning as he observed only minor differences in the rankings when varying ε comparedwith the huge differences resulting from different topically-biased PageRank vectors. Anevaluation on 10 sample queries yielded an average precision of 0.51 for topic-sensitivePageRank, whereas ordinary PageRank achieved only 0.28. Furthermore he outlined thepotential of his method for a disambiguation of query terms via query context and theranking of search results according to the appropiate sense of the search terms.

Efficiency considerations

Several actual contributions aim at an improvement of PageRank in terms of computationand storage requirements. Kamvar, Haveliwala, Manning, and Golub [28] accelerate the

Page 30: Query-log based Authority Analysis for Web Information Searchdomino.mpi-inf.mpg.de/imprs/imprspubl.nsf/a1f10a9f...be leveraged by a link-analysis algorithm. The problem that we study

28 CHAPTER 3. RELATED WORK

convergence rates of the iterative PageRank computation by two algorithms, the AitkenExtrapolation and the Quadratic Extrapolation. Both methods essentially exploit the factthat the first eigenvalue of a Markov matrix is known to be 1 to compute estimatesof the next two, respectively three, iterates of the Power Method, assuming that thecurrent iterate can be expressed as a linear combination of only two, respectively three,eigenvectors. By periodically estimating the next iterates and subtracting off the deducedestimates of the nonprincipal eigenvectors from the current iterate of the Power Method,a recognizable speed-up of the PageRank computation was achieved.

Kamvar, Haveliwala, Manning and Golub also developed another strategy to augmentthe efficiency of PageRank. In [29] they proposed a nested PageRank computation to takeinto account the link structure of many intra-links between pages on the same host andfew inter-links between different hosts. Their algorithm works in four stages:

(1) the link graph is partitioned into blocks of highly interlinked domains,

(2) for each block local PageRanks (LPR) of pages are computed considering only thelink structure of their subgraph,

(3) each block is assigned a Block Rank (BR) obtained from a PageRank computationon the densed graph of blocks,

(4) and finally a standard PageRank computation is run with initialization ~p0(j) =LPR(j) ∗BR(J) where J is the block that contains j.

The main benefits of this method lie in the parallizability and reusability of Local PageR-ank computations. However, the core intention of a faster convergence of the iterativePageRank computation might be achieved more easily by clever choosing the initializa-tion vector (e.g., by an initialization vector that reflects the indegrees of pages). Moreoverthe observed speed-up in convergence highly depends on the link structure and has notbeen examined on hosts with a low intrahost link density like GeoCities, T-Online etc.

In [24] Haveliwala explored several approaches towards an efficient encoding of rankingvectors. Instead of storing the absolute rank, only approximate values requiring fewer bitsare considered. Roughly, this is achieved by partitioning the range of PageRank valuesinto intervals (or cells) and storing for each page only the codeword of its cell.

Jeh and Widom [27] developed a sophisticated approach towards efficiency of person-alized web search. They determined minimal building blocks, partial vectors, that areshared across multiple personalizations and facilitate scalability in terms of computationand storage costs.

Adaptive Online Page Importance Computation

Abiteboul, Preda, and Cobena [1] developed an authority analysis algorithm that leanson PageRank, but aims at overcoming some of the drawbacks PageRank suffers from.In contrast to standard PageRank their OPIC (Online Page Importance Computation)algorithm works online (in parallel with crawling the web) and does not require storing

Page 31: Query-log based Authority Analysis for Web Information Searchdomino.mpi-inf.mpg.de/imprs/imprspubl.nsf/a1f10a9f...be leveraged by a link-analysis algorithm. The problem that we study

3.1. VARIATIONS OF HITS AND PAGERANK 29

the link matrix. Informally spoken, OPIC initially distributes an equal amount of cash toeach web page. Every time a page v is visited, all its cash is equally distributed betweenits successors, the pages it links to. Moreover each page has a credit history indicating thetotal amount of cash obtained by the page since the start of the algorithm until the lasttime it was crawled. Thus when a page is visited, its current cash is added to its credithistory and then reset to zero. This behavior is captured in the following pseudo-code:

Let n be the number of web pages. The vectors C[1..n] and H[1..n] contain the cash,respectively history values for each page. Whereas C is kept in main memory, H is storedon disk. Moreover the variable G is introduced to count the total amount of distributedcash (G =

∑i H[i] = |H|). Then OPIC proceeds as follows:

for each i let C[i] := 1/n;for each i let H[i] := 0;let G := 0;

do foreverbeginchoose some page i;% each node is selected infinitely often (fairness)H[i] += C[i];

for each successor j of i doC[j] += C[i] / outdegree[i];

G += C[i];C[i] := 0;end

To ensure the strong connectivity of the web graph and to handle zero-outdegree pages,Abiteboul et al. introduced an equivalent to the random jumps of PageRank, a uniquevirtual page to which all web pages point and that reversely points to all web pages. Theimportance of a web page v is defined as H[v]+C[v]

G+1 , or alternatively as X[v] = H[v]G = H[v]∑

i H[i] .They prove the convergence of X[v] in the course of the OPIC computation and examinethree strategies of page selection and their effects on the speed of convergence:

• Random: The next page to crawl is chosen uniformly at random.

• Greedy: Always the page with highest cash is read next.

• Cycle: Pages are selected in a fixed round robin fashion.

All three strategies fulfill the fairness property, i.e., each page is read infinitely often, butexperiments confirmed the intuition of the greedy strategy exhibiting the best convergencebehavior.

To overcome the OPIC underlying, simplifying assumptions of a static web graph andan a priori known number of web pages n, Abiteboul et al. studied an adapted version ofOPIC that takes time into account to model the deletion and creation of web pages and

Page 32: Query-log based Authority Analysis for Web Information Searchdomino.mpi-inf.mpg.de/imprs/imprspubl.nsf/a1f10a9f...be leveraged by a link-analysis algorithm. The problem that we study

30 CHAPTER 3. RELATED WORK

hyperlinks. Page importance in Adaptive OPIC is based on a time window. Nevertheless,the handling of new web pages and their cash values still remains somewhat ad-hoc.

Assessments of the quality of obtained rankings were conducted yielding quite goodresults, however, a comparative study between PageRank, HITS and OPIC is missing.

3.1.3 Unified Frameworks

Unified Framework for Link Analysis

Ding, He, Husbands, Zha, and Simon [18] generalized and combined the key concepts ofHITS and PageRank. Whereas HITS emphasizes the mutual reinforcement between huband authority pages, the main characteristics of PageRank are weight normalization andthe random walk model. Both PageRank and HITS were casted into a unified notationalframework and naturally arising new ranking algorithms, that combine or generalize con-cepts of HITS and PageRank, were studied.Let V be the set of nodes representing web pages and E the set of hyperlinks with |V | = n.E is overloaded to also denote the adjacency matrix, i.e., an entry Eij = 1 iff there is ahyperlink from page i to page j, otherwise Eij = 0. Moreover Ding et al. introduced thediagonal matrices Din and Dout that capture the indegrees, respectively outdegrees, of thecorresponding web pages on their diagonal. Each web page is also associated with bothan authority and a hub score. These are contained in the authority vector ~x and the hubvector ~y. Using this notation HITS and PageRank are defined as follows:

HITS:~x = Iop(~y), ~y = Oop(~x)

withIop(·) = ET , Oop(·) = E

PageRank:~x = Iop(~x)

withIop(·) = ET ·D−1

out = P T

The factor D−1out rescales the adjacency matrix E so that each row is sum-to-one, thus

P = (Pij) is a stochastic matrix. modelling also random jumps results in the full stochasticmatrix definition

Iop(·) = P T = ε · 1n

~eT~e + (1− ε) · ET ·D−1out

where ~e = (1, 1, . . . , 1). Thus ~eT~e is a matrix of all 1’s.Ding et al. transferred this traditional PageRank computation for authorities analo-

gously to hubs:~y = Oop(~y)

withOop(·) = P T = ε · 1

n~eT~e + (1− ε) · E ·D−1

in

Page 33: Query-log based Authority Analysis for Web Information Searchdomino.mpi-inf.mpg.de/imprs/imprspubl.nsf/a1f10a9f...be leveraged by a link-analysis algorithm. The problem that we study

3.1. VARIATIONS OF HITS AND PAGERANK 31

Futhermore they generalized Iop and Oop to

Iop(·) = D−pin · ET ·D−q

out, Oop(·) = Iop(·)T

with the mutual reinforcing iterative process

~x = Iop(~y), ~y = Oop(~y)

and studied several instantiations of p and q with p, q ≥ 0. They focused their experimentsonto the out-link normalized rank (OnormRank), the in-link normalized rank (InormRank)and the symmetrically normalized rank (SnormRank) which are given in Table 3.1.3.

Scheme Iop Oop

HITS ET EPageRank ET ·D−1

out E ·D−1in

OnormRank ET ·D− 12

out D− 1

2out · E

InormRank D− 1

2in · ET E ·D− 1

2in

SnormRank D− 1

2in · ET ·D− 1

2out D

− 12

out · E ·D− 12

in

Table 3.1: Iop and Oop for selected ranking schemes

Besides Ding et al. distinguished two score propagation schemes:

• Similarity-mediated score propagationThe iterative process ~x = Iop(Oop(~x)) for authority scores (analogously~y = Oop(Iop(~y)) for hub scores) can alternatively be replaced by an eigenvectorcomputation on M · ~x = λ · ~x with M = Iop(Oop(·)) and λ = 1. The matrix Mcan be interpreted to define the pairwise similarity between two web pages. Thusthe eigenvector computation can be viewed as a similarity-mediated authority scorepropagation on the undirected similarity graph G(M) deduced from M.

• Random surfing score propagationApplying the weight normalization of the random surfer model to the score propa-gation on the deduced similarity graph G(M), yields the refined equation(diag(~d))−1 ·M · ~x = ~x with ~d = (

∑k M1k, . . . ,

∑k Mnk).

Evaluation The theoretical results of Ding et al. comprised mainly three points:

1. In OnormRank, using the random surfer score propagation, the importance of apage is directly proportional to its indegree.

2. Analogously, the hub score of a page in InormRank, using the random surfer scorepropagation, is directly proportional to its outdegree.

3. Finally, in SnormRank using similarity-mediated score propagation, authority scoresare proportional to indegrees, and hub scores to outdegrees.

Page 34: Query-log based Authority Analysis for Web Information Searchdomino.mpi-inf.mpg.de/imprs/imprspubl.nsf/a1f10a9f...be leveraged by a link-analysis algorithm. The problem that we study

32 CHAPTER 3. RELATED WORK

1 2

3 4

5

5_h 5_a

1_h

2_h 2_a

3_h 3_a

4_h

1_a

4_a

Figure 3.3: Construction of the bipartite SALSA graph

Moreover Ding et al. experimentally underpinned their main theme that PageRank andHITS produce quite similar rankings which are both highly correlated with indegrees. Theylisted 5 sample queries and the resulting rankings with respect to PageRank, HITS andOnormRank to provide some evidence for the intermediate role of OnormRank betweenHITS and PageRank.

SALSA

Lempel and Moran [31] combined the Markov model concept of PageRank with the mutualreinforcement between hubs and authorities of HITS. Their approach SALSA (Stochas-tic Approach for Link-Structure Analysis) analyzes two different Markov chains, one forauthorities and one for hubs, on a bipartite graph G ′ constructed from the web graphG(V,E). For each web page v ∈ V the bipartite graph contains an authority node va if theindegree of v is greater than 0, and a hub node vh if the outdegree of v is greater than 0.For each edge (v,u) in the original graph, an undirected edge (vh, ua) is introduced. Thisgraph creation process is despicted in Figure 3.3.

The transition matrix A of the authority Markov chain is defined as

aij =∑

k|(kh,ia),(kh,ja)∈G ′

1deg(ia)

· 1deg(kh)

Thus a random walk on the authority Markov chain resides on the authority side of G ′. Allauthorities subsequently visited by the random surfer are linked by a common hub page.The final equilibrium distribution of this random walk is proportional to node indegrees,i.e., SALSA is equivalent to ranking by indegree. Analogously, the transition matrix H ofthe hub Markov chain is defined as

hij =∑

k|(ih,ka),(jh,ka)∈G ′

1deg(ih)

· 1deg(ka)

and yields a hub ranking equivalent to ranking by outdegree.

Page 35: Query-log based Authority Analysis for Web Information Searchdomino.mpi-inf.mpg.de/imprs/imprspubl.nsf/a1f10a9f...be leveraged by a link-analysis algorithm. The problem that we study

3.2. QUERY-DEPENDENT LINK ANALYSIS 33

3.2 Query-dependent Link Analysis

3.2.1 Query-dependent PageRank

Inspired by previous work of Cohn and Hofmann [15], Richardson and Domingos [36]presented a novel method to circumvent one drawback of PageRank that, in contrast toHITS, does not take any query-time information into account. They replaced the randomsurfer in ordinary PageRank that follows links or performs random jumps in a completelyunbiased manner with a more intelligent surfer that, depending on the content of the pagesand the query terms, chooses its way through the web. This behavior is more formallycaptured in the following definition:

~pq(j) = ε · ~rq(j) + (1− ε) ·∑

i∈predecessor(j)

~pq(i) · tq(i → j)

where

~pq(j) is the query-dependent PageRank score of page j with respect to query q,

~rq is the query-dependent jump vector, and

tq(i → j) denotes the probability of a transition from page i to page j given that theintelligent surfer has posed query q and there is a hyperlink (i, j).

The query-dependent jump vector and transition probabilities are computed accordingto the following equations:

~rq(j) =~Rq(j)∑

k∈V~Rq(k)

where V is the set of nodes and ~Rq is a query-dependent relevance function.

tq(i → j) =~Rq(j)∑

k∈successors(i)~Rq(k)

, i.e., when choosing which outlink to follow the intelligent surfer visits the page whose con-tent has been deemed relevant to the query. Usually a query consists of several query key-words, e.g., consider the multiple-term query Q = {q1, . . . , qn}. Then the query-dependentPageRank (QD-PageRank) scores are obtained via aggregation over the atomar queryterms:

~pQ(j) =∑q∈Q

~pq(j) · prob(q)

where prob(q) denotes the probability that the surfer uses q to guide its behavior.What does this mean in terms of storage and running time with respect to standard

PageRank? The QD-PageRank scores are computed offline for a set of terms containedin the lexicon L = {q1, q2, . . . , qm}. Let dq be the number of documents which containthe term q. Then S =

∑q∈L dq is the number of unique document-term pairs. Under

the assumption that the QD-PageRank of documents that do not contain any query term

Page 36: Query-log based Authority Analysis for Web Information Searchdomino.mpi-inf.mpg.de/imprs/imprspubl.nsf/a1f10a9f...be leveraged by a link-analysis algorithm. The problem that we study

34 CHAPTER 3. RELATED WORK

is zero, each QD-PageRank vector has dimension dq. Thus the required space to storeQD-PageRank is S. Let |V | be the total number of web pages. Then S

|V | approximates the

number of different query terms per page, thus QD-PageRank requires approximately S|V |

times the space of standard PageRank. Analogous the computation of QD-PageRank forquery q is restricted to dq documents. Thus the total running time is in O(S), a factor ofS|V | times the computation of standard PageRank alone.

Evaluation Richardson and Domingos calculated the query-dependent PageRank~pq, using ~Rq(j) = fraction of words equal to q in page j, and prob(q) = 1

|Q| , and com-pared it to standard PageRank. As data base served two different crawls, educrawl andWebBase. The first comprised 1.76 million pages over 32,000 different .edu domains,whereas the latter contained the first 15 million pages of the Stanford WebBase repository[25]. The relevance of the top-10 results of both methods on nine queries was assessedby four volunteers each. Although query-dependent PageRank performed worse on highlydependent two-term queries, it outperformed standard PageRank on one-keyword queriessignificantly. Richardson and Domingos explained the bad performance of their PageRankon 2-term queries with the practical, but unsatisfying approximation they made to calcu-late QD-PageRank on multiple-term queries. They suggested to take into account phraserecognition to cope with the problem of two words as a phrase having an other intendedmeaning than considered alone, such as the phrase ”financial aid”. Moreover documentsthat did not contain all query terms were completely discarded. This might be crucialwith respect to synonyms and re-phrasing of concepts.

Empirical scalability With educrawl Richardson and Domingos considered a lex-icon of 2.3 million words, thus QD-PageRank resulted in an approximately overhead factorof S

|V | = 165 in terms of storage and time requirements with respect to standard PageR-ank. Furthermore they noticed that the QD-PageRank computation converged faster on itssmaller sub-graphs, reducing the computational requirements even to a factor of 0.75 · S

|V | .

3.2.2 Unified FrameworkComprising Link Structure, Content and Queries

Achlioptas, Fiat, Karlin, and McSherry [2] proposed a unified probabilistic model forhyperlink generation and term distribution in the web, as well as the process by which auser generates queries. Moreover they presented the SP algorithm that generalizes HITSand PageRank and is motivated by their model. The assumption that underlies theirmodel is that there exists a set of k unknown (latent) basic concepts, by the combinationof which any topic can be represented. Thus a topic is a k-dimensional vector where eachdimension describes the contribution of one concept to the topic. This idea is inspired byprevious work on latent semantic analysis, including latent semantic indexing [5, 35] andPLSI [26].

Furthermore each web page p is characterized by two vectors:

• A(p) is a k-dimensional vector that captures the topic on which p is an authority

• H(p) is a k-dimensional vector that captures the topic on which p is a hub

Page 37: Query-log based Authority Analysis for Web Information Searchdomino.mpi-inf.mpg.de/imprs/imprspubl.nsf/a1f10a9f...be leveraged by a link-analysis algorithm. The problem that we study

3.3. EXPLOITING QUERY LOGS AND CLICK STREAMS 35

Two web pages p and q are authorities (hubs) on the same topic if A(p) (H(p)) is a scalarmultiple of A(q) (H(q)) and a page p is a better authority (hub) on its topic, the greaterthe magnitude of A(p) (H(p)), i.e., |A(p)| (|H(p)|), is.

Link generation is modelled probabilistically: The number of links from page p topage q is a random variable Xpq with expected value equal to

⟨H(p), A(q)

⟩, the inner prod-

uct of H(p) with A(q). Thus a link from p to q is more likely, the more closely aligned thehub topic of p is with the authority topic of q, and the greater the magnitudes of H(p)

and A(q) are.

Term distributions in web pages are captured by the following concepts: eachentry of the k-tuple S

(u)A describes the expected number of occurrences of term u in a

pure authoritative document for one of the k basic concepts, i.e., the ith entry representsthe use of term u in a document that is an authority exactly on concept i and a hub onnothing. Analogously S

(u)H describes the expected number of occurrences of term u in pure

hub documents on the k basic concepts each. Then the expected number of occurrencesof term u in document p with authority topic A(p) and hub topic H(p) is given by⟨

A(p), S(u)A

⟩+⟨H(p), S

(u)H

⟩The process of query generation is thought to be accomplished in three steps:

1. The searcher determines the topic of her search, i.e., the k-tuple v representing thecontributions of the k basic concepts.

2. The searcher computes q = vT ·STH . Then the uth entry of q represents the number

of occurrences of term u in a pure hub page on topic v.

3. The searcher chooses her search terms by sampling from a distribution with expec-tation q.

A web search then should return the most authoritative pages on the query topic v,where authoritativeness of a page p is given by

⟨v,A(p)

⟩.

Evaluation Achlioptas, Fiat, Karlin, and McSherry proved that the authoritative-ness vector computed by their SP algorithm is very close to the correct answer accordingto their model. However, the benefit of these theoretical results for the quality of websearch has not been shown explicitly.

3.3 Exploiting Query Logs and Click Streams

3.3.1 Query Clustering

Ji-Rong Wen and Hong-Jiang Zhang [47, 48] from Mircrosoft Research Asia suggested toperform query clustering via query logs in order to indentify Frequently Asked Questions.The result sets for these FAQs could then be prepared manually by human editors to yield

Page 38: Query-log based Authority Analysis for Web Information Searchdomino.mpi-inf.mpg.de/imprs/imprspubl.nsf/a1f10a9f...be leveraged by a link-analysis algorithm. The problem that we study

36 CHAPTER 3. RELATED WORK

better search precisions for similar, often occurring queries. Besides query clusters contain-ing different descriptions of similar subjects could be used as basis for query reformulationsuggestions and ontology services. They distinguish between two orthogonal query clus-tering methods that mainly differ in the definition of similarity between two queries, thecontent-based and the session-based approach.

Content-based Query Clustering

Keyword-based Similarity

similaritykeyword (q1, q2) =KN (q1, q2)

max (kn (q1) , kn (q2))

where KN(q1, q2) is the number of common keywords of queries q1 and q2, and kn(. ) isthe number of keywords a query comprises.With weighted query terms a slightly adapted formula can be used:

similarityweighted-keyword (q1, q2) =∑k

i=1 cwi (q1) · cwi (q2)√∑mi=1 w2

i (q1) ·√∑n

i=1 w2i (q2)

where cwi(q1) and cwi(q1) are the weights of the i-th common keyword in the query q1 andq2 respectively, and wi(. ) is the weight of the i-th keyword in the specified query. A greatdrawback of keyword-based similarity on queries is that queries are usually very short,mostly consisting of only two keywords, so that semantically similar queries are difficultto retrieve via textual similarity.

Similarity based on Edit-distance

To meet the special demands of query similarity deduction this approach takes into accountstopwords as well as word order.

similarityedit-distance (q1, q2) = 1− edit-distance (q1, q2)max (wn (q1) , wn (q2))

where wn(. ) denotes the number of words in a query.

Session-based Query Clustering

Naive Session-based Similarity

This similarity measure follows the maxim

If users clicked on the same web pages for different queries, then these queriesare similar.

similaritynaive-session (q1, q2) =DN (q1, q2)

max (dn (q1) , dn (q2))

where dn(. ) denotes the number of clicked pages for a query and DN(q1, q2) is the numberof page clicks in common.

Page 39: Query-log based Authority Analysis for Web Information Searchdomino.mpi-inf.mpg.de/imprs/imprspubl.nsf/a1f10a9f...be leveraged by a link-analysis algorithm. The problem that we study

3.3. EXPLOITING QUERY LOGS AND CLICK STREAMS 37

Combination of Query and Document Clustering

Unidirectional Clustering This similarity measure follows the extended maxim

If users clicked on similar web pages for different queries, then these queriesare similar.

Incorporating the similarity function s(. ,. ) to calculate the similarity between documentsdirectly into the query similarity calculations, yields:

similaritysession+doc (q1, q2) =12·

(∑mi=1 maxn

j=1 s (di, dj)m

+

∑nj=1 maxm

i=1 s (di, dj)n

)where di(1 ≤ i ≤ m) and dj(1 ≤ j ≤ n) are the clicked documents for queries q1 and q2

respectively.

Alternatively query clustering can be preceded with document clustering so that in-stead of arguing on the clicked docs in common, similarity is measured via common clickedclusters.

Bidirectional Clustering Yet another approach favors a bidirectional clusteringstrategy which has been developed by Beeferman and Berger [7]. Here document similarityis not based on textual similarity, but on the hypothesis that

If two documents are clicked for the same query, they tend to be similar.

The agglomerative iterative clustering algorithm of Beeferman and Berger works on abipartite graph constructed from query logs with query nodes on one side and documentnodes on the other. Only if document d has been clicked for query q an edge is introducedconnecting d and q. The similarity σ (x, y) between two nodes x and y is defined as follows.Let N (x) be the set of nodes neighboring node x in the bipartite graph. Then

σ (x, y) =

{N (x)∩N (y)N (x)∪N (y) , if |N (x) ∪N (y) | > 0

0, otherwise

Starting from the bipartite query-document graph in turn the two most similar querynodes and two most similar document nodes, respectively, are merged until no furthermerges are possible.

Evaluation Wen and Zhang carried out a case study on the Encarta encyclopediaon 41,942 documents and 20,000 query sessions extracted from one-month user logs. Thebottom line of their experiments was that a weighted combination of content- and session-based similarity measures outperformed the single-evidence functions (solely content- orsolely session-based). They obtained this assessment by comparing the precision andnormalized recall of randomly selected clusters with precision being defined as the ratioof the number of similar queries to the total number of queries in a cluster. Takinginto account document similarity resulted in a gain of recall without loss of precision incomparison with naive session-based similarity.However, the benefit of these techniques for web search has not been showed explicitlyand no method has been suggested to overcome the huge effort of manually editing theresult sets of FAQs.

Page 40: Query-log based Authority Analysis for Web Information Searchdomino.mpi-inf.mpg.de/imprs/imprspubl.nsf/a1f10a9f...be leveraged by a link-analysis algorithm. The problem that we study

38 CHAPTER 3. RELATED WORK

3.3.2 Query Expansion

A recent approach of Hang Cui, Ji-Rong Wen, Jian-Yun Nie and Wei-Ying Ma [16] exploitsdata from user logs for query expansion. This is achieved by reasoning on the conditionalprobabilities of a document term w

(d)j occurring in a clicked document given that the query

contains the term w(q)i , i.e., P(w(d)

j |w(q)i ).

Let C be the set of clicked documents for queries containing w(q)i . Then

P(w

(d)j |w(q)

i

)=

P(w

(d)j , w

(q)i

)P(w

(q)i

)=

∑∀Dk∈C P

(w

(d)j , w

(q)i , Dk

)P(w

(q)i

)=

∑∀Dk∈C P

(w

(d)j |w(q)

i , Dk

)· P(w

(q)i , Dk

)P(w

(q)i

)With the assumption P

(w

(d)j |w(q)

i , Dk

)= P

(w

(d)j |Dk

), it holds

P(w

(d)j |w(q)

i

)=

∑∀Dk∈C P

(w

(d)j |Dk

)· P(Dk|w

(q)i

)· P(w

(q)i

)P(w

(q)i

)=

∑∀Dk∈C

P(w

(d)j |Dk

)· P(Dk|w

(q)i

)

P(w

(d)j |Dk

)and P

(Dk|w

(q)i

)can be estimated from user logs as follows:

P(w

(d)j |Dk

)=

W(d)jk∑

∀ t∈DkW

(d)tk

P(Dk|w

(q)i

)=

f(q)ik

(w

(q)i , Dk

)f (q)

(w

(q)i

)where

f (q)ik

(w(q)

i ,Dk

)is the number of query sessions in which the query term w

(q)i and the

document Dk appear together,

f (q)(w(q)

i

)is the number of query sessions that contain the term w

(q)i , and

W(d)jk is the normalized weight of term w

(d)j in document Dk.

Page 41: Query-log based Authority Analysis for Web Information Searchdomino.mpi-inf.mpg.de/imprs/imprspubl.nsf/a1f10a9f...be leveraged by a link-analysis algorithm. The problem that we study

3.3. EXPLOITING QUERY LOGS AND CLICK STREAMS 39

To decide for a given query Q which term of the document space should be added forquery expansion, for all document terms w

(d)j the combinations of the relationships of the

term to all query terms, CombinedWeight(w(d)j )= ln

(∏w

(q)t ∈Q

(P(w

(d)j |w(q)

t

)+ 1))

, arecompared and the n best document terms are selected.

Evaluation Experiments were carried out on 41,942 documents from the Encartaencyclopedia and 4,839,704 user query sessions. For both short and long queries preci-sions of the proposed query-log based query expansion, the query expansion based onpseudo- relevance feedback as well as the original queries without query expansion werecompared against each other. In all cases log-based query expansion outperformed theother approaches whereby the improvements for short queries were larger than for longqueries.

3.3.3 Exploiting Click Streams

Wang, Chen, Tao, Ma and Wenyin [46] proposed a generalization of HITS to incorparatethe notion of users. The underlying assumptions are

• The importance of a web page can be inferred from the frequency and expertise ofthe visiting users.

• The importance of a user can be inferred from the quality and quantity of web pageshe has visited.

They perform the following iterative calculation:

a (p) = β ·∑q→p

h (q) + (1− β) ·∑r→p

u (r)

h (p) = β ·∑p→q

a (q) + (1− β) ·∑r→p

u (r)

u (r) = (1− β) ·

(∑r→p

a (p) +∑r→q

h (q)

)Thus the importance weights of users, denoted by u(.), and the authority/hub weights

of pages, denoted by a(.) and h(.) respectively, mutually reinforce each other.

Evaluation Experiments were performed on a 4 days log from a proxy server atMicrosoft and showed precision improvements of 11.8% over the HITS algorithm. How-ever, Wang et al. did not generally prove the convergence of their iterative computation,but tested the convergence experimentally.

In contrast Xue, Zeng, Chen, Ma, Zhang, and Lu [53] modified PageRank by replacingthe hyperlink-based web graph with an implicit link graph where edges connect subse-quently visited pages. Each edge (di, dj) in the implicit link graph is weighted with theconditional probability P (dj |di) of the page dj to be visited given current page di. For

Page 42: Query-log based Authority Analysis for Web Information Searchdomino.mpi-inf.mpg.de/imprs/imprspubl.nsf/a1f10a9f...be leveraged by a link-analysis algorithm. The problem that we study

40 CHAPTER 3. RELATED WORK

retrieving implicit links from click-through data a two-item sequential pattern miningstrategy is used:

Consider the explicit user browsing path (di1, di2, . . . , dik). From this path all possi-ble ordered pairs of subsequently visited pages are generated ((i1, i2), (i1, i3), . . ., (i1, ik),(i2, i3), (i2, i4), . . ., (i2, ik), . . .), and stored together with their total occurrence frequencyover all explicit paths of length k (k is called the gliding window size). The set of implicitlinks is then formed by the ordered pairs whose frequency exceeds some minimum supportthreshold.

In order to compute authorative pages, the PageRank algorithm is applied to theadjacency matrix obtained from the implicit link structure. The overall ranking of pageswith respect to a given query is then obtained either by a linear combination of thecontent-based similarity score and the PageRank score or by a linear combination of pages’positions in two lists with one list being sorted by similarity scores and the other byPageRank scores, thus yielding an order-based combination of similarity and PageRankscores.

Evaluation Experiments were conducted on 4-month click-through logs on the web-site at http://www.cs.berkeley.edu/logs/. After some preprocessing steps and two-itemsequential pattern mining they obtained 336,812 implicit links out of which 22,122 linkswere also explicit links. 7 volunteer graduated students were asked to evaluate on therecommendation quality of implicit links with respect to explicit ones, as well as on thesearch results for 10 selected queries obtained from five different ranking schemes (full textsearch, PageRank, DirectHit [19], modified-HITS [6] and implicit link-based PageRank).Their experimental results revealed clear gains in search performance, however, Xue, Zeng,Chen, Ma, Zhang, and Lu themselves emphasized the applicability of their method to berestricted to small web search. The main benefit of introducing the notion of implicit linksis their ability to capture recommendation information in a small web where hyperlinksno longer exhibit a kind of recommendation, but mere navigational functionality.

Page 43: Query-log based Authority Analysis for Web Information Searchdomino.mpi-inf.mpg.de/imprs/imprspubl.nsf/a1f10a9f...be leveraged by a link-analysis algorithm. The problem that we study

Chapter 4

Incorporating Queries intoPageRank

For an enhancement of search result quality we propose exploiting the implicit relevancefeedback contained in query logs by incorporating the notion of queries into the standardPageRank model. In contrast to the identification of FAQs and manual result preparationpresented by Ji-Rong Wen and Hong-Jiang Zhang [47, 48], our approach needs no addi-tional intellectual input. Also we avoid any further processing steps at query-time. TheQRank scores can be computed offline and used for ranking just as ordinary PageRankscores. In this manner query-log based relevance feedback is naturally integrated into theautomated process of authority analysis.In the following section we introduce our QRank model by first giving an intuitive de-scription of the enhanced web graph we model and then turning to its formal definition.Section 4.2 comprises some desirable properties of our model and their proofs.

4.1 QRank Model

As despicted in Figure 4.1 queries are added in a natural manner to the web graph byintroducing query nodes. Thus the set of nodes V is formed by the union of the set ofdocument nodes D and query nodes Q. Furthermore we differentiate between several edgetypes that represent various relationships between queries and documents.

Explicit Links

Explicit links that are represented by directed edges reflect different kinds of recommen-dation information. We introduce three explicit link types representing

Query Refinement A directed edge from query q1 to query q2 represents a situation,observed in the query history, that a user after posing query q1 didn’t get satisfyingsearch results so that she reformulated her query and posed query q2.

Query Result Clicks A directed edge from query q to document d, obtained from thequery and click-stream log, indicates that a user posed query q and then clicked onthe document d after having seen its summary in the top-10 result list.

41

Page 44: Query-log based Authority Analysis for Web Information Searchdomino.mpi-inf.mpg.de/imprs/imprspubl.nsf/a1f10a9f...be leveraged by a link-analysis algorithm. The problem that we study

42 CHAPTER 4. INCORPORATING QUERIES INTO PAGERANK

A

A B

query node

document node

Legend:

query result click

query refinement

citation

A’

similarity link

Figure 4.1: QRank graph

Citation This edge type is also present in the ordinary PageRank graph. An edge (d1, d2)represents a hyperlink that points from d1 to d2 and thus indicates that the authorof d1 appreciates the content of d2.

Implicit links

In addition to the various forms of explicit links, we introduce implicit links that cap-ture associative relationships between graph nodes, based on content similarity. Theserelationships are bidirectional, so the implicit links are undirected edges. We introducean implicit link from document u to document v if the textual similarity between bothpages, in terms of the tf*idf-based cosine measure [32, 13], exceeds a specified threshold.Analogously two queries are connected as soon as their ontology-based content similaritysuggests a meaningfull correlation. Thus the strongest associative links among documents,respectively queries, are captured to meet the need of an early combination of similarityand authority.

4.1.1 Notation

For the formal definition of QRank some extra notation has to be introduced. Let v, v ′ ∈V, d ∈ D and q, q ′ ∈ Q. Then

• explicitIN(v) denotes the set of predecessors of node v via explicit links.

• implicitIN(v) denotes set of predessors of node v via implicit links.

Page 45: Query-log based Authority Analysis for Web Information Searchdomino.mpi-inf.mpg.de/imprs/imprspubl.nsf/a1f10a9f...be leveraged by a link-analysis algorithm. The problem that we study

4.1. QRANK MODEL 43

• implicitOUT(v) denotes the set of successors of node v via implicit links. ActuallyimplicitIN(v) = implicitOUT(v), as all implicit links are undirected. This notationis only introduced for the sake of clarification.

• explicitOutdegree(v) denotes the number of explicit links leaving node v.

• explicitDocOutdegree(v) denotes the number of explicit links leaving node v andpointing to a document node.

• explicitQueryOutdegree(v) denotes the number of explicit links leaving node vand pointing to a query node.

• sim(v’,v) is an overloaded similarity function symbol

sim : Q×Q → [0, 1]

sim : D ×D → [0, 1]

with

∑v ′∈implicitOUT (v)

sim(v, v ′) = 1 ∀ v ∈ V

, i.e., the similarities on all outgoing implicit links of v are normalized to sum up to1.

• click(q, d) denotes the normalized click frequency, that reflects the probability thata user clicks on document d after posing query q.

click : Q×D → [0, 1]

with

∑d∈explicitOUT (q)∧d∈D

click(q, d) = 1 ∀ q ∈ Q

, i.e., the click frequencies on all explicit links from q to document nodes are nor-malized to sum up to 1.

• refine(q, q’) denotes the normalized refinement frequency, that reflects the prob-ability of a user posing query q’ after query q.

refine : Q×Q → [0, 1]

with

∑q ′∈explicitOUT (q)∧q ′∈Q

refine(q, q ′) = 1 ∀ q ∈ Q

, i.e., the refinement frequencies on all explicit links from q to other query nodes arenormalized to sum up to 1.

Page 46: Query-log based Authority Analysis for Web Information Searchdomino.mpi-inf.mpg.de/imprs/imprspubl.nsf/a1f10a9f...be leveraged by a link-analysis algorithm. The problem that we study

44 CHAPTER 4. INCORPORATING QUERIES INTO PAGERANK

4.1.2 Formalized Description

Before the PageRank model is adjusted to the new graph structure, recall the recursivedefinition of ordinary PageRank from Section 2.2.2:

PageRank

In standard PageRank there are no query nodes, and only explicit links between documentsrespresenting hyperlinks are considered, i.e., V = D and

~p(PageRank)0 = ~r

For n > 0:

~p (PageRank)n (d) = ε · ~r + (1− ε) ·

∑d ′∈explicitIN(d)∧d ′∈D

~p(PageRank)n−1 (d ′) · weight(d ′, d)

The random jump vector ~r is usually defined as

~r (d) =1|D|

∀ d ∈ D

, i.e., random jumps are performed according to a uniform probability distribution.Furthermore the edge weights are normalized, thus

weight(d ′, d

)={ 1

outdegree(d ′) , if d ′ ∈ explicitIN(d) ∧ d ′ ∈ D

0 , otherwise∀ d ∈ D

QRank

Now we are ready to introduce the equations of the QRank Markov model. We model arandom expert user as follows. The session starts with either a query node or a documentnode (i.e., a web page). With probability ε the user makes a random jump to a uniformlychosen node, and with probability 1 − ε she follows an outgoing link. If she makes arandom jump, then the target is a query node with probability β and document node withprobability 1− β. If the user follows a link, it is an explicit link with probability α or animplicit link with probability 1−α. In the case of an implicit link, the target node is chosenin a non-uniform way with a bias proportional to the content similarity between nodeswhich is either a similarity between two documents or two queries. When the user followsan explicit link, the behavior depends on whether the user currently resides on a querynode or a document node. In the case of a document node, she simply follows a uniformlychosen outgoing hyperlink as in the standard PageRank model. If she currently resideson a query node, she can either visit another query node, thus refining or reformulatingher previous query, or visit one of the documents that were clicked on (and thus implicitlyconsidered relevant) after the same query by some other user in the overall history. Fromthe history we have estimated relative frequencies of these refinement and click events,and these frequencies are proportional to the bias for non-uniformly choosing the targetnode in the random walk. This informal description is mathematically captured in thefollowing definition.

Page 47: Query-log based Authority Analysis for Web Information Searchdomino.mpi-inf.mpg.de/imprs/imprspubl.nsf/a1f10a9f...be leveraged by a link-analysis algorithm. The problem that we study

4.1. QRANK MODEL 45

Definition 10 (QRank). Let v ∈ V , d ∈ D and q ∈ Q. Then the QRank vector ~p (QRank),in the following called ~p for the sake of simplicity, is recursively defined as

~p0 = ~r

For n > 0 :

~pn(v) = ε · ~r(v) + (1− ε)

·

α ·∑

v ′∈explicitIN(v)

~pn−1(v ′) · weight(v ′, v)

+ (1− α) ·∑

v ′∈implicitIN(v)

~pn−1(v ′) · sim(v ′, v)

In full detail, with documents and queries notationally distinguished, this means:

~pn(d) = ε · ~r(d) + (1− ε)

·

α ·∑

d ′∈explicitIN(d)∧d ′∈D

~pn−1(d ′) · weight(d ′, d)

+ α ·∑

q ′∈explicitIN(d)∧q ′∈Q

~pn−1(q ′) · weight(q ′, d)

+ (1− α) ·∑

d ′∈implicitIN(d)∧d ′∈D

~pn−1(d ′) · sim(d ′, d)

~pn(q) = ε · ~r(q) + (1− ε)

·

α ·∑

q ′∈explicitIN(q)∧q ′∈Q

~pn−1(q ′) · weight(q ′, q)

+ (1− α) ·∑

q ′∈implicitIN(q)∧q ′∈Q

~pn−1(q ′) · sim(q ′, q)

with

weight(v ′, v) =

1

explicitOutdegree(v ′) , if v’, v ∈ D ∧ v ′ ∈ explicitIN(v)explicitQueryOutdegree(v ′)

explicitOutdegree(v ′) · refine(v’,v) , if v’, v ∈ Q ∧ v ′ ∈ explicitIN(v)explicitDocOutdegree(v ′)

explicitOutdegree(v ′) · click(v’,v) , if v’ ∈ Q, v ∈ D ∧ v ′ ∈ explicitIN(v)0 , otherwise

Page 48: Query-log based Authority Analysis for Web Information Searchdomino.mpi-inf.mpg.de/imprs/imprspubl.nsf/a1f10a9f...be leveraged by a link-analysis algorithm. The problem that we study

46 CHAPTER 4. INCORPORATING QUERIES INTO PAGERANK

and

~r(v) =

{β|Q| , if v ∈ Q

1−β|D| , if v ∈ D

In contrast to ordinary PageRank, our modified version contains two additional tuningparameters. The parameter β regulates the behavior of the random expert user when per-forming random jumps, whereas α determines how much stress should be laid on implicitlinks with respect to explicit ones.

To get a feeling for the parameter choice consider the particular extremes, the effectsof setting α and β to zero and one.

For β = 0, we ignore query nodes in the context of random jumps, i.e.,

~r(v) ={

0 , if v ∈ Q1|D| , if v ∈ D

whereas for β = 1 only query nodes are targets of random jumps, i.e.,

~r(v) ={ 1

|Q| , if v ∈ Q

0 , if v ∈ D

This last case models a random expert user that either poses a query or follows outgoinglinks. This seems to be a quite natural model as queries often serve as the starting pointof a real web session.

The extreme values for α result in either the total exclusion of implicit links or ex-plicit ones. Thus α and β determine the degree to which QRank deviates from standardPageRank.

4.2 Properties of QRank

The aim of this section is to formally prove some desirable properties of QRank, especiallyits convergence, as well as to theoretically examine the relationship between PageRankand QRank. Beforehand we need some technical lemmas which are given in the following.

Lemma 1. ∀ d ∈ D :∑

d ′∈D weight(d, d ′) = 1

Proof. Choose d ∈ D arbitrarily but fixed. Then

Page 49: Query-log based Authority Analysis for Web Information Searchdomino.mpi-inf.mpg.de/imprs/imprspubl.nsf/a1f10a9f...be leveraged by a link-analysis algorithm. The problem that we study

4.2. PROPERTIES OF QRANK 47

∑d ′∈D

weight(d, d ′) =∑

d ′∈explicitOUT (d)

weight(d, d ′) +∑

d ′ /∈explicitOUT (d)

weight(d, d ′)

=∑

d ′∈explicitOUT (d)

weight(d, d ′) +∑

d ′ /∈explicitOUT (d)

0

=∑

d ′∈explicitOUT (d)

1explicitOutdegree(d)

= |explicitOUT (d)| · 1explicitOutdegree(d)

= 1

Lemma 2. ∀ q ∈ Q :∑

d∈D weight(q, d) = explicitDocOutdegree(q)explicitOutdegree(q)

Proof. Choose q ∈ Q arbitrarily but fixed. Then∑d∈D

weight(q, d) =∑

d∈explicitOUT (q)∩D

weight(q, d) +∑

d/∈explicitOUT (q)∩D

weight(q, d)

=∑

d∈explicitOUT (q)∩D

explicitDocOutdegree(q)explicitOutdegree(q)

· click(q, d)

+∑

d∈explicitOUT (q)∩D

0

=explicitDocOutdegree(q)

explicitOutdegree(q)·

∑d∈explicitOUT (q)∩D

click(q, d)

=explicitDocOutdegree(q)

explicitOutdegree(q)

Lemma 3. ∀ q ∈ Q :∑

q ′∈Q weight(q, q ′) = explicitQueryOutdegree(q)explicitOutdegree(q)

Proof. Choose q ∈ Q arbitrarily but fixed. Then∑q ′∈Q

weight(q, q ′) =∑

q ′∈explicitOUT (q)∩Q

weight(q, q ′) +∑

q ′ /∈explicitOUT (q)∩Q

weight(q, q ′)

=∑

q ′∈explicitOUT (q)∩Q

explicitQueryOutdegree(q)explicitOutdegree(q)

· refine(q, q ′)

+∑

q ′ /∈explicitOUT (q)∩Q

0

=explicitQueryQutdegree(q)

explicitOutdegree(q)·

∑q ′∈explicitOUT (q)∩Q

refine(q, q ′)

=explicitQueryOutdegree(q)

explicitOutdegree(q)

Page 50: Query-log based Authority Analysis for Web Information Searchdomino.mpi-inf.mpg.de/imprs/imprspubl.nsf/a1f10a9f...be leveraged by a link-analysis algorithm. The problem that we study

48 CHAPTER 4. INCORPORATING QUERIES INTO PAGERANK

Lemma 4.∑

v∈V ~r(v) = 1

Proof.

∑v∈V

~r(v) =∑d∈D

~r(d) +∑q∈Q

~r(q)

=∑d∈D

1− β

|D|+∑q∈Q

β

|Q|

= 1− β + β

= 1

Now we are ready to show the convergence of QRank in the course of the followingtheorems.

Theorem 3. QRank defines a probabilistic transition matrix T, i.e. ∀ v ∈ V :∑

v ′∈V T (v, v ′) =1. Thus the QRank model is based on a time-discrete finite-state homogeneous Markovchain.

Proof. First, consider the definition of the transition matrix T suggested by the previouslydefined QRank Markov model.

Tij represents the probability of visting node j after node i, i.e.,

Tij =

ε · ~r(j) + (1− ε) · [α · weight(i, j) + (1− α) · sim(i, j)] , if i,j ∈ Dε · ~r(j) + (1− ε) · [α · weight(i, j) + (1− α) · sim(i, j)] , if i,j ∈ Q

ε · ~r(j) , if i ∈ D, j ∈ Qε · ~r(j) + (1− ε) · α · weight(i, j) , if i ∈ Q, j ∈ D

It remains to show:

∀ i ∈ V :∑j∈V

Tij = 1

Case study

Case 1: i ∈ D

Page 51: Query-log based Authority Analysis for Web Information Searchdomino.mpi-inf.mpg.de/imprs/imprspubl.nsf/a1f10a9f...be leveraged by a link-analysis algorithm. The problem that we study

4.2. PROPERTIES OF QRANK 49

∑j∈V

Tij =∑j∈V

{ε · ~r(j) + (1− ε) · [α · weight(i, j) + (1− α) · sim(i, j)] , if j ∈ D

ε · ~r(j) , if j ∈ Q

}=

∑j∈V

ε · ~r(j)

+∑j∈V

{(1− ε) · [α · weight(i, j) + (1− α) · sim(i, j)] , if j ∈ D

0 , if j ∈ Q

}Lemma 4

= ε +∑j∈V

{(1− ε) · [α · weight(i, j) + (1− α) · sim(i, j)] , if j ∈ D

0 , if j ∈ Q

}= ε +

∑j∈D

(1− ε) · [α · weight(i, j) + (1− α) · sim(i, j)]

= ε + (1− ε) · α ·∑j∈D

weight(i, j) + (1− ε) · (1− α) ·∑j∈D

sim(i, j)

Lemma 1= 1

Case 2: i ∈ Q∑j∈V

Tij =∑j∈V

{ε · ~r(j) + (1− ε) · [α · weight(i, j) + (1− α) · sim(i, j)] , if j ∈ Q

ε · ~r(j) + (1− ε) · α · weight(i, j) , if j ∈ D

}= ε ·

∑j∈V

~r(j) + (1− ε)

·

α ·∑j∈D

weight(i, j) + α ·∑j∈Q

weight(i, j) + (1− α) ·∑j∈Q

sim(i, j)

Lemma 2,3

= ε + (1− ε)

· [ α · explicitDocOutdegree(q)explicitOutdegree(q)

+ α · explicitQueryOutdegree(q)explicitOutdegree(q)

+ (1− α) ]

= ε + (1− ε)

· [ α · explicitDocOutdegree(q) + explicitQueryOutdegree(q)explicitOutdegree(q)

+ (1− α) ]

= ε + (1− ε) · [α + (1− α)]= ε + (1− ε)= 1

Theorem 4. For ε 6= 0, β 6= 0 and β 6= 1 QRank converges.

Proof. By Theorem 3 the QRank model defines a probabilistic transition matrix, thusa finite-state time-discrete homogeneous Markov chain. For ε 6= 0, β 6= 0 and β 6= 1this Markov chain is aperiodic and irreducible, i.e., every node is reachable from everyother node and has period 1. This is ensured by the random jumps. By Theorem 1 these

Page 52: Query-log based Authority Analysis for Web Information Searchdomino.mpi-inf.mpg.de/imprs/imprspubl.nsf/a1f10a9f...be leveraged by a link-analysis algorithm. The problem that we study

50 CHAPTER 4. INCORPORATING QUERIES INTO PAGERANK

properties imply that the defined Markov chain is positive recurrent and ergodic, i.e., thereexist stationary visiting probabilities for all nodes, thus QRank converges.

Thus Theorem 4 proves that the QRank model has a unique and computable solution,and we can compute QRank scores by the standard method of power iteration for theQRank transition matrix (see, e.g., [13] for this standard method).

By the following theorem we show that our approach encapsulates the original PageR-ank model, i.e., that ordinary PageRank is a special case contained in our more generalQRank model.

Theorem 5. For α = 1 and β = 0 QRank converges to PageRank.

Proof. First, we simply substitute 1 for α into the definitions of QRank yielding:

~p(d) = ε · ~r(d) + (1− ε)

·

∑d ′∈explicitIN(d)∧d ′∈D

~p(d ′) · weight(d ′, d)

+∑

q ′∈explicitIN(d)∧q ′∈Q

~p(q ′) · weight(q ′, d)

~p(q) = ε · ~r(q) + (1− ε)

·∑

q ′∈explicitIN(q)∧q ′∈Q

~p(q ′) · weight(q ′, q)

(4.1)

Thus, for α = 1 implicit links are ignored. The parameter α is a tuning parameter todetermine the weight of explicit with respect to implicit links.

For β = 0 the random jump vector ~r ignores query nodes:

~r(i) ={

0 , if i ∈ Q1|D| , if i ∈ D (4.2)

Note: Strictly speaking the Markov chain defined by the QRank model might not beirreducible for β = 0 or β = 1. In these cases either query nodes or document nodes areno targets of random jumps and we cannot argue that every node is reachable from everyother node. For all other choices of β every node is reachable by a random jump and theMarkov chain is for sure irreducible. To be correct we would have to consider β convergingto 0, but for the sake of simplicity we set β to 0 instead of β → 0.

Incorporating 4.2 into the equations for ~p(d) and ~p(q) (4.1) yields:

Page 53: Query-log based Authority Analysis for Web Information Searchdomino.mpi-inf.mpg.de/imprs/imprspubl.nsf/a1f10a9f...be leveraged by a link-analysis algorithm. The problem that we study

4.2. PROPERTIES OF QRANK 51

~p(d) =ε

|D|+ (1− ε)

·

∑d ′∈explicitIN(d)∧d ′∈D

~p(d ′) · weight(d ′, d)

+∑

q ′∈explicitIN(d)∧q ′∈Q

~p(q ′) · weight(q ′, d)

~p(q) = (1− ε) ·∑

q ′∈explicitIN(q)∧q ′∈Q

~p(q ′) · weight(q ′, q)

(4.3)

Let ~pt(v) denote the QRank of node v after t iterations. We want to show that

∀ q ∈ Q,∀ t ∈ N : β = 0 ⇒ ~pt(q) = 0

Proof by induction over t

t = 0:

The initial QRank of all nodes is set to the random jump vector. Thus

~p0(q) = ~r(q) =β

|Q|= 0 ∀ q ∈ Q

t-1 → t:From 4.3 we have

~pt(q) = (1− ε) ·∑

q ′∈explicitIN(q)∧q ′∈Q

~pt−1(q ′) · weight(q ′, q)

By induction hypothesis it holds that ∀ q ∈ Q : ~pt−1(q) = 0.

⇒ ~pt(q) = 0

From this follows:

∀ t > 0 : ~pt(d) =ε

|D|+ (1− ε)

·∑

d ′∈explicitIN(d)∧d ′∈D

~pt−1(d ′) · weight(d ′, d)

= ~p(PageRank)t (d)

Page 54: Query-log based Authority Analysis for Web Information Searchdomino.mpi-inf.mpg.de/imprs/imprspubl.nsf/a1f10a9f...be leveraged by a link-analysis algorithm. The problem that we study

Chapter 5

Implementation

We integrated our query-log enhanced authority model, the computation of QRank scores,into the BINGO! toolkit [38] that already comprises the capabilities for crawling andindexing web data as well as searching the indexed documents. BINGO! runs upon anOracle9i database so that the integration was easily done by extending the underlyingdatabase schema. Our implementation mainly consists of three components:

1. a parser that mines the browser histories obtained from the UrlSearch tool for querysessions and refinements,

2. an implicit link constructor that computes weighted correlations between queries ordocuments, and

3. the actual QRank computation

All components store intermediate results in the database for modularity and reuseof computations (see Tables 5.2 and 5.1 for the respective database schemes). After thepre-processing steps (history mining, associative link generation) the actual QRank com-putation proceeds in three steps: first the enhanced web graph is loaded into memory,then QRank is computed by power iteration using 100 iterations, and finally the resultingQRank score vector is written to the database (see Figure 5.1 for a schematic illustration).

Throughout the whole implementation the running time could be improved by usingthe datastructures of the trove package [44] that provides faster implementations of hashsets and hash maps for primitive types (int, long, and so on) so that the overhead of usingwrapper classes became unnecessary. Furthermore a refinement of all SQL statements

ClickStreams

Query ID integerDoc ID integerFrequency float

Refinements

Query ID integerRefinedQuery ID

integer

Frequency float

Table 5.1: Database schema capturing query click streams and refinements

52

Page 55: Query-log based Authority Analysis for Web Information Searchdomino.mpi-inf.mpg.de/imprs/imprspubl.nsf/a1f10a9f...be leveraged by a link-analysis algorithm. The problem that we study

53

HistoryParser

Document SimilarityQuery Similarity

ImplicitLinkRetriever

Query Click StreamsQuery Refinements

GraphCreator

BingoCrawler

DocumentsHyperlinks

QRank Computation

Figure 5.1: Overview of system architecture

numberQRank

numberPageRank

varchar2(255)Title

varchar2(3000)URL

integerID

BingoDocuments

integerID

numberTF_IDF

numberTF

numberRTF

varchar2(200)Term

DocumentFeatures

integerID

integerLink_ID

Links

Figure 5.2: Excerpt of essential BINGO! tables

Page 56: Query-log based Authority Analysis for Web Information Searchdomino.mpi-inf.mpg.de/imprs/imprspubl.nsf/a1f10a9f...be leveraged by a link-analysis algorithm. The problem that we study

54 CHAPTER 5. IMPLEMENTATION

ImplicitQueryLinks

Query ID integerQuery ID integerSimilarity float

ImplicitDocLinks

Doc ID integerDoc ID integerSimilarity float

Table 5.2: Database schema capturing query and document similarity links

used, as well as the use of appropiate indexes on selected database tables, resulted inreduced running times.

Some crucial implementational issues arose in the computation of similarity betweenqueries as well as documents, and the extraction of query sessions from browser histories.These will be described in more detail in the following sections.

5.1 Extracting Query Sessions from Browser His-

tories

When trying to extract query click streams and query refinements (see Section 1.3.1) frombrowser histories the first difficulty to handle is how to determine the end of a queryclick stream, respectively how to determine whether a query refines its preceding query.There are no hard rules that are applicable, but there are some assumptions that usuallycoincide with reality. With the use of time windows as a first heuristic quite good resultscan be obtained. We distinguish two time windows, Tlong and Tshort, such that for aquery q at time t0, in the absence of any interleaving query, all subsequently visiteddocuments that fall into the time window [t0, t0 + Tlong] belong to the query click streamof q. Analogously, q and a follow-up query q’ form a query refinement if q’ is posed attime t1 ∈ [t0, t0 + Tshort].

However, during our preliminary experiments, we encountered several difficulties thatraised the need for more sophisticated methods for the mining of query sessions. Dueto some storage-saving policy of web browsers multiple visits to one website are storedonly once with the time the website was last visited. With the navigation by the browser’sback button this policy confused the chronology of events inferred from the browser history,and resulted in clicked documents occuring before the corresponding query. In reaction tothe changed order of events, we augmented our history parser by a bidirectional parsingmethod that replaced the time window [t0, t0 + Tlong] with [t0 − Tlong, t0 + Tlong]. However,this measure only accounts for the disturbed order of clicked documents with respect totheir corresponding query. The problem of query refinements preceding original queries(see the scenario in Tables 5.7 and 5.8), as well as the deletion of multiple times occuringqueries cannot be handled during parsing. Fortunately, these special events usually appearseldom and can be circumvented by some small changes in the search interface. For ourfinal experiment we therefore simplified the query session extraction by adapting our searchengine for the mining of browser histories. First, we encoded a timestamp into the URLof our search engine (id =< time in milliseconds >) to prevent the extinction of queriesposed multiple times by users. Moreover we designed the search interface in a way that

Page 57: Query-log based Authority Analysis for Web Information Searchdomino.mpi-inf.mpg.de/imprs/imprspubl.nsf/a1f10a9f...be leveraged by a link-analysis algorithm. The problem that we study

5.1. EXTRACTING QUERY SESSIONS FROM BROWSER HISTORIES 55

guided the user more explicitly through the search process. We assigned our volunteers theresponsibility to decide whether the queries they pose are refinements of previously askedqueries (in this case they press the submit button with the title ”Refine” and the URLcontains the string ”submit=Refine”) or represent the beginning of a new query session(in this case ”submit=Search”).

Futhermore we experimented with different levels of restrictiveness:

Relaxed mode In this mode only the time window heuristic is applied. To decide onan appropiate choice of the tuning parameters Tlong and Tshort, we tried out severalsettings and empirically chose the one that resulted in a good precision-recall bal-ance. Of course, the parameter setting highly depends on the habits of the user andshould be adapted for each user’s history seperately.

Restrictive mode In this mode the time window heuristic serves as a first-stage measureto identify candidate documents, respectively queries, that are examined for furtherproperties.

Query Click Stream Given query q a candidate document dcand qualifies for thequery click stream of q, if it belongs to the result set of q. A depth parameterd regulates the extent of the considered result set R. At depth d = 0, Ronly contains the direct answer set to the query q (R0 refers to this set).At depth d = 1, R0 is augmented by the documents reachable via one link,the successor pages of documents in R0 (R1 refers to this set, thus R1 =successors(R0)∪R0). Analogously, the result sets for depth d > 1 are definedrecursively by Rd = Rd−1 ∪ successors(Rd−1). Successors are considered inorder to account for the situation that the desired information is not containedin the direct answer set R0, but linked by R0-documents, e.g., the situationthat R0 contains a good hub page on the query topic.

Query Refinement Given query q a candidate follow-up query q’ represents aquery refinement, if q and q’ share a common query term.

Consider the following case study to get a feeling for the possibly occuring phenomena.Each case is described by a pair of tables with one displaying the subsequent steps takenby the user, and one showing the resulting information stored in the browser history file.

Simple case scenario

In the most trivial case the user visits the search engine’s start page, either by typing itsURL or clicking on a corresponding bookmark entry, poses a query and reads the desiredinformation. To satisfy a second information need she repeats these steps. An exampleof this simple user behavior is given in Table 5.3 which is captured quite accurately inthe browser history as shown in Table 5.4. Only the multiple visits to the search engine’sstart page are lost, but this is not crucial for query session mining.

More involved scenario

A more involved scenario occurs as soon as the browser’s back button is used to return tothe search engine’s start page. For example, consider the user behavior displayed in Table

Page 58: Query-log based Authority Analysis for Web Information Searchdomino.mpi-inf.mpg.de/imprs/imprspubl.nsf/a1f10a9f...be leveraged by a link-analysis algorithm. The problem that we study

56 CHAPTER 5. IMPLEMENTATION

Step Action1 http://. . ./TopKSearch/index.jsp2 http://. . .?query=president+central+bank&submit=Search&id=10899811784623 http://. . ./wikidump/e/eu/european central bank.html4 http://. . ./TopKSearch/index.jsp5 http://. . .?query=Whoopi+Goldberg&submit=Search&id=10899683082636 http://. . ./wikidump/w/wh/whoopi goldberg. html7 http://. . ./TopKSearch/index. jsp

Table 5.3: User actions in the simple case

Step Action2 http://. . .?query=president+central+bank&submit=Search&id=10899811784623 http://. . ./wikidump/e/eu/european central bank.html5 http://. . .?query=Whoopi+Goldberg&submit=Search&id=10899683082636 http://. . ./wikidump/w/wh/whoopi goldberg. html7 http://. . ./TopKSearch/index. jsp

Table 5.4: Resulting history in the simple case

Step Action1 http://. . ./TopKSearch/index.jsp2 http://. . .?query=mediterranian+sea+state&submit=Search&id=10899683082633 http://. . .?query=mediterranian+sea+neutral&submit=Refine&id=10899821583624 http://. . .?query=mediterranian+state+neutral&submit=Refine&id=10899821890965 http://. . .?query=1987+state+neutral&submit=Refine&id=10899822104096 http://. . ./wikidump/n/ne/neutral country.html7 http://. . .?query=1987+state+neutral&submit=Refine&id=10899822104098 http://. . .?query=1987+state+neutral+island&submit=Refine&id=10899822376909 http://. . .?query=mediterranian+island&submit=Refine&id=10899822779710 http://. . ./wikidump/l/li/list of islands.html11 http://. . ./wikidump/l/li/list of islands in the mediterranean.html12 http://. . ./wikidump/l/li/list of islands of malta.html13 http://. . ./wikidump/m/ma/malta. html14 http://. . ./wikidump/l/li/list of islands of malta.html15 http://. . ./wikidump/l/li/list of islands in the mediterranean.html16 http://. . ./wikidump/l/li/list of islands.html17 http://. . .?query=mediterranian+island&submit=Refine&id=10899822779718 http://. . ./TopKSearch/jsp/index. jsp

Table 5.5: User actions in the involved case

Page 59: Query-log based Authority Analysis for Web Information Searchdomino.mpi-inf.mpg.de/imprs/imprspubl.nsf/a1f10a9f...be leveraged by a link-analysis algorithm. The problem that we study

5.1. EXTRACTING QUERY SESSIONS FROM BROWSER HISTORIES 57

Step Action2 http://. . .?query=mediterranian+sea+state&submit=Search&id=10899683082633 http://. . .?query=mediterranian+sea+neutral&submit=Refine&id=10899821583624 http://. . .?query=mediterranian+state+neutral&submit=Refine&id=10899821890966 http://. . ./wikidump/n/ne/neutral country.html7 http://. . .?query=1987+state+neutral&submit=Refine&id=10899822104098 http://. . .?query=1987+state+neutral+island&submit=Refine&id=108998223769013 http://. . ./wikidump/m/ma/malta.html14 http://. . ./wikidump/l/li/list of islands of malta.html15 http://. . ./wikidump/l/li/list of islands in the mediterranean.html16 http://. . ./wikidump/l/li/list of islands.html17 http://. . .?query=mediterranian+island&submit=Refine&id=10899822779718 http://. . ./TopKSearch/jsp/index.jsp

Table 5.6: Resulting history in the involved case

5.5. The steps marked in bold face indicate that the back button was used to return to thispage, e.g., after posing the query ”1987 state neutral” and visting the result page aboutneutral countries the user re-visits the query page via the back button. The omittance ofthe first occurences of multiple times visited pages disturbs the naive parsing algorithmthat scans the history chronologically for filtering out query click streams. In this scenariothe order of the queries posed is still preserved so that our bidirectional parsing methodsucceeds in these cases.

Worst case scenario

The exemplary user behavior in Tables 5.7 and 5.8 clarifies the worst case scenario witha completely rearrangement of the sequence of posed queries. As soon as the browser’sback button is not only used for returning from a query result to the query page, but alsofor jumping between several queries, the assumption of query refinements occuring in achronological order is no longer valid. Now only the encoded timestamps facilitate therecovery of the right query order.

At this stage, one might ask whether it would have been easier to switch from analyzingbrowser histories to analyzing server log files. We considered that, too, but decided tostick to browser histories because of several reasons. First, using server log files would havecommitted ourselves to a certain search engine, while the methods for parsing browserhistories are, by the majority, generally applicable. Second, it is erroneous to think thatparsing server log files is less problematical. With server logs one has to account for useridentification, interleaving user sessions, and so on. Apart from that, in most cases thesearch engine and the searched corpus do not reside on the same server, thus different logfiles have to be joined, or other measures, e.g., the use of cookies, have to be adopted.

Page 60: Query-log based Authority Analysis for Web Information Searchdomino.mpi-inf.mpg.de/imprs/imprspubl.nsf/a1f10a9f...be leveraged by a link-analysis algorithm. The problem that we study

58 CHAPTER 5. IMPLEMENTATION

Step Action2 http://. . .?query=Neil+Diamond&submit=Search&id=10899908942123 http://. . .?query=Neil+Diamond+1988&submit=Refine&id=10899909471654 http://. . .?query=Neil+Diamond+UB40&submit=Refine&id=10899909517755 http://. . ./wikidump/c/ca/canadian music hall of fame.html6 http://. . .?query=Neil+Diamond+UB40&submit=Refine&id=10899909517757 http://. . ./wikidump/u/ub/ub40.html8 http://. . ./wikidump/c/co/cover version.html9 http://. . ./wikidump/u/ub/ub40.html10 http://. . ./wikidump/n/ne/neil diamond.html11 http://. . ./wikidump/u/ub/ub40.html12 http://. . .?query=Neil+Diamond+UB40&submit=Refine&id=108999095177513 http://. . .?query=Neil+Diamond+1988&submit=Refine&id=108999094716514 http://. . .?query=Neil+Diamond&submit=Search&id=108999089421215 http://. . ./wikidump/t/th/the jazz singer.html16 http://. . .?query=Neil+Diamond&submit=Search&id=1089990894212

Table 5.7: User actions in the worst case

Step Action5 http://. . ./wikidump/c/ca/canadian music hall of fame.html8 http://. . ./wikidump/c/co/cover version.html10 http://. . ./wikidump/n/ne/neil diamond.html11 http://. . ./wikidump/u/ub/ub40.html12 http://. . .?query=Neil+Diamond+UB40&submit=Refine&id=108999095177513 http://. . .?query=Neil+Diamond+1988&submit=Refine&id=108999094716515 http://. . ./wikidump/t/th/the jazz singer.html16 http://. . .?query=Neil+Diamond&submit=Search&id=1089990894212

Table 5.8: Resulting history in the worst case

Page 61: Query-log based Authority Analysis for Web Information Searchdomino.mpi-inf.mpg.de/imprs/imprspubl.nsf/a1f10a9f...be leveraged by a link-analysis algorithm. The problem that we study

5.2. COMPUTATION OF QUERY SIMILARITY 59

5.2 Computation of Query Similarity

Standard textual similarity measures like tf*idf-based cosine similarity yield no reasonableresults when used for the computation of similarity between queries as queries usuallycontain only very few keywords. On the other hand, the alternative of defining query-query similarity via the common clicked-on documents in the history would only addredundant information to the enhanced web graph which already reflects this relationshipby explicit links. Therefore we decided to make use of an ontology to first map thequery keywords onto concepts and then aggregate the semantic similarities between thecorresponding concepts. The ontology that we are using captures hypernym/hyponym andholonym (part of) relationships in a graph structure. The nodes and edges are derivedfrom WordNet [22], a thesaurus handcrafted by cognitive scientists. WordNet distinguishesbetween words and word senses. Thereby one word is normally associated with severalsenses where each sense, also referred to as concept, is characterized by a synset, i.e., a setof synonyms (words with the same sense), and a short descriptive sentence. In addition,the ontology quantifies edges by computing Dice similarities [32] between concepts andtheir descriptions based on a large corpus. Details of this ontology service can be found in[40]. Depending on the desired granularity of query similarity links, we made use of oneof the following methods provided by the ontology service:

• findConceptByName(term a)This method returns a set of concepts, or to be more precise, a set of synsets term aoccurs in.

• getConceptSimilarityByTerm(term a,term b)This method returns a floating point similarity value, the weight of the shortestpath between a concept of term a and a concept of term b.

We considered two different strategies for computing similarity links between queries. Thefirst more naive one, decomposes a query into its keywords, retrieves the concepts forall keywords using the method findConceptByName(term a) of the ontology service andintroduces an unweighted edge between two queries whenever the concepts of keywordsoverlap. Later all undirected edges of one query node are weighted according to a uniformprobability distribution. See the pseudo-code despicted in Listing 5.1 for a better under-standing. Clearly the running time of this approach is quadratic in number and size ofqueries, as well as the number of concepts per query term, but with the number of queriesbeing quite low with respect to the number of documents and the average query consist-ing of only two keywords this is not a critical issue. However, the retrieval of undirectededges between query nodes and their weights is not very fine-grained in the course of thisfirst naive approach. To better exploit the ontology by not only considering the providedsynsets but also taking into account the relationship between concepts captured in theontology’s graph structure, we considered a second approach. This time edges are intro-duced if the maximal similarity between query terms obtained from the ontology serviceexceeds some threshold (see Listing 5.2). This similarity weight is assigned to the edgeand later on all edge weights are normalized. Empirically, we found a threshold of 0.5 or0.6 to be meaningful.

Page 62: Query-log based Authority Analysis for Web Information Searchdomino.mpi-inf.mpg.de/imprs/imprspubl.nsf/a1f10a9f...be leveraged by a link-analysis algorithm. The problem that we study

60 CHAPTER 5. IMPLEMENTATION

na iveQueryS imi la r i ty ( S t r ing query a , S t r ing query b ){

St r ing [ ] termsA = query a . decompose ( ) ;S t r ing [ ] termsB = query b . decompose ( ) ;

for ( each term a in termsA ){Concept [ ] conceptsA = onto logy . findConceptByName ( term a ) ;for ( each term b in termsB ){Concept [ ] conceptsB = onto logy . findConceptByName ( term b ) ;for ( each concept a in conceptsA ){for ( each concept b in conceptsB ){// Two concepts ove r l ap i f they share a common de s c r i p t o r// , i . e . , synset(a) ∩ synset(b) 6= ∅i f ( over lap ( concept a , concept b ) ){introduceUndirectedEdge ( query a , query b ) ;}}}}}}

Listing 5.1: Naive query similarity links

Page 63: Query-log based Authority Analysis for Web Information Searchdomino.mpi-inf.mpg.de/imprs/imprspubl.nsf/a1f10a9f...be leveraged by a link-analysis algorithm. The problem that we study

5.3. COMPUTATION OF DOCUMENT SIMILARITY 61

weightedQueryS imi lar i ty ( S t r ing query a , S t r ing query b ){

St r ing [ ] termsA = query a . decompose ( ) ;S t r ing [ ] termsB = query b . decompose ( ) ;HashMap l i n k s = new HashMap ( ) ;f loat th r e sho ld = 0 . 5 ;f loat sim = 0 . 0 ;

for ( each term a in termsA ){for ( each term b in termsB ){sim = onto logy . getConceptSimilarityByTerm ( term a , term b ) ;i f ( sim > th r e sho ld ){// cu r r en t we i gh t i s i n i t i a l i z e d wi th 0l i n k s . add ( query a , query b , max( current we ight , sim ) ) ;

}}}}

Listing 5.2: Weighted query similarity links

5.3 Computation of Document Similarity

It turns out that a computation of pairwise document similarity with its quadratic runningtime and quadratic number of database accesses (a linear number of database accessesconsumes too much working memory) is far too slow due to the huge amount of documents,even if the computation has to be conducted only once after a new crawl as all implicitlinks and their weights are then stored to database.

Therefore we restricted ourselves to a reasonably small set of source documents, andfor each of these we computed the k most similar documents. The selection of sourcedocuments is based on the rationale that only authoritative documents have high enoughQRank scores to be propagated along associative links. Introducing implicit links between“insignificant” documents would result in a negligible overall effect. Thus, we choose theset of clicked-on documents as well as the top-m documents with the highest standardPageRank scores as our set of source documents. Then an approximative top-k search isperformed for each source document S according to the following procedure:

1. the n terms of S with highest tf*idf scores are retrieved,

2. an approximation of the top-k ranked documents with respect to these terms isefficiently computed using Fagin’s median rank algorithm [21], and

3. for each document in the top-k result set and S a tf*idf-based cosine similarity scoreis computed, using all document terms, and stored in the database.

Page 64: Query-log based Authority Analysis for Web Information Searchdomino.mpi-inf.mpg.de/imprs/imprspubl.nsf/a1f10a9f...be leveraged by a link-analysis algorithm. The problem that we study

62 CHAPTER 5. IMPLEMENTATION

medRank( s e t o f terms ){

f o r each term{

r e t r i e v e so r t ed index l i s t ( docid , s c o r e ) ;}

Hashtable count = new Hashtable ( ) ;ArrayList f i n i s h e d d o c s = new ArrayList ( ) ;while ( f i n i s h e d d o c s . s i z e () < k ){

f o r each index l i s t i{docid = index l i s t [ i ] . pop ( ) ;count ( docid )++;i f ( count ( docid ) ≥ f l o o r (#terms /2)+1){

f i n i s h e d d o c s . add ( docid ) ;}}}}

Listing 5.3: Fagin’s median rank algorithm

Fagin’s Median Rank Algorithm (see Listing 5.3) determines the k documents with thehighest median rank with respect to the n index lists for the most discriminative terms ofthe source document S. Thereby the median rank of a document d with respect to indexlists l1, . . . , ln , medrank(d), is defined as the median of l1(d), . . . , ln(d), i.e.,

| {i|li(d) ≥ medrank(d)} | = ceil(n

2)

and| {i|li(d) ≤ medrank(d)} | = floor(

n

2)

Thus Fagin’s median rank algorithm yields a fast approximation of the top-k answer set.

Page 65: Query-log based Authority Analysis for Web Information Searchdomino.mpi-inf.mpg.de/imprs/imprspubl.nsf/a1f10a9f...be leveraged by a link-analysis algorithm. The problem that we study

Chapter 6

Experiments

Experiments were carried out on an Intel Pentium 3 GHz computer with 1 GB workingmemory under Windows XP with an Oracle 9i database and an Apache Tomcat 4.1 serverrunning on the same computer. Each experiment consisted of two phases. In the firstphase, volunteers were asked to generate query logs by searching our indexed data set.Thereby test users could freely choose their queries, but to provide some guidance aboutwhat interesting information one could search we gave them a number of quiz questionsfrom the popular Trivial Pursuit game. After this data collection phase, the second phasewas devoted to the actual evaluation.

6.1 Document Collection

All experiments were set upon an excerpt of the web pages of the Wikipedia Encyclopedia[49]. This open-content encyclopedia started in January 2001 and contains, in the Englishversion as of July 2004, 307959 articles on a wide variety of subjects. The decision againstthe use of any standard document collection like TREC [43] and in favor of the creationof an own document collection based on Wikipedia was driven by the aptitude of the datafor query log generation; our main consideration was that most users were able to posemeaningful queries on the data set. With the Wikipedia data covering most topics ofuser interest and exhibiting a clear structure, the encyclopedia became the data of choice.To avoid burdening the Wikipedia server with a large series of request in the process ofcrawling we made use of the available database dump to set up a Wikipedia server withinour local network. This likewise facilitated short response times for users in the firstexperimental phase of query session generation.

6.2 Tools

6.2.1 Search Engine

For better compatibility and integration of our QRank measure into the BINGO! toolkit,we first used SEEV, a search engine contained in the BINGO! package to search ourindexed Wikipedia data. However, SEEV provided many features unnecessary for our

63

Page 66: Query-log based Authority Analysis for Web Information Searchdomino.mpi-inf.mpg.de/imprs/imprspubl.nsf/a1f10a9f...be leveraged by a link-analysis algorithm. The problem that we study

64 CHAPTER 6. EXPERIMENTS

purposes that had quite heavy computational load. Therefore we wrote a rudimentarysearch interface, based on JSP, that comprised the needed functionality (results orderedby relevance, each with a small excerpt of the content, result page navigation, choice be-tween conjunctive or disjunctive query term combination, choice between different rankingschemes). This interface worked on the database schema of BINGO! and used simple SQLstatements for query processing. Nevertheless the response times for disjunctive querieswere still unsatisfactory due to the need for a join between the table ”BingoDocuments”that comprises all document information and the table ”DocumentFeatures” that repre-sents the inverted index of all terms associated with the IDs of the documents they occurin. For conjunctive queries this join was uncritical as the amount of data that had to bejoined was naturally reduced, for some queries though at the cost of small recall values.Consider Figure 5.2 for the underlying table scheme and the following SQL statements forthe two-keyword query ”american quilt” for illustration of this point.

Disjunctive Query Combination

select * from(select T.idf * D.QRank as ranking, D.title, D.url, D.text from BingoDocuments D,(select id, sum(tf_idf) as idf from DocumentFeatureswhere term=’american’ or term=’quilt’group by id) Twhere D.id = T.id order by ranking desc) where rownum < 50

Conjunctive Query Combination

select * from(select T.idf * D.QRank, D.title, D.url, D.text from BingoDocuments D,(select A.id, (sum(A.tf_idf) + sum(B.tf_idf)) as idffrom DocumentFeatures A, DocumentFeatures B

where A.term=’american’ and B.term=’quilt’ and A.id = B.idgroup by A.id) Twhere T.id = D.id order by ranking desc) where rownum < 50

In the course of our second experiment we therefore replaced the query execution viaSQL statements with the ”top-k” package [42]. This is based on the sorted version ofFagin’s threshold algorithm [21] and works on special index structures that facilitate a fastaccess on index lists. An index list for a certain term contains all the documents the termoccurs in, sorted in descending order of the product of the tf*idf-based term weight andthe documents’ QRank- or PageRank-based authority. By sequentially considering thetop-ranked documents of the index lists for all query terms in a round-robin fashion untila certain threshold is exceeded, the top-k ranked documents are retrieved within shortrunning time. Apart from improving the response times we enhanced the search interfaceby the explicit distinction between query refinements and new query sessions to simplifythe parsing of browser histories.

Page 67: Query-log based Authority Analysis for Web Information Searchdomino.mpi-inf.mpg.de/imprs/imprspubl.nsf/a1f10a9f...be leveraged by a link-analysis algorithm. The problem that we study

6.3. PRELIMINARY EXPERIMENT 65

6.2.2 Evaluation Interface

For both experiments we wrote a JSP-based evaluation interface, such that the top-tensearch results for the methods to compare were placed randomly side-by-side. The task ofthe test persons was twofold: first, they were asked to evaluate on the relevance of eachentry in the presented top-ten rankings with respect to the considered query. Second, theyshould compare the top-10 lists as a whole against each other. The obtained assessmentswere directly stored into the database for later precision computations via SQL.

6.3 Preliminary Experiment

In the course of a preliminary experiment, eight test persons participated in searching, andfour test persons evaluated the rankings for some independently selected queries. Here wecompared the original ranking, which resulted from a combination of tf*idf-based cosinesimilarity and standard PageRank, against the ranking produced by a combination oftf*idf-based similarity and our QRank scores.

6.3.1 Data Acquisition

To obtain test data we used BINGO! to crawl 226345 articles from our locally availableWikipedia server. We obtained all the browser history files of all test users, and extracteda total of 265 different queries, 429 query result clicks and 106 query refinements viabidirectional parsing in restrictive mode on query refinements (see Section 5.1). With asimilarity threshold of 0.5 we computed 131 query similarity links, and starting from theset of clicked documents we introduced 3322 similarity links between documents.

The test users in the second phase provided the intellectual quality assessment on theresult rankings produced by the standard method versus QRank. We initially consideredonly one variant of QRank with parameters set as follows: ε = 0.25, α = 0.5, β = 1.0.Studies with parameter variations were conducted in our second experiment. For eachquery the test users were shown an HTML page that displays the query and in twocolumns the top-10 results under the two methods we compared. The two rankings wererandomly placed in the left and right column without any explicit labels; so the assessorsdid not know which method produced which ranking. The users were then asked to markall documents that they considered relevant for the given query (possibly after looking atthe page itself if the summary on the HTML overview page was not informative enough),and also to identify the one of the two rankings that they generally considered of higherquality.

6.3.2 Results

We first focused on the assessors’ preferences about which of the two rankings had higherquality results. For 32.9% of the evaluated queries the assessors preferred standard PageR-ank, whereas QRank was considered better in 61.4% of the cases. 5.7% of the query rank-ings were perceived as fairly similar so that no preference could be stated. This resultshows that QRank outperformed PageRank significantly.

Page 68: Query-log based Authority Analysis for Web Information Searchdomino.mpi-inf.mpg.de/imprs/imprspubl.nsf/a1f10a9f...be leveraged by a link-analysis algorithm. The problem that we study

66 CHAPTER 6. EXPERIMENTS

Query mars bolivar virtual reality anthrax bacteria guatemala AveragePageRank 0.06 0.2 0.5 0.75 0.1 0.30

QRank 0.16 0.35 0.55 0.75 0.5 0.38

Table 6.1: Standard PageRank vs. QRank

Query 1: ”Expo” Query 2: ”Schumacher”PageRank QRank PageRank QRank

1998 World’s fair 1969 Michael Schumacher1992 EXPO Michael Schumacher German Grand Prix

Computer International Expositions Joel Schumacher Formula 11967 World Fair Toni Schumacher Ralf Schumacher

World’s Fair World Exhibition Ralf Schumacher Joel Schumacher

Table 6.2: Top-5 result sets

From the relevance assessments we also computed the precision (for the top-10 results)for each query and the micro-average over all queries (which is equal to macro-average forthe top-10 results; see Section 2.3, and [32]). The results are shown in Table 6.1 for selectedqueries and the average. We see that QRank yields an improvement of search results withrespect to precision although the extent depends on the actual query. The precision forsome queries is for both PageRank and QRank quite low due to the restricted data set.

Table 6.2 provides some anecdotic evidence by giving the top-5 results (titles of webpages) for each of the two rankings for two example queries. On the Wikipedia data thetitle of a page is a good descriptor of its content, as each document is devoted to one specialsubject, comparable to a single entry in an encyclopedia. For the query ”Expo” QRankproduced clearly better results than standard PageRank that boosts summary pages ofyears because of them being highly linked. The second query ”Schumacher” shows howa strong user interest in ”formula 1” (car racing) can influence the result for such anambiguous term. This effect might be a strength with respect to personalized search.However, in the general case, such a topic drift might be undesirable. We perceivedsimilar, partially more serious, phenomena due to the limited number of search topics thetrial persons could consider in the first phase.

6.4 Comprehensive Experiment

To further study the influences of the tuning parameters α and β on the query resultquality, and eliminate problems encountered in the course of the preliminary experiment,we conducted a second experiment. This time we tried to reduce the set of documents withrespect to the set of queries to create a well balanced model that was more likely to reflectthe ratio of queries to documents in the real web environment. As the probability of querynodes being targets of random jumps is equally distributed among all queries, a smallset of queries entails a relatively large, disproportionate influence of each single query.To account for the limited number of queries we could consider, we focused the crawling

Page 69: Query-log based Authority Analysis for Web Information Searchdomino.mpi-inf.mpg.de/imprs/imprspubl.nsf/a1f10a9f...be leveraged by a link-analysis algorithm. The problem that we study

6.4. COMPREHENSIVE EXPERIMENT 67

Group Ranking 1 Ranking 2 Ranking 3

A PageRank QRank-Q/D-Sim QRank-Q/D-NoSimB PageRank QRank-Q-Sim QRank-Q-NoSimC PageRank QRank-Q/D-Sim QRank-Q-SimD PageRank QRank-Q/D-NoSim QRank-Q-NoSim

Table 6.3: Evaluation groups

process on the categories geography, history and entertainment, and asked volunteers topose queries related to these subjects. This was motivated by the consideration that veryheterogenous topics create more serious topic drifts in a small set of queries than theywould in a realistic application with a larger number of queries.

6.4.1 Data Acquisition

18 volunteers, students from various backgrounds (law, psychology, intercultural commu-nication, ...), were asked to search on 72482 indexed documents retrieved from our locallyavailable Wikipedia server. Using BINGO!, we crawled the Wikipedia data starting fromoverview pages about geography, history, film, and music until a depth of 5. For the searchwe used the ”Top-k Search Engine” described in Section 6.2.1, enriched by the distinctionbetween a new query session and a query refinement that users were asked to explicitlymark. We provided some creativity help in the form of Trivial Pursuit questions andasked the volunteers to concentrate on the categories geography, history, and entertain-ment. But they were still allowed to freely choose their queries or follow some personalinterests to simulate real web search. Parsing the obtained history files bidirectionallyin restrictive mode on query click streams (see Section 5.1) with Tlong = 2 minutes anddepth = 1, a total of 544 queries, 657 query result clicks and 290 query refinements couldbe retrieved. With a similarity threshold of 0.6 we computed 294 query similarity links,and starting from the 3625 most authoritative pages (0.05-quantile of all documents, suchthat ∀ d ∈ 0.05-quantile : PageRank(d) ≥ max(PageRank(d′)|d′ /∈ 0.05-quantile)), aswell as the set of clicked documents, we introduced 15504 similarity links between docu-ments. We chose the 0.05-quantile, as the PageRank scores of these documents were orderof magnitude higher than those of the rest, but still the number of source documents wasmuch smaller than considering the whole data set.

6.4.2 Evaluation Methodology

In our evaluation we distinguished between trained and untrained queries. Whereas thefirst had been used for our QRank computation and posed by volunteers in the firstphase of the experiment, the latter had been extracted from two browser history files thathad not been incorporated into the QRank computation. The aim of studying untrainedqueries was to assess the side-effects of learned query result clicks on new queries. As weexpected the rankings of completely unrelated queries to differ by majority only slightlyfrom the ordinary PageRank-induced ranking, the search process of one user who generated

Page 70: Query-log based Authority Analysis for Web Information Searchdomino.mpi-inf.mpg.de/imprs/imprspubl.nsf/a1f10a9f...be leveraged by a link-analysis algorithm. The problem that we study

68 CHAPTER 6. EXPERIMENTS

Ranking α β

QRank-Q/D-Sim 0.5 0.5QRank-Q-Sim 0.5 1.0

QRank-Q/D-NoSim 1.0 0.5QRank-Q-NoSim 1.0 1.0

Table 6.4: QRank variations

Local Evaluation

User ID integerEvalGroup varchar2(10)Query ID integerRanking varchar2(50)

Pos numberRelevance integer

Global Evaluation

User ID integerEvalGroup varchar2(10)Query ID integerRanking varchar2(50)

Pos number

Table 6.5: Database schemes used for evaluation

untrained queries was guided by Trivial Pursuit questions the test persons in the first phasehad used, too. That way we obtained some implicit trained queries, i.e., queries that arethematically related to trained ones. In contrast the other user freely posed queries ofpersonal interest.

For the evaluation of both trained and untrained queries, four different evaluationgroups were formed from the volunteer students. Each group examined three rankingschemes: PageRank as the baseline algorithm, as well as two different QRank variations(see Table 6.3 for the evaluation group settings). In total four different QRank variations(see Table 6.4) were subject of the evaluation. Each evaluation group was assessed by 5distinct volunteers. They determined for each top-10 result of the three rankings whetherit was irrelevant (-1), relevant (1) or somewhat in-between (0) and ranked the top-10 listsas a whole against each other (1 = best ranking, 2 = in-between, 3 = worst ranking).To also account for fairly similar rankings, the same rank could be assigned to multipleranking schemes. The precision of a certain top-10 query result set was obtained by addingthe five assessments for each item in the list and interpreting a value greater or equal zeroas relevant. In total 30 trained and 10 untrained queries were assessed. Thereby evaluationgroups A and B assessed the same half of the queries, and correspondingly for groups C andD. All assessments were directly stored into the database for later evaluation via SQL (seeTable 6.5 for the schema; the table ”GlobalEvaluation” comprises the comparison betweenwhole rankings, whereas ”LocalEvaluation” contains the relevance values of single entriesof the top-10 query results).

Page 71: Query-log based Authority Analysis for Web Information Searchdomino.mpi-inf.mpg.de/imprs/imprspubl.nsf/a1f10a9f...be leveraged by a link-analysis algorithm. The problem that we study

6.4. COMPREHENSIVE EXPERIMENT 69

0.339

0.423 0.416

0.4940.503

0.000

0.100

0.200

0.300

0.400

0.500

0.600

Pagerank QRank-Q/D-NoSim QRank-Q-NoSim QRank-Q/D-Sim QRank-Q-Sim

Average Precision

Figure 6.1: Average precisions of top-10 result sets on trained queries

6.4.3 Results

Trained Queries

To get a first impression of how the different examined ranking schemes performed ontrained queries with respect to each other, we computed the average precision over the30 different queries for each of them. Figure 6.1 displays the results we obtained, clearlyjustifying two major conclusions:

1. QRank outperformed PageRank regardless of the parameter setting in use.

2. QRank incorporating implicit similarity links outperformed QRank solely based onexplicit links. This result advocates the benefits of an early combination of authoritywith similarity.

However, the obtained average precisions allow no clear preference for the choice ofparameter β. In contrast the quality of search results seems to be insensitive to how muchweight is put on query nodes. While there are noticeable jumps in precision betweenPageRank, the NoSim-QRank variations and the Sim-QRank varitions, only negligibleprecision differences between QRank-Q-NoSim and QRank-Q/D-NoSim, as well as QRank-Q-Sim and QRank-Q/D-Sim occur. As the right value of β highly depends on the numberof query nodes, its choice could be more decisive in a different setting.

Considering the average precisions at a different recall level, namely averaging theprecisions of the top-5 query results, leads to the same conclusions (Figure 6.2).

Page 72: Query-log based Authority Analysis for Web Information Searchdomino.mpi-inf.mpg.de/imprs/imprspubl.nsf/a1f10a9f...be leveraged by a link-analysis algorithm. The problem that we study

70 CHAPTER 6. EXPERIMENTS

0.452

0.542

0.600

0.671 0.665

0.000

0.100

0.200

0.300

0.400

0.500

0.600

0.700

0.800

Pagerank QRank-Q/D-NoSim QRank-Q-NoSim QRank-Q/D-Sim QRank-Q-Sim

Average Precision

Figure 6.2: Average precisions of top-5 result sets on trained queries

Ranking A Ranking B A beats B B beats A A = B

PageRank QRank-Q/D-Sim 11.25% 81.25% 7.5%PageRank QRank-Q/D-NoSim 13.75% 75% 11.25%

QRank-Q/D-Sim QRank-Q/D-NoSim 55% 28.75% 16.25%

Table 6.6: Evaluation group A on trained queries

Ranking A Ranking B A beats B B beats A A = B

PageRank QRank-Q-Sim 27.5% 68.75% 3.75%PageRank QRank-Q-NoSim 28.75% 66.25% 5%

QRank-Q-Sim QRank-Q-NoSim 48.75% 46.25% 5%

Table 6.7: Evaluation group B on trained queries

Page 73: Query-log based Authority Analysis for Web Information Searchdomino.mpi-inf.mpg.de/imprs/imprspubl.nsf/a1f10a9f...be leveraged by a link-analysis algorithm. The problem that we study

6.4. COMPREHENSIVE EXPERIMENT 71

Ranking A Ranking B A beats B B beats A A = B

PageRank QRank-Q/D-Sim 22.67% 74.67% 2.67%PageRank QRank-Q-Sim 24.0% 73.33% 2.67%

QRank-Q/D-Sim QRank-Q-Sim 40% 52.0% 8.0%

Table 6.8: Evaluation group C on trained queries

Ranking A Ranking B A beats B B beats A A = B

PageRank QRank-Q/D-NoSim 26.7% 68% 5.3%PageRank QRank-Q-NoSim 26.7% 66.7% 6.6%

QRank-Q/D-NoSim QRank-Q-NoSim 22.7% 60% 17.3%

Table 6.9: Evaluation group D on trained queries

To further underpin this first interpretation we evaluated the judgements of our vol-unteers when asked to explicitly rank three top-10 query results as a whole against eachother. Table 6.6 shows the outcome for the evaluation group A. For each pair of rankingschemes out of the three rankings, the preferences of the assessors on the 15 evaluatedqueries are listed as a percentage of assessors that prefer a specific ranking. Analogousinformation is displayed in Tables 6.7, 6.8 and 6.9 for the other evaluation groups.The central purpose of the evaluation groups A and B was to assess the effect of takingsimilarity links into account. Both groups suggest an amelioration of query results dueto the incorporation of similarity information, even though this effect is weakened in theB group. A very probable explanation of this phenomenon is the decrease in influence ofthe similarity links between documents when query nodes are the only targets of randomjumps.In contrast the aim of evaluation groups C and D was to investigate the influence of therandom jump parameter β, i.e., the weight of query nodes with respect to document nodes,on the query result ranking. Again both evaluation groups agree in their assessment andprefer a high weight of query nodes, albeit to a lower extent when similarity links aretaken into account. This endorses our last observation that the early consideration ofsimilarity significantly boosts the ranking quality so that the right choice of β becomesless important.Furthermore all QRank variations are consistently preferred over PageRank.

Figures 6.3 and 6.4 display the precisions each ranking scheme yields for each of the30 evaluated queries. When looking at these top-10 precision values we are facing avery diverse picture. Counting the cases in which the ranking schemes that put a highweight on queries, QRank-Q-Sim and QRank-Q-NoSim, outreach those with a lower weighton query nodes, QRank-Q/D-Sim and QRank-Q/D-NoSim, yields no clearly favorablequery node weight, just as the average precisions suggest. In about 10 cases the QRankvariations that take similarity into account, i.e., QRank-Q-Sim and QRank-Q/D-Sim, arethe dominating ones. In contrast there is no query for which both similarity-based rankingsare outperformed by QRank variations that do not account for similarity. This coincides

Page 74: Query-log based Authority Analysis for Web Information Searchdomino.mpi-inf.mpg.de/imprs/imprspubl.nsf/a1f10a9f...be leveraged by a link-analysis algorithm. The problem that we study

72 CHAPTER 6. EXPERIMENTS

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.91

Rome

Friend

sW

aterga

tePhil

osop

hyBraz

il citie

s Cuba C

old W

ar Egypt

pyram

ids Napole

on ex

ile Mande

la pri

son Can

nes f

estiv

al EU insti

tution

s The M

atrix

filmPan

ama C

anal

UNICEF pr

ogram

sPop

e Joh

n Pau

l II

Precision

QR

ank-

Q/D

-NoS

imQ

Ran

k-Q

-NoS

imQ

Ran

k-Q

/D-S

imQ

Ran

k-Q

-Sim

Pag

eRan

k

Figure 6.3: Precisions of top-10 result sets on trained queries

Page 75: Query-log based Authority Analysis for Web Information Searchdomino.mpi-inf.mpg.de/imprs/imprspubl.nsf/a1f10a9f...be leveraged by a link-analysis algorithm. The problem that we study

6.4. COMPREHENSIVE EXPERIMENT 73

0

0.2

0.4

0.6

0.81

1.2

Lucia

no P

avaro

ttiQue

ntin T

aranti

no

Bosnia

-Herz

egov

ina First c

loned

orga

nAfgh

anist

an Tali

ban

Tom H

anks

Cas

t Away

Robin

Williams f

ilmHarr

ison F

ord m

ovie

Whit

e Hou

se of

fices

Bangla

desh

inha

bitan

ts

Japa

n high

spee

d trai

n

Gorki R

ussia

n phy

sician

Politic

al sy

stem of

Chin

a

Consti

tution

al Sup

reme C

ourt

Prime M

iniste

r of P

hilipp

ines

Precision

QR

ank-

Q/D

-NoS

imQ

Ran

k-Q

-NoS

imQ

Ran

k-Q

/D-S

imQ

Ran

k-Q

-Sim

Pag

eRan

k

Figure 6.4: Precisions of top-10 result sets on trained queries

Page 76: Query-log based Authority Analysis for Web Information Searchdomino.mpi-inf.mpg.de/imprs/imprspubl.nsf/a1f10a9f...be leveraged by a link-analysis algorithm. The problem that we study

74 CHAPTER 6. EXPERIMENTS

Brazil Rio with largest population largest populationin the world

Legend

Brazil

Sao Sao

Salvador, BrazilRio de Janeiro

orange

wanna have funLula

Lula

da Silva

South America

Rio deJaneiro

2012 SummerOlympics

2012SummerPara−

lympics

Bahia deTodos osSantos

Brazil

SantiagoRio deJaneirostate

Paulo Paulocity

Capitals,

citieslarger

LauperCyndi

Salvador,

Populationdensity History

of Brazil

query node

document node

citation

similarity link query refinement

query result click

Brasil citiesBrazil cities

City in Brazil City in Brazil with

Paralympics

Girls just

Tel Aviv

Figure 6.5: Excerpt of the QRank graph in the neighborhood of query ”Brazil cities”

with our previous observations and strengthens our second bottom line. FurthermorePageRank is surpassed by at least one QRank variation for the vast majority of queriesencouraging our first bottom line as well.

To get a better understanding of possible explanations for the performance of theregarded QRank variations, we surveyed the rankings themselves. See Tables A.1 toA.30 in the appendix for the top-10 query results for the five rankings examined. In thefollowing we discuss the result sets for a selected query, revealing insights on how QRank’sperformance is influenced by the enhanced web graph.

Case Study: Brazil Cities

The top-10 query results for the query ”Brazil cities” under the five ranking schemesconsidered are listed in Table A.5 with PageRank yielding a precision of 0.1, QRank-

Page 77: Query-log based Authority Analysis for Web Information Searchdomino.mpi-inf.mpg.de/imprs/imprspubl.nsf/a1f10a9f...be leveraged by a link-analysis algorithm. The problem that we study

6.4. COMPREHENSIVE EXPERIMENT 75

Ranking A Ranking B A beats B B beats A A = B

PageRank QRank-Q/D-Sim 80% 20% 0%PageRank QRank-Q/D-NoSim 55% 15% 30%

QRank-Q/D-Sim QRank-Q/D-NoSim 35% 55% 10%

Table 6.10: Evaluation group A on untrained queries

Ranking A Ranking B A beats B B beats A A = B

PageRank QRank-Q-Sim 80% 20% 0%PageRank QRank-Q-NoSim 73.3% 13.35% 13.35%

QRank-Q-Sim QRank-Q-NoSim 26.6% 66.7% 6.7%

Table 6.11: Evaluation group B on untrained queries

Q/D-NoSim of 0.4, QRank-Q-NoSim of 0.6, QRank-Q/D-Sim of 0.7 and QRank-Q-Simof 0.9. Figure 6.5 shows a small excerpt of the QRank graph in the neighborhood of thequery node ”Brazil cities”. In accordance with our expectations PageRank boosts highlylinked pages, such as overview pages of countries and years. The naturally closest rankingto PageRank, QRank-Q/D-NoSim, still contains some PageRank authorities, like ”2003”,”Population density”, ”Argentina” and ”Mexico”, albeit at a lower rank. Here the clickeddocuments ”Brazil” and ”Sao Paulo” (for the query ”City in Brazil with largest popula-tion”), ”Salvador, Brazil” (for the query ”Brasil cities”), ”Santiago” (for the query ”Brazilcities”), ”History of Brazil” (for the query ”South America”), as well as ”Lula da Silva”(for the query ”Lula”) gain in weight. With an even stronger emphasis on query nodesother clicked documents, as well as documents in the neighborhood of clicked ones, appearat high ranks, moving the PageRank authorities ”2003”, ”Argentina” and ”Mexico” downin the ranking. Additionally incorporating implicit similarity links further changes theweight of documents; however, the probable causes are more difficult to retrace. Perhapsthe better rank of ”History of Brazil” is a consequence of the implicit links by whichadditional authority is propagated from ”Brazil”, a page that is highly linked from bothdocuments and query nodes, and ”Lula da Silva”, a page about the current president ofBrazil. But not only query-related pages profit from the re-distribution of authority bysimilarity links. As a side-effect a document about the singer ”Cyndi Lauper” that wasclicked for another query (”Girls just wanna have fun”), qualifies for the query as it con-tains the keywords ”Brazil” and ”city”. This effect is attributed to the relatively smallset of queries that puts disproportional emphasis on singular queries and the topics theycover.

Untrained Queries

When examining the precisions of the top-10 result sets for the assessed untrained queries(see Figure 6.8), as well as the corresponding rankings themselves (refer to Tables B.1 toB.10 in the appendix), we encountered mostly minor differences between the five rankings

Page 78: Query-log based Authority Analysis for Web Information Searchdomino.mpi-inf.mpg.de/imprs/imprspubl.nsf/a1f10a9f...be leveraged by a link-analysis algorithm. The problem that we study

76 CHAPTER 6. EXPERIMENTS

Ranking A Ranking B A beats B B beats A A = B

PageRank QRank-Q/D-Sim 46.6% 53.4% 0%PageRank QRank-Q-Sim 53.3% 40% 6.7%

QRank-Q/D-Sim QRank-Q-Sim 60% 20% 20%

Table 6.12: Evaluation group C on untrained queries

Ranking A Ranking B A beats B B beats A A = B

PageRank QRank-Q/D-NoSim 26.7% 66.7% 6.6%PageRank QRank-Q-NoSim 26.7% 53.3% 20%

QRank-Q/D-NoSim QRank-Q-NoSim 40% 33.3% 26.7%

Table 6.13: Evaluation group D on untrained queries

due to the limited number of queries and subjects that could be covered by our testpersons. This observation is also reflected in the average precisions (see Figure 6.6, aswell as Figure 6.7) that are closer to each other than in the trained case. Besides thelarge fraction of users that gave two different rankings the same rank also results from thecloseness of the different ranking schemes on untrained queries.

In the event of permutations in the ranks of top-10 result pages these were only partiallybeneficial. Depending on the query under consideration, documents that had been clickedfor trained queries and happened to contain the query keywords improved or worsened theresult quality. The effect of rather unrelated clicked documents occuring at high ranks isattributed to the strong bias towards the small set of clicked documents and should be lesscrucial with much larger query logs and click streams. The fact that most users preferredPageRank over any QRank variation for evaluation groups A and B (see Tables 6.10 and6.11), whereas the opposite was true for evaluation groups C and D (see Tables 6.12 and6.13), is another indication of the diversity of query results, as A and B rely on the sameset of queries, and correspondingly for C and D. Therefore it seems to be difficult to infer afavorable parameter setting from the assessments on untrained queries. The only possiblepreference that can be consistently deduced from both the average precisions at a cutoffof 5 and 10 documents, as well as the evaluation groups C and D, is a low weight on querynodes.

Exemplarily, we want to discuss the two result sets for the queries ”Colonial Power”and ”Formula One racing” which are displayed in Tables B.3 and B.5 respectively. Whilethe untrained query ”Colonial Power” profits from a topic-related trained query on culturaltheory resulting in a high rank of a document on postcolonial theory, it suffers at the sametime from a bias towards Brazil that had been clicked for several queries. In contrastthe different rankings for the query ”Formula One racing” are only slightly disturbed byunrelated endorsements of summary pages on years. A remarkable phenomenon with thesimilarity-based QRank variations is the appearance of the pages ”1986 in sports” and”Lola, racing car company” which seem to be a result of the changed lines of authoritypropagation via similarity links.

Page 79: Query-log based Authority Analysis for Web Information Searchdomino.mpi-inf.mpg.de/imprs/imprspubl.nsf/a1f10a9f...be leveraged by a link-analysis algorithm. The problem that we study

6.4. COMPREHENSIVE EXPERIMENT 77

0.41

0.43

0.42

0.47

0.4

0.36

0.38

0.4

0.42

0.44

0.46

0.48

Pagerank QRank-Q/D-NoSim QRank-Q-NoSim QRank-Q/D-Sim QRank-Q-Sim

Average Precision

Figure 6.6: Average precisions of top-10 result sets on untrained queries

Page 80: Query-log based Authority Analysis for Web Information Searchdomino.mpi-inf.mpg.de/imprs/imprspubl.nsf/a1f10a9f...be leveraged by a link-analysis algorithm. The problem that we study

78 CHAPTER 6. EXPERIMENTS

0.46

0.52

0.46

0.5

0.48

0.43

0.44

0.45

0.46

0.47

0.48

0.49

0.5

0.51

0.52

0.53

Pagerank QRank-Q/D-NoSim QRank-Q-NoSim QRank-Q/D-Sim QRank-Q-Sim

Average Precision

Figure 6.7: Average precisions of top-5 result sets on untrained queries

Page 81: Query-log based Authority Analysis for Web Information Searchdomino.mpi-inf.mpg.de/imprs/imprspubl.nsf/a1f10a9f...be leveraged by a link-analysis algorithm. The problem that we study

6.4. COMPREHENSIVE EXPERIMENT 79

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.91

Mel

Gib

son

wha

t wom

enw

ant

Larg

est

coun

try S

outh

Am

eric

a

Col

onia

lpo

wer

Airb

usco

mpa

nyFo

rmul

a O

nera

cing

Epi

dem

icB

ritai

n 20

01E

scob

arB

alka

nE

ast G

erm

any

1989

US

pres

iden

tial

cand

idat

e19

92

Precision

Qra

nk-Q

/D-N

oSim

Qra

nk-Q

-NoS

imQ

rank

-Q/D

-Sim

Qra

nk-Q

-Sim

Pag

eRan

k

Figure 6.8: Precisions of top-10 result sets on untrained queries

Page 82: Query-log based Authority Analysis for Web Information Searchdomino.mpi-inf.mpg.de/imprs/imprspubl.nsf/a1f10a9f...be leveraged by a link-analysis algorithm. The problem that we study

80 CHAPTER 6. EXPERIMENTS

6.5 Discussion

In summary, the outcome of the described experiments clearly shows the benefits of ourproposed model. QRank yields promising rankings, especially in comparison with PageR-ank, even though the choice of the right parameters exhibits insufficiencies of our experi-ment environment:

The limited number of available query logs results in a disproportionate authority ofsome few clicked documents entailing some strange query results, e.g., ”Cyndi Lauper”appearing in the top-10 answer to ”Brazil cities”. This effect should be weakened as soonas more data becomes available for query session mining.

The QRank model that relates a query with its clicked-on documents does not makeuse of this relation at query-time, for the sake of short response times. Instead, theserelationships are used to infer the general authority of a web page. Thus a documentthat was clicked for a certain query can occur at a high rank for a rather unrelated querywhen it accidentally contains the desired keywords. However, considering the extractedrelations at query-time, too, might be a starting point for further improvements.

The Wikipedia data was ideal for the first experiment phase, but resulted in somedrawbacks when the QRank model was applied. First, most hyperlinks in Wikipediaare structural, navigational links that do not really represent endorsements. Second,Wikipedia barely comprises any real ”junk” pages as the real Web does. This gave rise tomainly one consequence: the real power of QRank in seperating good-content pages frominvaluable ones, did not have that much points of application to take beneficial effect.

Nevertheless our experiments allow us to draw the following conclusions towards anoptimal parameter setting:

• QRank clearly benefits from incorporating similarity, i.e., a parameter α < 1.

• The choice of β is difficult and highly depends on the ratio of query nodes to docu-ments in the QRank graph.

Page 83: Query-log based Authority Analysis for Web Information Searchdomino.mpi-inf.mpg.de/imprs/imprspubl.nsf/a1f10a9f...be leveraged by a link-analysis algorithm. The problem that we study

Chapter 7

Conclusion and Future Work

We presented a new method that naturally integrates query-log and click-stream informa-tion into the automated process of authority analysis. Our approach incorporates implicitrelevance feedback obtained from browser histories or server-log files and demands nouser participation at query-time. The conducted experiments revealed the benefits of ourmodel over PageRank. However, the question of the optimal parameter setting could notbe exhaustively answered due to the difficulty of modelling the real web environment. Theright choice of α and β highly depends on the actual QRank graph and has to be adjustedempirically.

Possible directions of future work towards an amelioration of QRank-based searchresults include:

• The incorporation of more sophisticated probabilistic models for computing edgeweights, e.g, the uniform distribution of random jumps on query nodes could bereplaced with a biased distribution that visits popular queries, i.e., frequently askedquestions more likely.

• A more personalized search, i.e., topic-specific QRank scores could be used to grantpages of user interest higher authority. Topics could be obtained by extractingstrongly-connected components from the part of the QRank graph that is formedby query nodes. Considering topic-specific variations of the QRank graph with theset of queries being reduced to a certain strongly-connected component would thenyield topic-specific QRank scores. These could be linearly combined according touser profiles.

• The additional exploitation of the QRank graph at query-time to profit from thetransitive relations deducable from query result clicks and query refinements.

• The augmentation of QRank by negative user feedback, when users explicitly markcertain query results as irrelevant.

81

Page 84: Query-log based Authority Analysis for Web Information Searchdomino.mpi-inf.mpg.de/imprs/imprspubl.nsf/a1f10a9f...be leveraged by a link-analysis algorithm. The problem that we study

List of Figures

2.1 Web structure observed by Broder et al. . . . . . . . . . . . . . . . . . . . . 10

3.1 Edges with authority and hub weights . . . . . . . . . . . . . . . . . . . . . 233.2 Document citation model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253.3 Construction of the bipartite SALSA graph . . . . . . . . . . . . . . . . . . 32

4.1 QRank graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

5.1 Overview of system architecture . . . . . . . . . . . . . . . . . . . . . . . . . 535.2 Excerpt of essential BINGO! tables . . . . . . . . . . . . . . . . . . . . . . . 53

6.1 Average precisions of top-10 result sets on trained queries . . . . . . . . . . 696.2 Average precisions of top-5 result sets on trained queries . . . . . . . . . . . 706.3 Precisions of top-10 result sets on trained queries . . . . . . . . . . . . . . . 726.4 Precisions of top-10 result sets on trained queries . . . . . . . . . . . . . . . 736.5 Excerpt of the QRank graph in the neighborhood of query ”Brazil cities” . 746.6 Average precisions of top-10 result sets on untrained queries . . . . . . . . . 776.7 Average precisions of top-5 result sets on untrained queries . . . . . . . . . 786.8 Precisions of top-10 result sets on untrained queries . . . . . . . . . . . . . . 79

82

Page 85: Query-log based Authority Analysis for Web Information Searchdomino.mpi-inf.mpg.de/imprs/imprspubl.nsf/a1f10a9f...be leveraged by a link-analysis algorithm. The problem that we study

Bibliography

[1] Serge Abiteboul, Mihai Preda, Gregory Cobena: Adaptive on-line page importancecomputation. WWW Conference 2003: 280-290

[2] Dimitris Achlioptas, Amos Fiat, Anna R. Karlin, and Frank McSherry: Web Searchvia Hub Synthesis. IEEE FOCS 2001: 500-509

[3] R. Albert, H. Jeong and A. -L. Barabasi. Diameter of the World Wide Web. Nature401:130-131, Sep 1999

[4] Arnold Allen: Probability, Statistics, and Queuing Theory with Computer ScienceApplications Academic Press 1990

[5] Yossi Azar, Amos Fiat, Anna Karlin, Frank McSherry, and Jared Saia: Spectral Anal-ysis of Data ACM STOC 2001

[6] Krishna Bharat, Monika R. Henzinger: Improved Algorithms for Topic Distillation ina Hyperlinked Environment. ACM SIGIR 1998: 104-111

[7] Doug Beeferman, Adam Berger: Agglomerative clustering of a search engine querylog. ACM SIGKDD 2000: 407 - 416

[8] Bettina Berendt, Myra Spiliopoulou: Analysis of Navigation Behaviour in Web SitesIntegrating Multiple Information Systems. VLDB 2000 9(1): 56-75

[9] Allan Borodin, Gareth O. Roberts, Jeffrey S. Rosenthal, Panayiotis Tsaparas: Findingauthorities and hubs from link structures on the World Wide Web. WWW Conference2001: 415-429

[10] Andrei Broder, Ravi Kumar, Farzin Maghoul, Prabhakar Raghavan, Sridhar Ra-jagopalan, Raymie Stata, Andrew Tomkins, Janet Wiener: Graph structure in the web.Computer Networks 33(1-6): 309-320, 2000

[11] Sergey Brin, Lawrence Page, Rajeev Motwani, and Terry Winograd: The PageRankCitation Ranking: Bringing Order to the Web. Technical Report 1998

[12] Sergey Brin, Lawrence Page: The Anatomy of a Large-Scale Hypertextual Web SearchEngine. WWW Conference 1998

[13] Soumen Chakrabarti: Mining the Web: Discovering Knowledge from Hypertext Data.Morgan Kaufmann, 2002

83

Page 86: Query-log based Authority Analysis for Web Information Searchdomino.mpi-inf.mpg.de/imprs/imprspubl.nsf/a1f10a9f...be leveraged by a link-analysis algorithm. The problem that we study

84 BIBLIOGRAPHY

[14] David Cohn, Huan Chang: Learning to Probabilistically Identify Authoritative Doc-uments. ICML 2000: 167-174

[15] David Cohn, Thomas Hofmann: The Missing Link - A Probabilistic Model of Docu-ment Content and Hypertext Connectivity. NIPS 2000: 430-436

[16] Hang Cui, Ji-Rong Wen, Jian-Yun Nie, Wei-Ying Ma: Query Expansion by MiningUser Logs. IEEE Transaction on Knowledge and Data Engineering 15(4) 2003

[17] Arthur Dempster, Nan Laird, and Donald Rubin: Maximum likelihood from incom-plete data via the EM algorithm. Journal of Royal Statistical Society Series B, 39, 1977:1-38

[18] Chris H. Q. Ding, Xiaofeng He, Parry Husbands, Hongyuan Zha, Horst D. Simon:PageRank, HITS and a Unified Framework for Link Analysis. SIAM International Con-ference on Data Mining 2003

[19] DirectHit. http://www.directhit.com

[20] Ronald Fagin, Ravi Kumar, Kevin S. McCurley, Jasmine Novak, D. Sivakumar, JohnA. Tomlin, David P. Williamson: Searching the workplace web. WWW Conference2003: 366-375

[21] Ronald Fagin, Ravi Kumar, D. Sivakumar: Efficient similarity search and classifica-tion via rank aggregation. ACM SIGMOD 2003: 301-312

[22] Christiane Fellbaum (Editor): WordNet: An Electronic Lexical Database. MIT Press,1998

[23] Taher Haveliwala: Topic-Sensitive PageRank: A Context-Sensitive Ranking Algo-rithm for Web Search. IEEE Transaction on Knowledge and Data Engineering 15(4)2003: 784-796

[24] Taher Haveliwala: Efficient encodings for document ranking vectors. Stanford Tech-nical Report 2002

[25] Jun Hirai, Sriram Raghavan, Hector Garcia-Molina, and Andreas Paepcke: WebBase:A repository of web pages. WWW Conference 2000

[26] Thomas Hofmann: Probabilistic Latent Semantic Indexing. ACM SIGIR 1999: 50-57

[27] Glen Jeh and Jennifer Widom: Scaling personalized web search. WWW Conference2003

[28] Sepandar Kamvar, Taher Haveliwala, Christopher Manning, and Gene Golub: Ex-trapolation Methods for Accelerating PageRank Computations. WWW Conference 2003

[29] Sepandar Kamvar, Taher Haveliwala, Christopher Manning, and Gene Golub: Ex-ploiting the Block Structure of the Web for Computing PageRank. Stanford TechnicalReport 2003

Page 87: Query-log based Authority Analysis for Web Information Searchdomino.mpi-inf.mpg.de/imprs/imprspubl.nsf/a1f10a9f...be leveraged by a link-analysis algorithm. The problem that we study

BIBLIOGRAPHY 85

[30] J. M. Kleinberg: Authoritative Sources in a Hyperlinked Environment. ACM-SIAMSymposium on Discrete Algorithms 1998: 668-677

[31] Ronny Lempel, Shlomo Moran: SALSA: the stochastic approach for link-structureanalysis. ACM Transactions on Information Systems 19(2) 2001: 131-160

[32] Christopher D. Manning, Hinrich Schutze: Foundations of Statistical Natural Lan-guage Processing. MIT Press, 1999

[33] Andrew McCallum, Kamal Nigam, Jason Rennie, and Kristie Seymore: BuildingDomain-Specific Search Engines with Machine Learning Techniques. AAAI 1999

[34] Open Directory Project: Web directory for over 2.5 million URLs.http://www.dmoz.org

[35] Christos Papadimitriou, Prabhakar Raghavan, Hisao Tamaki, and Santosh Vempala:Latent semantic indexing: A probabilistic analysis. Proceedings of ACM Symposiumon Principles of Database Systems 1997

[36] Matt Richardson, Pedro Domingos: The Intelligent surfer: Probabilistic Combinationof Link and Content Information in PageRank. NIPS 2001: 1441-1448

[37] L. Saul and F. Pereira: Aggregate and mixed-order Markov models for statisticallanguage processing. Proceedings of the 2nd International Conference on EmpiricalMethods in Natural Language Processing 1997

[38] Sergej Sizov, Michael Biwer, Jens Graupmann, Stefan Siersdorfer, Martin Theobald,Gerhard Weikum, Patrick Zimmer: The BINGO! System for Information Portal Gen-eration and Expert Web Search. CIDR 2003

[39] Myra Spiliopoulou, Carsten Pohle: Data Mining for Measuring and Improving theSuccess of Web Sites. Data Mining and Knowledge Discovery 5(1/2) 2001: 85-114

[40] Martin Theobald, Ralf Schenkel, Gerhard Weikum: Exploiting Structure, Annota-tion, and Ontological Knowledge for Automatic Classification of XML Data. WebDB2003: 1-6

[41] Martin Theobald, Ralf Schenkel, Gerhard Weikum: Classification and Focused Crawl-ing for Semistructured Data. in: H. Blanken et al. (Eds.), Intelligent Search on XMLData, Springer-Verlag 2003: 145-157

[42] Martin Theobald, Gerhard Weikum, Ralf Schenkel: Top-K Query Evaluation withProbabilistic Guarantees. VLDB Conference, 2004

[43] Text REtrieval Conference (TREC). http://trec.nist.gov/

[44] GNU Trove. http://trove4j.sourceforge.net/

[45] H. Ulbrich: UrlSearch 2.4.6 http://people.freenet.de/h.ulbrich/

[46] Jidong Wang, Zheng Chen, Li Tao, Wei-Ying Ma, Liu Wenyin: Ranking User’s Rel-evance to a Topic through Link Analysis on Web Logs. WIDM 2002: 49-54

Page 88: Query-log based Authority Analysis for Web Information Searchdomino.mpi-inf.mpg.de/imprs/imprspubl.nsf/a1f10a9f...be leveraged by a link-analysis algorithm. The problem that we study

86 BIBLIOGRAPHY

[47] Ji-Rong Wen, Hong-Jiang Zhang: Query Clustering in the Web Context. InformationRetrieval and Clustering, Kluwer Academic Publishers 2002

[48] Ji-Rong Wen, Jian-Yun Nie, Hong-Jiang Zhang: Query Clustering Using User Logs.ACM Transactions on Information Systems 20(1) 2002: 59-81

[49] Wikipedia, The Free Encyclopedia. http://en.wikipedia.org/wiki/Main Page

[50] Julia Luxenburger and Gerhard Weikum: Query-log based Authority Analysis forWeb Information Search. Web Information Systems Engineering (WISE) Conference2004

[51] Yi-Hung Wu, Arbee L. P. Chen: Prediction of Web Page Accesses by Proxy ServerLog. World Wide Web 5(1) 2002: 67-88

[52] Xu, and Croft: Query expansion using local and global document analysis. ACMSIGIR 1996

[53] Gui-Rong Xue, Hua-Jun Zeng, Zheng Chen, Wei-Ying Ma, Hong-Jiang Zhang, Chao-Jun Lu: Implicit Link Analysis for Small Web Search. ACM SIGIR 2003: 56-63

[54] Osmar R. Zaiane, Jaideep Srivastava, Myra Spiliopoulou, Brij M. Masand: WEBKDD2002 - MiningWeb Data for Discovering Usage Patterns and Profiles. 4th InternationalWorkshop 2002, Revised Papers Springer 2003

Page 89: Query-log based Authority Analysis for Web Information Searchdomino.mpi-inf.mpg.de/imprs/imprspubl.nsf/a1f10a9f...be leveraged by a link-analysis algorithm. The problem that we study

Appendix A

Top-10 Rankings for trainedQueries

87

Page 90: Query-log based Authority Analysis for Web Information Searchdomino.mpi-inf.mpg.de/imprs/imprspubl.nsf/a1f10a9f...be leveraged by a link-analysis algorithm. The problem that we study

88 APPENDIX A. TOP-10 RANKINGS FOR TRAINED QUERIES

# PageRank QRank-Q-Sim QRank-Q/D-Sim

QRank-Q-NoSim

QRank-Q/D-NoSim

1 Rome Rome Rome Rome Rome2 5th century Ancient Rome-

related topicsAncient Rome-related topics

Roman 5th century

3 Ancient Rome-related topics

Roman Roman Roman Empire Roman Empire

4 Roman Empire Italy Roman Empire 5th century Ancient Rome-related topics

5 Roman Roman mythology 64 Ancient Rome-related topics

Roman

6 Roman Republic 64 5th century Italy Italy7 Italy Roman Empire Great fire of Rome St.Valentine’s Day Roman Republic8 Latin Great fire of Rome Roman Republic Holy See Holy See9 March Statute Roman mythology Statute Roman mythology10 410s List of firsts Capitoline Hill Roman mythology Vatican City

Table A.1: Top-10 result rankings for the query ”Rome”

# PageRank QRank-Q-Sim QRank-Q/D-Sim

QRank-Q-NoSim

QRank-Q/D-NoSim

1 GNU Friends World Pro-gram

Friends World Pro-gram

Friends World Pro-gram

Friends World Pro-gram

2 Digital rights Friends (TV show) Friends (TV show) Friends ( TV show) Friends (TV show)3 Telecommunication Jennifer Aniston Jennifer Aniston Jennifer Aniston Jennifer Aniston4 Personal relation-

shipEnvironmentalhealth

GNU White Nights GNU

5 1994 White Nights Telecommunication Expo 86 Digital rights6 2002 Expo 86 Open Source Ini-

tiativeGNU Telecommunication

7 1947 Batman Environmentalhealth

Environmentalhealth

Religious Society ofFriends

8 Religious Society ofFriends

Cyndi Lauper White Nights Nelson Mandela 1994

9 False friend Nelson Mandela Expo 86 Batman White Nights10 Society Jean Cocteau Batman Murder on the Ori-

ent ExpressExpo 86

Table A.2: Top-10 result rankings for Query ”Friends”

# PageRank QRank-Q-Sim QRank-Q/D-Sim

QRank-Q-NoSim

QRank-Q/D-NoSim

1 1974 Richard Nixon Richard Nixon Richard Nixon 19742 1973 Watergate scandal 1974 Watergate scandal Richard Nixon3 1975 1974 Watergate scandal 1974 Watergate scandal4 1970s The Conversation 1973 1973 19735 Richard Nixon Constitutional Cri-

sis1975 Constitutional Cri-

sis1975

6 Watergate scandal 1973 The Conversation The Conversation 1970s7 1988 November 17 1970s 1975 Constitutional Cri-

sis8 1993 1975 Constitutional Cri-

sisNovember 17 The Conversation

9 1982 1970s November 17 1970s 198810 May 29 November 1 1988 1993 November 17

Table A.3: Top-10 result rankings for the query ”Watergate”

Page 91: Query-log based Authority Analysis for Web Information Searchdomino.mpi-inf.mpg.de/imprs/imprspubl.nsf/a1f10a9f...be leveraged by a link-analysis algorithm. The problem that we study

89

# PageRank QRank-Q-Sim QRank-Q/D-Sim

QRank-Q-NoSim

QRank-Q/D-NoSim

1 Philosophy Free software li-cense

Free software li-cense

Philosophy Philosophy

2 Free software li-cense

Philosophy Philosophy Early modern phi-losophy

Free software li-cense

3 Free SoftwareFoundation

Early modern phi-losophy

Early modern phi-losophy

Free software li-cense

Early modern phi-losophy

4 Richard Stallman Ancient philosophy Ancient philosophy Mysticism Mysticism5 Aristotle Mysticism Continental philos-

ophyAncient philosophy Free Software

Foundation6 Philosopher Continental philos-

ophyPostmodern philos-ophy

The Matrix Ancient philosophy

7 Enlightenment The Matrix Mysticism Continental philos-ophy

The Matrix

8 Copyright Postmodern philos-ophy

Marxist philosophy Cognitive bias Continental philos-ophy

9 Government Cognitive bias Richard Stallman Marxist philosophy Richard Stallman10 Debian Ontology Cognitive bias Alternative school Cognitive bias

Table A.4: Top-10 result rankings for the query ”Philosophy”

# PageRank QRank-Q-Sim QRank-Q/D-Sim

QRank-Q-NoSim

QRank-Q/D-NoSim

1 Brazil Brazil Brazil Brazil Brazil2 Population density Brazil History Brazil History Salvador, Brazil Salvador, Brazil3 2003 Salvador, Brazil Salvador, Brazil Sao Paulo Sao Paulo4 Acre Sao Paulo Sao Paulo Brazil History Population density5 Argentina Santiago Santiago Santiago Brazil History6 Mexico Luiz da Silva Luiz da Silva Luiz da Silva Santiago7 Portugal Capitals, larger

citiesPopulation density 2012 Summer

Olympics2003

8 Metropolitan areas Cyndi Lauper Capitals, largercities

Rio de Janeiro(state)

Luiz da Silva

9 Russia Rio de Janeiro(state)

Argentina Capitals, largercities

Argentina

10 Portuguese lan-guage

Bahia de Todos Cyndi Lauper Population density Mexico

Table A.5: Top-10 result rankings for the query ”Brazil cities”

# PageRank QRank-Q-Sim QRank-Q/D-Sim

QRank-Q-NoSim

QRank-Q/D-NoSim

1 United States Cuban Missile Cri-sis

United States 20th century United States

2 World War I 20th century Cuban Missile Cri-sis

United States 20th century

3 20th century United States World War I World War I World War I4 1950s Vietnam War 20th century Cuban Missile Cri-

sis1965

5 Spain World War I Vietnam War 1965 Cuban Missile Cri-sis

6 1965 1965 1965 1990 1950s7 Vietnam War 1990 1950s 1950s Spain8 1990 1950s 1990 Spain 19909 1959 US History (1945-

1964)Spain Vietnam War Vietnam War

10 1961 Law 1959 1985 1959

Table A.6: Top-10 result rankings for the query ”Cuba Cold War”

Page 92: Query-log based Authority Analysis for Web Information Searchdomino.mpi-inf.mpg.de/imprs/imprspubl.nsf/a1f10a9f...be leveraged by a link-analysis algorithm. The problem that we study

90 APPENDIX A. TOP-10 RANKINGS FOR TRAINED QUERIES

# PageRank QRank-Q-Sim QRank-Q/D-Sim

QRank-Q-NoSim

QRank-Q/D-NoSim

1 Egypt Pyramid Egypt Great Giza Pyra-mid

Egypt

2 Ancient Egypt Great Giza Pyra-mid

Pyramid Egypt Great Giza Pyra-mid

3 Pyramid Egypt Great Giza Pyra-mid

Pyramid Pyramid

4 Great Giza Pyra-mid

Ancient Egypt His-tory

Egypt-related top-ics

Ancient Egypt His-tory

Ancient Egypt His-tory

5 Cairo Egypt-related top-ics

Ancient Egypt His-tory

Egypt-related top-ics

Ancient Egypt

6 27th century BC Old Kingdom Ancient Egypt Ancient Egypt Egypt-related top-ics

7 Pharaoh Ancient Egypt Old Kingdom Cairo Cairo8 Giza 27th century BC Cairo Buildings 27th century BC9 Djoser Pyramid Buildings Djoser Pyramid Giza Giza10 3rd millennium BC Cairo 27th century BC 27th century BC Pharaoh

Table A.7: Top-10 result rankings for the query ”Egypt pyramids”

# PageRank QRank-Q-Sim QRank-Q/D-Sim

QRank-Q-NoSim

QRank-Q/D-NoSim

1 1815 Exile Exile Exile Exile2 Napolean I Waterloo Battle 1815 Waterloo Battle 18153 1814 Modena Waterloo Battle Napoleon I Napoleon I4 Napoleon III 1815 Napoleon I 1815 Napoleon III5 Exile Napolean I 1814 Modena 18146 Italy 1814 Modena Napoleon III Waterloo Battle7 1812 Arthur Wellesley Napoleon III Italy Italy8 Spain November 17 Italy 1814 18129 Jew Italy 1812 November 17 Modena10 Norway Napolean III Arthur Wellesley Arthur Wellesley Spain

Table A.8: Top-10 result rankings for the query ”Napolean exile”

# PageRank QRank-Q-Sim QRank-Q/D-Sim

QRank-Q-NoSim

QRank-Q/D-NoSim

1 1990 Nelson Mandela Nelson Mandela Nelson Mandela Nelson Mandela2 Nelson Mandela 1990 1990 1990 19903 1994 February 11 February 11 February 11 19944 1993 South Africa 1994 1994 February 115 South Africa 1994 South Africa South Africa 19936 1918 1993 1993 1993 South Africa7 February 11 1918 1918 1918 19188 June 10 Protest song Apartheid Protest song June 109 June 12 Apartheid June 10 June 10 Cape Town10 Cape Town June 10 June 12 Cape Town June 12

Table A.9: Top-10 result rankings for the query ”Mandela prison”

Page 93: Query-log based Authority Analysis for Web Information Searchdomino.mpi-inf.mpg.de/imprs/imprspubl.nsf/a1f10a9f...be leveraged by a link-analysis algorithm. The problem that we study

91

# PageRank QRank-Q-Sim QRank-Q/D-Sim

QRank-Q-NoSim

QRank-Q/D-NoSim

1 Film festival Film festival Film festival Film Festival Film Festival2 Cannes List of film festivals List of film festivals List of film festivals List of film festivals3 List of festivals Cannes Film Festi-

valCannes Film Festi-val

Cannes Film Festi-val

Cannes Film Festi-val

4 Reggae The Conversation List of festivals The Conversation Cannes5 Cannes Film Festi-

valThe Lost Weekend Cannes Cannes The Conversation

6 Sundance FilmFestival

List of festivals The Lost Weekend The Lost Weekend The Lost Weekend

7 List of film festivals Apocalypse Now The Conversation Pulp Fiction Reggae8 Palme d’Or Quentin Tarantino Sundance Film

FestivalSundance FilmFestival

Sundance FilmFestival

9 September 20 Cannes Quentin Tarantino Quentin Tarantino Pulp Fiction10 The Piano Jean Cocteau List of movie

awardsReggae List of festivals

Table A.10: Top-10 result rankings for the query ”Cannes festival”

# PageRank QRank-Q-Sim QRank-Q/D-Sim

QRank-Q-NoSim

QRank-Q/D-NoSim

1 European Union European Union European Union European Union European Union2 France European Eco-

nomic AreaEuropean Eco-nomic Area

European Eco-nomic Area

France

3 Spain European constitu-tion

France France European Eco-nomic Area

4 Belgium Verse protocol Open source EU history EU history5 Poland EU history EU history Verse protocol Verse protocol6 Germany France EU constitution EU constitution EU constitution7 Open source Maastricht treaty Verse protocol Maastricht Treaty Germany8 Portugal Open source Maastricht Treaty Germany Spain9 European Eco-

nomic AreaAmsterdam Treaty Amsterdam Treaty Portugal Maastricht Treaty

10 Japan Germany Germany Belgium Belgium

Table A.11: Top-10 result rankings for the query ”EU institutions”

# PageRank QRank-Q-Sim QRank-Q/D-Sim

QRank-Q-NoSim

QRank-Q/D-NoSim

1 Matrix The Matrix The Matrix The Matrix The Matrix2 The Matrix Liquid crystal dis-

playLiquid crystal dis-play

Liquid crystal dis-play

Liquid crystal dis-play

3 1999 The Matrix Revo-lutions

The Matrix Revo-lutions

1999 Matrix

4 Television Matrix video game Matrix video game Mysticism 19995 Years in film The Matrix

ReloadedThe MatrixReloaded

Matrix Television

6 Action movie Blade Runner Matrix Blade Runner Mysticism7 Keanu Reeves Agent Smith 1999 Television Keanu Reeves8 Matrix video game 1999 Blade Runner The Matrix Revo-

lutionsBlade Runner

9 The Matrix Revo-lutions

Mysticism Agent Smith Keanu Reeves The Matrix Revo-lutions

10 Nebuchadnezzar Greatest movies Greatest movies Matrix video game Matrix video game

Table A.12: Top-10 result rankings for the query ”The Matrix film”

Page 94: Query-log based Authority Analysis for Web Information Searchdomino.mpi-inf.mpg.de/imprs/imprspubl.nsf/a1f10a9f...be leveraged by a link-analysis algorithm. The problem that we study

92 APPENDIX A. TOP-10 RANKINGS FOR TRAINED QUERIES

# PageRank QRank-Q-Sim QRank-Q/D-Sim

QRank-Q-NoSim

QRank-Q/D-NoSim

1 Panama Panama CanalZone

Panama CanalZone

Panama CanalZone

Panama CanalZone

2 Canal Panama Panama Panama Panama3 North America North America Canal Panama Canal Canal4 Suez Canal Canal North America Canal North America5 Panama Canal Panama Canal Suez Canal North America Suez Canal6 19th century Theodore Roo-

seveltPanama Canal Suez Canal Panama Canal

7 1903 Suez Canal 1900s 1977 19th century8 1900s 1977 Panama City 19th century 19039 1977 Panama City 1903 Hay-Bunau Varilla

Treaty1977

10 1978 1900s 1977 1903 1900s

Table A.13: Top-10 result rankings for the query ”Panama Canal”

# PageRank QRank-Q-Sim QRank-Q/D-Sim

QRank-Q-NoSim

QRank-Q/D-NoSim

1 UN Children’sFund

UN Children’sFund

UN Children’sFund

UN Children’sFund

UN Children’sFund

2 1965 UN Economic/So-cial Council

1965 1965 1965

3 1946 1965 UN Economic/So-cial Council

UN Economic/So-cial Council

UN Economic/So-cial Council

4 United Nations United Nations 1946 United Nations 19465 UN Economic/So-

cial Council1946 United Nations 1946 United Nations

6 Gulf War Gulf War Gulf War Gulf War Gulf War7 Bah’ Faith Bah’ Faith Cambodia History Bah’ Faith Bah’ Faith8 Cambodia History Cambodia History Bah’ Faith Cambodia History Cambodia History9 Free software li-

censeFree software li-cense

Free software li-cense

Free software li-cense

Free software li-cense

10 GNU GNU License GNU License Friends World Pro-gram

GNU

Table A.14: Top-10 result rankings for the query ”UNICEF programs”

# PageRank QRank-Q-Sim QRank-Q/D-Sim

QRank-Q-NoSim

QRank-Q/D-NoSim

1 List of popes List of popes List of popes List of popes List of popes2 Pope Pope John Paul II Pope John Paul II Pope John Paul II Pope John Paul II3 1970s 20th century Pope 20th century Pope4 Roman Catholic

ChurchPope 1970s Pope 20th century

5 20th century February 11 20th century 1970s 1970s6 1960s 1970s 1960s February 11 Roman Catholic

Church7 1990s Sorbonne Roman Catholic

Church1992 1960s

8 1980s List of Saints 1990s 2003 1990s9 1963 1992 1963 1960s 1980s10 1978 1960s 1980s Roman Catholic

Church2003

Table A.15: Top-10 result rankings for the query ”Pope John Paul II”

Page 95: Query-log based Authority Analysis for Web Information Searchdomino.mpi-inf.mpg.de/imprs/imprspubl.nsf/a1f10a9f...be leveraged by a link-analysis algorithm. The problem that we study

93

# PageRank QRank-Q-Sim QRank-Q/D-Sim

QRank-Q-NoSim

QRank-Q/D-NoSim

1 Luciano Pavarotti Luciano Pavarotti Luciano Pavarotti Luciano Pavarotti Luciano Pavarotti2 1935 Placido Domingo Placido Domingo Modena Modena3 Wax museum Modena Modena 1935 19354 Tenor 1935 1935 Placido Domingo Tenor5 October 12 Jose Carreras Wax museum Jose Carreras Placido Domingo6 Tamagno Tenor Tenor Tenor Jose Carreras7 La donna mobile Grammy Awards

1989La donna mobile October 12 October 12

8 Placido Domingo December 2003 Tamagno La donna mobile Wax museum9 Colon Theater La donna mobile Jose Carreras December 2003 La donna mobile10 Jose Carreras October 12 October 12 Wax museum Tamagno

Table A.16: Top-10 result rankings for the query ”Luciano Pavarotti”

# PageRank QRank-Q-Sim QRank-Q/D-Sim

QRank-Q-NoSim

QRank-Q/D-NoSim

1 1990s Quentin Tarantino Quentin Tarantino Quentin Tarantino Quentin Tarantino2 Film director Pulp Fiction Pulp Fiction Pulp Fiction Pulp Fiction3 Quentin Tarantino Film director Film director Film director Film director4 Pulp Fiction From Dusk Till

Dawn1990s 1990s 1990s

5 1963 1990s From Dusk TillDawn

1963 1963

6 Reservoir Dogs Cannes Film Festi-val

1963 Reservoir Dogs Reservoir Dogs

7 Jackie Brown 1963 Reservoir Dogs Cannes Film Festi-val

Jackie Brown

8 Soundtrack Reservoir Dogs Jackie Brown Jackie Brown Soundtrack9 From Dusk Till

DawnJackie Brown Cannes Film Festi-

valSoundtrack From Dusk Till

Dawn10 Academy Award 2004 in film Stealers Wheel From Dusk Till

DawnCannes Film Festi-val

Table A.17: Top-10 result rankings for the query ”Quentin Tarantino”

# PageRank QRank-Q-Sim QRank-Q/D-Sim

QRank-Q-NoSim

QRank-Q/D-NoSim

1 Bosnia Herzegov-ina

Bosnia Herzegov-ina

Bosnia Herzegov-ina

Bosnia Herzegov-ina

Bosnia Herzegov-ina

2 Europe Europe Europe Europe Europe3 Population density PAL Mediterranian Sea PAL France4 France Mediterranian Sea Herzegovina Serbia Population density5 Italy Germany Political divisions

Bosnia Herzegov-ina

Germany Italy

6 Mediterranian Sea Italy PAL 1992 Germany7 Germany Serbia Population density France Cabinet8 Currency 1992 France Italy Mediterranian Sea9 Cabinet Cabinet Italy Cabinet 199210 Herzegovina Political divisions

Bosnia Herzegov-ina

Cabinet Mediterranian Sea Currency

Table A.18: Top-10 result rankings for the query ”Bosnia-Herzegovina”

Page 96: Query-log based Authority Analysis for Web Information Searchdomino.mpi-inf.mpg.de/imprs/imprspubl.nsf/a1f10a9f...be leveraged by a link-analysis algorithm. The problem that we study

94 APPENDIX A. TOP-10 RANKINGS FOR TRAINED QUERIES

# PageRank QRank-Q-Sim QRank-Q/D-Sim

QRank-Q-NoSim

QRank-Q/D-NoSim

1 2003 Human cloning Human cloning Human cloning Human cloning2 Cloning Cloning Cloning 2003 20033 1990s 2003 2003 Cloning Cloning4 Human cloning Blade Runner 1990s 20th century 20th century5 20th century 20th century 20th century 1990s 1990s6 Molecular biology February 2004 Blade Runner Blade Runner Blade Runner7 May 28 Cryonics February 2004 Cryonics Cryonics8 Developmental bi-

ology1990s May 28 February 2004 Molecular biology

9 Genetic engineer-ing

List of biology top-ics

Cryonics May 28 May 28

10 Anarchism May 28 List of biology top-ics

Anarchism Genetic engineer-ing

Table A.19: Top-10 result rankings for the query ”First cloned organ”

# PageRank QRank-Q-Sim QRank-Q/D-Sim

QRank-Q-NoSim

QRank-Q/D-NoSim

1 Afghanistan Mazar-e Sharif Afghanistan Mazar-e Sharif Afghanistan2 2002 Afghanistan Mazar-e Sharif Afghanistan Mazar-e Sharif3 Taliban Taliban Taliban Al-Quaida 20024 2001 Al-Qaida 2002 Taliban Taliban5 1996 2001 2001 2002 20016 1998 Kabul Al-Quaida 2001 Al-Quaida7 US invasion of

Afghanistan2002 US invasion of

AfghanistanKabul Kabul

8 George W Bush US invasion ofAfghanistan

Kabul September 11,2001attack

1996

9 Propaganda Prison 1996 Prison September 11,2001attack

10 Mazar-e Sharif September 11,2001attack

History ofAfghanistan

1996 1998

Table A.20: Top-10 result rankings for the query ”Taliban Afghanistan”

# PageRank QRank-Q-Sim QRank-Q/D-Sim

QRank-Q-NoSim

QRank-Q/D-NoSim

1 Tom Hanks Cast Away Cast Away Cast Away Cast Away2 Academy Award

Best ActorTom Hanks Tom Hanks Tom Hanks Tom Hanks

3 Saturday NightLive

Academy AwardBest Actor

Academy AwardBest Actor

Academy AwardBest Actor

Academy AwardBest Actor

4 Cast Away Steven Spielberg Steven Spielberg Saturday NightLive

Saturday NightLive

5 Steven Spielberg 2004 in film 2004 in film Saving PrivateRyan

Saving PrivateRyan

6 Saving PrivateRyan

Saving PrivateRyan

Saturday NightLive

Catch me if youcan

Steven Spielberg

7 Roy Orbison Role-playing game Saving PrivateRyan

Steven Spielberg Roy Orbison

8 Role-playing game Catch me if youcan

Role-playing game Forrest Gump Catch me if youcan

9 Years in film Saturday NightLive

Catch me if youcan

Roy Orbison Role-playing game

10 2000 in film Years in film Years in film Role-playing game Forrest Gump

Table A.21: Top-10 result rankings for the query ”Tom Hanks Cast Away”

Page 97: Query-log based Authority Analysis for Web Information Searchdomino.mpi-inf.mpg.de/imprs/imprspubl.nsf/a1f10a9f...be leveraged by a link-analysis algorithm. The problem that we study

95

# PageRank QRank-Q-Sim QRank-Q/D-Sim

QRank-Q-NoSim

QRank-Q/D-NoSim

1 Actor Actor What Dreams MayCome

What Dreams MayCome

Actor

2 Public domain Public domain Actor Actor Public Domain3 1994 What Dreams May

ComeThe Birdcage Public Domain What Dreams May

Come4 1990 1994 Awakenings The Birdcage 19945 Robin Hood 1990 Best Picture

Academy AwardAwakenings 1990

6 1949 1949 1990 1990 19497 Best Picture

Academy AwardThe Birdcage Public domain 1994 The Birdcage

8 1946 Robin Hood 1949 1949 Robin Hood9 1941 Awakenings 1994 Best Picture

Academy AwardAwakenings

10 Best ActorAcademy Award

Best PictureAcademy Award

Robin Hood 1946 Best PictureAcademy Award

Table A.22: Top-10 result rankings for the query ”Robin Williams film”

# PageRank QRank-Q-Sim QRank-Q/D-Sim

QRank-Q-NoSim

QRank-Q/D-NoSim

1 Harrison Ford Harrison Ford Harrison Ford Harrison Ford Harrison Ford2 1985 Raiders of the Lost

ArkRaiders of the LostArk

The Conversation The Conversation

3 Movie-related top-ics

Apocalypse Now Apocalypse Now Richard Nixon Richard Nixon

4 Richard Nixon The Conversation Movie-related top-ics

American Graffiti American Graffiti

5 Kazakhstan Richard Nixon The Conversation Apocalypse Now 19856 Steven Spielberg American Graffiti Richard Nixon Raiders of the Lost

ArkApocalypse Now

7 1943 Steven Spielberg American Graffiti Ronald Reagan Raiders of the LostArk

8 Chicago, Illinois Movie-related top-ics

Steven Spielberg 1985 Ronald Reagan

9 Raiders of the LostArk

Blade Runner Ronald Reagan Kazakhstan Kazakhstan

10 Ronald Reagan Ronald Reagan 1985 Blade Runner Movie-related top-ics

Table A.23: Top-10 result rankings for the query ”Harrison Ford movie”

Page 98: Query-log based Authority Analysis for Web Information Searchdomino.mpi-inf.mpg.de/imprs/imprspubl.nsf/a1f10a9f...be leveraged by a link-analysis algorithm. The problem that we study

96 APPENDIX A. TOP-10 RANKINGS FOR TRAINED QUERIES

# PageRank QRank-Q-Sim QRank-Q/D-Sim

QRank-Q-NoSim

QRank-Q/D-NoSim

1 United States White House White House White House White House2 US House of Repre-

sentativesUS President US House of Repre-

sentativesUnited States United States

3 White House US House of Repre-sentatives

United States US President US House of Repre-sentatives

4 US Congress US State Depart-ment

US president US House of Repre-sentatives

US President

5 US President US Federal Gov-ernment

US Congress US Cabinet US Congress

6 France US Cabinet US Department ofState

US Department ofState

US Cabinet

7 1974 US Congress Speaker of the USHouse of Represen-tatives

US Congress US Department ofState

8 Washington, DC Speaker of the USHouse of Represen-tatives

US Federal Gov-ernment

US Representa-tives’ Speaker

France

9 1998 United States US Cabinet Richard Nixon 197410 2003 US presidential line

of successionUS presidential lineof succession

US presidential lineof succession

Speaker of the USHouse of Represen-tatives

Table A.24: Top-10 result rankings for the query ”White House offices”

# PageRank QRank-Q-Sim QRank-Q/D-Sim

QRank-Q-NoSim

QRank-Q/D-NoSim

1 Asia Asia Asia Asia Asia2 Population density Population density Population density Population density Population density3 Australia List of countries by

populationAustralia India Australia

4 India India Ethnic Groups Australia India5 Commonwealth of

NationsSouth Africa Countries by popu-

lationCountries by popu-lation

Commonwealth ofNations

6 South Africa Australia India South Africa South Africa7 Malta Capitals and larger

citiesSouth Africa Commonwealth of

NationsCountries by popu-lation

8 Java (island) Ethnic groups Commonwealth ofNations

Capitals and largerCities

Countries by area

9 Countries by area Commonwealth ofNations

Capitals and largerCities

Countries by area Malta

10 Countries by popu-lation

Countries by area Countries by area Countries by popu-lation

Java (island)

Table A.25: Top-10 result rankings for the query ”Bangladesh inhabitants”

Page 99: Query-log based Authority Analysis for Web Information Searchdomino.mpi-inf.mpg.de/imprs/imprspubl.nsf/a1f10a9f...be leveraged by a link-analysis algorithm. The problem that we study

97

# PageRank QRank-Q-Sim QRank-Q/D-Sim

QRank-Q-NoSim

QRank-Q/D-NoSim

1 2002 Tilting train Tilting train Tilting train Tilting train2 1998 Winter Olympic

GamesWinter OlympicGames

Fukuoka 2002

3 Anime Fukuoka Fukuoka 2002 Fukuoka4 World War I Narita Airport 2002 Narita Airport 19985 1964 2002 Narita Airport Winter Olympic

GamesWinter OlympicGames

6 Tilting train 1998 1998 1998 Narita Airport7 Horse February 2004 Anime World War I Anime8 Magnetic levita-

tionDecember 2003 World War I December 2003 World War I

9 Winter OlympicGames

Horse Horse 1964 1964

10 Bicycle Narrow gauge 1964 Anime Horse

Table A.26: Top-10 result rankings for the query ”Japan high speed train”

# PageRank QRank-Q-Sim QRank-Q/D-Sim

QRank-Q-NoSim

QRank-Q/D-NoSim

1 Russians Russians Russians Russians Russians2 Maxim Gorky Nizhny Novgorod Nizhny Novgorod Nizhny Novgorod Nizhny Novgorod3 1910 White Sea-Baltic

CanalMaxim Gorky Maxim Gorky Maxim Gorky

4 April 1 Andrei Sakharov White Sea-BalticCanal

White Sea-BalticCanal

White Sea-BalticCanal

5 1869 Maxim Gorky Russian literature Andrei Sakharov Andrei Sakharov6 December 31 Russian literature Andrei Sakharov February 24 February 247 1891 February 24 February 24 April 1 19108 1854 List of Jews List of Jews March 15 April 19 1799 December 31 December 31 1910 December 3110 1605 December 2 1854 December 31 1869

Table A.27: Top-10 result rankings for the query ”Gorki Russian physician”

# PageRank QRank-Q-Sim QRank-Q/D-Sim

QRank-Q-NoSim

QRank-Q/D-NoSim

1 China China China China China2 People’s Republic

of ChinaOne country, twosystems

People’s Republicof China

One country, twosystems

People’s Republicof China

3 19th century People’s Republicof China

One country, twosystems

People’s Republicof China

One country, twosystems

4 India Law Law Communist state 19th century5 Law Communist state Communist state Law Law6 County Party discipline Party discipline India Communist state7 List of countries Prison 19th century Prison India8 Hong Kong Special Adminis-

trative RegionSpecial Adminis-trative Region

19th century Prison

9 Chinese language Falun Gong India Party discipline List of countries10 President India Prison Falun Gong Hong Kong

Table A.28: Top-10 result rankings for the query ”Political system of China”

Page 100: Query-log based Authority Analysis for Web Information Searchdomino.mpi-inf.mpg.de/imprs/imprspubl.nsf/a1f10a9f...be leveraged by a link-analysis algorithm. The problem that we study

98 APPENDIX A. TOP-10 RANKINGS FOR TRAINED QUERIES

# PageRank QRank-Q-Sim QRank-Q/D-Sim

QRank-Q-NoSim

QRank-Q/D-NoSim

1 Fair use Constitution Constitution US Supreme Court US Supreme Court2 US Supreme Court US Supreme Court US Supreme Court Constitution Constitution3 United States US constitutional

lawUS constitutionallaw

US constitutionallaw

Fair use

4 constitution Supreme court Supreme court Court of Appeals United States5 court Court of Appeals Court of Appeals United States Supreme Court6 copyright Article IV of US

constitutionUnited States Supreme court Court

7 US constitution United States Article IV of USconstitution

Fair use US Constitution

8 Supreme court Court Court US Constitution Court of Appeals9 Canada Law Fair use Court US constitutional

law10 2003 US Constitution US Constitution Law Copyright

Table A.29: Top-10 result rankings for the query ”Constitutional Supreme Court”

# PageRank QRank-Q-Sim QRank-Q/D-Sim

QRank-Q-NoSim

QRank-Q/D-NoSim

1 2000 State leaders in2002

State leaders in2002

Office-holders 2000

2 1980s Office-holders Office-holders State leaders in2002

1980s

3 2004 2000 2000 2000 Office-holders4 Minister 1980s 1980s 1980s State leaers in 20025 1983 State leaders in

2003State leaders in2003

Minister Minister

6 1998 December 2003 Minister 2004 20047 1990 1990 2004 1990 19908 1945 State leaders in

20041990 Turkey 1983

9 1995 February 2004 State leaders in2004

1983 1998

10 1967 Minister 1945 1987 1989

Table A.30: Top-10 result rankings for the query ”Prime Minister of Philippines”

Page 101: Query-log based Authority Analysis for Web Information Searchdomino.mpi-inf.mpg.de/imprs/imprspubl.nsf/a1f10a9f...be leveraged by a link-analysis algorithm. The problem that we study

Appendix B

Top-10 Rankings for untrainedQueries

In the following rankings documents that have been clicked for a training query are writtenin bold face.

# PageRank QRank-Q-Sim QRank-Q/D-Sim

QRank-Q-NoSim

QRank-Q/D-NoSim

1 1990s 1987 1990s 1987 1990s2 Mel Gibson 1990s 1970s 1990s Mel Gibson3 1970s 1970s Mel Gibson 1970s 1970s4 1980s 1980s 1980s Mel Gibson 1980s5 1995 Mel Gibson 1987 1980s 19876 1987 1995 1995 1995 19957 January 3 Deaths in 2003 January 3 January 3 January 38 2000 in film January 3 Deaths in 2003 Deaths in 2003 2000 in film9 Jesus Academy Award

for Best Picture2000 in film Academy Award

for DirectingJesus

10 Academy Awardfor Directing

Academy Awardfor Directing

Jesus 2000 in film Academy Awardfor Directing

Table B.1: Top-10 result rankings for the query ”Mel Gibson what women want”

99

Page 102: Query-log based Authority Analysis for Web Information Searchdomino.mpi-inf.mpg.de/imprs/imprspubl.nsf/a1f10a9f...be leveraged by a link-analysis algorithm. The problem that we study

100 APPENDIX B. TOP-10 RANKINGS FOR UNTRAINED QUERIES

# PageRank QRank-Q-Sim QRank-Q/D-Sim

QRank-Q-NoSim

QRank-Q/D-NoSim

1 United States United States United States United States United States2 North America Latin America Latin America Latin America North America3 South America South America South America South America South America4 Europe Brazil North America Brazil Latin America5 Canada North America Europe North America Europe6 Latin America Canada Brazil Europe Canada7 Mexico Europe Canada Canada Brazil8 Spain Ireland Mexico Mexico Mexico9 Netherlands Mexico Netherlands Netherlands Netherlands10 16th century Chile Ireland 20th century Spain

Table B.2: Top-10 result rankings for the query ”Largest country South America”

# PageRank QRank-Q-Sim QRank-Q/D-Sim

QRank-Q-NoSim

QRank-Q/D-NoSim

1 United States Postcolonial the-ory

Postcolonial the-ory

Postcolonial the-ory

Postcolonial the-ory

2 Australia Colony United States United States United States3 United Kingdom United States Australia Colony Australia4 17th century History of Brazil Colony United Kingdom United Kingdom5 18th century Australia United Kingdom Australia Colony6 France United Kingdom 17th century France 17th century7 19th century Brazil 18th century 17th century France8 Colony Sicherheitsdienst France 18th century 18th century9 Spain France British Empire 19th century 19th century10 World War I Suharto New England Brazil Spain

Table B.3: Top-10 result rankings for the query ”Colonial power”

# PageRank QRank-Q-Sim QRank-Q/D-Sim

QRank-Q-NoSim

QRank-Q/D-NoSim

1 Airbus Industrie DaimlerChrysler DaimlerChrysler DaimlerChrysler DaimlerChrysler2 Aircraft Manufac-

turersFebruary 2004 Airbus Industrie Airbus Industrie Airbus Industrie

3 Air France Airbus Industrie Aircraft Manu-factuerers

February 2004 Aircraft Manu-factuerers

4 Boeing Aircraft Manufac-turers

International AeroEngines

Aircraft Air France

5 DaimlerChrysler Aircraft February 2004 Aircraft Manufac-turers

Boeing

6 International AeroEngines

International AeroEngines

Air France Air France International AeroEngines

7 Airbus A 340 Air France Boeing Boeing Airbus A 3408 Scandinavian Air-

linesAirbus A 340 Air 2000 Airbus A 340 Boeing 747

9 Airbus A 380 Air 2000 Airbus A 340 Boeing 747 Scandinavian Air-lines

10 Boeing 747 Boeing Airbus A 380 International AeroEngines

February 2004

Table B.4: Top-10 result rankings for the query ”Airbus company”

Page 103: Query-log based Authority Analysis for Web Information Searchdomino.mpi-inf.mpg.de/imprs/imprspubl.nsf/a1f10a9f...be leveraged by a link-analysis algorithm. The problem that we study

101

# PageRank QRank-Q-Sim QRank-Q/D-Sim

QRank-Q-NoSim

QRank-Q/D-NoSim

1 Formula One 1986 in sports Auto Racing 1985 Formula One2 Auto Racing 1985 Formula One Formula One Auto Racing3 Formula Auto Racing Formula Auto Racing 19854 1985 Formula One 1985 1994 Formula5 Racing 1949 Racing 1949 Racing6 1994 February 2004 1949 Formula 19947 1971 Formula 1994 Racing 19498 1949 1994 1986 in sports 1971 19719 1930 1955 1955 1977 197710 June 11 Racing Lola (racing car

company)1955 1930

Table B.5: Top-10 result rankings for the query ”Formula One racing”

# PageRank QRank-Q-Sim QRank-Q/D-Sim

QRank-Q-NoSim

QRank-Q/D-NoSim

1 1980s South Africa 1980s 1980s 1980s2 South Africa 1980s South Africa South Africa South Africa3 Black Death AIDS AIDS AIDS AIDS4 AIDS History of Europe History of Europe Black death Black death5 Medicine Medicine Medicine Medicine Medicine6 History of England Black Death Black Death BSE BSE7 BSE History of England History of the

worldHistory of England History of England

8 Devon History of theworld

History of England Spanish Flu Devon

9 Great Plague November 2003 BSE Claudius II Great Plague10 Dutch elm disease BSE Patrick Trevor-

RoperDevon Dutch elm disease

Table B.6: Top-10 result rankings for the query ”Epidemic Britain 2001”

# PageRank QRank-Q-Sim QRank-Q/D-Sim

QRank-Q-NoSim

QRank-Q/D-NoSim

1 1993 1992 1993 1992 19932 1992 December 2 Sixto Escobar 1993 19923 Pablo Escobar 1993 December 2 December 2 December 24 December 2 Chile 1992 Chile Pablo Escobar5 Sixto Escobar Sixto Escobar Pablo Escobar Pablo Escobar Chile6 Chile Pablo Escobar Chile July 22 Sixto Escobar7 Isabela, Puerto

RicoState Leaders in2002

La Catedral Isabela, PuertoRico

Isabela, PuertoRico

8 La Catedral History of Colom-bia

Medellin Cartel Sixto Escobar July 22

9 Medellin Cartel La Catedral Isabela, PuertoRico

Cocaine La Catedral

10 Corporate leaders Salvadoran pres-idential election2004

Presidents ofParaguay

Corporate leaders Corporate leaders

Table B.7: Top-10 result rankings for the query ”Escobar”

Page 104: Query-log based Authority Analysis for Web Information Searchdomino.mpi-inf.mpg.de/imprs/imprspubl.nsf/a1f10a9f...be leveraged by a link-analysis algorithm. The problem that we study

102 APPENDIX B. TOP-10 RANKINGS FOR UNTRAINED QUERIES

# PageRank QRank-Q-Sim QRank-Q/D-Sim

QRank-Q-NoSim

QRank-Q/D-NoSim

1 Balkans Balkans Balkans Balkans Balkans2 Bulgaria Europe Europe Bulgaria Bulgaria3 Europe Bulgaria Bulgaria Europe Europe4 Ottoman Empire 1912 1912 Ottoman Empire Ottoman Empire5 1912 Eastern Europe Ottoman Empire 1912 19126 Greece 1913 Eastern Europe Greece Greece7 1990s Ottoman Empire Balkanization 1913 1990s8 Balkanization 1990s 1913 1990s 19139 1913 Greece Greece Eastern Europe Eastern Europe10 Middle East Balkanization October 18 Spain Balkanization

Table B.8: Top-10 result rankings for the query ”Balkan”

# PageRank QRank-Q-Sim QRank-Q/D-Sim

QRank-Q-NoSim

QRank-Q/D-NoSim

1 Germany Germany Germany Germany Germany2 England Ostmark East Germany Ostmark Ostmark3 World War II East Germany Ostmark Weimar East Germany4 World War I German reunifi-

cationGerman reunifi-cation

East Germany West Germany

5 West Germany Weimar Weimar German reunifi-cation

Weimar

6 East Germany West Germany West Germany West Germany World War II7 Europe President of

GermanyWorld War I Hamburg England

8 United States Hamburg World War II 20th century World War I9 Prussia PAL England World War II German reunifi-

cation10 Berlin World War I Prussia World War I 20th century

Table B.9: Top-10 result rankings for the query ”East Germany 1989”

Page 105: Query-log based Authority Analysis for Web Information Searchdomino.mpi-inf.mpg.de/imprs/imprspubl.nsf/a1f10a9f...be leveraged by a link-analysis algorithm. The problem that we study

103

# PageRank QRank-Q-Sim QRank-Q/D-Sim

QRank-Q-NoSim

QRank-Q/D-NoSim

1 US President US President US President US President US President2 2000 Political conven-

tionPrimary election Primary election US Democratic

Party3 US Democratic

PartyUS ElectoralCollege

Political conven-tion

US ElectoralCollege

2000

4 United States Primary election US presidentialelection 2004

Political conven-tion

United States

5 2004 US presidentialelection 2004

US ElectoralCollege

US presidentialelection 2004

US ElectoralCollege

6 President US presidentialnominating con-vention

US DemocraticParty

US DemocraticParty

2004

7 US RepublicanParty

February 2004 US presidentialelection 1900

US DemocraticParty presiden-tial nomination2004

Primary election

8 1960 US DemocraticParty presiden-tial nomination2004

US presidentialnominating con-vention

US presidentialnominating con-vention

Political conven-tion

9 US Vice President US DemocraticParty

2000 2000 US presidentialelection 2004

10 1992 US Senate US Vice President US Senate President

Table B.10: Top-10 result rankings for the query ”US presidential candidate 1992”