b.tech project documentation

ORGANIZING USER SEARCH HISTORIES

A project report submitted to

JAWAHARLAL NEHRU TECHNOLOGICAL UNIVERSITY-A, ANANTAPUR

In partial fulfillment of the requirements for the award of the degree of

BACHELOR OF TECHNOLOGY

IN COMPUTER SCIENCE AND ENGINEERING

Submitted by

Іν B.Tech ІІ Semester

Under the Esteemed Guidance of

Prof. D. Jatin Das, B.E., M.Sc(Tech-CS)

Prof & Dean(Freshmen)

Department of Computer Science and Engineering

(Autonomous)

(Affiliated to JNTUA, Anantapur and approved by AICTE, New Delhi)

Sree Sainathnagar, A.Rangampet, Tirupathi-517102.

(2009-2013)

M.VANI (09121A0565)

C.MOHAN BABU (09121A0519)

K.MANOJ KUMAR (09121A0543)

B.PAVAN KUMAR (09121A0516)

(Autonomous)

(Affiliated to JNTUA, Anantapur and approved by AICTE, New Delhi)

Sree Sainathnagar, A.Rangampet, Tirupathi-517102,Chittoor Dist.,A.P.

Department of Computer Science and Engineering

Certificate

This is to certify that the project work entitled

“ORGANIZING USER SEARCH HISTORIES”

is the bonafide work done by

In the Department of Computer Science and Engineering, Sree Vidyanikethan

Engineering College, A.Rangampet is affiliated to JNTU-Anantapur in partial

fulfillment of the requirements for the award of Bachelor of Technology in

Computer Science and Engineering during 2009-2013.

This work has been carried out under my guidance and supervision.

The results embodied in this Project report have not been submitted in any

University or Organization for the award of any degree or diploma.

INTERNAL GUIDE:

PROF D.JATIN DAS B.E,M.SC,(TECH-CS)

PROFESSOR & DEAN(FRESHMEN)

DEPT OF CSE

S.V.E.C

A.RANGAMPET

INTERNAL EXAMINER EXTERNAL EXAMINER

M.VANI (09121A0565)

C.MOHAN BABU (09121A0519)

K.MANOJ KUMAR (09121A0543)

B.PAVAN KUMAR (09121A0516)

HEAD OF DEPARTMENT:

DR. A.SENGUTTUVAN M.E, PH.D.

PROFESSOR & HEAD

DEPT OF CSE

S.V.E.C

A.RANGAMPET

i

ACKNOWLEDGEMENT

Before getting into the thickest of things, we would like to thank the

personalities who were part of my project in numerous ways, those who

gave me outstanding support from birth of the project.

We are extremely thankful to our beloved chairman and establisher

PADMASRI Dr.M.Mohan Babu & Special Officer Prof. T.Gopal Rao for

providing necessary infrastructure and resources for the accomplishment

of our project at Sree Vidyanikethan Engineering College, Tirupathi.

We are highly indebted to Dr. P.C.Krishnama Chary, Principal of

Sree Vidyanikethan Engineering College, for his support during the

tenure of the project.

We are very much obliged to our beloved Dr. A.Senguttavan,

Head of the Department of Computer Science & Engineering, Sree

Vidyanikethan Engineering College for providing the opportunity to

undertake this project and encouragement in completion of this project.

We hereby wish to express our deep sense of gratitude to

Prof.D.Jatin Das, Prof & Dean(Freshmen), Department of Computer

Science and Engineering, Sree Vidyanikethan Engineering College for the

esteemed guidance, moral support and invaluable advice provided by him

for the success of the project.

We are also thankful to all the staff members of Computer Science

and Engineering department who have co operated in making our project

a success. We would like to thank all our parents and friends who

extended their help, encouragement and moral support either directly or

indirectly in our project work.

Thanks for Your Valuable Guidance and kind support.

ii

DECLARATION

We hereby declare that project report entitled “ORGANIZING

USER SEARCH HISTORIES” is a genuine project work carried out by us,

in B.Tech (Computer Science and Engineering) degree course of

JAWAHARLAL NEHRU TECHNOLOGICAL UNIVERSITY, ANANTAPUR

and has not been submitted to any other courses or University for award

of any degree by us.

Signature of the Student

1.

2.

3.

4.

iii

ABSTRACT

Users are increasingly pursuing complex task-oriented goals on the

Web, such as making travel arrangements, managing finances or planning

purchases. To this end, they usually break down the tasks into a few co-

dependent steps and issue multiple queries around these steps repeatedly

over long periods of time.

To better support users in their long-term information quests on the

Web, search engines keep track of their queries and clicks while searching

online. In this paper, we study the problem of organizing a user’s

historical queries into groups in a dynamic and automated fashion.

Automatically identifying query groups is helpful for a number of different

search engine components and applications, such as query suggestions,

result ranking, query alterations, sessionization, and collaborative search.

In our approach, we go beyond approaches that rely on textual

similarity or time thresholds, and we propose a more robust approach

that leverages search query logs. We experimentally study the

performance of different techniques, and showcase their potential,

especially when combined together.

iv

CONTENTS

Chapter Name of the Chapter Page

1. Introduction 01

1.1 Introduction 02

1.2 Statement of the problem 03

1.3 Objectives 03

1.4 Scope 03

1.5 Applications 04

1.6 Limitations 04

2. Literature Survey 05

2.1 Information retrieval 06

2.2 Query Chains 07

2.3 Query clustering using

Click Through graph 08

2.4 Beyond the session Timeout 09

3. Analysis 11

3.1 Existing System 12

3.2 Disadvantages 12

3.3 Proposed system 12

3.4 Advantages 13

3.5 System Used 13

4. Design 14

4.1 UML Diagrams 15

5. Implementation 19

5.1 Query Group 20

v

5.2 Search History 21

5.3 query Relevance and

Search logs 21

5.4 Dynamic query grouping 22

6. Testing 23

6.1 Types of tests 24

6.2 Test strategy and approach 26

6.3 Test results 26

7. Results and Performance Evaluation 27

7.1 Results 28

7.2 Performance Evaluation 28

8. Conclusion and Future work 29

9. Appendix 31

9.1 Object Oriented Analysis

and Design through

Unified Modeling Language 32

9.2 Software Environment 36

9.3 Coding 46

9.4 List of Figures 67

9.5 List of Tables 68

9.6 List of Abbreviations 69

9.7 Screen Shots 70

10. References 91

11. Base Paper 92

INTRODUCTION

Department of Computer Science & Engineering 1

1. INTRODUCTION

INTRODUCTION


1. INTRODUCTION

Users are increasingly using web search engines. People search task

by dividing into co-dependent steps and issue multiple queries. Search

engine needs keep track of all the multiple queries. However, the primary

means of accessing information online is still through keyword queries to a

search engine.

A complex task such as travel arrangement has to be broken down

into a number of co-dependent steps over a period of time.

For instance, a user may first search on possible destinations, timeline,

events, etc. After deciding when and where to go, the user may then search

for the most suitable arrangements for air tickets, rental cars, lodging,

meals, etc. Each step requires one or more queries, and each query results

in one or more clicks on relevant pages.

One important step towards enabling services and features that can

help users during their complex search quests online is the capability to

identify and group related queries together.

Recently, some of the major search engines have introduced a new

“Search History” feature, which allows users to track their online searches by

recording their queries and click, Bing search engine on February of 2010.

This history includes a sequence of four queries displayed in reverse

chronological order together with their corresponding clicks.

INTRODUCTION


In addition to viewing their search history, users can manipulate it by

manually editing and organizing related queries and clicks into groups, or by

sharing them with their friends. While these features are helpful, the manual

efforts involved can be disruptive and will be untenable as the search history

gets longer over time.

1.2 STATEMENT OF THE PROBLEM

There are many query grouping algorithms which can be used by

many users. But the users do not know which query grouping algorithm suits

well in query grouping. Time based query grouping work well in only few

cases since user can perform multitask activities on a system. Similarly,

Keyword based query grouping work well in some cases because same

keywords can have different meanings. Therefore, query grouping should be

done in based on relevance. This work is often tedious or complex since

users search differ from one to another.

1.3 OBJECTIVES

Our main objective is to group the queries in a dynamic fashion.

Query Grouping can be done on the queries based on closely related and

relevant queries and query clicks. Relevance among query groups can be

measured using search logs such as query reformulation and clicks.

Relevance among query groups can be measured using search logs such as

query reformulation and clicks. Our measure of relevance is aimed at

capturing two important properties of relevant queries, namely: (1) queries

that frequently appear together as reformulations and (2) queries that have

induced the users to click on similar sets of pages.

INTRODUCTION


Query Reformulation + Query Click Query Fusion

1.4 SCOPE

Our work can also extend and store in cloud computing and can be

used in any large business industry. We can also calculate number of visitors

clicked particular site. We can extend our work that might be useful for

knowing user’s taste. So that based on each user’s search we can also add

suggestions while searching. Now we implement query grouping to small

area i.e., with in a particular database like travelling.

1.5 APPLICATIONS

Enterprise Business Applications.

Web Applications.

Calculating number of visitors clicked particular site.

How to choose right keywords for searching.

1.6 LIMITATIONS

Search by relevance filter results according to the information

that a particular user has given them, which rarely provides an

accurate reflection of a user’s real interests.

We concerned a small database, travelling with some subfields

like Marriage, Pilgrims, Site-Seeing Places, Job Searching,

Colleges etc.,

Searching results can be effectively retrieved on these subfield

based keywords.

LITERATURE SURVEY


2. LITERATURE SURVEY

LITERATURE SURVEY


2 LITTERATURE SURVEY

Literature survey is the most important step in software development

process. Before developing the tool it is necessary to determine the time

factor, economy and company strength. Once these things are satisfied, ten

next steps are to determine which operating system and language can be

used for developing the tool. Once the programmers start building the tool

the programmers need lot of external support. This support can be obtained

from senior programmers, from book or from websites. Before building the

system the above consideration are taken into account for developing the

proposed system.

2.1 Information Re-Retrieval:

People often repeat Web searches, both to find new information on

topics they have previously explored and to re-find information they have

seen in the past. The query associated with a repeat search may differ from

the initial query but can nonetheless lead to clicks on the same results. Re-

finding appears to be an important behavior for search engines to explicitly

support, and we explore how this can be done. We demonstrate that

changes to search engine results can hinder re-finding, and provide a way to

automatically detect repeat searches and predict repeat clicks. Log analysis

has shown that Web site re-visitation is very common Log analysis allows

researchers to observe a greater variety of behavior than laboratory and

observational studies, and gives a very realistic picture of people’s ones,

though it gives no insight into people’s underlying motivation. To

successfully identify repeat queries in this data, it was necessary to

associate queries by inferring the intent of the user, rather than relying on

http://www.blurtit.com/q876299.html



LITERATURE SURVEY


the exact query string being repeated. Query strings used to re-find can

differ from their original forms. To understand how a change in a result’s

rank affected click behavior, we looked at how likely a result was to be

clicked again in many ways. We compared the probability that any given

click would be a repeat click for these queries under two conditions: (i) when

a change in rank was observed among one of the overlapping clicks and (ii)

where no rank change was observed.

The strongest predictors for a click on a new result included the

number of times the query was issued previously (and if it was issued more

than once before), whether any previously viewed result was clicked more

than once, and several features that were the same for queries that were

repeated only twice:

Number of clicks the first time the query was issued.

Number of clicks the previous time the query was issued.

Number of unique clicks the previous time.

2.2 QUERY CHAINS:

Designing effective ranking functions for free text retrieved has proved

notoriously difficult. Retrieval function designed for one collection and

application often do not work well on other collections without additional

time consuming modifications.

We refer to a sequence of reformulated queries as a query chain. When

queries are considered independently, Log files only provide implicit

feedback on a few results at the top of the result set for each query.

The key contribution of this work is recognizing that we can successfully

use evidence of query chains that is present in search engine log files to

learn better retrieval functions.

LITERATURE SURVEY


In order to infer implicit preference judgment from log files, we need to

understand how users access search results. Clearly we can only derive valid

feedback for results that the user actually looked at and assessed.

Clearly we can only derive valid feedback for the results that the user

actually looked at and assessed. An eye tracking study was performed to

observe how users formulate queries, assess the results returned by the

search engine. Finally, the model trained using query chains outperform the

model trained without query chains with over 99% confidence, using the

same test.

2.3 Query Clustering Using Click-through Graph:

Users who pose a query to a web search engine often have specific

information needs in mind, such as finding the address of a business, an

article about a historic event, a company’s home page, and so on. Users

select pages by clicking on links on a search engine result page that deem to

be closely relevant to their intended information needs. Considering

collective filtering, it is reasonable to assume a frequently clicked set of

pages for a query reflects the kinds of information that the users intend to

find by posing the query. Further, it is observed that users of similar

information needs click on a similar set of pages, even though the queries

they pose may vary, thus forming a cluster of queries and clicked pages that

are more strongly connected to each other than with the rest of queries and

clicked pages.

Designing a query clustering method that takes into account the query

and clicked page relationship, not considering syntactic or semantic features

on the query, such as keywords. The graph consists of a set of web search

queries, a set of pages selected for the queries, and a set of directed edges

LITERATURE SURVEY


that connects a query node and a page node clicked by a user for the query.

The proposed method extracts all maximal bipartite cliques (bicliques) from

a click-through graph and computes an equivalence set of queries from the

maximal bicliques. A cluster of queries is formed from the queries in a

biclique. We present a scalable algorithm that enumerates all maximal

bicliques from the click-through graph. We represent the query and click-

through page relationships by a directed bipartite graph that consists of a

set of queries, a set of web page URLs, and a set of edges that connect a

query node to a page node in the graph. The proposed query clustering

method involves maximal biclique enumeration problem.

Maximal bicliques generation from a bipartite graph is a special

case of the maximal clique generation problem from a general graph. Let G

= (V1∪V2, E) be a bipartite graph, where V1and V2 is the two disjoint sets

of nodes, and E is a set of edges connecting nodes in V1 and V2. To

generate maximal bicliques, G is transformed to a general graph,

G’= (V1∪V2, E’), where E’=E∪ (V1×V1) ∪ (V2×V2)

At each step of the biclique generation algorithm we either generate a

biclique, or exclude a node and the associated edges from the graph. After

generating a biclique, the sub graph corresponding to the biclique is

removed from the click-through graph. For each biclique generated, the

query set, forms an equivalence set that becomes a query cluster.

LITERATURE SURVEY


2.4 Beyond the Session Time Out: Automatic

Hierarchical Segmentation of Search Topics in Query Logs

Web services engines attempt to satisfy user’s information needs by

ranking web pages with respect to queries. But the reality of web search is

that it is often a process of querying, learning and reformulating.

If we are able to accurately identify sets of queries with the same

information-seeking intent, then we will be in a better position to evaluate

the performance of a web search engine from the user’s point of view. To

this end, we built classifiers to identify task and subtasks boundaries, as well

as pairs of queries which correspond to the same task.

Historically a session was a set of queries to satisfy a single

information need and a series of successive queries. A separate body of

work models the formal syntax of user’s interactions with the search engine

rather than making distinctions regarding what they seek .A search goal is

an atomic information need, resulting in one or more queries.

We are able to build highly accurate classifiers for goal and mission

boundaries as well as identifying pairs of queries from the same goal or

mission. Finally, we have shown that a diverse set of syntactic, temporal,

query log and web search features in combination can predict goal and

mission boundaries well. Additionally, we have shown that the task of

matching queries within the same interleaved goal or mission is harder than

identifying boundaries.

ANALYSIS


3. ANALYSIS

ANALYSIS


3. ANALYSIS

3.1 Existing System

With the growth of computer technology, large amount of data is

stored in database. User can manually store data into groups with their own

effect. Also, query grouping is done in an iterative manner.

3.2 Disadvantages

There is a problem with the existing system. First, Query grouping

may have the undesirable effect of changing a user’s existing query groups,

potentially undoing the user’s own manual efforts in organizing her history.

Second, it involves a high computational cost, since we would have to repeat

a large number of query group similarity computations for every new query.

3.3 Proposed System:

The following are the contributions made in our proposed system:

1. We investigate how signals from search logs such as query

reformulations and clicks can be used together to determine the

relevance among query groups. We study two potential ways of using

clicks in order to enhance this process by fusing the query

reformulation graph and the query click graph into a single graph that

we refer to as the query fusion graph, and by expanding the query set

when computing relevance to also include other queries with similar

clicked URLs.

ANALYSIS


2. We show through comprehensive experimental evaluation the

effectiveness and the robustness of our proposed search log-based

method, especially when combined with approaches using other

signals such as text similarity.

3.4 Advantages

Using the proposed system, we will focus on evaluating the

effectiveness of the proposed algorithms in capturing query relevance.

User need not group manually his search history into groups.

In this system, online query grouping process is done which results in

high efficiency of search.

3.5 System Used:

The following are the requirements we used to run proposed system.

Hardware System Configuration:

ANY CONTEMPORARY PC

Software System Configuration:

Operating System Windows XP / UNIX

Front End HTML, Java, JSP

Scripts JavaScript

Database RDBMS

Tools RSA (IBM)

DESIGN


4. DESIGN

DESIGN


UML DIAGRAMS:

Class Diagram:

Fig.4.1. Class Diagram

DESIGN


Use Case Diagram:

Fig.4.2. Use case Diagram

DESIGN


Use Case Diagram 2:

Fig.4.3 Use case Diagram

DESIGN


Sequence Diagram:

Fig.4.4. Sequence Diagram

IMPLEMENTATION


5. IMPLEMENTATION

IMPLEMENTATION


IMPLEMENTATION

Module Description:

1. Query Group

2. Search history

3. Query Relevance and Search logs

4. Dynamic Query Grouping

5.1 Query Group:

We need a relevance measure that is robust enough to identify similar

query groups beyond the approaches that simply rely on the textual content

of queries or time interval between them. Our approach makes use of search

logs in order to determine the relevance between query groups more

effectively. In fact, the search history of a large number of users contains

signals regarding query relevance, such as which queries tend to be issued

closely together (query reformulations), and which queries tend to lead to

clicks on similar URLs (query clicks). Such signals are user-generated and

are likely to be more robust, especially when considered at scale. We

suggest measuring the relevance between query groups by exploiting the

query logs and the click logs simultaneously.

Fig. 5.1 Query Grouping

IMPLEMENTATION


5.2 Search History:

We study the problem of organizing a user’s search history into a set

of query groups in an automated and dynamic fashion. Each query group is

a collection of queries by the same user that are relevant to each other

around a common informational need. These query groups are dynamically

updated as the user issues new queries, and new query groups may be

created over time.

Fig. 5.2 Search History

5.3 Query Relevance and Search logs:

We now develop the machinery to define the query relevance based on

Web search logs. Our measure of relevance is aimed at capturing two

important properties of relevant queries, namely: (1) queries that frequently

appear together as reformulations and (2) queries that have induced the

users to click on similar sets of pages. We start our discussion by introducing

three search behavior graphs that capture the aforementioned properties.

Following that, we show how we can use these graphs to compute query

IMPLEMENTATION


relevance and how we can incorporate the clicks following a user’s query in

order to enhance our relevance metric.

5.4 Dynamic Query Grouping:

One approach to the identification of query groups is to first treat

every query in a user’s history as a singleton query group, and then merge

these singleton query groups in an iterative fashion. However, this is

impractical in our scenario for two reasons. First, Existing query groups

potentially doing the user’s own manual efforts in organizing her history.

Second, it involves a high computational cost, since we would have to repeat

a large number of query group similarity computations for every new query.

TESTING


6. TESTING

TESTING


6. TESTING

PURPOSE:

The purpose of testing is to discover errors. Testing is the process of

trying to discover every conceivable fault or weakness in a work product. It

provides a way to check the functionality of components, sub assemblies,

assemblies and/or a finished product It is the process of exercising software

with the intent of ensuring that the Software system meets its requirements

and user expectations and does not fail in an unacceptable manner. There

are various types of test. Each test type addresses a specific testing

requirement.

6.1 Types of Tests

Unit testing:

Unit testing involves the design of test cases that validate that the

internal program logic is functioning properly, and that program inputs

produce valid outputs. All decision branches and internal code flow should be

validated. It is the testing of individual software units of the application it is

done after the completion of an individual unit before integration. This is a

structural testing, that relies on knowledge of its construction and is

invasive. Unit tests perform basic tests at component level and test a

specific business process, application, and/or system configuration. Unit

tests ensure that each unique path of a business process performs

accurately to the documented specifications and contains clearly defined

inputs and expected results.

TESTING


Functional Testing:

Functional tests provide systematic demonstrations that functions

tested are available as specified by the business and technical requirements,

system documentation, and user manuals.

Functional testing is centered on the following items:

Valid Input: Identified classes of valid input must be accepted.

Invalid Input: Identified classes of invalid input must be rejected.

Functions: Identified functions must be exercised.

Output: Identified classes of application outputs must be exercised.

Systems/Procedures: Interfacing systems or procedures must be invoked.

Organization and preparation of functional tests is focused on

requirements, key functions, or special test cases. In addition, systematic

coverage pertaining to identify Business process flows; data fields,

predefined processes, and successive processes must be considered for

testing. Before functional testing is complete, additional tests are identified

and the effective value of current tests is determined.

TESTING


6.2 Test Strategy and Approach

Field-testing will be performed manually and functional tests will be

written in detail.

Test Objectives:

All field entries must work properly.

Pages must be activated from the identified link.

The entry screen, messages and responses must not be delayed.

Features to be tested:

Verify that the entries are of the correct format.

No duplicate entries should be allowed.

All links should take the user to the correct page.

6.3 Test Results

All the test cases mentioned above passed successfully. No defects

encountered.

RESULT & PERFORMANCE EVALUATION


7 RESULTS AND PERFORMANCE EVALUATION

RESULT & PERFORMANCE EVALUATION


7.1 RESULTS

Using the proposed system the following results can be observed.

Result of Travelling Database:

C:\Program Files\Java\jdk1.6.0\bin>javac databasecon.java

C:\Program Files\Java\jdk1.6.0\bin>java databasecon

Colname Datatype Length Decimals Allow Null

Id int 20 0 NOT NULL (primary key)

Category varchar 50 0 NULL

S.cat varchar 200 0 NULL

Keyword varchar 225 0 NULL

Des longtext 0 0 NULL

Currtime varchar 30 0 NULL

Currdate varchar 20 0 NULL

Status varchar 20 0 NULL

7.2 PERFORMANCE EVALUATION

While running the normal search engines there is no query grouping is

done. But when we apply this relevance based searching algorithm query

grouping is performed. User feels comfort using this query group searching.

Thus there was efficiency in searching.

CONCLUSION & FUTURE WORK


8 CONCLUSION & FUTURE WORK

CONCLUSION & FUTURE WORK


8.1 CONCLUSION:

In this project work, the query reformulation and click graphs contain

useful information on user behavior when searching online. In this paper, we

show how such information can be used effectively for the task of organizing

user search histories into query groups. More specifically, we propose

combining the two graphs into a query fusion graph. We further show that

our approach that is based on probabilistic random walks over the query

fusion graph outperforms time-based and keyword similarity based

approaches. We also find value in combining our method with keyword

similarity-based methods, especially when there is insufficient usage

information about the queries.

8.2 FUTURE WORK:

As future work, we intend to investigate the usefulness of the

knowledge gained from these query groups in various applications such as

providing query suggestions and biasing the ranking of search results.

APPENDIX


9 APPENDIX

APPENDIX


Introduction to UML:

UML Approach:

UML stands for Unified Modeling Language. UML is a language for

specifying, visualizing and documenting the system. This is the step while

developing any product after analysis. The goal from this is to produce a

model of the entities involved in the project which later need to be built.

Definition: UML is a general purpose visual modeling language that is used

to specify, visualize, construct, and document the artifacts of the software

system.

UML is a language: It provides vocabulary and rules for Communications.

These vocabulary and rules focus on conceptual and physical representation

of a system. So, UML is standard language for software blueprint.

UML is a language for visualization: The UML more than just a bunch of

graphical symbols. Rather behind each symbol in UML notation is a well

defined semantics. In this manner, one we can write a model in UML, and

other tools can also interpret that model.

UML is a language for specifying: Specifying means building models that

are precise, unambiguous and complete. In particular, the UML address the

specification of all the important analysis, design and implementation

decisions that must be made in developing and displaying a software

intensive system.

APPENDIX


UML is language for constructing: UML models can be directly connected

to a variety of programming languages and it is sufficiently expressive and

free from any ambiguity to permit the direct execution of models.

UML is a language for Documenting: UML provides variety of documents

in addition raw executable codes. These artifacts include:

Requirements

Architecture

Design

Source Code

Project Plans

Tests

Prototypes

Goal of UML:

The primary goal in the design of the UML is:

Provide users with a ready-to-use, expressive visual modeling

language so they can develop and exchange meaningful models.

Provide extensibility and specialization mechanisms to extend the core

concepts.

Be independent of particular programming languages and development

processes.

Provide a formal basis for understanding the modeling language.

Encourage the growth of the OO tools market.

Support higher-level development concepts such as collaborations,

frameworks, patterns and components.

APPENDIX


Uses of UML:

The UML is intended primarily for software intensive systems. It has been

used effectively for such domain as

1. Enterprise information system

2. Banking and financial services

3. Telecommunications

4. Transportations

5. Defense/Aerospace

6. Retails

7. Medical Electronics

8. Scientific Fields

9. Distributed Web

Rules of UML:

The UML has semantic rules for

Names: it will call things, relationship and diagrams.

Scope: the content that gives specific meaning to a name.

Visibility: how those names can be seen and used by others.

Integrity: how things properly and consistently relate to another.

Execution: what it means is to run or simulate a dynamic model.

Building blocks of UML:

The vocabulary of the UML encompasses 3 kinds of building blocks:

1. Things

2. Relationships

3. Diagrams

APPENDIX


1. Things: things are the data abstractions that are first class citizens in

a model. Things are of 4 types

Structural Things

Behavioral Things

Grouping Things

An notational things

2. Relationships: Relationships tie the things together.

Relationships in the UML are

Dependency

Association

Generalization

Specialization

3. Diagrams: Diagrams in the UML are of 2 types

Static Diagrams

Dynamic Diagrams

Static Diagrams consists of

Class Diagram

Object Diagram

Component Diagram

Deployment Diagram

Dynamic Diagrams consists of

Use case Diagram

Sequence Diagram

Collaboration Diagram

State chart Diagram

Activity Diagram

APPENDIX


9.2 SOFTWARE ENVIRONMENT

OVERVIEW OF JAVA SCRIPT:

Java script is a general purpose; prototypes based, object oriented

scripting language developed jointly by Sun and Netscape and are meant for

the WWW. It is designed to be embedded in diverse applications and

systems, without consuming much memory. Java script borrows most of its

syntax from java but also inherits from AWK and PERL, with some indirect

influence from self in its object prototype system.

Java script is dynamically typed that is programs do not declare

variable types, and the type of variable is unrestricted and can change at

runtime. Source can be generated at run time and evaluated against an

arbitrary scope. Typical implementations compile by translating source into a

specified byte code format, to check syntax and source consistency. Note

that the availability to generate and interpret programs at runtime implies

the presence of a compiler at runtime.

Java script is a high level scripting language that does not depend on

or expose particular machine representations or operating system services.

It provides automatic storage management, typically using a garbage

collector.

Features:

Java script is embedded into HTML documents and is executed

with in them.

Java script is browser dependent

APPENDIX


Java script is an interpreted language that can be interpreted by the

browser at runtime.

Java script is loosely typed language

Java script is an object based language.

Java script is an Event-Driven language and supports event handlers

to specify the functionality of a button.

Advantages:

Java script can be used for client side application

Java script provides means to contain multiform windows for

presentation of the web.

Java script provides basic data validation before it is sent to the

server. e.g.: login and password checking or whether the values

entered are correct or whether all fields in a form are filled and

reduced network traffic

It creates interactive forms and client side lookup tables.

APPENDIX


OVERVIEW OF JAVA/SERVLETS:

Java has a major impact on the computing scene. Java could be easily

incorporated into the web system and is capable of supporting animation

graphics, games and other special effect. The web has become more

dynamic and interactive with support of java. We can run a java program on

remote machine over internet with the support of web.

SERVLET:

Servlets are modules or software written in Java that extend and

enhance the functionality of a server. They are typically used for

request/response applications and have no graphical user interface.

They are widely used with HTTP servers

Servlets abilities

Allow 2-way interaction between a client and a server

Collaboration between people

Can handle multiple requests at the same time

Can synchronize requests

e.g., on-line conferencing

Forwarding requests

Balance load among several servers with similar content

Partition a single service over several servers according to task type

Benefits of servlets

Written in Java Capable of running in process

Compiled into Java byte codes

Run on every popular Web server

Durable

APPENDIX


Remain in memory until destroyed

Unused servlets do not consume server resources

Easily deployed

Multithreaded

Protocol independent

Secure

Servlet API

Servlet API is a stand-alone Java library, also part of the Apache

Tomcat project. It is a specification that defines the classes, interfaces and

used to create and execute Servlets. It provides strong support for common

HTTP functionality but does not sacrifice the ability to support other

protocols

JAVA SERVER PAGES (JSP):

Java Server Pages (JSP) is a technology that lets you mix regular,

static HTML with dynamically-generated HTML. Many Web pages that are

built by CGI (Common Gate way Interface) programs are mostly static, with

dynamic part limited to a few small locations. But most CGI variations,

including servlets, make you generate the entire page via your program,

even though most of it is always the same.

Features of JSP:

Portability:

Java server pages files can be run on any web server or web-enabled

application server that provides support for them. Dubbed the JSP engine,

APPENDIX


this support involves recognition, and management of the Java Serve Page

life cycle and its interaction components.

Components:

Java Server Pages architecture can include reusable java components.

The architecture also allows for the embedding of a scripting language

directly into Java Server Pages file. The components current supported

include JavaBeans, and Servlets.

Processing:

A Java Server Pages file is essentially an HTML document with JSP

scripting or tags. The Java Server Pages file has a JSP extension to the

server as a Java Server Pages file. Before the page is served, the Java

Server Pages syntax is parsed and processed into the server on the server

side. The Servlet that is generated outputs real content in straight HTML for

responding to the client.

Access Models:

A Java Server Pages file may be accessed in at least two different

ways. A client’s request comes directly into a Java Server Page. In this

scenario, suppose the page accesses reusable Java Bean components that

perform particular well-defined computations like accessing a database. The

result of the Beans computations, called result sets is stored within the Bean

as properties. The page uses Beans to generate dynamic content and

present it back to the client. In both of the above cases, the page could also

APPENDIX


contain any valid Java code. Java Server Pages encourages separation of

content from presentation.

Step in the execution of a JSP Application:

1. The client sends a request to the web server for a JSP file by giving the

name of the JSP file within the form tag of a HTML page.

2. This request is transferred to the Java Web Server. At the server side

Java Web Server receives the request for a jsp file server gives this

request to the JSP engine.

3. JSP engine is program which can understand the tags of the jsp and

then it converts those tags into a Servlets program and it is stored at

the server side. This Servlets is loaded in the memory and then it is

executed and the result is given back to the JavaWebServer and then

it is transferred back to the result is given back to the Java WebServer

and then it is transferred back to the client.

JDBC connectivity:

The JDBC provides database-independent connectivity between the

J2EE platform and a wide range of tabular data sources. The overall system

is planned to be in the formal of distributed architecture with homogeneous

database platform. JDBC technology allows an Application Component

Provider to:

Perform connection and authentication to a database server

Manage transactions

Move SQL statements to a database engine for preprocessing

and execution

APPENDIX


Execute stored procedures

Inspect and modify the results from Select statements

JDBC Technology Drivers

To use the JDBC API with a particular database management system,

you need a JDBC technology based deriver to mediate between JDBC

technology and the database. Depending on various factors, a driver might

be written purely in the java programming language or in a mixture of the

java programming language and java TM Native Interface (JNI) native

methods.

There are four types of JDBC drivers each having its own functionality.

Please note that they are not substituting one another. Each having their

own suitability aspects. They are classified based on how they access the

data from the database.

Native JDBC Driver: A JDBC driver, which is partly written in and

most of it is implemented using native methods to access the

database. This is useful in case of java application that can run only

on some specific platforms. Writing this type is easier when compared

to writing other drivers.

All Java JDBC Net Drivers: A JDBC net driver when uses a common

network Protocol to connect an intermediate server which in turn

employs native calls to connect to the database. This approach suited

for applets. Where the request must go through the intermediate

server.

3. JDBC – ODBC Bridge Driver: A bridge driver provided with JDBC

can convert JDBC call into equivalent ODBC calls using native method.

APPENDIX


Since ODBC provides connection to any type of database i.e. ODBC

complaint, to connect number of databases simultaneously is very

simple matter this approach is a recommended one since using ODBC

drivers, which are industry standards as of now, would make it

portable across databases.

APPENDIX


OVERVIEW OF ORACLE:

Oracle is a popular relational database. It provides complete control

organizing the data storage to obtain good performance using indexing,

clustering. It provides reliable and secure data management for applications

ranging from small departmental to high-volume on-line transaction

systems, or query intensive data warehouse applications. They also provide

the tools for systems management, the flexibility to distribute data

efficiently, and the scalability for optimal performance from computing

resources. Oracle’s open, standards-based Network Computing Architecture

(NCA) enables companies to spend less time struggling with administration

and more time deploying solutions.

Oracle also provides the following databases management solutions.

Manage very large databases

Decision support system

Data warehousing applications

Features of Oracle:

Large amounts of data complicate administrative tasks and affect the

availability of the database. To improve availability, ease administration and

enhance query DML performance, the Oracle Enterprise Edition allows tables

and indexes to be partitioned, or broken up into smaller parts based on a

range of key values improved data warehouse performance. Oracle

introduces new features that improve data warehousing performance.

Enhanced star-query processing

New parallel operations

Increased database size

APPENDIX


Object-Relational Technology:

Corporate management of the information becomes a difficult task of

integrating different relational objects and different applications, possibly

from different vendors into a more coherent end-user data model. By

enhancing the relational database with object extensions, oracle addresses

the need to simplify data modeling and extend the data base with new data

types. The new object-relational features include the following

Object types and views

Calling external procedures from within the data base

Client-side support for objects

Evolution of relational environments

Development tool for objects modeling

Java

Extensibility

Migration and Interoperability:

A simple and fast migration utility rebuilds the data dictionary and

converts the control files, log files and data blocks. The migration utility

converts one oracle data base into any oracle data base applications run

unchanged against either of the oracle products.

Other Enhancements:

Index-Organized tables

Reverse key indexes

Improved constraint processing

Two character sets in one data base

APPENDIX


9.3 CODING

Group.jsp

<%@ page

import="java.util.date.*,java.util.text.DateFormat.*,java.text.ParseExceptio

n.*"%>

<%@page

import="com.oreilly.servlet.*,java.sql.*,java.lang.*,databaseconnection.*,ja

va.text.SimpleDateFormat,java.util.*,java.io.*,javax.servlet.*,

javax.servlet.http.*" %>

<%@ page import =

"java.util.Date,java.text.SimpleDateFormat,java.text.ParseException"%>

<html>

<head>

<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />

<title>Organizing User Search Histories</title>

<link rel="stylesheet" type="text/css" href="style.css" />

<script language="JavaScript">

function valid()

{

var a = document.f.scat.value;

if(document.f.scat.selectedIndex==0)

{ alert("Please select type of finance");

document.f.scat.focus();

return false;

}

APPENDIX


}

</script>

</head>

<%

Thread.sleep(500);

%>

<body>

<div id="wrapper">

<div id="header">

<div id="logo">Organizing User Search Histories

</div>

<div id="nav">

<ul>

<li><a href="adminpage.jsp">Database Home</a></li>

<li><a href="financial.jsp">Add Details</a></li>

<li><a href="group.jsp">Query Groups</a></li>

<li><a href="signout.jsp">Logout</a></li>

</ul>

</div>

</div>

<div id="r1"><div id="r2"><div id="r3"><div id="r4"><div id="r5"><div

id="r6"><div id="r7"><div id="r8"><div id="r9">

<div id="bar">

APPENDIX


</div>

<div id="content">

<div id="leftcolumn">

<table align="center" width="90%">

<tr>

<td width="45%" height="523" valign="top">

<table align="left" width="793" style="border:1px solid #ddd;">

<tr>

<td width="785" height="98" class="paragraping">

<form name="f" method="get" action="groupcheck.jsp" onSubmit="return

valid();">

Select Query Group:

<select name="scat" class="input">

<option value="">--------Select--------</option>

<option value="Pilgrims">Pilgrims</option>

<option value="Job Searching">Job Searching</option>

<option value="Sightseeing">Sightseeing</option>

<option value="Colleges">Colleges</option>

<option value="Marriage">Marriage</option>

<option value="Hotels">Hotels</option>

</select>

APPENDIX


<input type="submit" name="sub" value="Submit Query"

id="button"> <input type="reset" name="Clear"

value="Reset" id="button">

</form>

</td>

</tr>

<tr>

<td height="250"> </td>

</tr>

</table>

</td>

</tr>

</table>

</td>

</tr>

</table>

<div class="clear"></div>

</div>

</div>

<div class="clear"></div></div>

<div id="footer" align="center"></div>

</div>

</div></div></div></div></div></div></div></div>

</div>

APPENDIX


</body>

</html>

Groupcheck.jsp

<%@ page


n.*"%>

<%@page




<%@ page import =


<html>

<head>




<script type="text/javascript">

function displayDate()

{

document.getElementById("demo").innerHTML=Date();

}

</script>

</head>

<body>

APPENDIX


<div id="wrapper">

<div id="header">


</div>

<div id="nav">

<ul>

<li><a href="adminpage.jsp">Database Home</a></li>

<li><a href="financial.jsp">Add Details</a></li>

<li><a href="group.jsp">Query Groups</a></li>

<li><a href="signout.jsp">Logout</a></li>

</ul></div>

</div>



<div id="bar">

</div>

<div id="content">


<table align="left" width="97%">

<tr>

<td width="67%" valign="top">

<table align="center" width="659" style="border:1px solid #ddd;">

<tr>

APPENDIX


<td class="paragraping">Related Search</td>

</tr>

<%

String scat=request.getParameter("scat");

Connection con1=null;

Statement st1=null;

ResultSet rs1=null;

String sql1=null;

sql1="select * from financial where scat='"+scat+"'";

try

{

con1=databasecon.getconnection();

st1=con1.createStatement();

rs1=st1.executeQuery(sql1);

while(rs1.next())

{

%>

<tr style="font-family:verdana;font-size:12px;" class="paragraping">

<td height="45">  <img

src="images/ok.png">  <%out.println(rs1.getString(2));%><b

r>

     <font

color="#006FDD"><%out.println(rs1.getString(3));%></font><br>

APPENDIX


Title: <font

color="#006600"><%out.println(rs1.getString(4));%></font><br>

Description: <font

color="red"><%out.println(rs1.getString(5));%></font><br>

</td>

</tr>

<tr>

<td>--------------------------------------------------------------------------------

----------------------------------------------</td>

</tr>

<%

}

}

catch(SQLException e1)

{

out.println("Your given query didn't match to our database");

System.out.println(e1);

}

finally

{

st1.close();

con1.close();

}

%>

APPENDIX


</table>

</td>

<td width="33%" align="justify" valign="top">

</td>

</tr>

</table>

</td>

</tr>

</table>


</div>

</div>



</div>


</div>

</body>

</html>

User.jsp

<%@ page import="java.sql.*"%>

<%@ page import="java.io.*"%>

<%@ page import="java.util.*"%>

APPENDIX


<%@page




<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">

<html>

<head>

<title></title>

<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-

1">

</head>

<body>

<%

Connection con=null;

PreparedStatement psmt1=null;

String a = request.getParameter("uid");

String b = request.getParameter("name");

String c = request.getParameter("user");

String d = request.getParameter("pass");

String e = request.getParameter("email");

String f = request.getParameter("mobile");

String g = request.getParameter("date");

try{

con=databasecon.getconnection();

APPENDIX


psmt1=con.prepareStatement("insert into

signup(uid,name,user,pass,email,mobile,date) values(?,?,?,?,?,?,?)");

psmt1.setString(1,a);

psmt1.setString(2,b);

psmt1.setString(3,c);

psmt1.setString(4,d);

psmt1.setString(5,e);

psmt1.setString(6,f);

psmt1.setString(7,g);

psmt1.executeUpdate();

response.sendRedirect("register.jsp?message=success");

}

catch(Exception ex)

{

out.println("Error in connection : "+ex);

} %>

</body>

</html>

Databasecon.java

package databaseconnection;

import java.sql.*;

public class databasecon

{

static Connection con;

APPENDIX


public static Connection getconnection()

{

try

{

Class.forName("com.mysql.jdbc.Driver");

con =

DriverManager.getConnection("jdbc:mysql://localhost:3306/history","root","

admin");

}

catch(Exception e)

{

System.out.println("class error");

}

return con;

}

}

UserRegister.jsp

<%@ page


n.*"%>

<%@page




<%@ page import =


<html>

<head>

APPENDIX





<script language="JavaScript">

function valid()

{

var a = document.f.name.value;

var b = document.f.user.value;

var c = document.f.pass.value;

var d = document.f.email;

var e = document.f.mobile.value;

if(a=="")

{

alert("Enter your Name");

document.f.name.focus();

return false;

}

if(b=="")

{

alert("Enter your Username");

document.f.user.focus();

return false;

}

APPENDIX


if(c=="")

{

alert("Enter your Password");

document.f.pass.focus();

return false;

}

if (d.value == "")

{

window.alert("Please enter a valid e-mail address.");

d.focus();

return false;

}

if (d.value.indexOf("@", 0) < 0)

{


d.focus();

return false;

}

if (d.value.indexOf(".", 0) < 0)

{


d.focus();

return false;

APPENDIX


}

if(e=="")

{

alert("Please enter the Mobile number");

document.f.mobile.focus();

return false;

}

if(isNaN(e))

{

alert("Please enter the Correct Mobile number");


return false;

}

if (e.length!=10)

{

alert("Enter 10 Integers");


return false;

}

}

</script>

</head>

<body>

APPENDIX


<div id="wrapper">

<div id="header">


</div>

<div id="nav">

<ul>

<li><a href="index.html">Home</a></li>

<li><a href="register.jsp">Register</a></li>

<li><a href="userlogin.jsp">User Login</a></li>

<li><a href="admin.jsp">Database Login</a></li>

</ul>

</div>

</div>



<div id="bar">

</div>

<div id="content">


<table align="center" width="90%">

<tr>

<td valign="top">

<%

APPENDIX


java.util.Date now = new java.util.Date();

String DATE_FORMAT1 = "dd/MM/yyyy";

SimpleDateFormat sdf1 = new SimpleDateFormat(DATE_FORMAT1);

String strDateNew1 = sdf1.format(now);

String u=null;int u2=0,u1=0;

try

{

Connection con=databasecon.getconnection();

PreparedStatement ps=con.prepareStatement("select * from signup");

ResultSet rs=ps.executeQuery();

while(rs.next())

{

u=rs.getString("uid");

}

if(u==null)

{

u2=u1+1;

}

else

{

u1=Integer.parseInt(u);

u2=u1+1;

}

APPENDIX


String u3=Integer.toString(u2);

session.setAttribute("u3",u3);

%>

<table align="center" width="506" style="border:1px solid green;">

<form name="f" action="user.jsp" method="post" onSubmit="return

valid();">

<tr>

<td height="35" colspan="2" align="center" bgcolor="#FFFFCC"

class="paragraping">User Signup Here</td>

</tr>

<tr>

<td class="paragraping" colspan="2" align="center"><font

size="2"><b><%

Stringmessage=request.getParameter("message");

if(message!=null && message.equalsIgnoreCase("success"))

{

out.println("<font color='Green'>Successfully Registered</font>");

}

%></b></font></td>

</tr>

<tr>

<td width="147" height="44" class="paragraping">ID:</td>

<td width="347"><input type="text" name="uid" value="<%=u3%>"

class="input"></td>

</tr>

APPENDIX


<tr>

<td height="41" class="paragraping">Name:</td>

<td><input type="text" name="name" value="" class="input"></td>

</tr>

<tr>

<td height="43" class="paragraping">Username:</td>

<td><input type="text" name="user" value="" class="input"></td>

</tr>

<tr>

<td height="39" class="paragraping">Password:</td>

<td><input type="password" name="pass" value="" class="input"></td>

</tr>

<tr>

<td height="45" class="paragraping">Email:</td>

<td><input type="text" name="email" value="" class="input"></td>

</tr>

<tr>

<td height="41" class="paragraping">Mobile:</td>

<td><input type="text" name="mobile" value="" class="input"></td>

</tr>

<tr>

<td height="40" class="paragraping">Date:</td>

APPENDIX


<td><input type="text" name="date" value="<%=strDateNew1%>"

class="input"></td>

</tr>

<tr>

<td height="40"></td>

<td><input type="submit" name="sub" value="Submit"

id="button"> <input type="reset" name="clear" value="Clear"

id="button">

</td>

</tr>

</form>

</table>

<%

}

catch(Exception e1)

{

out.println(e1.getMessage());

}

%>

</td>

<td valign="top" align="justify"> </td>

</tr>

</table>

</td>

APPENDIX


</tr>

</table>


</div>

</div>



</div>


</div>

</body>

</html>

APPENDIX


9.4 List of Figures

S. No Name of the Figure Page No

4.1 Class Diagram 15

4.2 Usecase Diagram 16

4.3 Usecase Diagram 17

4.4 Sequence Diagram 18

5.1 Query Grouping 20

5.2 Search History 21

9.1 Home Page 69

9.2 Input Query 70

9.3 Query Result 71

9.4 Public History 72

9.5 Registration Page 73

9.6 User Login

9.7 User Home Page 75

9.8 User Details Page 76

9.9 User Search History by Date 77

9.10 Database Login Page 78

9.11 Database Home Page 79

9.12 User Details 80

9.13 Add Travelling Details 81

9.14 View History Graph 82

9.15 Password Update 83

9.16 Query Groups 84

APPENDIX


9.5 List of Tables

S. No Name of the table Page No.

9.1.1 Travelling Database Structure 85

9.1.2 Travelling Database 86

9.1.3 Registration Database Structure 87

9.1.4 Registration Database 87

9.1.5 User History Database Structure 88

9.1.6 User History Database 88

9.1.7 Public History Database Structure 89

9.1.8 Public History Database 89

APPENDIX


9.6 LIST OF ABBREVATIONS

UML UNIFIED MODELING LANGUAGE

SQL STRUCTURED QUERY LANQUAGE

DBMS DATA BASE MANAGEMENT SYSTEM

JDBC JAVA DATABASE CONNECTIVITY

HTML HYPER TEXT MARKUP LANGUAGE

CSS CASCADING STLE SHEETS

RDBMS RELATIONAL DATABASE MANAGEMENT SYSTEM

APPENDIX

Department of Computer Science and Engineering Page 70

9.7 SCREENSHOTS

Homepage:

Fig. 9.1 Home page

APPENDIX


Home Page-Input Query:

Fig. 9.2 Input Query

APPENDIX


Home Page-Output Results:

Fig. 9.3 Query Result

APPENDIX


Home Page –View Public History:

Fig. 9.4 Public History

APPENDIX


User Registration Page:

Fig. 9.5 Registration Page

APPENDIX


User Login Page:

Fig. 9.6 User Login Page

APPENDIX


User Home Page:

Fig. 9.7 User Home Page

APPENDIX


User Update Details:

Fig. 9.8 User Details Page

APPENDIX


User Search History:

Fig.9.9 User Search History by Date

APPENDIX


Database Login Page:

Fig.9.10 Database Login Page

APPENDIX


Database Homepage:

Fig. 9.11 Database Home Page

APPENDIX


Database Page – User Details:

Fig. 9.12 User Details

APPENDIX


Database Page – Add Travelling Details:

Fig. 9.13 Add Travelling Details

APPENDIX


Database Page – View History Graph:

Fig.9.14 View History Graph

APPENDIX


Database Password Update:

Fig. 9.15 Password Update

APPENDIX


Database Query Groups:

Fig. 9.16 Query Groups

APPENDIX


Travelling Database Structure:

Table 9.1.1 Travelling Database Structure

APPENDIX


Travelling Database:

Table 9.1.2 Travelling Database

APPENDIX


Registration Database Structure:

Fig. 9.1.3 Registration Database Structure

Registration Database:

Fig. 9.1.4 Registration Database Structure

APPENDIX


User History Database Structure:

Fig. 9.1.5 User History Database Structure

User History Database:

Fig. 9.1.6 User History Database

APPENDIX


Public History Database Structure:

Fig. 9.1.7 Public History Database Structure

Public History Database:

Fig.9.1.8 Public History Database

APPENDIX

Department of Computer Science and Engineering 91

REFERENCES

1. J. Teevan, E. Adar, R. Jones, and M.A.S. Potts, “Information

Re-Retrieval: Repeat Queries in Yahoo’s Logs,” Proc. 30th Ann. Int’l

ACM SIGIR Conf. Research and Development in Information

Retrieval (SIGIR ’07), pp. 151-158, 2007.

2. A. Spink, M. Park, B.J. Jansen, and J. Pedersen, “Multitasking during

Web Search Sessions,” Information Processing and Management,

vol. 42, no. 1, pp. 264-275, 2006.

3. P. Boldi, F. Bonchi, C. Castillo, D. Donato, A. Gionis, and S. Vigna,

“The Query-Flow Graph: Model and Applications,” Proc. 17th ACM

Conf. Information and Knowledge Management (CIKM), 2008.

4. J. Yi and F. Maghoul, “Query Clustering Using Click-through Graph,”

Proc. the 18th Int’l Conf. World Wide Web (WWW ’09), 2009.

5. N. Craswell and M. Szummer, “Random Walks on the Click Graph,”

Proc. 30th Ann. Int’l ACM SIGIR Conf. Research and Development in

Information Retrieval (SIGIR ’07), 2007.

Organizing User Search HistoriesHeasoo Hwang, Hady W. Lauw, Lise Getoor, and Alexandros Ntoulas

Abstract—Users are increasingly pursuing complex task-oriented goals on the web, such as making travel arrangements, managing

finances, or planning purchases. To this end, they usually break down the tasks into a few codependent steps and issue multiple

queries around these steps repeatedly over long periods of time. To better support users in their long-term information quests on the

web, search engines keep track of their queries and clicks while searching online. In this paper, we study the problem of organizing a

user’s historical queries into groups in a dynamic and automated fashion. Automatically identifying query groups is helpful for a number

of different search engine components and applications, such as query suggestions, result ranking, query alterations, sessionization,

and collaborative search. In our approach, we go beyond approaches that rely on textual similarity or time thresholds, and we propose

a more robust approach that leverages search query logs. We experimentally study the performance of different techniques, and

showcase their potential, especially when combined together.

Index Terms—User history, search history, query clustering, query reformulation, click graph, task identification.

Ç

1 INTRODUCTION

AS the size and richness of information on the web grows,so does the variety and the complexity of tasks that

users try to accomplish online. Users are no longer contentwith issuing simple navigational queries. Various studies onquery logs (e.g., Yahoo’s [1] and AltaVista’s [2]) reveal thatonly about 20 percent of queries are navigational. The rest areinformational or transactional in nature. This is becauseusers now pursue much broader informational and task-oriented goals such as arranging for future travel, managingtheir finances, or planning their purchase decisions. How-ever, the primary means of accessing information online isstill through keyword queries to a search engine. A complextask such as travel arrangement has to be broken down into anumber of codependent steps over a period of time. Forinstance, a user may first search on possible destinations,timeline, events, etc. After deciding when and where to go,the user may then search for the most suitable arrangementsfor air tickets, rental cars, lodging, meals, etc. Each steprequires one or more queries, and each query results in one ormore clicks on relevant pages.

One important step toward enabling services andfeatures that can help users during their complex searchquests online is the capability to identify and group relatedqueries together. Recently, some of the major search engines

have introduced a new “Search History” feature, whichallows users to track their online searches by recording theirqueries and clicks. For example, Fig. 1 illustrates a portionof a user’s history as it is shown by the Bing search engineon February of 2010. This history includes a sequence offour queries displayed in reverse chronological ordertogether with their corresponding clicks. In addition toviewing their search history, users can manipulate it bymanually editing and organizing related queries and clicksinto groups, or by sharing them with their friends. Whilethese features are helpful, the manual efforts involved canbe disruptive and will be untenable as the search historygets longer over time.

In fact, identifying groups of related queries hasapplications beyond helping the users to make sense andkeep track of queries and clicks in their search history. Firstand foremost, query grouping allows the search engine tobetter understand a user’s session and potentially tailor thatuser’s search experience according to her needs. Once querygroups have been identified, search engines can have agood representation of the search context behind thecurrent query using queries and clicks in the correspondingquery group. This will help to improve the quality of keycomponents of search engines such as query suggestions,result ranking, query alterations, sessionization, and colla-borative search. For example, if a search engine knows thata current query “financial statement” belongs to a {“bank ofamerica,” “financial statement”} query group, it can boostthe rank of the page that provides information about how toget a Bank of America statement instead of the Wikipediaarticle on “financial statement,” or the pages related tofinancial statements from other banks.

Query grouping can also assist other users by promotingtask-level collaborative search. For instance, given a set ofquery groups created by expert users, we can select the onesthat are highly relevant to the current user’s query activityand recommend them to her. Explicit collaborative searchcan also be performed by allowing users in a trustedcommunity to find, share and merge relevant query groupsto perform larger, long-term tasks on the web.

912 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 5, MAY 2012

. H. Hwang is with the Samsung Advanced Institute of Technology,Yongin-si, Gyeonggi-do 446-712, South Korea.E-mail: [email protected].

. H.W. Lauw is with the Institute for Infocomm Research, 1 FusionopolisWay, #21-01 Connexis (South Tower), Singapore 138632.E-mail: [email protected].

. L. Getoor is with the Department of Computer Science, University ofMaryland, AV Williams Bldg, Rm 3217 College Park, MD 20742.E-mail: [email protected].

. A. Ntoulas is with the Microsoft Research, Silicon Valley, 1065 La AvenidaSt, SVC-6/1040, Mountain View, CA 94043.E-mail: [email protected].

Manuscript received 20 Mar. 2010; revised 4 Oct. 2010; accepted 12 Nov.2010; published online 21 Dec. 2010.Recommended for acceptance by R. Kumar.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference IEEECS Log Number TKDE-2010-03-0169.Digital Object Identifier no. 10.1109/TKDE.2010.251.

1041-4347/12/$31.00 � 2012 IEEE Published by the IEEE Computer Society

In this paper, we study the problem of organizing a user’ssearch history into a set of query groups in an automated anddynamic fashion. Each query group is a collection of queriesby the same user that are relevant to each other around acommon information need. These query groups are dyna-mically updated as the user issues new queries, and newquery groups may be created over time. To better illustrateour goal, we show in Fig. 2a a set of queries from the activityof a real user on the Bing search engine over the period of oneday, together with the corresponding query groups in

Fig. 2b: the first query group contains all the queries thatare related to saturn automobiles. The other groups,respectively, pertain to barbados vacation, sprint phone,financials, and Wii game console.

Organizing the query groups within a user’s history ischallenging for a number of reasons. First, related queriesmay not appear close to one another, as a search task mayspan days or even weeks. This is further complicated by theinterleaving of queries and clicks from different search tasksdue to users’ multitasking [3], opening multiple browsertabs, and frequently changing search topics. For instance, inFig. 2a, the related queries “hybrid saturn vue” and “saturndealers” are separated by many unrelated queries. Thislimits the effectiveness of approaches relying on time orsequence to identify related queries. Second, related queriesmay not be textually similar. For example, in Fig. 2b, therelated queries “tripadvisor barbados” and “caribbeancruise” in Group 2 have no words in common. Therefore,relying solely on string similarity is also insufficient.Finally, as users may also manually alter their respectivequery groups, any automated query grouping has to respectthe manual efforts or edits by the users.

To achieve more effective and robust query grouping, wedo not rely solely on textual or temporal properties ofqueries. Instead, we leverage search behavioral data ascaptured within a commercial search engine’s log. Inparticular, we develop an online query grouping methodover the query fusion graph that combines a probabilisticquery reformulation graph, which captures the relationshipbetween queries frequently issued together by the users, anda query click graph, which captures the relationship betweenqueries frequently leading to clicks on similar URLs. Relatedto our problem, are the problems of session identification [4],[5] and query clustering [6], [7] that have also used similargraphs in the past. We extend previous work in two ways.First, we use information from both the query reformulationgraph and the query click graph in order to better capture

HWANG ET AL.: ORGANIZING USER SEARCH HISTORIES 913

Fig. 2. Search history of a real user over the period of one day together with the query groups.

Fig. 1. Example of search history feature in Bing.

various important signals of query relevance. Second, wefollow an unsupervised approach where we do not requiretraining data to bootstrap our model.

In this paper, we make the following contributions:

. We motivate and propose a method to perform

query grouping in a dynamic fashion. Our goal is to

ensure good performance while avoiding disruptionof existing user-defined query groups.

. We investigate how signals from search logs such asquery reformulations and clicks can be used together

to determine the relevance among query groups. We

study two potential ways of using clicks in order to

enhance this process: 1) by fusing the query

reformulation graph and the query click graph into

a single graph that we refer to as the query fusion

graph, and 2) by expanding the query set when

computing relevance to also include other querieswith similar clicked URLs.

. We show through comprehensive experimentalevaluation the effectiveness and the robustness ofour proposed search log-based method, especiallywhen combined with approaches using other signalssuch as text similarity.

The rest of the paper is organized as follows. In Section 2,

we state the goal of our paper, identifying query groups in a

search history, and provide an overview of our solution. In

Section 3, we discuss how we can construct the query

reformulation graph and the query click graph from search

logs, and how to use them to determine relevance between

queries or query groups within a user’s history. In Section 4,

we describe our algorithm to perform query grouping using

the notion of relevance based on search logs. In Section 5,

we present our experimental evaluation results. In Section

6, we review the related work and we conclude with a

discussion on our results and future research directions in

Section 7.

2 PRELIMINARIES

2.1 Goal

Our goal is to automatically organize a user’s search historyinto query groups, each containing one or more relatedqueries and their corresponding clicks. Each query groupcorresponds to an atomic information need that mayrequire a small number of queries and clicks related tothe same search goal. For example, in the case ofnavigational queries, a query group may involve as fewas one query and one click (e.g., “cnn” and www.cnn.com).For broader informational queries, a query group mayinvolve a few queries and clicks (e.g., Group 5 queries inFig. 2b are all about where to buy Wii console and games).This definition of query groups follows closely the defini-tion of search goals given in [4].

Definition 2.1 (Query Group). A query group is an orderedlist of queries, qi, together with the corresponding set of clickedURLs, clki of qi. A query group is denoted as s ¼ hfq1;clk1g; . . . ; fqk; clkkgi.

The specific formulation of our problem is as follows:

. Given: a set of existing query groups of a user, S ¼fs1; s2; . . . ; sng, and her current query and clicks,fqc; clkcg,

. Find: the query group for fqc; clkcg, which is eitherone of the existing query groups in S that it is mostrelated to, or a new query group sc ¼ fqc; clkcg ifthere does not exist a query group in S that issufficiently related to fqc; clkcg.

Below, we will motivate the dynamic nature of thisformulation, and give an overview of the solution. The coreof the solution is a measure of relevance between twoqueries (or query groups). We will further motivate theneed to go beyond baseline relevance measures that rely ontime or text, and instead propose a relevance measure basedon signals from search logs.

2.2 Dynamic Query Grouping

One approach to the identification of query groups is to firsttreat every query in a user’s history as a singleton querygroup, and then merge these singleton query groups in aniterative fashion (in a k-means or agglomerative way [8]).However, this is impractical in our scenario for two reasons.First, it may have the undesirable effect of changing a user’sexisting query groups, potentially undoing the user’s ownmanual efforts in organizing her history. Second, it involvesa high-computational cost, since we would have to repeat alarge number of query group similarity computations forevery new query.

As in online clustering algorithms [9], we perform thegrouping in a similar dynamic fashion, whereby we firstplace the current query and clicks into a singleton querygroup sc ¼ fqc; clkcg, and then compare it with each existingquery group si within a user’s history (i.e., si 2 S). Theoverall process of identifying query groups is presented inFig. 3. Given sc, we determine if there are existing querygroups sufficiently relevant to sc. If so, we merge sc with thequery group s having the highest similarity �max above orequal to the threshold �sim. Otherwise, we keep sc as a newsingleton query group and insert it into S.


Fig. 3. Algorithm for selecting the query group that is the most similar tothe given query and clicked URLs.

2.3 Query (or Query Group) Relevance

To ensure that each query group contains closely relatedand relevant queries and clicks, it is important to have asuitable relevance measure sim between the current querysingleton group sc and an existing query group si 2 S. Thereare a number of possible approaches to determine therelevance between sc and si. Below, we outline a number ofdifferent relevance metrics that we will later use as baselinesin experiments (see Section 5). We will also discuss the prosand cons of such metrics as well as our proposed approachof using search logs (see Section 3).

Time. One may assume that sc and si are somehowrelevant if the queries appear close to each other in time inthe user’s history. In other words, we assume that usersgenerally issue very similar queries and clicks within ashort period of time. In this case, we define the followingtime-based relevance metric simtime that can be used inplace of sim in Fig. 3.

Definition 2.2 (Time). simtimeðsc; siÞ is defined as the inverseof the time interval (e.g., in seconds) between the times that qcand qi are issued, as follows:

simtimeðsc; siÞ ¼1

jtimeðqcÞ � timeðqiÞj:

The queries qc and qi are the most recent queries in sc and si,respectively. Higher simtime values imply that the queries aretemporally closer.

Text. On a different note, we may assume that twoquery groups are similar if their queries are textuallysimilar. Textual similarity between two sets of words canbe measured by metrics such as the fraction of overlappingwords (Jaccard similarity [10]) or characters (Levenshteinsimilarity [11]). We can thus define the following two text-based relevance metrics that can be used in place of sim inFig. 3.

Definition 2.3 (Jaccard). simjaccardðsc; siÞ is defined as thefraction of common words between qc and qi as follows:

simjaccardðsc; siÞ ¼jwordsðqcÞ \ wordsðqiÞjjwordsðqcÞ [ wordsðqiÞj

:

Definition 2.4 (Levenshtein). simeditðsc; siÞ is defined as1� disteditðqc; qiÞ. The edit distance distedit is the number ofcharacter insertions, deletions, or substitutions required totransform one sequence of characters into another, normalizedby the length of the longer character sequence (see [11] formore details.)

Although the above time-based and text-based relevancemetrics may work well in some cases, they cannot capturecertain aspects of query similarity. For instance, simtime

assumes that a query is always followed by a related query.However, this may not be the case when the user ismultitasking (i.e., having more than one tabs open in herbrowser, or digressing to an irrelevant topic and thenresuming her searches). Similarly, the text-based metrics,simjaccard and simedit, can capture the relevance betweenquery groups around textually similar queries such as

“ipod” and “apple ipod,” but will fail to identify relevantquery groups around queries such as “ipod” and “applestore,” since they are not textually similar. Additionally, thetext-based metrics may mistakenly identify query groupsaround, say, “jaguar car manufacturer” and “jaguar animalreserve” as relevant, since they share some common text.

Therefore, we need a relevance measure that is robustenough to identify similar query groups beyond theapproaches that simply rely on the textual content ofqueries or time interval between them. Our approach makesuse of search logs in order to determine the relevancebetween query groups more effectively. In fact, the searchhistory of a large number of users contains signalsregarding query relevance, such as which queries tend tobe issued closely together (query reformulations), andwhich queries tend to lead to clicks on similar URLs (queryclicks). Such signals are user generated and are likely to bemore robust, especially when considered at scale. Wesuggest measuring the relevance between query groupsby exploiting the query logs and the click logs simulta-neously. We will discuss our proposed relevance measurein greater detail in Sections 3 and 4.

In fact, the idea of making use of signals in query logs tomeasure similarity between queries has been explored inprevious work, although not to the same extent as ourproposed approach. Here, we outline two such methods,Co-Retrieval (CoR) and Asymmetric Traveler SalesmanProblem (ATSP), which will also be compared against inour experimental section (see Section 5).

CoR. CoR is based on the principle that a pair of queriesare similar if they tend to retrieve similar pages on a searchengine. This approach is similar to the ones discussed in[12], [13].

Definition 2.5 (CoR). simcorðsc; siÞ is the Jaccard coefficient ofqc’s set of retrieved pages retrievedðqcÞ and qi’s set of retrievedpages retrievedðqiÞ and is defined as:

simcorðsc; siÞ ¼jretrievedðqcÞ \ retrievedðqiÞjjretrievedðqcÞ [ retrievedðqiÞj

:

Unlike [12] which relies on textual comparison, wecompare two queries based on the overlap in pagesretrieved. We consider a page to be retrieved by a searchengine if it has not only been shown to some users, but hasalso been clicked at least once in the past one year. Noticethat this is a stronger definition that favors CoR as abaseline because of the relevance signals in the form ofclicks. Differently from our approach, CoR makes use ofneither reformulation signals (whether one query fre-quently follows another) nor click signals (whether queriesfrequently lead to clicks on similar pages).

ATSP. This technique is based on the principle that twoqueries issued in succession in the search logs are closelyrelated. In [5], the authors present a solution that firstreorders a sequence of user queries to group similar queriestogether by solving an instance of the ATSP. Once thequeries are reordered, query groups are generated bydetermining “cut points” in the chain of queries, i.e., twosuccessive queries whose similarity is less than a threshold�. Note that ATSP needs to operate on the whole set of


queries that we are interested in grouping as it involves aninitial reordering step.

Definition 2.6 (ATSP). simATSP ðsc; siÞ is defined as thenumber of times two queries, qc and qi, appear in successionin the search logs over the number of times qc appears. Moreformally

simATSP ðsc; siÞ ¼freqðqc; qiÞfreqðqcÞ

:

In our work we consider both query pairs havingcommon clicked URLs and the query reformulationsthrough a combined query fusion graph.

3 QUERY RELEVANCE USING SEARCH LOGS

We now develop the machinery to define the query relevancebased on web search logs. Our measure of relevance isaimed at capturing two important properties of relevantqueries, namely: 1) queries that frequently appear togetheras reformulations and 2) queries that have induced theusers to click on similar sets of pages. We start ourdiscussion by introducing three search behavior graphsthat capture the aforementioned properties. Following that,we show how we can use these graphs to compute queryrelevance and how we can incorporate the clicks following auser’s query in order to enhance our relevance metric.

3.1 Search Behavior Graphs

We derive three types of graphs from the search logs of acommercial search engine. The query reformulation graph,QRG, represents the relationship between a pair of queriesthat are likely reformulations of each other. The query clickgraph, QCG, represents the relationship between two queriesthat frequently lead to clicks on similar URLs. The queryfusion graph, QFG, merges the information in the previoustwo graphs. All three graphs are defined over the same setof vertices VQ, consisting of queries which appear in at leastone of the graphs, but their edges are defined differently.

3.1.1 Query Reformulation Graph

One way to identify relevant queries is to consider queryreformulations that are typically found within the query logsof a search engine. If two queries that are issued consecu-tively by many users occur frequently enough, they arelikely to be reformulations of each other. To measure therelevance between two queries issued by a user, the time-based metric, simtime, makes use of the interval between thetimestamps of the queries within the user’s search history. Incontrast, our approach is defined by the statistical frequencywith which two queries appear next to each other in theentire query log, over all of the users of the system.

To this end, based on the query logs, we construct thequery reformulation graph, QRG ¼ ðVQ; EQRÞ, whose set ofedges, EQR, are constructed as follows: for each query pairðqi; qjÞ, where qi is issued before qj within a user’s day ofactivity, we count the number of such occurrences across allusers’ daily activities in the query logs, denotedcountrðqi; qjÞ. Assuming infrequent query pairs are notgood reformulations of each other, we filter out infrequentpairs and include only the query pairs whose counts exceed

a threshold value, �r. For each ðqi; qjÞwith countrðqi; qjÞ � �r,we add a directed edge from qi to qj to EQR. The edge

weight, wrðqi; qjÞ, is defined as the normalized count of the

query transitions

wrðqi; qjÞ :¼ countrðqi; qjÞPðqi;qkÞ2EQR countrðqi; qkÞ

:

3.1.2 Query Click Graph

A different way to capture relevant queries from the searchlogs is to consider queries that are likely to induce users to

click frequently on the same set of URLs. For example,

although the queries “ipod” and “apple store” do not shareany text or appear temporally close in a user’s search

history, they are relevant because they are likely to have

resulted in clicks about the ipod product. In order tocapture such property of relevant queries, we construct a

graph called the query click graph, QCG.We first start by considering a bipartite click-through

graph, CG ¼ ðVQ [ VU ; ECÞ, used by Fuxman et al. [14]. CG hastwo distinct sets of nodes corresponding to queries, VQ, and

URLs, VU , extracted from the click logs. There is an edge

ðqi; ukÞ 2 EC, if query qi was issued and URL uk was clickedby some users. We weight each edge ðqi; ukÞ by the number

of times qi was issued and uk was clicked, countcðqi; ukÞ. As

before, we filter out infrequent pairs using a threshold �c. Inthis way, using the CG, we identify pairs of queries that

frequently lead to clicks on similar URLs.Next, from CG, we derive our query click graph,

QCG ¼ ðVQ; EQCÞ, where the vertices are the queries, and adirected edge from qi to qj exists if there exists at least one

URL, uk, that both qi and qj link to in CG. The weight of edge

ðqi; qjÞ in QCG, wcðqi; qjÞ, is defined as the weightedasymmetric Jaccard similarity [10] as follows:

wcðqi; qjÞ ¼P

ukminðcountcðqi; ukÞ; countcðqj; ukÞÞP

ukcountcðqi; ukÞ

:

This captures the intuition that qj is more related to qi if moreof qi’s clicks fall on the URLs that are also clicked for qj.

3.1.3 Query Fusion Graph

The query reformulation graph, QRG, and the query click

graph, QCG, capture two important properties of relevantqueries, respectively. In order to make more effective use of

both properties, we combine the query reformulation

information within QRG and the query-click informationwithin QCG into a single graph, QFG ¼ ðVQ; EQF Þ, that we

refer to as the query fusion graph. At a high level, EQFcontains the set of edges that exist in either EQR or EQC. Theweight of edge ðqi; qjÞ in QFG, wfðqi; qjÞ, is taken to be a

linear sum of the edge’s weights, wrðqi; qjÞ in EQR and

wcðqi; qjÞ in EQC, as follows:

wfðqi; qjÞ ¼ �� wrðqi; qjÞ þ ð1� �Þ � wcðqi; qjÞ:

The relative contribution of the two weights is controlled by

�, and we denote a query fusion graph constructed with aparticular value of � as QFGð�Þ. The effects of varying � is

explored further in Section 5.


3.2 Computing Query Relevance

Having introduced the search behavior graphs in theprevious section, we now compute the relevance betweentwo queries. More specifically, for a given user query q, wecompute a relevance vector using QFG, where each entrycorresponds to the relevance value of each query qj 2 VQ to q.

The edges in QFG correspond to pairs of relevantqueries extracted from the query logs and the click logs.However, it is not sufficiently effective to use the pairwiserelevance values directly expressed in QFG as our queryrelevance scores. Let us consider a vector rq, where eachentry, rqðqjÞ, is wfðq; qjÞ if there exists an edge from q to qj inQFG, and 0 otherwise. One straightforward approach forcomputing the relevance of qj to q is to use this rqðqjÞ value.However, although this may work well in some cases, it willfail to capture relevant queries that are not directlyconnected in QFG (and thus rqðqjÞ ¼ 0).

Therefore, for a given query q, we suggest a more genericapproach of obtaining query relevance by defining aMarkov chain for q, MCq, over the given graph, QFG, andcomputing the stationary distribution of the chain. We referto this stationary distribution as the fusion relevance vector ofq, relFq , and use it as a measure of query relevancethroughout this paper.

In a typical scenario, the stationary probability distribu-tion of MCq can be estimated using the matrix multi-plication method, where the matrix corresponding to MCqis multiplied by itself iteratively until the resulting matrixreaches a fixpoint. However, given our setting of havingthousands of users issuing queries and clicks in real timeand the huge size of QFG, it is infeasible to perform theexpensive matrix multiplication to compute the stationarydistribution whenever a new query comes in. Instead, wepick the most efficient Monte Carlo random walk simula-tion method among the ones presented in [15], and use it onQFG to approximate the stationary distribution for q. Fig. 4outlines our algorithm.

The algorithm in Fig. 4 computes the fusion relevancevector of a given query q, relFq . It requires the followinginputs in addition to QFG. First, we introduce a jump

vector of q, gq, that specifies the probability that a query isselected as a starting point of a random walk. Since we setgqðq0Þ to 1 if q0 ¼ q, and 0 otherwise, q will always beselected; in the next section we will generalize gq to havemultiple starting points by considering both q and the clicksfor q. A damping factor, d2 ½0; 1� (similar to the originalPageRank algorithm [16]), determines the probability ofrandom walk restart at each node.

Two additional inputs control the accuracy and thetime budget of the random walk simulation: the totalnumber of random walks, numRWs, and the size ofneighborhood explored, maxHops. As numRWs increases,the approximation accuracy of the fusion relevance vectorimproves by the law of large numbers. We limit thelength of each random walk to maxHops, assuming that atransition from q to q0 is very unlikely if no user in thesearch logs followed q by q0 in less than maxHopsnumber of intermediate queries. In practice, we typicallyuse numRWs ¼ 1;000;000 and maxHops ¼ 5, but we canreduce the number of random walk samples or thelengths of random walks by decreasing both parametersfor a faster computation of relFq .

The random walk simulation then proceeds as follows:we use the jump vector gq to pick the random walk startingpoint. At each node v, for a given damping factor d, therandom walk either continues by following one of theoutgoing edges of v with a probability of d, or stops andrestarts at one of the starting points in gq with a probabilityof ð1� dÞ. Then, each outgoing edge, ðv; qiÞ, is selected withprobability wfðv; qiÞ, and the random walk always restarts ifv has no outgoing edge. The selection of the next node tovisit based on the outgoing edges of the current node v inQFG and the damping factor d is performed by theSelectNextNodeToV isit process in Step (7) of the algorithm,which is illustrated in Fig. 5. Notice that each random walksimulation is independent of another, so can be parallelized.

After simulating numRWs random walks on the QFGstarting from the node corresponding to the given query q,we normalize the number of visits of each node by thenumber of all the visits, finally obtaining relFq , the fusionrelevance vector of q. Each entry of the vector, relFq ðq0Þ,corresponds to the fusion relevance score of a query q0 2 VQto the given query q. It is the probability that q0 node is visitedalong a random walk originated from q node over the QFG.

Lastly, we show that there exists a unique fusionrelevance vector of a given query q, relFq . It is well known


Fig. 4. Algorithm for calculating the query relevance by simulatingrandom walks over the query fusion graph.

Fig. 5. Algorithm for selecting the next node to visit.

that for a finite ergodic Markov chain, there exists a uniquestationary distribution. In fact, the random walk simulationalgorithm described in Fig. 4 approximates relFq thatcorresponds to the stationary distribution of the Markovchain for q, MCq. To prove the uniqueness of relFq , it issufficient to show that MCq is ergodic.

Given a query q and a damping factor d, the Markovchain for q, MCq, is defined as follows: first, the finite statespace ofMCq, denoted �q, contains all the queries reachablefrom the given query q in QFG (�q � VQ). Then, we definethe transition matrix of MCq. For each state qi and qj in �q,the transition probability from state qi to state qj,MCqðqi; qjÞ, is defined as

MCqðqi; qjÞ ¼d � wfðqi; qjÞ if qj 6¼ q;d � wfðqi; qjÞ þ ð1� dÞ if qj ¼ q:

�

If qi has no outgoing edge in QFG, we setMCqðqi; qjÞ to 1for the next state qj ¼ q and 0 otherwise. Also note that if qiand qj are not directly connected in QFG, wfðqi; qjÞ ¼ 0. Asin Boldi et al. [17], we assume that the transition matrix ofMCq is aperiodic. Also, each state in �q has a positivetransition probability to state q (actually, MCqðqi; qÞ �1� d8qi 2 �q), so any state in MCq can reach any otherstate in MCq through state q. Thus, MCq is ergodic, whichguarantees the existence of unique stationary distribution ofMCq. However, we want to mention that MCq is aconceptual model, and we do not materializeMCq for eachquery q in QFG to calculate relFq in practice. Instead, for agiven query q, we simply adjust edge weights in QFGaccordingly, and set state q as the start state of everyrandom walk to ensure that only states of MCq amongnodes in QFG are visited.

3.3 Incorporating Current Clicks

In addition to query reformulations, user activities alsoinclude clicks on the URLs following each query submis-sion. The clicks of a user may further help us infer hersearch interests behind a query q and thus identify queriesand query groups relevant to q more effectively. In thissection, we explain how we can use the click information ofthe current user to expand the random walk process toimprove our query relevance estimates. Note that theapproach we introduce in this section is independent ofmodeling the query click information asQCG in Section 3.1.2to build QFG. Here, we use clicks of the current user tobetter understand her search intent behind the currentlyissued query, while clicks of massive users in the click logsare aggregated into QCG to capture the degree of relevanceof query pairs through commonly clicked URLs.

We give a motivating example that illustrates why itmay be helpful to take into account clicked URLs of q tocompute the query relevance. Let us consider that a usersubmitted a query “jaguar.” If we compute the relevancescores of each query in VQ with respect to the given queryonly, both the queries related to the car “jaguar” and thoserelated to the animal “jaguar” get high fusion relevancescores. This happens because we do not know the actualsearch interest of the current user when she issues thequery “jaguar.” However, if we know the URLs clicked bythe current user following the query “jaguar” (e.g., the

Wikipedia article on animal “jaguar”), we can infer thesearch interest behind the current query and assign queryrelevance scores to queries in VQ accordingly. In this way,by making use of the clicks, we can give much higher queryrelevance scores to queries related to “animal jaguar” thanthose related to “car jaguar.” This idea of biasing therandom walks toward a certain subset of the graph nodes issimilar in spirit to topic-sensitive PageRank [18].

We now describe how we use the clicked URLs by thecurrent user together with the given query q to better captureher search intent. First, we identify the set of URLs, clk, thatwere clicked by the current user after issuing q. Then, we useclk and the click-through graph CG to expand the space ofqueries considered when we compute the fusion relevancevector of q. Unlike the jump vector gq in Section 3.2 thatreflects the given query q only, we now consider both q andclk together when we set a new jump vector.

Given q and clk, we employ a click jump vector, gclk, thatrepresents the queries in CG that have also induced clicks tothe URLs within clk. Each entry in gclk, gclkðqiÞ, correspondsto the relevance of query qi to the URLs in clk. Using CG, wedefine gclk as the proportion of the number of clicks to clkinduced by qi (qi 2 VQ n fqg) to the total number of clicks toclk induced by all the queries in VQ n fqg

gclkðqiÞ :¼P

uk2clk countcðqi; ukÞPqj2VQ;qj 6¼q

Puk2clk countcðqj; ukÞ

:

Since the given query q is already captured in gq, we set theentry in gclk corresponding to q to 0 (gclkðqÞ ¼ 0).

Now, we introduce a new jump vector gðq;clkÞ thatconsiders both q and clk by incorporating gclk that biasesthe random jump probabilities toward queries related to theclicks, clk. In particular, we combine gq and gclk by defininggðq;clkÞ as the weighted sum of gq in Section 3.2 and the clickjump vector gclk. We control the importance of query andclick by using wquery and wclick (wquery þ wclick ¼ 1), thusgðq;clkÞðqÞ ¼ wquery and gðq;clkÞðq0Þ ¼ wclick � gclkðq0Þ for everyquery q0 2 VQ n fqg. Once gðq;clkÞ is set, we simulate randomwalks and estimate the fusion relevance vector in a similarway as before, with one difference. Notice that in Section 3.2,when calculating relFq , all the random walks start from thenode corresponding to q, because gqðqÞ is the only nonzeroentry in the jump vector gq (gqðqÞ ¼ 1). Now, however, therandom walk simulation can start from any query node q0 forwhich gðq;clkÞðq0Þ > 0, with a probability of gðq;clkÞðq0Þ. Wedenote this alternate query fusion vector obtained fromgðq;clkÞ as relFðq;clkÞ.

In the following sections, fusion relevance vectors, relFqand relFðq;clkÞ, are referred to as relq and relðq;clkÞ, respec-tively, assuming that we, by default, use the query fusiongraphQFG, notQRG or QCG, to compute relevance vectors.

4 QUERY GROUPING USING THE QFGIn this section, we outline our proposed similarity functionsimrel to be used in the online query grouping processoutlined in Section 2. For each query, we maintain a queryimage, which represents the relevance of other queries to thisquery. For each query group, we maintain a context vector,which aggregates the images of its member queries to form


an overall representation. We then propose a similarityfunction simrel for two query groups based on these conceptsof context vectors and query images. Note that our proposeddefinitions of query reformulation graph, query images, andcontext vectors are crucial ingredients, which lend signifi-cant novelty to the Markov chain process for determiningrelevance between queries and query groups.

Context Vector. For each query group, we maintain acontext vector which is used to compute the similaritybetween the query group and the user’s latest singletonquery group. The context vector for a query group s, denotedcxts, contains the relevance scores of each query in VQ tothe query group s, and is obtained by aggregating thefusion relevance vectors of the queries and clicks in s. If s isa singleton query group containing only fqs1

; clks1g, it is

defined as the fusion relevance vector relðqs1;clks1

Þ. For aquery group s ¼ hfqs1

; clks1g; . . . ; fqsk ; clkskgi with k > 1,

there are a number of different ways to define cxts. Forinstance, we can define it as the fusion relevance vector ofthe most recently added query and clicks, relðqsk

;clkskÞ. Other

possibilities include the average or the weighted sum of allthe fusion relevance vectors of the queries and clicks in thequery group. In our experiments, we calculate the contextvector of a query group s by weighting the queries and theclicks in s by recency, as follows:

cxts ¼ wrecencyXkj¼1

ð1� wrecencyÞk�jrelðqsj;clksj

Þ:

Note that if fqsk ; clkskg are the most recent query and clicksadded to the query group, this can be rewritten

cxts ¼ wrecency � relðqsk;clksk

Þ þ ð1� wrecencyÞcxts0 ;

where s0 ¼ hfqs1; clks1

g; . . . ; fqsk�1; clksk�1

gi. In our implemen-tation we used wrecency ¼ 0:3.

Query Image. The fusion relevance vector of a givenquery q, relq, captures the degree of relevance of each queryq0 2 VQ to q. However, we observed that it is not effective orrobust to use relq itself as a relevance measure for ouronline query grouping. For instance, let us consider tworelevant queries, “financial statement” (“fs”) and “bank ofamerica” (“boa”), in Fig. 2b. We may use the relevancevalue in the fusion relevance vectors, rel}fs}ð}boa}Þ orrel}boa}ð}fs}Þ. Usually, however, it is a very tiny numberthat does not comprehensively express the relevance of thesearch tasks of the queries, thus is not an adequaterelevance measure for an effective and robust online querygrouping. Instead, we want to capture the fact that bothqueries highly pertain to financials.

To this end, we introduce a new concept, the image of q,denoted IðqÞ, that expresses q as the set of queries in VQ thatare considered highly relevant to q. We generate IðqÞ byincluding every query q0 whose relevance value to q,relqðq0Þ, is within top-X percentage. To do this, we sortthe queries by relevance, and find the cutoff such that thesum of the relevance values of the most relevant queriesaccounts for X% of the total probability mass. We break tiesrandomly. In our experiments, X ¼ 99%. We found thateven with this high percentage, the size of the image of thequery is typically very small compared to the total numberof possible queries in QFG. The image of a query group s,

IðsÞ, is defined in the same way as IðqÞ except that thecontext vector of s, cxts, is used in the place of relðq;clkÞ.

Now, we define the relevance metric for query groups,simrel (2 ½0; 1�), based on QFG. Two query groups aresimilar if their common image occupies high probabilitymass in both of the context vectors of the query groups. Weuse the above definitions of context vector and query imageto capture this intuition.

Definition 4.1. simrelðs1; s2Þ, the relevance between two querygroups s1 and s2, is defined as follows:

simrelðs1; s2Þ ¼X

q2Iðs1Þ\Iðs2Þcxts1

ðqÞ �X

q2Iðs1Þ\Iðs2Þcxts2

ðqÞ:

Then, the relevance between the user’s latest singletonquery group sc ¼ fqc; clkcg and an existing query group si 2S will be

simrelðsc; siÞ ¼X

q2IðscÞ\IðsiÞrelðqc;clkcÞðqÞ �

Xq2IðscÞ\IðsiÞ

cxtsiðqÞ:

The relevance metric simrel is used in the Step (5) of thealgorithm in Fig. 3 in place of sim. In this way, the latestsingleton query group sc will be attached to the querygroup s that has the highest similarity simrel.

Online Query Grouping. The similarity metric that wedescribed in Definition 4.1 operates on the images of aquery and a query group. Some applications such as querysuggestion may be facilitated by fast on-the-fly grouping ofuser queries. For such applications, we can avoid perform-ing the random walk computation of fusion relevancevector for every new query in real time, and insteadprecompute and cache these vectors for some queries in ourgraph. This works especially well for the popular queries. Inthis case, we are essentially trading-off disk storage forruntime performance. We estimate that to cache the fusionrelevance vectors of 100 million queries, we would requiredisk storage space in the hundreds of gigabytes. Thisadditional storage space is insignificant relative to theoverall storage requirement of a search engine. Meanwhile,retrieval of fusion relevance vectors from the cache can bedone in milliseconds. Hence, for the remainder of thispaper, we will focus on evaluating the effectiveness of theproposed algorithms in capturing query relevance.

5 EXPERIMENTS

5.1 Experimental Setup

In this section, we study the behavior and performance of ouralgorithms on partitioning a user’s query history into one ormore groups of related queries. For example, for the sequenceof queries “caribbean cruise”; “bank of america”; “expedia”;“financial statement”, we would expect two output parti-tions: first, {“caribbean cruise,” “expedia”} pertaining totravel-related queries, and, second, {“bank of america,”“financial statement”} pertaining to money-related queries.

Data. To this end, we obtained the query reformulationand query click graphs by merging a number of monthlysearch logs from a commercial search engine. Each monthlysnapshot of the query log adds approximately 24 percentnew nodes and edges in the graph compared to the exactlypreceding monthly snapshot, while approximately 92 per-


cent of the mass of the graph is obtained by merging ninemonthly snapshots. To reduce the effect of noise andoutliers, we pruned the query reformulation graph bykeeping only query pairs that appeared at least two times(�q ¼ 2), and the query click graph by keeping only query-click edges that had at least 10 clicks (�c ¼ 10). Thisproduced query and click graphs that were 14 and 16percent smaller compared to their original respectivegraphs. Based on these two graphs, we constructed thequery fusion graph as described in Section 3 for variousparameter settings of �.

In order to create test cases for our algorithms, we usedthe search activity (comprising at least two queries) of a setof 200 users (henceforth called the Rand200 data set) fromour search log. To generate this set, users were pickedrandomly from our logs, and two human labelers examinedtheir queries and assigned them to either an existing groupor a new group if the labelers deemed that no related groupwas present. A user’s queries were included in the Rand200data set if both labelers were in agreement in order to reducebias and subjectivity while grouping. The labelers wereallowed access to the web in order to determine if twoseemingly distant queries were actually related (e.g.,“alexander the great” and “gordian knot”). The averagenumber of groups in the data set was 3.84 with 30 percent ofthe users having queries grouped in more than three groups.

Performance Metric. To measure the quality of theoutput groupings, for each user, we start by computingquery pairs in the labeled and output groupings. Twoqueries form a pair if they belong to the same group, withlone queries pairing with a special “null” query.

To evaluate the performance of our algorithms againstthe groupings produced by the labelers, we will use theRand Index [19] metric, which is a commonly employedmeasure of similarity between two partitions. The RandIndex similarity between two partitions X,Y of n elementseach is defined as RandIndexðX;Y Þ ¼ ðaþ bÞ=ðn2Þ, where ais the number of pairs that are in the same set in X and thesame set in Y , and b is the number of pairs that are indifferent sets in x and in different sets in Y . HigherRandIndex values indicate better ability of grouping relatedqueries together for a given algorithm.

Default values. In the following, we will study differentaspects of our proposed algorithms. Unless we explicitlyspecify differently, we use the following default para-meters: damping factor d ¼ 0:6, top-X ¼ 99%, � ¼ 0:7, clickimportance wclick ¼ 0:2, recency wrecency ¼ 0:3, and similaritythreshold �sim ¼ 0:9. We have picked these values byrepeating a set of experiments with varying values forthese parameters and selecting the ones that would allowour algorithm to achieve the best performance on Rand200based on the RandIndex metric. We followed the sameapproach for the baselines that we implemented as well. Wewill also evaluate the approaches on additional test sets(Lo100, Me100, Hi100), which will be described later. Sinceour method involves a random walk, we also tested forstatistical significance of each configuration across runs.The results that we present in the remainder of the sectionare statistically significant at the 0.001 level according to thet-test statistical significance test [20] across runs.

5.2 Using Search Logs

As discussed in Section 3, our query grouping algorithm

relies heavily on the use of search logs in two ways: first, to

construct the query fusion graph used in computing query

relevance, and, second, to expand the set of queries

considered when computing query relevance. We start

our experimental evaluation, by investigating how we can

make the most out of the search logs.In our first experiment, we study how we should combine

the query graphs coming from the query reformulations

and the clicks within our query log. Since combining the

two graphs is captured by the � parameter as we discussed

in Section 3, we evaluated our algorithm over the graphs

that we constructed for increasing values of �. The result is

shown in Fig. 6; the horizontal axis represents � (i.e., how

much weight we give to the query edges coming from the

query reformulation graph), while the vertical axis shows

the performance of our algorithm in terms of the

RandIndex metric. As we can see from the graph, our

algorithm performs best (RandIndex ¼ 0:86) when � is

around 0.7, with the two extremes (only edges from clicks,

i.e., � ¼ 0:0, or only edges from reformulations, i.e., � ¼ 1:0)

performing lower. It is interesting to note that, based on the

shape of the graph, edges coming from query reformula-

tions are deemed to be slightly more helpful compared to

edges from clicks. This is because there are 17 percent fewer

click-based edges than reformulation-based edges, which

means that random walks performed on the query

reformulation graph can identify richer query images as

there are more available paths to follow in the graph.We now turn to study the effect of expanding the query set

based on the user clicks when computing query relevance.

To this end, we evaluated the performance of our algorithm

for increasing values of click importance ws and we show

the result in Fig. 7. Based on this figure, we observe that, in

general, taking user clicks into account to expand the

considered query set helps to improve performance.

Performance rises up to a point (wclick ¼ 0:3), after which

it starts degrading. At the two extremes (when only queries

from user clicks are used to seed the random walks, i.e.,

ws ¼ 1, or when only the current query is used, i.e.,

wclick ¼ 0, performance is generally lower.


Fig. 6. Varying mix of query and click graphs.

5.3 Varying the Parameters

Given the previous results on how to utilize the information

from search logs, we now turn to studying the remaining

parameters of our algorithms.Damping Factor. The damping factor d is the probability

of continuing a random walk, instead of starting over fromone of the query nodes in the jump vector. As shown in Fig. 8,RandIndex is lower for very low damping factor, increasestogether with the damping factor, and maxes out dampingfactors between 0.6 and 0.8. This confirms our intuition that

related queries are close to the current query in our queryfusion graph and that they can be captured with shortrandom walks (small d) from the current query. At theextreme where damping factor is 0, we observe a lowerperformance as the query image is essentially computed on arandom sample from the jump vector without exploiting thelink information of the query fusion graph.

Top-X. Top-X is the fraction of the sum of relevancescores of related queries that are included in the image of aquery. As Fig. 9 shows, we get better performance for veryhigh X, such as 0.99. We pick a high X, in order to keep mostof the related queries that can be potential useful forcapturing query similarities. Usually, even though we use avery high X value such as 0.99, the number of relatedqueries in a query image is still much smaller than jVQj as

related queries obtain much higher relevance scores thanthose of irrelevant ones.

Similarity Threshold. The similarity threshold �sim helpsus determine whether we should start a new group for thecurrent query or attach to an existing one. We show howperformance varies based on increasing similarity thresholdsin Fig. 10. In general, as the similarity threshold increases, theRandIndex value becomes higher. This is expected as thehigher the similarity is, the more likely that a session wouldinclude query groups containing highly related queries. Ahigh threshold is also useful for avoiding the effect of havingunrelated but very popular queries (e.g., “ebay,” “yahoo”)that may appear frequently as reformulations of each other.As �sim increases from 0.8 to 1, the RandIndex drops sincesuch �sim is too strict to group related queries together,resulting in many small groups.

Recency Weight. We finally study the recency weightwrecency that affects how much weight we are giving to thefusion relevance vectors within an existing query group.Larger values of wrecency mean that we are favoring more thelatest query that was assigned to a given query group. Weshow how performance varies based on increasing wrecencyvalues in Fig. 11. Overall, we observe that we get the bestperformance for wrecency values between 0.3 and 0.6.


Fig. 7. Varying the click importance wclick.

Fig. 8. Varying the damping factor d.

Fig. 9. Varying the fraction of related queries in Top-X.

Fig. 10. Varying the similarity threshold �sim.

5.4 Performance Comparison

We now compare the performance of our proposed

methods against five different baselines. For these baselines,

we use the same SelectBestQueryGroup as in Fig. 3 with

varying relevance metrics.As the first baseline, we use a time-based method

(henceforth referred to as Time) that groups queries based

on whether the time difference between a query and the most

recent previous query is above a threshold. It is essentially the

same as the Time metric introduced in Section 2, except that

instead of measuring similarity as the inverse of the time

interval, we measure the distance in terms of the time interval

(in seconds). Fig. 12 shows the performance of this method for

varying time thresholds (measured in seconds). We will use

600 secs (highest RandIndex value in Fig. 12) as the default

threshold for this method.The next two baselines are based on text similarity.

Jaccard similarity uses the fraction of overlapping keywords

between two queries, while Levenshtein similarity calculates

the edit distance, normalized by the maximum length of the

two queries being compared. It may capture misspellings

and typographical errors that may elude the word-based

Jaccard. Fig. 13 shows their performance as we vary the

similarity threshold. As with Time, the optimal performance

is reached at an intermediate threshold, 0.1 (default) in thecase of Jaccard, and 0.4 (default) for Levenshtein.

Our last two baselines exploit click and query graphs.More specifically, we have implemented the coretrievalbaseline (henceforth referred to as CoR) to assign a query tothe group with the highest overlap in the retrieved results,as described in Section 2. We have also implemented themethod based on the Asymmetric Traveler SalesmanProblem (henceforth referred to as ATSP) as described in[5]. Since both of these baselines are threshold based, westudy their performance for increasing threshold values inFig. 13, and then set the similarity threshold for CoR to 0.7(default) and for ATSP to 0.7(default).

We compare the baseline methods with our method thatuses the query fusion graph. For our method (denoted asQFG), we use the default parameters that we specified inSection 5.1. We report the results on the Rand200 data set inthe first row of Table 1, where we use boldface to denote thebest performance for a data set (we will discuss theremaining rows in the next section). Overall, Time andLevenshtein perform worse than the rest of the algorithms.This is an indication that the queries issued by the users areinterleaved in terms of their topics (hence Time performsbadly) and also that the edit distance between queries is notable to capture related queries too well. Jaccard is perform-ing slightly better than these two but it also cannot capturethe groupings very well, with the CoR method coming next.Finally, our QFG method and the ATSP method performthe best with QFG performing slightly better than ATSP.

The techniques that we have studied so far fall intodifferent categories and attempt to capture different aspectsof query similarities; Time simply looks at the time intervals,Jaccard and Levenshtein exploit textual similarities of queries,


Fig. 11. Varying the recency weight wrecency.

Fig. 12. Varying the time threshold.

Fig. 13. Varying the similarity threshold.

TABLE 1Comparative Performance (RandIndex) of Our Methods

Best performance in each data set is shown in bold.

while CoR, ATSP, and QFG use the search logs. Therefore,given the different natures of these algorithms it is reason-able to hypothesize that they do well for different kinds ofqueries. In particular, since our QFG method relies on theaccurate estimation of a query image within the queryfusion graph, it is expected to perform better when theestimation was based on more information and is thereforemore accurate. On the other hand, if there are queries thatare rare in the search logs or do not have many outgoingedges in our graph to facilitate the random walk, the graph-based techniques may perform worse due to the lack ofedges. We study how the structure of the graph affects theperformance of the algorithms as follows.

5.5 Varying Graph Connectivity

In order to better estimate the query transition probabilitiesin our query fusion graph, it is helpful to have as muchusage information encoded in the graph as possible. Morespecifically, if the queries within a user’s session are issuedmore frequently, they are also more likely to have moreoutgoing edges in the graph and thus facilitate the randomwalks going out of these queries. At the same time, morepopular queries will have more accurate counts in the graphand this may lead to higher confidence when we computethe query images.

To gain a measure of usage information for a given user,we look at the average outdegree of the user’s queries(average outdegree), as well as the average counts amongthe outgoing links (average weight) in the query reformula-tion graph. In order to study the effects of usage informa-tion on the performance of our algorithms, we created threeadditional test sets of 100 users each. The sets were alsomanually labeled as we described in Section 5.1. The firstset, Lo100 contains the search activity of 100 users, withaverage outdegree <5 and average weight <5. Similarly,Me100 contains user activity for users having 5� averageoutdegree <10 and 5� average weight <10, while Hi100contains user activity with average outdegree �10 andaverage weight �10.

Based on these data sets, we evaluate again the perfor-mance of our algorithms and we show the results in thebottom three lines of Table 1. As we can see from the table, forQFG, subsets with higher usage information also tend tohave higher RandIndex values. Hi100 (RandIndex ¼ 0:88)performs better than Me100 (RandIndex ¼ 0:868), which inturn outperforms Lo100 (RandIndex ¼ 0:821). ATSP shows asimilar trend (higher usage shows better performance) and itoutperformsQFG at the Lo100 data set. CoR’s performance ismore or less similar for the different data sets which isexpected as it does not use the graphs directly. For Jaccard, itis most efficient when the connectivity around the querieswithin a user’s session is relatively low. We do not observeany significant difference in the performance of the otherbaselines (Time and Levenshtein) in these new data sets.

Overall, we observe that different techniques might bemore appropriate for different degrees of usage information(and hence connectivity) of the graph. Higher connectivityimplies that the queries are well known and may be wellconnected in the graph, while lower connectivity mightimply that the query is new or not very popular. Since ourgoal is to have a good performance across the board for allqueries we study the combination of these methods next.

5.6 Combining the Methods

The results of the previous experiment point out thecontrast between the performance of the different methods.This suggests that a combination of two methods may yieldbetter performance than either method individually. Weexplore combining two methods by merging the outputquery groups as follows: given the output groups of anytwo methods, query pairs that belong to a group within oneor within the other, will belong to the same group in thecombined output.

Table 2 shows the performance gained by combiningQFG with each baseline. For QFG þ Jaccard and QFG þLevenshtein, the combination performs better than theindividual methods. QFG þ Time performs better than Timebut worse than QFG.

Interestingly, for QFG þ Jaccard, we now get a moreconsistent performance across the three test sets (Lo100,Me100, Hi100) at around 0.89. The biggest boost to QFG’sperformance is obtained for Lo100; it is more than for Me100or Hi100. This result is noteworthy as it implies that thecombination method performs gracefully across queries forwhich we may have different usage information in thegraph. Combining QFG with CoR and ATSP improvesslightly their performance but not as much as the combina-tion of QFG and Jaccard. This is mostly due to the fact thatCoR captures similar information as the click portion of theQFG, while ATSP captures similar information to the queryreformulation portion of QFG.

In summary, from the experimental results, we observethat using the click graph in addition to query reformula-tion graph in a unified query fusion graph helps improveperformance. Additionally, the query fusion graph per-forms better for queries with higher usage information andhandily beats time-based and keyword similarity-basedbaselines for such queries. Finally, keyword similarity-based methods help complement our method well provid-ing for a high and stable performance regardless of theusage information.

6 RELATED WORK

While we are not aware of any previous work that has thesame objective of organizing user history into query groups,there has been prior work in determining whether twoqueries belong to the same search task. In recent work,Jones and Klinkner [4] and Boldi et al. [5] investigate thesearch-task identification problem. More specifically, Jonesand Klinkner [4] considered a search session to consist of anumber of tasks (missions), and each task further consists ofa number of subtasks (goals). They trained a binaryclassifier with features based on time, text, and query logs


TABLE 2Performance (RandIndex) of Combined Methods

Best performance in each data set is shown in bold.

to determine whether two queries belong to the same task.Boldi et al. [5] employed similar features to construct aquery flow graph, where two queries linked by an edgewere likely to be part of the same search mission.

Our work differs from these prior works in the followingaspects. First, the query-log based features in [4], [5] areextracted from co-occurrence statistics of query pairs. In ourwork, we additionally consider query pairs having commonclicked URLs and we exploit both co-occurrence and clickinformation through a combined query fusion graph. Jonesand Klinkner [4] will not be able to break ties when anincoming query is considered relevant to two existing querygroups. Additionally, our approach does not involvelearning and thus does not require manual labeling andretraining as more search data come in; our Markov randomwalk approach essentially requires maintaining an updatedquery fusion graph. Finally, our goal is to provide userswith useful query groups on-the-fly while respectingexisting query groups. On the other hand, search taskidentification is mostly done at server side with goals suchas personalization, query suggestions [5], etc.

Some prior work also looked at the problem of how tosegment a user’s query streams into “sessions.” In mostcases, this segmentation was based on a “time-out thresh-old” [21], [22], [23], [24], [25], [26], [27]. Some of them, suchas [23], [26], looked at the segmentation of a user’s browsingactivity, and not search activity. Silverstein et al. [27]proposed a time-out threshold value of 5 minutes, whileothers [21], [22], [24], [25] used various threshold values. Asshown in Section 5, time is not a good basis for identifyingquery groups, as users may be multitasking when searchingonline [3], thus resulting in interleaved query groups.

The notion of using text similarity to identify relatedqueries has been proposed in prior work. He et al. [24] andOzmutlu and Cavdur [28] used the overlap of terms of twoqueries to detect changes in the topics of the searches. Lauand Horvitz [29] studied the different refinement classesbased on the keywords in queries, and attempted topredict these classes using a Bayesian classifier. Radlinskiand Joachims [30] identified query sequences (calledchains) by employing a classifier that combines a time-out threshold with textual similarity features of thequeries, as well as the results returned by those queries.While text similarity may work in some cases, it may fail tocapture cases where there is “semantic” similarity betweenqueries (e.g., “ipod” and “apple store”) but no textualsimilarity. In Section 5, we investigate how we can usetextual similarity to complement approaches based onsearch logs to obtain better performance.

The problem of online query grouping is also related toquery clustering [13], [31], [6], [7], [32]. The authors in [13]found query clusters to be used as possible questions for aFAQ feature in an Encarta reference website by relying onboth text and click features. In Beeferman and Berger [6] andBaeza-Yates and Tiberi [7], commonly clicked URLs onquery-click bipartite graph are used to cluster queries. Theauthors in [31] defined clusters as bicliques in the click graph.Unlike online query grouping, the queries to be clustered areprovided in advance, and might come from many differentusers. The query clustering process is also a batch processthat can be accomplished offline. While these prior work

make use of click graphs, our approach is much richer in thatwe use the click graph in combination with the reformulationgraph, and we also consider indirect relationships betweenqueries connected beyond one hop in the click graph. Thisproblem is also related to document clustering [33], [34], withthe major difference being the focus on clustering queries(only a few words) as compared to clustering documents forwhich term distributions can be estimated well.

Graphs based on query and click logs [35] have also beenused in previous work for different applications such asquery suggestions [5], query expansion [36], ranking [37],and keyword generation [14]. In several cases, variations ofrandom walks have been applied on the graph in order toidentify the most important nodes. In Craswell andSzummer [37], a Markov random walk was applied onthe click graph to improve ranking. In Fuxman et al. [14], arandom walk was applied on the click-through graph todetermine useful keywords; while in Collins-Thomson andCallan [36], a random walk was applied for querysuggestion/expansion with the node having the higheststationary probability being the best candidate for sugges-tion. As we discussed in Section 3, we take advantage of thestationary probabilities computed from the graph as adescriptive vector (image) for each query in order todetermine similarity among query groups.

7 CONCLUSION

The query reformulation and click graphs contain usefulinformation on user behavior when searching online. In thispaper, we show how such information can be usedeffectively for the task of organizing user search historiesinto query groups. More specifically, we propose combin-ing the two graphs into a query fusion graph. We furthershow that our approach that is based on probabilisticrandom walks over the query fusion graph outperformstime-based and keyword similarity-based approaches. Wealso find value in combining our method with keywordsimilarity-based methods, especially when there is aninsufficient usage information about the queries. As futurework, we intend to investigate the usefulness of theknowledge gained from these query groups in variousapplications such as providing query suggestions andbiasing the ranking of search results.

ACKNOWLEDGMENTS

This work was done while H. Hwang, H.W. Lauw, andL. Getoor were at Microsoft Research, Silicon Valley.

REFERENCES

[1] J. Teevan, E. Adar, R. Jones, and M.A.S. Potts, “Information Re-Retrieval: Repeat Queries in Yahoo’s Logs,” Proc. 30th Ann. Int’lACM SIGIR Conf. Research and Development in Information Retrieval(SIGIR ’07), pp. 151-158, 2007.

[2] A. Broder, “A Taxonomy of Web Search,” SIGIR Forum, vol. 36,no. 2, pp. 3-10, 2002.

[3] A. Spink, M. Park, B.J. Jansen, and J. Pedersen, “Multitaskingduring Web Search Sessions,” Information Processing and Manage-ment, vol. 42, no. 1, pp. 264-275, 2006.

[4] R. Jones and K.L. Klinkner, “Beyond the Session Timeout:Automatic Hierarchical Segmentation of Search Topics in QueryLogs,” Proc. 17th ACM Conf. Information and Knowledge Manage-ment (CIKM), 2008.


[5] P. Boldi, F. Bonchi, C. Castillo, D. Donato, A. Gionis, and S. Vigna,“The Query-Flow Graph: Model and Applications,” Proc. 17thACM Conf. Information and Knowledge Management (CIKM), 2008.

[6] D. Beeferman and A. Berger, “Agglomerative Clustering of aSearch Engine Query Log,” Proc. Sixth ACM SIGKDD Int’l Conf.Knowledge Discovery and Data Mining (KDD), 2000.

[7] R. Baeza-Yates and A. Tiberi, “Extracting Semantic Relations fromQuery Logs,” Proc. 13th ACM SIGKDD Int’l Conf. KnowledgeDiscovery and Data Mining (KDD), 2007.

[8] J. Han and M. Kamber, Data Mining: Concepts and Techniques.Morgan Kaufmann, 2000.

[9] W. Barbakh and C. Fyfe, “Online Clustering Algorithms,” Int’lJ. Neural Systems, vol. 18, no. 3, pp. 185-194, 2008.

[10] Lecture Notes in Data Mining, M. Berry, and M. Browne, eds. WorldScientific Publishing Company, 2006.

[11] V.I. Levenshtein, “Binary Codes Capable of Correcting Deletions,Insertions and Reversals,” Soviet Physics Doklady, vol. 10, pp. 707-710, 1966.

[12] M. Sahami and T.D. Heilman, “A Web-based Kernel Function forMeasuring the Similarity of Short Text Snippets,” Proc. the 15thInt’l Conf. World Wide Web (WWW ’06), pp. 377-386, 2006.

[13] J.-R. Wen, J.-Y. Nie, and H.-J. Zhang, “Query Clustering UsingUser Logs,” ACM Trans. in Information Systems, vol. 20, no. 1,pp. 59-81, 2002.

[14] A. Fuxman, P. Tsaparas, K. Achan, and R. Agrawal, “Using theWisdom of the Crowds for Keyword Generation,” Proc. the 17thInt’l Conf. World Wide Web (WWW ’08), 2008.

[15] K. Avrachenkov, N. Litvak, D. Nemirovsky, and N. Osipova,“Monte Carlo Methods in PageRank Computation: When OneIteration Is Sufficient,” SIAM J. Numerical Analysis, vol. 45, no. 2,pp. 890-904, 2007.

[16] L. Page, S. Brin, R. Motwani, and T. Winograd, “The PageRankCitation Ranking: Bringing Order to the Web,” technical report,Stanford Univ., 1998.

[17] P. Boldi, M. Santini, and S. Vigna, “Pagerank as a Function ofthe Damping Factor,” Proc. the 14th Int’l Conf. World Wide Web(WWW ’05), 2005.

[18] T.H. Haveliwala, “Topic-Sensitive PageRank,” Proc. the 11th Int’lConf. World Wide Web (WWW ’02), 2002.

[19] W.M. Rand, “Objective Criteria for the Evaluation of ClusteringMethods,” J. the Am. Statistical Assoc., vol. 66, no. 336, pp. 846-850,1971.

[20] D.D. Wackerly, W.M. III, and R.L. Scheaffer, Mathematical Statisticswith Applications, sixth ed. Duxbury Advanced Series, 2002.

[21] P. Anick, “Using Terminological Feedback for Web SearchRefinement: A Log-Based Study,” Proc. 26th Ann. Int’l ACM SIGIRConf. Research and Development in Information Retrieval, 2003.

[22] B.J. Jansen, A. Spink, C. Blakely, and S. Koshman, “Defining aSession on Web Search Engines: Research Articles,” J. the Am. Soc.for Information Science and Technology, vol. 58, no. 6, pp. 862-871,2007.

[23] L.D. Catledge and J.E. Pitkow, “Characterizing Browsing Strate-gies in the World-Wide Web,” Computer Networks and ISDNSystems, vol. 27, no. 6, pp. 1065-1073, 1995.

[24] D. He, A. Goker, and D.J. Harper, “Combining Evidence forAutomatic Web Session Identification,” Information Processing andManagement, vol. 38, no. 5, pp. 727-742, 2002.

[25] R. Jones and F. Diaz, “Temporal Profiles of Queries,” ACM Trans.Information Systems, vol. 25, no. 3, p. 14, 2007.

[26] A.L. Montgomery and C. Faloutsos, “Identifying Web BrowsingTrends and Patterns,” Computer, vol. 34, no. 7, pp. 94-95, July 2001.

[27] C. Silverstein, H. Marais, M. Henzinger, and M. Moricz, “Analysisof a Very Large Web Search Engine Query Log,” SIGIR Forum,vol. 33, no. 1, pp. 6-12, 1999.

[28] H.C. Ozmutlu and F. Cavdur, “Application of Automatic TopicIdentification on Excite Web Search Engine Data Logs,”Information Processing and Management, vol. 41, no. 5, pp. 1243-1262, 2005.

[29] T. Lau and E. Horvitz, “Patterns of Search: Analyzing andModeling Web Query Refinement,” Proc. Seventh Int’l Conf. UserModeling (UM), 1999.

[30] F. Radlinski and T. Joachims, “Query Chains: Learning to Rankfrom Implicit Feedback,” Proc. ACM Conf. Knowledge Discovery andData Mining (KDD), 2005.

[31] J. Yi and F. Maghoul, “Query Clustering Using Click-throughGraph,” Proc. the 18th Int’l Conf. World Wide Web (WWW ’09), 2009.

[32] E. Sadikov, J. Madhavan, L. Wang, and A. Halevy, “ClusteringQuery Refinements by User Intent,” Proc. the 19th Int’l Conf. WorldWide Web (WWW ’10), 2010.

[33] T. Radecki, “Output Ranking Methodology for Document-Clustering-Based Boolean Retrieval Systems,” Proc. Eighth Ann.Int’l ACM SIGIR Conf. Research and Development in InformationRetrieval, pp. 70-76, 1985.

[34] V.R. Lesser, “A Modified Two-Level Search Algorithm UsingRequest Clustering,” Report No. ISR-11 to the Nat’l ScienceFoundation, Section 7, Dept. of Computer Science, Cornell Univ.,1966.

[35] R. Baeza-Yates, “Graphs from Search Engine Queries,” Proc. 33rdConf. Current Trends in Theory and Practice of Computer Science(SOFSEM), vol. 4362, pp. 1-8, 2007.

[36] K. Collins-Thompson and J. Callan, “Query Expansion UsingRandom Walk Models,” Proc. 14th ACM Int’l Conf. Information andKnowledge Management (CIKM), 2005.

[37] N. Craswell and M. Szummer, “Random Walks on the ClickGraph,” Proc. 30th Ann. Int’l ACM SIGIR Conf. Research andDevelopment in Information Retrieval (SIGIR ’07), 2007.

Heasoo Hwang received the PhD degree incomputer science from the University of Califor-nia at San Diego. Her main research interestsinclude effective and efficient search over large-scale graph-structured data. She is a researchstaff member at Samsung Advanced Institute ofTechnology.

Hady W. Lauw received the PhD degree incomputer science at Nanyang TechnologicalUniversity in 2008 on an A�STAR graduatefellowship. He is a researcher at the Institutefor Infocomm Research in Singapore. Pre-viously, he was a postdoctoral researcher atMicrosoft Research Silicon Valley.

Lise Getoor received the PhD degree incomputer science from Stanford University.She is an associate professor at the Universityof Maryland, College Park. Her research inter-ests include machine learning and reasoningunder uncertainty, with applications to informa-tion integration, database management, andsocial media.

Alexandros Ntoulas received the PhD degreein computer science from the University ofCalifornia, Los Angeles. He is a researcher atMicrosoft Research, Silicon Valley. His researchinterests include systems and algorithms thatfacilitate the monitoring, collection, manage-ment, mining, and searching of information onthe web.

. For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.