query suggestion

Query Suggestion

Naama Kraus

Slides are based on the papers:Baeza-Yates, Hurtado, Mendoza, Improving search engines by query clusteringBoldi, Bonchi, Castillo, Donato, Vigna, The Query Flow Graph: Model and Applications

Ambiguous queries: jaguar

General queries:haifa

Terminology differences (synonyms)between user and corpusstars - planets

The Problem• User queries are an imperfect description of their information

needs• Examples:

Query Suggestions

Assist the user to phrase her information need

jaguar

Jaguar carJaguar xfJaguar animalJaguar cat

Example: Google Related Searches

Query suggestion algorithms• Query suggestions are extracted from the

query log– There are methods that use different data sources

such as a corpus, not covered today

• Topic (cluster) based – identify groups of similar queries

• Sequence based – mine and analyze the query log for likely query sequences

Improving Search Engines by Query Clustering - Baeza-Yates et al.

• Algorithm outline• Offline:

– Represent queries as term weighted vectors– Cluster queries– Rank queries in each cluster

• Online:– Given user’s query q– Find cluster C containing q– Suggest top k queries in cluster C

• Based on their rank and similarity to q

Query Model

• Given query q• Let U be the set of URLs clicked for q (for all

users and sessions)– Information is extracted from the query log

• q’s term weighted vector has a non 0 entry for any term that appears in some URL in U

• Terms are weighted according to – Term frequency and URLs popularity– Formula in next slide …

Query Model (2)

- The number of clicks of u for the query q

Note: paper proposes a refinement to Pop(u,q) which is notbiased by search engine’s ranking

Query similarity is computed by some measure, e.g. cosine similarity.

Query Support

• The fraction of the documents returned by the query that captured the attention of users (clicked documents)

• Denotes how ‘good’ is a query– A ‘global score’

• Queries within a cluster are ranked according to their similarity to q as well as their support

Query Flow Graph – Boldi et al.

• Main idea:• Aggregate the (massive) raw data in the query

log– Many queries of many users

• Model user query behavior• Use sophisticated techniques to infer query

relatedness

Query Flow Graph Model

• G=(V, E, w) a directed graph where:• V – nodes, representing a distinct set of

queries Q– Queries are extracted from the query log

• A set of directed edges E• Two queries q,q’ are connected with an edge

if q’ follows q in at least one session

QFG Illustration

q0

q1

q2

q3

q4

q5

Nodes are queriesEdges connect between queries

apple ipod

applestore

Weighting Function

• w : E -> (0..1] a weighting function that assigns a weight to every edge (q,q’)

• For each edge (q,q’) assign a probability that q’ follows q in the same session– Extracted from the observed query log sessions

'

( , ')( , ') , ( ) ( , ')( ) q

count q qw q q d q count q qd q

Illustration

q0

q1

q2

q3

0.5

0.25

0.25

q4

q5

0.1

0.55

0.35 0.2

0.8

1.0

1.0

Random walk on the QFG

• A random surfer executes a random walk on the graph as follows:– Start at a some node– Move along an edge with probability d

• Choose an edge by its probability (weight)– Or teleport to a random node with probability 1-d

• Choose an edge uniformlyThe Stationary distribution

The probability to be at node q in the infinity Random walk score vector – query absolute scores

Random Walk Relative to a Node

• Random walk with restart to a single node:– Start at node q– Instead of teleporting to any node, always teleport

to q• The score of node q’ for this random walk

measures relatedness of q’ to q– The probability to get from q to q’ in the infinity– Can normalize node’s relative score by its absolute

score ; similar somehow to tfxidf – avoid highly popular queries (non related to q)

The Full Picture

• Off-line stage– For each node q in the graph

• Compute the stationary distribution vector of q– A random walk score relative to q

• Store suggestions for q, alternatives:– top k scored nodes– nodes having a score above some threshold

• On-line stage– User submits query q– Suggest queries stored for q

• Queries most related to q

query suggestion

Documents