query suggestion

17
Query Suggestion Naama Kraus es are based on the papers: za-Yates, Hurtado, Mendoza, Improving search engines by query clustering i, Bonchi, Castillo, Donato, Vigna, The Query Flow Graph: Model and Applicatio

Upload: gabe

Post on 24-Feb-2016

58 views

Category:

Documents


6 download

DESCRIPTION

Query Suggestion. Naama Kraus. Slides are based on the papers: Baeza -Yates, Hurtado , Mendoza, Improving search engines by query clustering Boldi , Bonchi , Castillo, Donato , Vigna , The Query Flow Graph: Model and Applications. The Problem. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Query Suggestion

Query Suggestion

Naama Kraus

Slides are based on the papers:Baeza-Yates, Hurtado, Mendoza, Improving search engines by query clusteringBoldi, Bonchi, Castillo, Donato, Vigna, The Query Flow Graph: Model and Applications

Page 2: Query Suggestion

Ambiguous queries: jaguar

General queries:haifa

Terminology differences (synonyms)between user and corpusstars - planets

The Problem• User queries are an imperfect description of their information

needs• Examples:

Page 3: Query Suggestion

Query Suggestions

Assist the user to phrase her information need

jaguar

Jaguar carJaguar xfJaguar animalJaguar cat

Page 4: Query Suggestion

Example: Google Related Searches

Page 5: Query Suggestion

Query suggestion algorithms• Query suggestions are extracted from the

query log– There are methods that use different data sources

such as a corpus, not covered today

• Topic (cluster) based – identify groups of similar queries

• Sequence based – mine and analyze the query log for likely query sequences

Page 6: Query Suggestion

Improving Search Engines by Query Clustering - Baeza-Yates et al.

• Algorithm outline• Offline:

– Represent queries as term weighted vectors– Cluster queries– Rank queries in each cluster

• Online:– Given user’s query q– Find cluster C containing q– Suggest top k queries in cluster C

• Based on their rank and similarity to q

Page 7: Query Suggestion

Query Model

• Given query q• Let U be the set of URLs clicked for q (for all

users and sessions)– Information is extracted from the query log

• q’s term weighted vector has a non 0 entry for any term that appears in some URL in U

• Terms are weighted according to – Term frequency and URLs popularity– Formula in next slide …

Page 8: Query Suggestion

Query Model (2)

- The number of clicks of u for the query q

Note: paper proposes a refinement to Pop(u,q) which is notbiased by search engine’s ranking

Query similarity is computed by some measure, e.g. cosine similarity.

Page 9: Query Suggestion

Query Support

• The fraction of the documents returned by the query that captured the attention of users (clicked documents)

• Denotes how ‘good’ is a query– A ‘global score’

• Queries within a cluster are ranked according to their similarity to q as well as their support

Page 10: Query Suggestion

Query Flow Graph – Boldi et al.

• Main idea:• Aggregate the (massive) raw data in the query

log– Many queries of many users

• Model user query behavior• Use sophisticated techniques to infer query

relatedness

Page 11: Query Suggestion

Query Flow Graph Model

• G=(V, E, w) a directed graph where:• V – nodes, representing a distinct set of

queries Q– Queries are extracted from the query log

• A set of directed edges E• Two queries q,q’ are connected with an edge

if q’ follows q in at least one session

Page 12: Query Suggestion

QFG Illustration

q0

q1

q2

q3

q4

q5

Nodes are queriesEdges connect between queries

apple ipod

applestore

Page 13: Query Suggestion

Weighting Function

• w : E -> (0..1] a weighting function that assigns a weight to every edge (q,q’)

• For each edge (q,q’) assign a probability that q’ follows q in the same session– Extracted from the observed query log sessions

'

( , ')( , ') , ( ) ( , ')( ) q

count q qw q q d q count q qd q

Page 14: Query Suggestion

Illustration

q0

q1

q2

q3

0.5

0.25

0.25

q4

q5

0.1

0.55

0.35 0.2

0.8

1.0

1.0

Page 15: Query Suggestion

Random walk on the QFG

• A random surfer executes a random walk on the graph as follows:– Start at a some node– Move along an edge with probability d

• Choose an edge by its probability (weight)– Or teleport to a random node with probability 1-d

• Choose an edge uniformlyThe Stationary distribution

The probability to be at node q in the infinity Random walk score vector – query absolute scores

Page 16: Query Suggestion

Random Walk Relative to a Node

• Random walk with restart to a single node:– Start at node q– Instead of teleporting to any node, always teleport

to q• The score of node q’ for this random walk

measures relatedness of q’ to q– The probability to get from q to q’ in the infinity– Can normalize node’s relative score by its absolute

score ; similar somehow to tfxidf – avoid highly popular queries (non related to q)

Page 17: Query Suggestion

The Full Picture

• Off-line stage– For each node q in the graph

• Compute the stationary distribution vector of q– A random walk score relative to q

• Store suggestions for q, alternatives:– top k scored nodes– nodes having a score above some threshold

• On-line stage– User submits query q– Suggest queries stored for q

• Queries most related to q