exploring session context using distributed representations of queries and reformulations (sigir...

Exploring Session Context using Distributed

Representations of Queries and Reformulations

Bhaskar Mitra

Microsoft

(Paper: http://research.microsoft.com/apps/pubs/default.aspx?id=244728)

http://research.microsoft.com/apps/pubs/default.aspx?id=244728



Motivation

Intuitively, the following query reformulation (or intent shift)

is similar to

london things to do in london

new york new york tourist attractions

…and

is similar to

san francisco san francisco 49ers

detroit detroit lions

…but

is NOT similar to

movies new movies

york new york

Questions

• Can we learn intuitively “meaningful” vector representations for query reformulations?

• Can we use it for modelling session context for tasks such as query auto-completion (QAC)?

Background

Session Context for QAC

muscle cars

f

facebook

fandango

forever 21

fox news

f

facebook

ford

ford mustang

fast and furious

or

Previous query

Session ContextWhat’s the more likely query after “big ben”?

Topical disambiguation. vs Transition likelihood (symmetrical) (asymmetrical)

big ben

big ben height

london clock tower

Distributed Representation

A (low-dimensional) vector representation for items (e.g., words, sentences, images, etc.) such that all the values in a vector are necessary to determine the exact item.

Imaginary example:

Also called embeddings.

6 3 0 4 1 7 2 8

As opposed to…

One-hot representation scheme, where all except one of the values of the vector are zeros.

Imaginary example:

0 1 0 0 0 0 0 1

For Neural Networks…

Localist Representations

• One neuron to represent each item• One-to-one relationship• For few items / classes

only

Distributed Representations

• Multiple neurons to represent each item• Many-to-many

relationship• For many items with

shared attributes

Vector Algebra on Word Embeddings

Word2vec linguistic regularitiesvector(“king”) – vector(“man”) + vector(“woman”) = vector(“queen”)

T. Mikolov, et al. Efficient estimation of word representations in vector space. arXiv preprint, 2013.T. Mikolov, et al. Distributed representations of words and phrases and their compositionality. NIPS, 2013.

Convolutional Latent Semantic Model• DNN trained on clickthrough

data

• Maximize cosine similarity

• Tri-gram hashing over raw terms

• Convolutional-Pooling structureP.-S. Huang, et al. Learning deep structured semantic models for web search using clickthrough data.

CIKM, 2013.Y. Shen, et al. Learning semantic representations using convolutional neural networks for web search. WWW, 2014.

Main Contributions

Main Contributions

• CLSM models trained on Session Pairs (SP)

• Demonstrate semantic regularities in the CLSM query embedding space

• Leverage the regularities to explicitly represent query reformulations as vectors

• Improved Mean Reciprocal Rank (MRR) for session context-aware QAC ranking by more than 10% using CLSM based features

Training on Session Pairs

• Pairs of consecutive queries from search sessions

• Pre-Query and Post-Query model

• Symmetric vs. Asymmetric models

q1 q2 q3 q4

Advantages

1. Demonstrates higher levels of reformulation regularities (discussed next)

2. Train on time-stamped query log, no need for clickthrough data

Vector Algebra on Query Embeddings (SP)

More Examples

Embeddings for Query ReformulationsExplicit vector representation

k-means clustering of 65K in-session query pairs shows intuitive clusters

Reformulation Likelihood

How crowded is the neighborhood of a reformulation in the embedding space?

Session Context-Aware QAC

• Evaluation setup based on

• Temporally separated background, train, validation and test sets

• Sample queries and extract all possible prefixes

• Submitted query as ground truth

• Re-rank top N suggestion candidates using a LambdaMART model

• Two testbeds: search logs from AOL & Bing

M. Shokouhi. Learning to personalize query auto-completion. SIGIR, 2013.

Features

Non-contextual featuresPrefix length, suggestion length, vowels-alphabet ratio, contains numeric, etc.

N-gram similarity features

Character trigram similarity between previous queries and suggestion candidate

Pairwise frequency feature

Pairwise frequency based on popular sessions pairs in the background data

CLSM topical similarity featuresCLSM similarity between previous queries and suggestion candidate

CLSM reformulation featuresValues along each dimension of the reformulation vector based on previous query and suggestion candidate

Results

LambdaMART-based

baseline

performs better than

Popularity-based baseline

Results

The models with CLSM

features

perform better than

the model without CLSM

features

Results

Reformulation + Similarity

features

perform better than

only Similarity features

Results

Session Pairs based CLSMs

perform better than

Query-Document based

CLSM

Results

Session Pairs based CLSMs

mostly win due to

Reformulation based

features

Effect of Prefix Length, History Length and Vector Dimensions

Example Queries

Summary of Contributions

• CLSM models trained on Session Pairs (SP)

• Demonstrate semantic regularities in the CLSM query embedding space

• Leverage the regularities to explicitly represent query reformulations as vectors

• Improved Mean Reciprocal Rank (MRR) for session context-aware QAC ranking by more than 10% using CLSM based features

Potential Future Work

• Studying search trails (White et. al.) in the embedding space

• Query change retrieval model (Guan et. al.) using reformulation embeddings

• Generating user embeddings for search personalization

• Study how reformulations vary by user expertise and device

QUESTIONS?

exploring session context using distributed representations of queries and reformulations (sigir...

Science