exploring session context using distributed representations of queries and reformulations (sigir...
TRANSCRIPT
Exploring Session Context using Distributed
Representations of Queries and Reformulations
Bhaskar Mitra
Microsoft
(Paper: http://research.microsoft.com/apps/pubs/default.aspx?id=244728)
Intuitively, the following query reformulation (or intent shift)
is similar to
london things to do in london
new york new york tourist attractions
Questions
• Can we learn intuitively “meaningful” vector representations for query reformulations?
• Can we use it for modelling session context for tasks such as query auto-completion (QAC)?
Session Context for QAC
muscle cars
f
fandango
forever 21
fox news
f
ford
ford mustang
fast and furious
or
Previous query
Session ContextWhat’s the more likely query after “big ben”?
Topical disambiguation. vs Transition likelihood (symmetrical) (asymmetrical)
big ben
big ben height
london clock tower
Distributed Representation
A (low-dimensional) vector representation for items (e.g., words, sentences, images, etc.) such that all the values in a vector are necessary to determine the exact item.
Imaginary example:
Also called embeddings.
6 3 0 4 1 7 2 8
As opposed to…
One-hot representation scheme, where all except one of the values of the vector are zeros.
Imaginary example:
0 1 0 0 0 0 0 1
For Neural Networks…
Localist Representations
• One neuron to represent each item• One-to-one relationship• For few items / classes
only
Distributed Representations
• Multiple neurons to represent each item• Many-to-many
relationship• For many items with
shared attributes
Vector Algebra on Word Embeddings
Word2vec linguistic regularitiesvector(“king”) – vector(“man”) + vector(“woman”) = vector(“queen”)
T. Mikolov, et al. Efficient estimation of word representations in vector space. arXiv preprint, 2013.T. Mikolov, et al. Distributed representations of words and phrases and their compositionality. NIPS, 2013.
Convolutional Latent Semantic Model• DNN trained on clickthrough
data
• Maximize cosine similarity
• Tri-gram hashing over raw terms
• Convolutional-Pooling structureP.-S. Huang, et al. Learning deep structured semantic models for web search using clickthrough data.
CIKM, 2013.Y. Shen, et al. Learning semantic representations using convolutional neural networks for web search. WWW, 2014.
Main Contributions
• CLSM models trained on Session Pairs (SP)
• Demonstrate semantic regularities in the CLSM query embedding space
• Leverage the regularities to explicitly represent query reformulations as vectors
• Improved Mean Reciprocal Rank (MRR) for session context-aware QAC ranking by more than 10% using CLSM based features
Training on Session Pairs
• Pairs of consecutive queries from search sessions
• Pre-Query and Post-Query model
• Symmetric vs. Asymmetric models
q1 q2 q3 q4
Advantages
1. Demonstrates higher levels of reformulation regularities (discussed next)
2. Train on time-stamped query log, no need for clickthrough data
Embeddings for Query ReformulationsExplicit vector representation
k-means clustering of 65K in-session query pairs shows intuitive clusters
Session Context-Aware QAC
• Evaluation setup based on
• Temporally separated background, train, validation and test sets
• Sample queries and extract all possible prefixes
• Submitted query as ground truth
• Re-rank top N suggestion candidates using a LambdaMART model
• Two testbeds: search logs from AOL & Bing
M. Shokouhi. Learning to personalize query auto-completion. SIGIR, 2013.
Features
Non-contextual featuresPrefix length, suggestion length, vowels-alphabet ratio, contains numeric, etc.
N-gram similarity features
Character trigram similarity between previous queries and suggestion candidate
Pairwise frequency feature
Pairwise frequency based on popular sessions pairs in the background data
CLSM topical similarity featuresCLSM similarity between previous queries and suggestion candidate
CLSM reformulation featuresValues along each dimension of the reformulation vector based on previous query and suggestion candidate
Summary of Contributions
• CLSM models trained on Session Pairs (SP)
• Demonstrate semantic regularities in the CLSM query embedding space
• Leverage the regularities to explicitly represent query reformulations as vectors
• Improved Mean Reciprocal Rank (MRR) for session context-aware QAC ranking by more than 10% using CLSM based features
Potential Future Work
• Studying search trails (White et. al.) in the embedding space
• Query change retrieval model (Guan et. al.) using reformulation embeddings
• Generating user embeddings for search personalization
• Study how reformulations vary by user expertise and device