context-aware query classification

Context-Aware Query Context-Aware Query Classification Classification

Huanhuan Cao1, Derek Hao Hu2, Dou Shen3, Daxin Jiang4 , Jian-Tao Sun4 , Enhong Chen1 and Qiang Yang2

1University of Science and Technology of China, 2Hong Kong University of Science and Technology,

3Microsoft Corporation4Microsoft Research Asia

MotivationMotivation

• Understanding Web user's information need is one of the most important problems in Web search.

• Such information could generally help improving the quality of many Web search services such as:– Ranking– Online advertising – Query suggestion, etc.

ChallengesChallenges

• The main challenges of query classification:– Lack of feature information– Ambiguity– Multiple intents

• The first problem has been studied widely:– Query expansion by top search results– Leverage a web directory

• However, the second and the third problems are far away from being closed.

Why context is useful?Why context is useful?

• Context means the previous queries and clicked URLs in the same session given a query.

• It’s assumed that:– Context has semantic relation with the current query.– Context may help to label appropriate categories for

current query.

• It makes sense to exploit context for specifying the current query.

ExampleExample

Overview Overview

• Problem statement• Model query context by CRF• Features of CRF• Experiment• Conclusion and future work

Problem Statement: ContextProblem Statement: Context

• In a user search session, suppose the user has raised a series of queries as q1q2…qT-1 and clicked some returned URLs U1U2…UT-1;

• If the user raises a query qT at time T, we call q1q2…qT-1 and U1U2…UT-1 as query context of qT

• And we call qt t (t ∈ [1, T - 1]) as contextual queries of qT .

Query ContextQuery Context

Query Context of {Q_T}

Query Context of {Q_T}

Problem Statement: QC with context Problem Statement: QC with context and Taxonomyand Taxonomy

• The objective of query classification (QC) with context is to classify a user query qT into a ranked list of K categories cT1, cT2, ..., cTK, among Nc categories {c1,c2,…,cNc}, given the context of qT .

• A target taxonomy Υ is a tree of categories where {c1,c2,…,cNc} are leaf nodes of this tree.

Modeling Query Context by CRFModeling Query Context by CRF

where q represents q1q2…qt

Why CRF?Why CRF?

• The two main advantages of CRF are: – 1) It can incorporate general feature functions to model

the relation between observations and unobserved states;

– 2) It doesn't need prior knowledge of the type of conditional distribution.

• Given 1), we can incorporate some external web knowledge.

• Given 2), we don’t need any assumptions of the type of p(c|q).

Features of CRFFeatures of CRF

• When we use CRF to model query context, one of the most important part is to choose effective feature functions.

• We should consider:– Relevance between queries and category labels for

leveraging local information of queries;– Relevance between adjacent labels for leveraging

contextual information.

Relevance between queries and Relevance between queries and category labelscategory labels

• Term occurrence– The terms of qt are obvious features for supporting ct

– Due to the limited size of training data, many useful terms indicating category information may be uncovered.

• General label confidence– Leverage an external web directory such as Google Directory;– where M means the number of returned results and Mct,qt means the number of returned results with

label ct after mapping.

Relevance between queries and Relevance between queries and category labelscategory labels

• Click-aware label confidence– Combining the click-information with the knowledge of a external web

directory;–

– CConf(ct ,ut) can be calculated by multiple approaches.

– Here, we use VSM to calculate cosine similarity between term vectors of ct and ut

Relevance between Adjacent LabelsRelevance between Adjacent Labels

• Direct relevance between adjacent labels– Occurrence of adjacent label pair <ct-1,ct>

– The weight implies how likely the two labels co-occur

• Taxonomy based relevance between adjacent labels– Limited by the sampling approach and size of the training data, some

reasonable adjacent label pairs may not occur proportionally or even not occur at all.

– Consider indirect relevance between adjacent labels by considering the taxonomy.

ExperimentExperiment

• Data set:– 10,000 random selected sessions from one day’s search

log of a commercial search engine.

– Three labelers firstly label all possible categories with KDDCUP’05 taxonomy for each unique query of the training data.

Examples of multiple category queriesExamples of multiple category queries

A large ratio of multiple category queries implies the difficulty of QC without context.

Label SessionsLabel Sessions

• Then the three human labelers are asked to cross label each session of the data set with a sequence of level-2 category labels.

• For each query, a labeler gives a most appropriate category label by considering:– Query itself;– The query context;– Clicked URLs of the query.

Tested ApproachesTested Approaches• Baselines:

– Non context-aware baseline: Bridging classifier(BC) proposed by Shen et al.

– Naïve context-aware baseline: Collaborating classifier(CC). Combine a test query and the previous query to classify with BC.

• CRFs:– CRF-B: CRF with basic features including term occurrence,

general label confidence and direct relevance between adjacent labels)

– CRF-B-C: CRF with basic features + click-aware label confidence)– CRF-B-C-T: CRF with basic features + click-aware label

confidence + taxonomy based relevance)

Evaluation MetricsEvaluation Metrics

• Given a test session q1q2…qT, we let the qT be the test query and let queries q1q2…qT-1 and corresponding clicked URL sets U1U2…UT-1 be the query context.

• For qT ,we evaluate a tested approach by:– Precision(P): δ(cT ∈ CT,K)/K – Recall(R): δ(cT ∈ CT,K)– F1 score(F1 ): 2*P*R/(P+R)Where cT means the ground truth label and CT,K means a set of

the top K labels. δ(*) is a Boolean function of indicating whether * is true (=1) or false (=0).

Overall resultsOverall results

1) The naïve context-aware baseline consistently outperforms the non context-aware baseline.

2) CRFs consistently outperform the two baselines.

3) CRF-B-C-T > CRF-B-C >CRF-B: click information and taxonomy based relevance are useful.

Case studyCase study

Context about travel

Click a travel guide web page

Give the most appropriate label in the first position

Efficiency of Our ApproachEfficiency of Our Approach

• Offline training:– Each iteration

takes about 300ms– Time cost of

training a CRF is acceptable

• Online cost:– Calculating

features• Label confidence

Conclusion and Future workConclusion and Future work• In this paper, we propose a novel approach for query

classification by modeling query context via CRFs.

• Experiments on a real search log clearly show that our approach outperforms a non context-aware baseline and a naive context-aware baselines.

• Current approach cannot leverage the contextual information of the beginning queries of sessions, which make us carry on our following researches for leveraging more contextual information out of sessions.

Thanks

context-aware query classification

Documents

query context of qt

query contextquery context

modeling query context

current query

user query qt

query expansion

qt t t

contextual queries of