query dependent ranking using k-nearest neighbor

Query Dependent Ranking using K-Nearest Neighbor

Xiubo Geng, Tie-Yan Liu, Tao Qin, Andrew Arnold, Hang Li, Heung-Yeung Shum, Query Dependent Ranking with K

Nearest Neighbor, Proc. of SIGIR 2008, 115-122.

Introduction

• Problem:– Most of the existing methods do not take into consideration the

fact that significant differences exist between queries, and only resort to a single function in ranking of documents

• Solution– query-dependent ranking: different ranking models for different

queries– propose a K-Nearest Neighbor (KNN) method for query

dependent ranking

Ranking function

query

……

Test query

Training query Ranking model

……

Introduction

• The reason of enhance the accuracy– in the method ranking for a query is conducted by leveraging the

useful information of the similar queries and avoiding the negative effects from the dissimilar ones.

Related Work

• Query dependent ranking– There has not been much previous work on query dependent

ranking

• Query classification– queries were classified according to users’ search needs– queries were classified according to topics, for instance,

Computers, Entertainment, Information, etc. – query classification was not extensively applied to query

dependent ranking, probably due to the difficulty of the query classification problem

Query dependent ranking method

• straightforward approach– employ a ‘hard’ classification approach in which we classify queri

es into categories and train a ranking model for each category• it is hard to draw clear boundaries between the queries in diff

erent categories.• queries in different categories

are mixed together and

cannot be separated using

hard classification boundaries• High probability a query

belongs to the same category

as those of its neighbors– Locality property of queries

Query dependent ranking method

• KNN approach– K-Nearest Neighbor method– Given a new test query q, we try to find the k closest training

queries to it in terms of Euclidean distance.– train a local ranking model online using the neighboring training

queries (Ranking SVM)– rank the documents of the test query using the trained local

model

KNN online algorithm

KNN offline-1

• To reduce complexity, we further propose two algorithms, which move the time-consuming steps offline.

q1 q2 q3

q

q

q

q

qq q

q q

training

)( 1qN k)( 2qN k )( 3qN k

1qh2qh

3qh

q

q

qq

)(qN k

compare

)(

* )()()( maxargiqkN

kikik

S

qNqNqN SSS

qih

KNN offline-1

KNN offline-2

• In KNN Oine-1, still need to find the k nearest neighbors of the test query online which is also time-consuming

q1 q2 q3

q

q

q

q

qq q

q q

training

)( 1qN k)( 2qN k )( 3qN k

1qh2qh

3qh

q

q

)(qN k

qih

KNN offline-2

Time Complexities of Testing

• n denotes the number of documents to be ranked for the test query

• k denotes the number of nearest neighbors• m denotes the number of queries in the training data

Experiment

• Experimental Setting– Data from a commercial search engine

• DataSet1: 1500 training queries / 400 test queries• DataSet2: 3000 training queries / 800 test queries• Five levels of relevance: perfect, excellent, good, fair and bad• A query-document pair: 200 features

– LETOR data• Learning TO Rank

– released by Microsoft Research Asia– extracted features for each query-document pair in the O

HSUMED and TREC collections – also released an evaluation tool which can compute preci

sion (P@n and MAP) and normalized discount cumulative gain (NDCG)

Experiment

• Parameter selection– parameter k is tuned automatically based on a validation set

• Evaluation Measure– NDCG

n

j

jr

jr

nj

j

j

ZnN1

)(

)(

2 ,)log(

12

2,1 ,12

)(

factornormalized

) bad:. perfect: (EX

ument listin the doc

cument he j-th doscore of tr(j): the

tcument lis in the doj:position

:Z

0 4

n

3 2 0 1 41 2 3 4 5

]5log

12

4log

12

3log

12)12()12[()5(

]4log

12

3log

12)12()12[()4(

]3log

12)12()12[()3(

)]12()12[()2(

)12()1(

41023

5

1023

4

023

3

232

31

ZN

ZN

ZN

ZN

ZN

Experiment

• Single: single model approach• QC: Query classification• Result:

– the proposed three methods (KNN Online, KNN Oine-1 and KNN Oine-2) perform comparably well with each other, and all of them almost always outperform the baselines

Experiment Result

• The better results of KNN over Single indicate that query dependent ranking does help, and an approach like KNN can indeed effectively accomplish the task.

• The superior results of KNN to QC indicate that an approach based on soft classification of queries like KNN is more successful than an approach based on hard classification of queries like QC.

• QC cannot work better than Single, mainly due to the relatively low accuracy of query classification.

Experiment Result

• When only a small number of neighbors are used, the performances of KNN are not so good due to the insufficiency of training data.

• When the numbers of neighbors increase, the performances gradually improve, because of the use of more information.

• However, when too many neighbors are used (approaching 1500, which is equivalent to Single), the performances begin to deteriorate. This seems to indicate that query dependent ranking can really help.

Conclusion

• ranking of documents in search should be conducted by using different models based on different properties of queries

• The complexity of the online processing is still high• It is also a common practice to use a fixed radius in KNN• examine the many other potentially helpful approaches

query dependent ranking using k-nearest neighbor

Documents

number of queries

similar queries

different ranking models

neighboring training

closest training queries

new test query q

local ranking model

bada querydocument pair