learning to rank
DESCRIPTION
Learning to Rank. Ming-Feng Tsai National Taiwan University. Ranking. Ranking vs. Classification Training samples is not independent, identical distributed. Criterion of training is not compatible to one of IR Many ML approaches have been applied to ranking RankSVM - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Learning to Rank](https://reader036.vdocuments.net/reader036/viewer/2022062410/56815300550346895dc12105/html5/thumbnails/1.jpg)
2005.12.27
Learning to Rank
Ming-Feng Tsai
National Taiwan University
![Page 2: Learning to Rank](https://reader036.vdocuments.net/reader036/viewer/2022062410/56815300550346895dc12105/html5/thumbnails/2.jpg)
2005.12.27
Ranking Ranking vs. Classification
Training samples is not independent, identical distributed. Criterion of training is not compatible to one of IR
Many ML approaches have been applied to ranking RankSVM
T. Joachims, SIGKDD, 2002 (SVM Light) RankBoost
Freund Y., Iyer, Journal of Machine Learning Research, 2003 RankNet
C.J.C. Burges, ICML, 2005 (MSN Search)
![Page 3: Learning to Rank](https://reader036.vdocuments.net/reader036/viewer/2022062410/56815300550346895dc12105/html5/thumbnails/3.jpg)
2005.12.27
Motivation RankNet
Pro Probabilistic ranking model Good properties
Con Training is not efficient Criterion of training is not compatible to one of IR
Motivation Based on the probabilistic ranking model Improve efficiency and loss function
![Page 4: Learning to Rank](https://reader036.vdocuments.net/reader036/viewer/2022062410/56815300550346895dc12105/html5/thumbnails/4.jpg)
2005.12.27
Probabilistic Ranking Model Probabilistic Ranking Model
Model posterior by Pij
The map from outputs to probabilities are modeled using a sigmoid function
Define ( ) and ( ) ( )
1
ij
ij
i i ij i j
o
ij o
o f x o f x f x
eP
e
( )i jP x x
Properties Combined Probabilities Consistency requirements
P(A>B)=0.5, and P(B>C)=0.5, then P(A>C)=0.5 Confidence, or lack of confidence, builds as expected.
P(A>B)=0.6, and P(B>C)=0.6, then P(A>C)>0.6
![Page 5: Learning to Rank](https://reader036.vdocuments.net/reader036/viewer/2022062410/56815300550346895dc12105/html5/thumbnails/5.jpg)
2005.12.27
Probabilistic Ranking Model
Cross entropy loss function Let be the desired target values
Total Cost Function:
RankNet applied this loss function by Nerural Network (BP network)
Applied this loss function by additive model
log(1 )ijo
ij ij ijC P o e
( ) log (1 ) log(1 )
log(1 )ij
ij ij ij ij ij ij
o
ij ij ij
C C o P P P P
C P o e
ijP
ijij
C
![Page 6: Learning to Rank](https://reader036.vdocuments.net/reader036/viewer/2022062410/56815300550346895dc12105/html5/thumbnails/6.jpg)
2005.12.27
Derivation of cross entropy loss function for Additive Model
1 1
( ) ( )
1
( ) ( ) ( ( ) ( ))
1 1
1, ,
log(1 )
( ( ) ( )) log(1 )
Let ( ) ( ) ( )
( ( ) ( ) ( ( ) ( )) log(1 )
Let
ij
i j
k i k j k k i k j
o
ij ij ijij ij
f x f x
ij i jij
k k k k
f x f x h x h x
ij k i k j k k i k jij
k i j
C P o e
P f x f x e
f x f x h x
P f x f x h x h x e
f
1, , , ,
1, , , ,
1, , , ,
1 1 , ,
1, , , ,
1, , , ,
,, ,
( ) ( ) and = ( ) ( )
( ) log(1 )
( ) log(1 )
k i j k k i j
k i j k k i j
k i j k k i j
k i k j k i j k i k j
f h
ij k i j k k i jij
f h
ij k i j k k i jij
f h
kij k i j
k
f x f x h h x h x
P f h e
J P f h e e
e e hJP h
1, , , ,
1, ,
,
, ,
2
1 11 1
01
Let , , ,
( ) 01
with some relaxations
( ) ( ( ) ( ) ) ( ) 01 1
k i j k k i j
k i j k
i j
f hij
f
ij k i j
b
b
b bb b
e e
a P b h c e x e
bcxab
cx
c cx a a x
c c
![Page 7: Learning to Rank](https://reader036.vdocuments.net/reader036/viewer/2022062410/56815300550346895dc12105/html5/thumbnails/7.jpg)
2005.12.27
Candidates of Loss Functions Cross entropy
KL-Divergence This loss function is equivalent to cross entropy
Information Radius KL-Divergence and cross entropy are asymmetric information radius is symmetric, that is, IRad(p,q)=IRad(q,p)
Minkowski norm
This seems simpler than cross entropy in mathematical derivation for boosting
( ; ) ( ) ( || ) ( ) log ( )x X
H X q H X D p q p x q x
( )( || ) ( ) log
( )x X
p xD p q p x
q x
( || ) ( || )2 2
p q p qD p D q
( , ) ( ) ( )x X
L p q p x q x
![Page 8: Learning to Rank](https://reader036.vdocuments.net/reader036/viewer/2022062410/56815300550346895dc12105/html5/thumbnails/8.jpg)
2005.12.27
Fidelity Loss Function Fidelity
A more reasonable loss function that is inspired from quantum computation
Hold the same properties in probabilistic ranking model proposed by Chris et al.
New properties F(p, q) = F(q, p) Loss is between 0 and 1 get the minimum loss value 0 the loss convergence
1 12 2
( , ) 1 ( ) ( )
11 ( ) (1 )*( )
1 1
ij
ij ij
x X
o
ij ij ijo o
F p q p x q x
eF P P
e e
![Page 9: Learning to Rank](https://reader036.vdocuments.net/reader036/viewer/2022062410/56815300550346895dc12105/html5/thumbnails/9.jpg)
2005.12.27
Fidelity Loss Function Properties
Total loss function
Pair-level loss is considerede.g. the loss of (5, 4, 3, 2, 1) and (4, 3, 2, 1, 0) is zero
Query-level loss is also considered
More penalty for larger grade of paire.g. (5, 0) and (5, 4)
q
1 1
| | | # of pairs | ijq ij
FQ
queyr1 query2 Loss
Case1 1000 0 0.5
case2 990 10 0.005
![Page 10: Learning to Rank](https://reader036.vdocuments.net/reader036/viewer/2022062410/56815300550346895dc12105/html5/thumbnails/10.jpg)
2005.12.27
Derivation for Additive Model
1
1 1
1 12 2
1 1( )
| | | # of pairs |
1 1 11 *( ) (1 )*( )
| | | # of pairs | 1 1
k ij k kij
k ij k kij k ij k kij
ijq ijq
f h
ij ijf h f hq ijq
J H FQ
eP P
Q e e
1( , )
| | | # of pairs |q
D i jQ
1
1 1
1 12 21
( ) ( , ) 1 *( ) (1 )*( )1 1
k ij k kij
k ij k kij k ij k kij
f h
ij ijf h f hij
eJ H D i j P P
e e
We denote
![Page 11: Learning to Rank](https://reader036.vdocuments.net/reader036/viewer/2022062410/56815300550346895dc12105/html5/thumbnails/11.jpg)
2005.12.27
Derivation for Additive Model
1 11
1 1 1 1
1 12 2
2 2
1 1 1( , ) *( ) * (1 )*( ) (1 )*
2 21 (1 ) 1 (1 )
k ij k kij k ij k kijk ij k kij
k ij k kij k ij k kij k ij k kij k ij k kij
f h f hf hkij kij
ij ij ij ijf h f h f h f hk
h e h eJ eD i j P P P P
e e e e
1 1
1 1
1 12 2
3 132 2
0
(1 )( , ) 0
( ) ( ( ) )
k ij k kij k ij k kij
k kij k ij k kij k kij k ij
ij
f h f hij ijkij kij
h f h h fij
h P e e h e P eD i j
e e e e e
1 1 1 1
1 1
1 1
1
1 11 12 22 2
3 31 12 2
1 12 2
2
(1 ) (1 )( , ) ( , )
(1 ) (1 )
(1 )( , )
(1
k ij k ij k ij k ij
k k
k ij k ijkij kij
k ij k ij
k
k
f f f fij ij ij ij
f fh h
f fij ij
f
P e e P P e e Pe D i j e D i j
e e
P e e PD i j
ee
1 1
1
31 2
1 12 2
31 2
,1
,1
)
(1 )( , )
(1 )
1ln
2
ijkij
k ij k ij
k ijkij
kij
kij
h
f fij ij
fh
i jh
ki j
h
P e e PD i j
e
W
W
![Page 12: Learning to Rank](https://reader036.vdocuments.net/reader036/viewer/2022062410/56815300550346895dc12105/html5/thumbnails/12.jpg)
2005.12.27
FRank
1 | |{(( , ), )}, ... , {(( , ), )}ij iji j Q i jq x x P q x x P
1( , )
| | | # of pairs |q
D i jQ
1
( ) ( )T
t tt
H x h x
Algorithm: FRankGiven: ground truth
Initialize:
For t=1,2, …, T(a) For each weak learner candidate hi(x)
(a.1) Compute optimal αt,i
(a.2) Compute the fidelity loss(b) Choose the weak learner ht,i(x) with the minimal loss as ht(x)
(c) Choose the corresponding αt,i as αt
(d) Update pair weight by Wi,j
Output the final ranking
![Page 13: Learning to Rank](https://reader036.vdocuments.net/reader036/viewer/2022062410/56815300550346895dc12105/html5/thumbnails/13.jpg)
2005.12.27
Implementation Finished
Threshold fast implementation Faster 4 times
Alpha fast implementation Faster 120 seconds per weak learner in 3w
Total loss fast implementation Faster 3 times
Resume to training Plan
Multi-Thread implementation Parallel Computation Margin Consideration (Fast, but with loss)
![Page 14: Learning to Rank](https://reader036.vdocuments.net/reader036/viewer/2022062410/56815300550346895dc12105/html5/thumbnails/14.jpg)
2005.12.27
Preliminary Experimental Results Data Set of BestRank
Competition Training Data: about 2,500
queries Validation Data: about
1,100 queries Features: 64 features
Evaluation NDCG
![Page 15: Learning to Rank](https://reader036.vdocuments.net/reader036/viewer/2022062410/56815300550346895dc12105/html5/thumbnails/15.jpg)
2005.12.27
Preliminary Experimental Results
Results of Validation Data
![Page 16: Learning to Rank](https://reader036.vdocuments.net/reader036/viewer/2022062410/56815300550346895dc12105/html5/thumbnails/16.jpg)
2005.12.27
Next step
![Page 17: Learning to Rank](https://reader036.vdocuments.net/reader036/viewer/2022062410/56815300550346895dc12105/html5/thumbnails/17.jpg)
2005.12.27
Interesting Analogy Loss function
Pair-level loss Query-level loss Other considerations
Learning Model Boosting, additive model LogitBoost Boosted Lasso SVM Neural Network
The whole new model The dependence of retrieved web pages
![Page 18: Learning to Rank](https://reader036.vdocuments.net/reader036/viewer/2022062410/56815300550346895dc12105/html5/thumbnails/18.jpg)
2005.12.27
Pairwise Pairwise Training
Ranking is reduced to a classification problem by using pairwise items as training samples
This increases the data complexity from O(n) to O(n2)
Suppose there are n samples evenly distributed on k ranks, the total number of pairwise samples is roughly n2/k
F(x,y)Xi
Xj
F(x)Xi
![Page 19: Learning to Rank](https://reader036.vdocuments.net/reader036/viewer/2022062410/56815300550346895dc12105/html5/thumbnails/19.jpg)
2005.12.27
Pairwise Pairwise Training
F(x,y) is more general function than F(x) – F(y) Find properties that should be modeled by F(x,y)
Nonlinear relation between x and y margin(r1, r30) > margin(r1, r10) > margin(r21, r30) …
![Page 20: Learning to Rank](https://reader036.vdocuments.net/reader036/viewer/2022062410/56815300550346895dc12105/html5/thumbnails/20.jpg)
2005.12.27
Pairwise Pairwise Testing
In testing phrase, rank should be reconstructed from a partial orders graph, even inconsistent and incomplete
Topological sorting can only handle DAG in linear time Problem
inconsistent How to find the best spanning tree
incomplete How to deal with the node without label
![Page 21: Learning to Rank](https://reader036.vdocuments.net/reader036/viewer/2022062410/56815300550346895dc12105/html5/thumbnails/21.jpg)
2005.12.27
Pairwise Spanning Tree Related Content
Colley’s Bias Free College Football Ranking Method
Tree Reconstruction via partial order
…
![Page 22: Learning to Rank](https://reader036.vdocuments.net/reader036/viewer/2022062410/56815300550346895dc12105/html5/thumbnails/22.jpg)
2005.12.27
Thanks your attention Q&A
![Page 23: Learning to Rank](https://reader036.vdocuments.net/reader036/viewer/2022062410/56815300550346895dc12105/html5/thumbnails/23.jpg)
2005.12.27
Additive Model AdaBoost
Construct a classifier H(x) by the linear combination of the base classifier h(x)
In order to obtain the optimal base classifiers {hT(x)} and linear combination coefficients {αT}, we need to minimize the training error
For binary classification problems (1 or -1), the training error for the classifier H(x) can be written as
11
( ) ( ) ( ) ( )T
t t T T Tt
H x h x H x h x
1
( ( ) ) /N
i ii
err sign H x y N
![Page 24: Learning to Rank](https://reader036.vdocuments.net/reader036/viewer/2022062410/56815300550346895dc12105/html5/thumbnails/24.jpg)
2005.12.27
Additive Model AdaBoost
For the simplicity of computation, it uses the exponential cost function as the objective function
Apparently, the exponential cost function upper bounds the training error err
1 1
( )
1
( ) ( )
1
1 21 2
1 2
1
1 { ( ( ), ) ( ( ), )}
where function I is defined as
1 ( , )
0
T i i
T i i T i i
NH x y
i
NH x y H x yT T
T i i T i ii
err eN
e e I h x y e e I h x yN
if x xI x x
if x x
![Page 25: Learning to Rank](https://reader036.vdocuments.net/reader036/viewer/2022062410/56815300550346895dc12105/html5/thumbnails/25.jpg)
2005.12.27
Additive Model AdaBoost
By setting the derivative of the equation above with respect to αT to be zero, we have the expression as follows:
1
1
( )
1
( )
1
( ( ), )1ln
2 ( ( ), )
T i i
T i i
N H x yT i ii
T N H x yT i ii
e I h x y
e I h x y
With the expression of data distribution1
1
( )
( )
1
T i i
T j j
H x yT
i N H x y
j
eW
e
the linear combination coefficient αT can be written as
1
1
( ( ), )1 1 1ln ln( )
2 2( ( ), )
where stand for the weighted error rate under the weight distribution
for the base classifier ( ) in iteration T
N T Ti T i ii
T N TTi T i ii
T T
T
W I h x y
W I h x y
W
h x
![Page 26: Learning to Rank](https://reader036.vdocuments.net/reader036/viewer/2022062410/56815300550346895dc12105/html5/thumbnails/26.jpg)
2005.12.27
Additive ModelGiven: (xi, yi)1. Initialize: W1=1/N2. For t=1,2, …, T
(a) Train weak learner using distribution Wt
(b) compute
(c) compute
(d) update
3. Output
1( ( ), )
Nt ti t i ii
W I h x y
1 1
ln( )2
t
t t
1
( ) ( )T
t tt
H x sign h x
( )i t i ty h xt ti iW W e
Back
![Page 27: Learning to Rank](https://reader036.vdocuments.net/reader036/viewer/2022062410/56815300550346895dc12105/html5/thumbnails/27.jpg)
2005.12.27
NDCG K Jaervelin, J Kekaelaeinen - ACM Transactions on
Information Systems, 2002 Example
Assume that the relevance scores 0 – 3 are used.
G’=<3, 2, 3, 0, 0, 1, 2, 2, 3, 0, …> Cumulated Gain (CG)
CG’=<3, 5, 8, 8, 8, 9, 11, 13, 16, 16, …>
[1], if 1 [ ]
[ 1] [ ], otherwise
G iCG i
CG i G i
![Page 28: Learning to Rank](https://reader036.vdocuments.net/reader036/viewer/2022062410/56815300550346895dc12105/html5/thumbnails/28.jpg)
2005.12.27
NDCG Discount Cumulated Gain (DCG)
let b=2,
DCG’=<3, 5, 6.89, 6.89, 6.89, 7.28, 7.99, 8.66, 9.61, 9.61, …> Normalized Discount Cumulated Gain (NDCG)
Ideal vector
I’=<3, 3, 3, 2, 2, 2, 1, 1, 1, 1, 0, 0, 0, …>
CGI’=<3, 6, 9, 11, 13, 15, 16, 17, 18, 19, 19, 19, 19, …>
DCGI’=<3, 6, 7.89, 8.89, 9.75, 10.52, 10.88, 11.21, 11.53, 11.83, 11.83, …>
NDCG’=<1, 0.83, 0.89, 0.73, 0.62, 0.6, 0.69, 0.76, 0.89, 0.84, …>
[1], if 1 [ ]
[ 1] [ ] / log , otherwiseb
G iDCG i
DCG i G i i
Back