learning to rank - wnzhangwnzhang.net/teaching/ee448/slides/9-learning-to-rank.pdf · learning to...
TRANSCRIPT
![Page 1: Learning to Rank - wnzhangwnzhang.net/teaching/ee448/slides/9-learning-to-rank.pdf · Learning to Rank Approaches •Learn (not define) a scoring function to optimally rank the documents](https://reader033.vdocuments.net/reader033/viewer/2022043014/5fafc1bde10b0c52d1242764/html5/thumbnails/1.jpg)
Learning to RankWeinan Zhang
Shanghai Jiao Tong Universityhttp://wnzhang.net
2019 EE448, Big Data Mining, Lecture 9
http://wnzhang.net/teaching/ee448/index.html
![Page 2: Learning to Rank - wnzhangwnzhang.net/teaching/ee448/slides/9-learning-to-rank.pdf · Learning to Rank Approaches •Learn (not define) a scoring function to optimally rank the documents](https://reader033.vdocuments.net/reader033/viewer/2022043014/5fafc1bde10b0c52d1242764/html5/thumbnails/2.jpg)
Content of This Course• Another ML problem: ranking
• Learning to rank
• Pointwise methods
• Pairwise methods
• Listwise methods
![Page 3: Learning to Rank - wnzhangwnzhang.net/teaching/ee448/slides/9-learning-to-rank.pdf · Learning to Rank Approaches •Learn (not define) a scoring function to optimally rank the documents](https://reader033.vdocuments.net/reader033/viewer/2022043014/5fafc1bde10b0c52d1242764/html5/thumbnails/3.jpg)
Ranking ProblemLearning to rankPointwise methodsPairwise methodsListwise methods
Sincerely thank Dr. Tie-Yan Liu
![Page 4: Learning to Rank - wnzhangwnzhang.net/teaching/ee448/slides/9-learning-to-rank.pdf · Learning to Rank Approaches •Learn (not define) a scoring function to optimally rank the documents](https://reader033.vdocuments.net/reader033/viewer/2022043014/5fafc1bde10b0c52d1242764/html5/thumbnails/4.jpg)
The Probability Ranking Principle• https://nlp.stanford.edu/IR-
book/html/htmledition/the-probability-ranking-principle-1.html
![Page 5: Learning to Rank - wnzhangwnzhang.net/teaching/ee448/slides/9-learning-to-rank.pdf · Learning to Rank Approaches •Learn (not define) a scoring function to optimally rank the documents](https://reader033.vdocuments.net/reader033/viewer/2022043014/5fafc1bde10b0c52d1242764/html5/thumbnails/5.jpg)
Regression and Classification
• Two major problems for supervised learning• Regression
L(yi; fμ(xi)) =1
2(yi ¡ fμ(xi))
2L(yi; fμ(xi)) =1
2(yi ¡ fμ(xi))
2
• Supervised learning
minμ
1
N
NXi=1
L(yi; fμ(xi))minμ
1
N
NXi=1
L(yi; fμ(xi))
• Classification
L(yi; fμ(xi)) = ¡yi log fμ(xi)¡ (1 ¡ yi) log(1 ¡ fμ(xi))L(yi; fμ(xi)) = ¡yi log fμ(xi)¡ (1 ¡ yi) log(1 ¡ fμ(xi))
![Page 6: Learning to Rank - wnzhangwnzhang.net/teaching/ee448/slides/9-learning-to-rank.pdf · Learning to Rank Approaches •Learn (not define) a scoring function to optimally rank the documents](https://reader033.vdocuments.net/reader033/viewer/2022043014/5fafc1bde10b0c52d1242764/html5/thumbnails/6.jpg)
Learning to Rank Problem• Input: a set of instances
X = fx1; x2; : : : ; xngX = fx1; x2; : : : ; xng
• Output: a rank list of these instances
Y = fxr1 ; xr2 ; : : : ; xrngY = fxr1 ; xr2 ; : : : ; xrng
• Ground truth: a correct ranking of these instancesY = fxy1 ; xy2 ; : : : ; xyngY = fxy1 ; xy2 ; : : : ; xyng
![Page 7: Learning to Rank - wnzhangwnzhang.net/teaching/ee448/slides/9-learning-to-rank.pdf · Learning to Rank Approaches •Learn (not define) a scoring function to optimally rank the documents](https://reader033.vdocuments.net/reader033/viewer/2022043014/5fafc1bde10b0c52d1242764/html5/thumbnails/7.jpg)
A Typical Application• Webpage ranking given a query
• Page ranking
![Page 8: Learning to Rank - wnzhangwnzhang.net/teaching/ee448/slides/9-learning-to-rank.pdf · Learning to Rank Approaches •Learn (not define) a scoring function to optimally rank the documents](https://reader033.vdocuments.net/reader033/viewer/2022043014/5fafc1bde10b0c52d1242764/html5/thumbnails/8.jpg)
Webpage Ranking
D = fdigD = fdig
Indexed Document Repository
QueryRankingModel
Ranked List of Documents
query = qquery = q
dq1 =https://www.crunchbase.com
dq2 =https://www.reddit.com
...
dqn =https://www.quora.com
dq1 =https://www.crunchbase.com
dq2 =https://www.reddit.com
...
dqn =https://www.quora.com
"ML in China"
![Page 9: Learning to Rank - wnzhangwnzhang.net/teaching/ee448/slides/9-learning-to-rank.pdf · Learning to Rank Approaches •Learn (not define) a scoring function to optimally rank the documents](https://reader033.vdocuments.net/reader033/viewer/2022043014/5fafc1bde10b0c52d1242764/html5/thumbnails/9.jpg)
Model Perspective• In most existing work, learning to rank is defined as
having the following two properties• Feature-based
• Each instance (e.g. query-document pair) is represented with a list of features
• Discriminative training• Estimate the relevance given a query-document pair• Rank the documents based on the estimation
yi = fμ(xi)yi = fμ(xi)
![Page 10: Learning to Rank - wnzhangwnzhang.net/teaching/ee448/slides/9-learning-to-rank.pdf · Learning to Rank Approaches •Learn (not define) a scoring function to optimally rank the documents](https://reader033.vdocuments.net/reader033/viewer/2022043014/5fafc1bde10b0c52d1242764/html5/thumbnails/10.jpg)
Learning to Rank• Input: features of query and documents
• Query, document, and combination features
• Output: the documents ranked by a scoring function
• Objective: relevance of the ranking list• Evaluation metrics: NDCG, MAP, MRR…
• Training data: the query-doc features and relevance ratings
yi = fμ(xi)yi = fμ(xi)
![Page 11: Learning to Rank - wnzhangwnzhang.net/teaching/ee448/slides/9-learning-to-rank.pdf · Learning to Rank Approaches •Learn (not define) a scoring function to optimally rank the documents](https://reader033.vdocuments.net/reader033/viewer/2022043014/5fafc1bde10b0c52d1242764/html5/thumbnails/11.jpg)
Training Data• The query-doc features and relevance ratings
FeaturesRating Document Query
LengthDoc
PageRankDoc
LengthTitle Rel.
Content Rel.
3 d1=http://crunchbase.com 0.30 0.61 0.47 0.54 0.76
5 d2=http://reddit.com 0.30 0.81 0.76 0.91 0.81
4 d3=http://quora.com 0.30 0.86 0.56 0.96 0.69
Query=‘ML in China’
Documentfeatures
Query-docfeatures
Queryfeatures
![Page 12: Learning to Rank - wnzhangwnzhang.net/teaching/ee448/slides/9-learning-to-rank.pdf · Learning to Rank Approaches •Learn (not define) a scoring function to optimally rank the documents](https://reader033.vdocuments.net/reader033/viewer/2022043014/5fafc1bde10b0c52d1242764/html5/thumbnails/12.jpg)
Learning to Rank Approaches• Learn (not define) a scoring function to optimally
rank the documents given a query
• Pointwise• Predict the absolute relevance (e.g. RMSE)
• Pairwise• Predict the ranking of a document pair (e.g. AUC)
• Listwise• Predict the ranking of a document list (e.g. Cross Entropy)
Tie-Yan Liu. Learning to Rank for Information Retrieval. Springer 2011.http://www.cda.cn/uploadfile/image/20151220/20151220115436_46293.pdf
![Page 13: Learning to Rank - wnzhangwnzhang.net/teaching/ee448/slides/9-learning-to-rank.pdf · Learning to Rank Approaches •Learn (not define) a scoring function to optimally rank the documents](https://reader033.vdocuments.net/reader033/viewer/2022043014/5fafc1bde10b0c52d1242764/html5/thumbnails/13.jpg)
Pointwise Approaches• Predict the expert ratings
• As a regression problem
FeaturesQuery=‘ML in China’
yi = fμ(xi)yi = fμ(xi)
minμ
1
2N
NXi=1
(yi ¡ fμ(xi))2min
μ
1
2N
NXi=1
(yi ¡ fμ(xi))2
Rating Document QueryLength
DocPageRank
DocLength
Title Rel.
Content Rel.
3 d1=http://crunchbase.com 0.30 0.61 0.47 0.54 0.76
5 d2=http://reddit.com 0.30 0.81 0.76 0.91 0.81
4 d3=http://quora.com 0.30 0.86 0.56 0.96 0.69
![Page 14: Learning to Rank - wnzhangwnzhang.net/teaching/ee448/slides/9-learning-to-rank.pdf · Learning to Rank Approaches •Learn (not define) a scoring function to optimally rank the documents](https://reader033.vdocuments.net/reader033/viewer/2022043014/5fafc1bde10b0c52d1242764/html5/thumbnails/14.jpg)
Point Accuracy != Ranking Accuracy
• Same square error might lead to different rankings
Relevancy
Doc 1 Ground truth Doc 2 Ground truth
32.4 3.63.4 4.64
Ranking interchanged
![Page 15: Learning to Rank - wnzhangwnzhang.net/teaching/ee448/slides/9-learning-to-rank.pdf · Learning to Rank Approaches •Learn (not define) a scoring function to optimally rank the documents](https://reader033.vdocuments.net/reader033/viewer/2022043014/5fafc1bde10b0c52d1242764/html5/thumbnails/15.jpg)
Pairwise Approaches• Not care about the absolute relevance but the
relative preference on a document pair
• A binary classification
q(i)q(i)266664d
(i)1 ; 5
d(i)2 ; 3...
d(i)
n(i) ; 2
377775266664
d(i)1 ; 5
d(i)2 ; 3...
d(i)
n(i) ; 2
377775Transform n
(d(i)1 ; d
(i)2 ); (d
(i)1 ; d
(i)
n(i)); : : : ; (d(i)2 ; d
(i)
n(i))on
(d(i)1 ; d
(i)2 ); (d
(i)1 ; d
(i)
n(i)); : : : ; (d(i)2 ; d
(i)
n(i))oq(i)q(i)
5 > 35 > 3 5 > 25 > 2 3 > 23 > 2
![Page 16: Learning to Rank - wnzhangwnzhang.net/teaching/ee448/slides/9-learning-to-rank.pdf · Learning to Rank Approaches •Learn (not define) a scoring function to optimally rank the documents](https://reader033.vdocuments.net/reader033/viewer/2022043014/5fafc1bde10b0c52d1242764/html5/thumbnails/16.jpg)
Binary Classification for Pairwise Ranking• Given a query and a pair of documentsqq (di; dj)(di; dj)
yi;j =
(1 if i B j
0 otherwiseyi;j =
(1 if i B j
0 otherwise
Pi;j = P (di B dj jq) =exp(oi;j)
1 + exp(oi;j)Pi;j = P (di B dj jq) =
exp(oi;j)
1 + exp(oi;j)
oi;j ´ f(xi)¡ f(xj)oi;j ´ f(xi)¡ f(xj) xi is the feature vector of (q, di)
• Cross entropy loss
• Target probability
• Modeled probability
L(q; di; dj) = ¡yi;j log Pi;j ¡ (1¡ yi;j) log(1 ¡ Pi;j)L(q; di; dj) = ¡yi;j log Pi;j ¡ (1¡ yi;j) log(1 ¡ Pi;j)
![Page 17: Learning to Rank - wnzhangwnzhang.net/teaching/ee448/slides/9-learning-to-rank.pdf · Learning to Rank Approaches •Learn (not define) a scoring function to optimally rank the documents](https://reader033.vdocuments.net/reader033/viewer/2022043014/5fafc1bde10b0c52d1242764/html5/thumbnails/17.jpg)
RankNet• The scoring function is implemented by a
neural network
Burges, Christopher JC, Robert Ragno, and Quoc Viet Le. "Learning to rank with nonsmooth cost functions." NIPS. Vol. 6. 2006.
fμ(xi)fμ(xi)
@L(q; di; dj)
@μ=
@L(q; di; dj)
@Pi;j
@Pi;j
@oi;j
@oi;j
@μ
=@L(q; di; dj)
@Pi;j
@Pi;j
@oi;j
³@fμ(xi)
@μ¡ @fμ(xj)
@μ
´@L(q; di; dj)
@μ=
@L(q; di; dj)
@Pi;j
@Pi;j
@oi;j
@oi;j
@μ
=@L(q; di; dj)
@Pi;j
@Pi;j
@oi;j
³@fμ(xi)
@μ¡ @fμ(xj)
@μ
´
Pi;j = P (di B dj jq) =exp(oi;j)
1 + exp(oi;j)Pi;j = P (di B dj jq) =
exp(oi;j)
1 + exp(oi;j)
oi;j ´ f(xi)¡ f(xj)oi;j ´ f(xi)¡ f(xj)• Cross entropy loss
• Modeled probability
L(q; di; dj) = ¡yi;j log Pi;j ¡ (1¡ yi;j) log(1 ¡ Pi;j)L(q; di; dj) = ¡yi;j log Pi;j ¡ (1¡ yi;j) log(1 ¡ Pi;j)
• Gradient by chain ruleBP in NN
![Page 18: Learning to Rank - wnzhangwnzhang.net/teaching/ee448/slides/9-learning-to-rank.pdf · Learning to Rank Approaches •Learn (not define) a scoring function to optimally rank the documents](https://reader033.vdocuments.net/reader033/viewer/2022043014/5fafc1bde10b0c52d1242764/html5/thumbnails/18.jpg)
Shortcomings of Pairwise Approaches
• Each document pair is regarded with the same importance
24324
Same pair-level error but different list-level error
Documents Rating
![Page 19: Learning to Rank - wnzhangwnzhang.net/teaching/ee448/slides/9-learning-to-rank.pdf · Learning to Rank Approaches •Learn (not define) a scoring function to optimally rank the documents](https://reader033.vdocuments.net/reader033/viewer/2022043014/5fafc1bde10b0c52d1242764/html5/thumbnails/19.jpg)
Ranking Evaluation Metrics
• Precision@k for query q
P@k =#frelevant documents in top k resultsg
kP@k =
#frelevant documents in top k resultsgk
• Average precision for query q
AP =
Pk P@k ¢ yi(k)
#frelevant documentsgAP =
Pk P@k ¢ yi(k)
#frelevant documentsg
• For binary labels yi =
(1 if di is relevant with q
0 otherwiseyi =
(1 if di is relevant with q
0 otherwise
10101
• Mean average precision (MAP): average over all queries
AP =1
3¢³1
1+
2
3+
3
5
´AP =
1
3¢³1
1+
2
3+
3
5
´• i(k) is the document id at k-th position
![Page 20: Learning to Rank - wnzhangwnzhang.net/teaching/ee448/slides/9-learning-to-rank.pdf · Learning to Rank Approaches •Learn (not define) a scoring function to optimally rank the documents](https://reader033.vdocuments.net/reader033/viewer/2022043014/5fafc1bde10b0c52d1242764/html5/thumbnails/20.jpg)
Ranking Evaluation Metrics
• Normalized discounted cumulative gain (NDCG@k) for query q
NDCG@k = Zk
kXj=1
2yi(j) ¡ 1
log(j + 1)NDCG@k = Zk
kXj=1
2yi(j) ¡ 1
log(j + 1)
• For score labels, e.g.,
yi 2 f0; 1; 2; 3; 4gyi 2 f0; 1; 2; 3; 4g
• i(j) is the document id at j-th position• Zk is set to normalize the DCG of the ground truth ranking as 1
Normalizer
GainDiscount
![Page 21: Learning to Rank - wnzhangwnzhang.net/teaching/ee448/slides/9-learning-to-rank.pdf · Learning to Rank Approaches •Learn (not define) a scoring function to optimally rank the documents](https://reader033.vdocuments.net/reader033/viewer/2022043014/5fafc1bde10b0c52d1242764/html5/thumbnails/21.jpg)
Shortcomings of Pairwise Approaches
• Same pair-level error but different list-level error
NDCG@k = Zk
kXj=1
2yi(j) ¡ 1
log(j + 1)NDCG@k = Zk
kXj=1
2yi(j) ¡ 1
log(j + 1)
y = 0 y = 1
![Page 22: Learning to Rank - wnzhangwnzhang.net/teaching/ee448/slides/9-learning-to-rank.pdf · Learning to Rank Approaches •Learn (not define) a scoring function to optimally rank the documents](https://reader033.vdocuments.net/reader033/viewer/2022043014/5fafc1bde10b0c52d1242764/html5/thumbnails/22.jpg)
Listwise Approaches• Training loss is directly built based on the difference
between the prediction list and the ground truth list
• Straightforward target• Directly optimize the ranking evaluation measures
• Complex model
![Page 23: Learning to Rank - wnzhangwnzhang.net/teaching/ee448/slides/9-learning-to-rank.pdf · Learning to Rank Approaches •Learn (not define) a scoring function to optimally rank the documents](https://reader033.vdocuments.net/reader033/viewer/2022043014/5fafc1bde10b0c52d1242764/html5/thumbnails/23.jpg)
ListNet• Train the score function• Rankings generated based on • Each possible k-length ranking list has a probability
• List-level loss: cross entropy between the predicted distribution and the ground truth
• Complexity: many possible rankingsCao, Zhe, et al. "Learning to rank: from pairwise approach to listwise approach." Proceedings of the 24th international conference on Machine learning. ACM, 2007.
yi = fμ(xi)yi = fμ(xi)
fyigi=1:::nfyigi=1:::n
Pf ([j1; j2; : : : ; jk]) =
kYt=1
exp(f(xjt))Pnl=t exp(f(xjl
))Pf ([j1; j2; : : : ; jk]) =
kYt=1
exp(f(xjt))Pnl=t exp(f(xjl
))
L(y; f(x)) = ¡Xg2Gk
Py(g) log Pf (g)L(y; f(x)) = ¡Xg2Gk
Py(g) log Pf (g)
![Page 24: Learning to Rank - wnzhangwnzhang.net/teaching/ee448/slides/9-learning-to-rank.pdf · Learning to Rank Approaches •Learn (not define) a scoring function to optimally rank the documents](https://reader033.vdocuments.net/reader033/viewer/2022043014/5fafc1bde10b0c52d1242764/html5/thumbnails/24.jpg)
Distance between Ranked Lists• A similar distance: KL divergence
Tie-Yan Liu @ Tutorial at WWW 2008
![Page 25: Learning to Rank - wnzhangwnzhang.net/teaching/ee448/slides/9-learning-to-rank.pdf · Learning to Rank Approaches •Learn (not define) a scoring function to optimally rank the documents](https://reader033.vdocuments.net/reader033/viewer/2022043014/5fafc1bde10b0c52d1242764/html5/thumbnails/25.jpg)
Pairwise vs. Listwise• Pairwise approach shortcoming
• Pair-level loss is away from IR list-level evaluations• Listwise approach shortcoming
• Hard to define a list-level loss under a low model complexity
• A good solution: LambdaRank• Pairwise training with listwise information
Burges, Christopher JC, Robert Ragno, and Quoc Viet Le. "Learning to rank with nonsmooth cost functions." NIPS. Vol. 6. 2006.
![Page 26: Learning to Rank - wnzhangwnzhang.net/teaching/ee448/slides/9-learning-to-rank.pdf · Learning to Rank Approaches •Learn (not define) a scoring function to optimally rank the documents](https://reader033.vdocuments.net/reader033/viewer/2022043014/5fafc1bde10b0c52d1242764/html5/thumbnails/26.jpg)
LambdaRank• Pairwise approach gradient
@L(q; di; dj)
@μ=
@L(q; di; dj)
@Pi;j
@Pi;j
@oi;j| {z }¸i;j
³@fμ(xi)
@μ¡ @fμ(xi)
@μ
´@L(q; di; dj)
@μ=
@L(q; di; dj)
@Pi;j
@Pi;j
@oi;j| {z }¸i;j
³@fμ(xi)
@μ¡ @fμ(xi)
@μ
´oi;j ´ f(xi)¡ f(xj)oi;j ´ f(xi)¡ f(xj)
Pairwise ranking loss Scoring function itself
• LambdaRank basic idea• Add listwise information into as ¸i;j¸i;j h(¸i;j ; gq)h(¸i;j ; gq)
Current ranking list
@L(q; di; dj)
@μ= h(¸i;j ; gq)
³@fμ(xi)
@μ¡ @fμ(xj)
@μ
´@L(q; di; dj)
@μ= h(¸i;j ; gq)
³@fμ(xi)
@μ¡ @fμ(xj)
@μ
´
![Page 27: Learning to Rank - wnzhangwnzhang.net/teaching/ee448/slides/9-learning-to-rank.pdf · Learning to Rank Approaches •Learn (not define) a scoring function to optimally rank the documents](https://reader033.vdocuments.net/reader033/viewer/2022043014/5fafc1bde10b0c52d1242764/html5/thumbnails/27.jpg)
LambdaRank for Optimizing NDCG• A choice of Lambda for optimize NDCG
h(¸i;j ; gq) = ¸i;j¢NDCGi;jh(¸i;j ; gq) = ¸i;j¢NDCGi;j
![Page 28: Learning to Rank - wnzhangwnzhang.net/teaching/ee448/slides/9-learning-to-rank.pdf · Learning to Rank Approaches •Learn (not define) a scoring function to optimally rank the documents](https://reader033.vdocuments.net/reader033/viewer/2022043014/5fafc1bde10b0c52d1242764/html5/thumbnails/28.jpg)
LambdaRank vs. RankNet
Burges, Christopher JC, Robert Ragno, and Quoc Viet Le. "Learning to rank with nonsmooth cost functions." NIPS. Vol. 6. 2006.
Linear nets
![Page 29: Learning to Rank - wnzhangwnzhang.net/teaching/ee448/slides/9-learning-to-rank.pdf · Learning to Rank Approaches •Learn (not define) a scoring function to optimally rank the documents](https://reader033.vdocuments.net/reader033/viewer/2022043014/5fafc1bde10b0c52d1242764/html5/thumbnails/29.jpg)
LambdaRank vs. RankNet
Burges, Christopher JC, Robert Ragno, and Quoc Viet Le. "Learning to rank with nonsmooth cost functions." NIPS. Vol. 6. 2006.
![Page 30: Learning to Rank - wnzhangwnzhang.net/teaching/ee448/slides/9-learning-to-rank.pdf · Learning to Rank Approaches •Learn (not define) a scoring function to optimally rank the documents](https://reader033.vdocuments.net/reader033/viewer/2022043014/5fafc1bde10b0c52d1242764/html5/thumbnails/30.jpg)
Summary of Learning to Rank• Pointwise, pairwise and listwise approaches for
learning to rank
• Pairwise approaches are still the most popular• A balance of ranking effectiveness and training efficiency
• LambdaRank is a pairwise approach with list-level information
• Easy to implement, easy to improve and adjust