A Support Vector Method for Optimizing Average Precision
SIGIR 2007
Yisong YueCornell University
In Collaboration With:
Thomas Finley, Filip Radlinski, Thorsten Joachims
(Cornell University)
Motivation
• Learn to Rank Documents
• Optimize for IR performance measures
• Mean Average Precision
• Leverage Structural SVMs [Tsochantaridis et al. 2005]
MAP vs Accuracy
• Average precision is the average of the precision scores at the rank locations of each relevant document.
• Ex: has average precision
• Mean Average Precision (MAP) is the mean of the Average
Precision scores for a group of queries.
• A machine learning algorithm optimizing for Accuracy might learn a very different model than optimizing for MAP.
• Ex: has average precision of about 0.64, but has a max accuracy of 0.8 vs 0.6 in above ranking.
76.05
3
3
2
1
1
3
1
Recent Related Work
Greedy & Local Search• [Metzler & Croft 2005] – optimized for MAP, used gradient descent,
expensive for large number of features.
• [Caruana et al. 2004] – iteratively built an ensemble to greedily improve arbitrary performance measures.
Surrogate Performance Measures• [Burges et al. 2005] – used neural nets optimizing for cross entropy.
• [Cao et al. 2006] – used SVMs optimizing for modified ROC-Area.
Relaxations• [Xu & Li 2007] – used Boosting with exponential loss relaxation
Conventional SVMs
• Input examples denoted by x (high dimensional point)• Output targets denoted by y (either +1 or -1)
• SVMs learns a hyperplane w, predictions are sign(wTx)• Training involves finding w which minimizes
• subject to
• The sum of slacks upper bounds the accuracy loss
N
iiN
Cw
1
2
2
1
iiT xwi 1)(y : i
i
i
Conventional SVMs
• Input examples denoted by x (high dimensional point)• Output targets denoted by y (either +1 or -1)
• SVMs learns a hyperplane w, predictions are sign(wTx)• Training involves finding w which minimizes
• subject to
• The sum of slacks upper bounds the accuracy loss
N
iiN
Cw
1
2
2
1
iiT xwi 1)(y : i
i
i
Adapting to Average Precision
• Let x denote the set of documents/query examples for a query• Let y denote a (weak) ranking (each yij 2 {-1,0,+1})
• Same objective function:
• Constraints are defined for each incorrect labeling y’ over the set of documents x.
• Joint discriminant score for the correct labeling at least as large as incorrect labeling plus the performance loss.
i
iN
Cw 2
2
1
)',(),'(),( :' yyxywxywyy TT
Adapting to Average Precision
• Maximize
subject to
where
and
• Sum of slacks upper bound MAP loss.
• After learning w, a prediction is made by sorting on wTxi
reli relj
jiij xxyxy: :!
)('),'(
i
iN
Cw 2
2
1
)',(),'(),( :' yyxywxywyy TT
)'(1)',( yAvgprecyy
Adapting to Average Precision
• Maximize
subject to
where
and
• Sum of slacks upper bound MAP loss.
• After learning w, a prediction is made by sorting on wTxi
reli relj
jiij xxyxy: :!
)('),'(
i
iN
Cw 2
2
1
)',(),'(),( :' yyxywxywyy TT
)'(1)',( yAvgprecyy
Too Many Constraints!
• For Average Precision, the true labeling is a ranking where the relevant documents are all ranked in the front, e.g.,
• An incorrect labeling would be any other ranking, e.g.,
• This ranking has Average Precision of about 0.8 with (y,y’) ¼ 0.2
• Exponential number of rankings, thus an exponential number of constraints!
Structural SVM Training• STEP 1: Solve the SVM objective function using only the current
working set of constraints.
• STEP 2: Using the model learned in STEP 1, find the most violated constraint from the exponential set of constraints.
• STEP 3: If the constraint returned in STEP 2 is more violated than the most violated constraint the working set by some small constant, add that constraint to the working set.
• Repeat STEP 1-3 until no additional constraints are added. Return the most recent model that was trained in STEP 1.
STEP 1-3 is guaranteed to loop for at most a polynomial number of iterations. [Tsochantaridis et al. 2005]
Illustrative Example
Original SVM Problem• Exponential constraints
• Most are dominated by a small set of “important” constraints
Structural SVM Approach• Repeatedly finds the next most
violated constraint…
• …until set of constraints is a good approximation.
Illustrative Example
Original SVM Problem• Exponential constraints
• Most are dominated by a small set of “important” constraints
Structural SVM Approach• Repeatedly finds the next most
violated constraint…
• …until set of constraints is a good approximation.
Illustrative Example
Original SVM Problem• Exponential constraints
• Most are dominated by a small set of “important” constraints
Structural SVM Approach• Repeatedly finds the next most
violated constraint…
• …until set of constraints is a good approximation.
Illustrative Example
Original SVM Problem• Exponential constraints
• Most are dominated by a small set of “important” constraints
Structural SVM Approach• Repeatedly finds the next most
violated constraint…
• …until set of constraints is a good approximation.
Finding Most Violated Constraint
• Structural SVM is an oracle framework.
• Requires subroutine to find the most violated constraint.– Dependent on formulation of loss function and joint feature
representation.
• Exponential number of constraints!
• Efficient algorithm in the case of optimizing MAP.
Finding Most Violated Constraint
Observation• MAP is invariant on the order of documents within a relevance
class– Swapping two relevant or non-relevant documents does not change MAP.
• Joint SVM score is optimized by sorting by document score, wTx
• Reduces to finding an interleaving
between two sorted lists of documents
reli relj
jT
iT
ij xwxwyyywyH: :!
)(')',();'(
Finding Most Violated Constraint
reli relj
jT
iT
ij xwxwyyywyH: :!
)(')',();'(
►
• Start with perfect ranking• Consider swapping adjacent
relevant/non-relevant documents
Finding Most Violated Constraint
reli relj
jT
iT
ij xwxwyyywyH: :!
)(')',();'(
►
• Start with perfect ranking• Consider swapping adjacent
relevant/non-relevant documents• Find the best feasible ranking of
the non-relevant document
Finding Most Violated Constraint
reli relj
jT
iT
ij xwxwyyywyH: :!
)(')',();'(
►
• Start with perfect ranking• Consider swapping adjacent
relevant/non-relevant documents• Find the best feasible ranking of the
non-relevant document• Repeat for next non-relevant
document
Finding Most Violated Constraint
reli relj
jT
iT
ij xwxwyyywyH: :!
)(')',();'(
►
• Start with perfect ranking• Consider swapping adjacent
relevant/non-relevant documents• Find the best feasible ranking of the
non-relevant document• Repeat for next non-relevant
document• Never want to swap past previous
non-relevant document
Finding Most Violated Constraint
reli relj
jT
iT
ij xwxwyyywyH: :!
)(')',();'(
►
• Start with perfect ranking• Consider swapping adjacent
relevant/non-relevant documents• Find the best feasible ranking of the
non-relevant document• Repeat for next non-relevant
document• Never want to swap past previous
non-relevant document• Repeat until all non-relevant
documents have been considered
Quick Recap
SVM Formulation• SVMs optimize a tradeoff between model complexity and MAP loss• Exponential number of constraints (one for each incorrect ranking)• Structural SVMs finds a small subset of important constraints• Requires sub-procedure to find most violated constraint
Find Most Violated Constraint• Loss function invariant to re-ordering of relevant documents• SVM score imposes an ordering of the relevant documents• Finding interleaving of two sorted lists• Loss function has certain monotonic properties• Efficient algorithm
Experiments
• Used TREC 9 & 10 Web Track corpus.
• Features of document/query pairs computed from outputs of existing retrieval functions.
(Indri Retrieval Functions & TREC Submissions)
• Goal is to learn a recombination of outputs which improves mean average precision.
Comparison with Best Base methods
0.1
0.12
0.14
0.16
0.18
0.2
0.22
0.24
0.26
0.28
0.3
TREC 9 Indri TREC 10 Indri TREC 9Submissions
TREC 10Submissions
TREC 9Submissions(without best)
TREC 10Submissions(without best)
Dataset
Mea
n A
vera
ge
Pre
cisi
on
SVM-MAP
Base 1
Base 2
Base 3
Comparison with other SVM methods
0.1
0.12
0.14
0.16
0.18
0.2
0.22
0.24
0.26
0.28
0.3
TREC 9 Indri TREC 10 Indri TREC 9Submissions
TREC 10Submissions
TREC 9Submissions(without best)
TREC 10Submissions(without best)
Dataset
Mea
n A
vera
ge
Pre
cisi
on
SVM-MAP
SVM-ROC
SVM-ACC
SVM-ACC2
SVM-ACC3
SVM-ACC4
Moving Forward
• Approach also works (in theory) for other measures.
• Some promising results when optimizing for NDCG (with only 1 level of relevance).
• Currently working on optimizing for NDCG with multiple levels of relevance.
• Preliminary MRR results not as promising.
Conclusions
• Principled approach to optimizing average precision.
(avoids difficult to control heuristics)
• Performs at least as well as alternative SVM methods.
• Can be generalized to a large class of rank-based performance measures.
• Software available at http://svmrank.yisongyue.com