advances on the development of evaluation measures
TRANSCRIPT
![Page 1: Advances on the Development of Evaluation Measures](https://reader031.vdocuments.net/reader031/viewer/2022020621/61e74dde4f6a846e2a4bf5fa/html5/thumbnails/1.jpg)
Advances on the Development of Evaluation Measures
Ben Carterette Evangelos Kanoulas Emine Yilmaz
![Page 2: Advances on the Development of Evaluation Measures](https://reader031.vdocuments.net/reader031/viewer/2022020621/61e74dde4f6a846e2a4bf5fa/html5/thumbnails/2.jpg)
Information Retrieval Systems
Match information seekers with
the information they seek
![Page 3: Advances on the Development of Evaluation Measures](https://reader031.vdocuments.net/reader031/viewer/2022020621/61e74dde4f6a846e2a4bf5fa/html5/thumbnails/3.jpg)
“What you can’t measure you can’t improve”
Lord Kelvin
3
Most retrieval systems are tuned to optimize for an objective
evaluation metric
Why is Evaluation so Important?
![Page 4: Advances on the Development of Evaluation Measures](https://reader031.vdocuments.net/reader031/viewer/2022020621/61e74dde4f6a846e2a4bf5fa/html5/thumbnails/4.jpg)
Outline
• Intro to evaluation
– Different approaches to evaluation
– Traditional evaluation measures
• User model based evaluation measures
• Session Evaluation
• Novelty and Diversity
4
![Page 5: Advances on the Development of Evaluation Measures](https://reader031.vdocuments.net/reader031/viewer/2022020621/61e74dde4f6a846e2a4bf5fa/html5/thumbnails/5.jpg)
Online Evaluation
• Design interactive experiments
• Use users’ actions to evaluate the quality
Click/Noclick
Evaluate
5
![Page 6: Advances on the Development of Evaluation Measures](https://reader031.vdocuments.net/reader031/viewer/2022020621/61e74dde4f6a846e2a4bf5fa/html5/thumbnails/6.jpg)
Online Evaluation
• Standard click metrics – Clickthrough rate
– Queries per user
– Probability user skips over results they have considered (pSkip)
• Result interleaving
![Page 7: Advances on the Development of Evaluation Measures](https://reader031.vdocuments.net/reader031/viewer/2022020621/61e74dde4f6a846e2a4bf5fa/html5/thumbnails/7.jpg)
What is result interleaving? • A way to compare rankers online
– Given the two rankings produced by two methods
– Present a combination of the rankings to users
• Result interleaving
– Credit assignment based on clicks
![Page 8: Advances on the Development of Evaluation Measures](https://reader031.vdocuments.net/reader031/viewer/2022020621/61e74dde4f6a846e2a4bf5fa/html5/thumbnails/8.jpg)
Team Draft Interleaving (Radlinski et al., 2008)
• Interleaving two rankings
– Input: Two rankings
• Repeat: – Toss a coin to see which team picks next
– Winner picks their best remaining player
– Loser picks their best remaining player
– Output: One ranking
• Credit assignment
– Ranking providing more of the clicked results wins
![Page 9: Advances on the Development of Evaluation Measures](https://reader031.vdocuments.net/reader031/viewer/2022020621/61e74dde4f6a846e2a4bf5fa/html5/thumbnails/9.jpg)
Team Draft Interleaving Ranking A
1. Napa Valley – The authority for lodging... www.napavalley.com 2. Napa Valley Wineries - Plan your wine... www.napavalley.com/wineries 3. Napa Valley College www.napavalley.edu/homex.asp 4. Been There | Tips | Napa Valley www.ivebeenthere.co.uk/tips/16681 5. Napa Valley Wineries and Wine www.napavintners.com 6. Napa Country, California – Wikipedia en.wikipedia.org/wiki/Napa_Valley
Ranking B 1. Napa Country, California – Wikipedia en.wikipedia.org/wiki/Napa_Valley 2. Napa Valley – The authority for lodging... www.napavalley.com 3. Napa: The Story of an American Eden... books.google.co.uk/books?isbn=... 4. Napa Valley Hotels – Bed and Breakfast... www.napalinks.com 5. NapaValley.org www.napavalley.org 6. The Napa Valley Marathon www.napavalleymarathon.org
Presented Ranking 1. Napa Valley – The authority for lodging... www.napavalley.com 2. Napa Country, California – Wikipedia en.wikipedia.org/wiki/Napa_Valley 3. Napa: The Story of an American Eden... books.google.co.uk/books?isbn=... 4. Napa Valley Wineries – Plan your wine... www.napavalley.com/wineries 5. Napa Valley Hotels – Bed and Breakfast... www.napalinks.com 6. Napa Valley College www.napavalley.edu/homex.asp 7 NapaValley.org www.napavalley.org
A B
![Page 10: Advances on the Development of Evaluation Measures](https://reader031.vdocuments.net/reader031/viewer/2022020621/61e74dde4f6a846e2a4bf5fa/html5/thumbnails/10.jpg)
Team Draft Interleaving Ranking A
1. Napa Valley – The authority for lodging... www.napavalley.com 2. Napa Valley Wineries - Plan your wine... www.napavalley.com/wineries 3. Napa Valley College www.napavalley.edu/homex.asp 4. Been There | Tips | Napa Valley www.ivebeenthere.co.uk/tips/16681 5. Napa Valley Wineries and Wine www.napavintners.com 6. Napa Country, California – Wikipedia en.wikipedia.org/wiki/Napa_Valley
Ranking B 1. Napa Country, California – Wikipedia en.wikipedia.org/wiki/Napa_Valley 2. Napa Valley – The authority for lodging... www.napavalley.com 3. Napa: The Story of an American Eden... books.google.co.uk/books?isbn=... 4. Napa Valley Hotels – Bed and Breakfast... www.napalinks.com 5. NapaValley.org www.napavalley.org 6. The Napa Valley Marathon www.napavalleymarathon.org
Presented Ranking 1. Napa Valley – The authority for lodging... www.napavalley.com 2. Napa Country, California – Wikipedia en.wikipedia.org/wiki/Napa_Valley 3. Napa: The Story of an American Eden... books.google.co.uk/books?isbn=... 4. Napa Valley Wineries – Plan your wine... www.napavalley.com/wineries 5. Napa Valley Hotels – Bed and Breakfast... www.napalinks.com 6. Napa Valley College www.napavalley.edu/homex.asp 7 NapaValley.org www.napavalley.org
B wins!
![Page 11: Advances on the Development of Evaluation Measures](https://reader031.vdocuments.net/reader031/viewer/2022020621/61e74dde4f6a846e2a4bf5fa/html5/thumbnails/11.jpg)
Offline Evaluation
• Controlled laboratory experiments
• The user’s interaction with the engine is only simulated – Ask experts to judge each query result
– Predict how users behave when they search
– Aggregate judgments to evaluate
11
![Page 12: Advances on the Development of Evaluation Measures](https://reader031.vdocuments.net/reader031/viewer/2022020621/61e74dde4f6a846e2a4bf5fa/html5/thumbnails/12.jpg)
Offline Evaluation
• Ask experts to judge each query result
• Predict how users behave when they search
• Aggregate judgments to evaluate
Documents Judge
Evaluate
User model
12
![Page 13: Advances on the Development of Evaluation Measures](https://reader031.vdocuments.net/reader031/viewer/2022020621/61e74dde4f6a846e2a4bf5fa/html5/thumbnails/13.jpg)
Online vs. Offline Evaluation
Online Offline
Pros Cheap
Measure actual user reactions
Fast to evaluate
Easy to try new ideas
Portable
Cons
Need to go live
Noisy
Slow
Not duplicable
Needs ground truth
Slow to obtain judgments “Expensive”
“Inconsistent”
Difficult to model how users behave
13
![Page 14: Advances on the Development of Evaluation Measures](https://reader031.vdocuments.net/reader031/viewer/2022020621/61e74dde4f6a846e2a4bf5fa/html5/thumbnails/14.jpg)
Outline
• Intro to evaluation
– Different approaches to evaluation
– Traditional evaluation measures
• User model based evaluation measures
• Session Evaluation
• Novelty and Diversity
14
![Page 15: Advances on the Development of Evaluation Measures](https://reader031.vdocuments.net/reader031/viewer/2022020621/61e74dde4f6a846e2a4bf5fa/html5/thumbnails/15.jpg)
Traditional Experiment
Results
Judges
Search Engines
How many good docs have I missed/found? 15
![Page 16: Advances on the Development of Evaluation Measures](https://reader031.vdocuments.net/reader031/viewer/2022020621/61e74dde4f6a846e2a4bf5fa/html5/thumbnails/16.jpg)
Depth-k Pooling
. . . .
1
2
3
z
k
sys2
. . . . . .
C
D
M
A
T
sys3
. . . . . .
A
E
D
F
B
sysM
. . . . . .
B
A
S
Z
L
sys1
. . . . . .
A
B
C
X
Y
Judge Documents
16
![Page 17: Advances on the Development of Evaluation Measures](https://reader031.vdocuments.net/reader031/viewer/2022020621/61e74dde4f6a846e2a4bf5fa/html5/thumbnails/17.jpg)
Depth-k Pooling
. . . .
1
2
3
z
k
sys2
. . . . . .
C
D
M
A
T
sys3
. . . . . .
A
E
D
F
B
sysM
. . . . . .
B
A
S
Z
L
sys1
. . . . . .
A
B
C
X
Y
Judge
17
![Page 18: Advances on the Development of Evaluation Measures](https://reader031.vdocuments.net/reader031/viewer/2022020621/61e74dde4f6a846e2a4bf5fa/html5/thumbnails/18.jpg)
Depth-k Pooling
. . . .
1
2
3
z
k
sys2
. . . . . .
C
D
M
A
T
sys3
. . . . . .
A
E
D
F
B
sysM
. . . . . .
B
A
S
Z
L
sys1
. . . . . .
A
B
C
X
Y
. . . .
1
2
3
z
k
sys2
. . . . . .
N
N
R
R
?
sys3
. . . . . .
R
N
N
R
R
sysM
. . . . . .
R
R
N
N
?
sys1
. . . . . .
R
R
N
N
?
Judge
18
![Page 19: Advances on the Development of Evaluation Measures](https://reader031.vdocuments.net/reader031/viewer/2022020621/61e74dde4f6a846e2a4bf5fa/html5/thumbnails/19.jpg)
Depth-k Pooling
. . . .
1
2
3
z
k
sys2
. . . . . .
C
D
M
A
T
sys3
. . . . . .
A
E
D
F
B
sysM
. . . . . .
B
A
S
Z
L
sys1
. . . . . .
A
B
C
X
Y
. . . .
1
2
3
z
k
sys2
. . . . . .
N
N
R
R
N
sys3
. . . . . .
R
N
N
R
R
sysM
. . . . . .
R
R
N
N
N
sys1
. . . . . .
R
R
N
N
N
Judge
19
![Page 20: Advances on the Development of Evaluation Measures](https://reader031.vdocuments.net/reader031/viewer/2022020621/61e74dde4f6a846e2a4bf5fa/html5/thumbnails/20.jpg)
Reusable Test Collections
• Document Corpus
• Topics
• Relevance Judgments
Topic 1 Topic 2 Topic N
20
![Page 21: Advances on the Development of Evaluation Measures](https://reader031.vdocuments.net/reader031/viewer/2022020621/61e74dde4f6a846e2a4bf5fa/html5/thumbnails/21.jpg)
Evaluation Metrics: Precision vs Recall
Retrieved list
R
N
R
N
N
R
N
N
N
R . .
1
2
3
4
5
6
7
8
9
10 . .
![Page 22: Advances on the Development of Evaluation Measures](https://reader031.vdocuments.net/reader031/viewer/2022020621/61e74dde4f6a846e2a4bf5fa/html5/thumbnails/22.jpg)
Visualizing Retrieval Performance: Precision-Recall Curves
R
N
R
N
N
R
N
N
N
R
List:
![Page 23: Advances on the Development of Evaluation Measures](https://reader031.vdocuments.net/reader031/viewer/2022020621/61e74dde4f6a846e2a4bf5fa/html5/thumbnails/23.jpg)
Evaluation Metrics: Average Precision
R
N
R
N
N
R
N
N
N
R
List:
![Page 24: Advances on the Development of Evaluation Measures](https://reader031.vdocuments.net/reader031/viewer/2022020621/61e74dde4f6a846e2a4bf5fa/html5/thumbnails/24.jpg)
Outline
• Intro to evaluation
– Different approaches to evaluation
– Traditional evaluation measures
• User model based evaluation measures
• Session Evaluation
• Novelty and Diversity
24
![Page 25: Advances on the Development of Evaluation Measures](https://reader031.vdocuments.net/reader031/viewer/2022020621/61e74dde4f6a846e2a4bf5fa/html5/thumbnails/25.jpg)
User models Behind Traditional Metrics
• Precision@k
– Users always look at top k documents
– What fraction of the top k documents are relevant?
• Recall
– Users would like to find all the relevant documents.
– What fraction of these documents have been retrieved by the search engine?
![Page 26: Advances on the Development of Evaluation Measures](https://reader031.vdocuments.net/reader031/viewer/2022020621/61e74dde4f6a846e2a4bf5fa/html5/thumbnails/26.jpg)
User Model of Average Precision (Robertson ‘08)
1. User steps down a ranked list one-by-one
2. Stops browsing documents due to satisfaction
– stops with a certain probability after observing a relevant document
3. Gains utility from each relevant document
![Page 27: Advances on the Development of Evaluation Measures](https://reader031.vdocuments.net/reader031/viewer/2022020621/61e74dde4f6a846e2a4bf5fa/html5/thumbnails/27.jpg)
User Model of Average Precision (Robertson ‘08)
• The probability that the user stops browsing is uniform over all the relevant documents
• The utility a user gains when he stops browsing at a relevant document at rank n (precision at rank n)
• AP can be written as:
U(n) 1
nrel(k)
k1
n
1
)()(n
nUnPAP
o.w. 0 relevant, is doc if 1
)(R
nP
![Page 28: Advances on the Development of Evaluation Measures](https://reader031.vdocuments.net/reader031/viewer/2022020621/61e74dde4f6a846e2a4bf5fa/html5/thumbnails/28.jpg)
User Model Based Evaluation Measures
• Directly aim at evaluating user satisfaction
– An effectiveness measure should be correlated to the user’s experience
• Thus interest in effectiveness measures based on explicit models of user interaction
– Devise a user model correlated with user behavior
– Infer an evaluation metric from the user model
![Page 29: Advances on the Development of Evaluation Measures](https://reader031.vdocuments.net/reader031/viewer/2022020621/61e74dde4f6a846e2a4bf5fa/html5/thumbnails/29.jpg)
Basic User Model
• Simple model of user interaction:
1. User steps down ranked results one-by-one
2. Stops at a document at rank k with some probability P(k)
3. Gains some utility U(k) from relevant documents
M U(k)P(k)k1
![Page 30: Advances on the Development of Evaluation Measures](https://reader031.vdocuments.net/reader031/viewer/2022020621/61e74dde4f6a846e2a4bf5fa/html5/thumbnails/30.jpg)
Basic User Model
1. Discount: What is the chance a user will visit a document?
– Model of the browsing behavior
2. Utility: What does the user gain by visiting a document?
![Page 31: Advances on the Development of Evaluation Measures](https://reader031.vdocuments.net/reader031/viewer/2022020621/61e74dde4f6a846e2a4bf5fa/html5/thumbnails/31.jpg)
Model Browsing Behavior
Position-based models
The chance of observing a document depends on the position
it is presented in the ranked list.
black powder
ammunition
1
2
3
4
5
6
7
8
9
10
…
![Page 32: Advances on the Development of Evaluation Measures](https://reader031.vdocuments.net/reader031/viewer/2022020621/61e74dde4f6a846e2a4bf5fa/html5/thumbnails/32.jpg)
Rank Biased Precision black powder
ammunition
1
2
3
4
5
6
7
8
9
10
…
Query
Stop View Next
Item
![Page 33: Advances on the Development of Evaluation Measures](https://reader031.vdocuments.net/reader031/viewer/2022020621/61e74dde4f6a846e2a4bf5fa/html5/thumbnails/33.jpg)
Rank Biased Precision black powder
ammunition
1
2
3
4
5
6
7
8
9
10
…
)-(1= RBP1=i
1
i
irel
![Page 34: Advances on the Development of Evaluation Measures](https://reader031.vdocuments.net/reader031/viewer/2022020621/61e74dde4f6a846e2a4bf5fa/html5/thumbnails/34.jpg)
Discounted Cumulative Gain black powder
ammunition
1
2
3
4
5
6
7
8
9
10
…
HR
R
N
N
HR
R
N
R
N
N
…
Discounted Gain
3
0.63
0
0
1.14
0.35
0
0.31
0
0
1/log2(r+1)
Relevance Score
2
1
0
0
2
1
0
1
0
0
Discount by rank
Relevance Gain
3
1
0
0
3
1
0
1
0
0
12 rrel
5.46
![Page 35: Advances on the Development of Evaluation Measures](https://reader031.vdocuments.net/reader031/viewer/2022020621/61e74dde4f6a846e2a4bf5fa/html5/thumbnails/35.jpg)
Discounted Cumulative Gain
• DCG can be written as:
• Discount function models the probability that the user visits (clicks on) the document at rank r
– Currently, P(user clicks on doc r) = 1/log2(r+1)
r
N
r
UtilityrdocvisitsuserP
) (1
![Page 36: Advances on the Development of Evaluation Measures](https://reader031.vdocuments.net/reader031/viewer/2022020621/61e74dde4f6a846e2a4bf5fa/html5/thumbnails/36.jpg)
Discounted Cumulative Gain
• Instead of stopping probability, think about viewing probability
• This fits in discounted gain model framework:
![Page 37: Advances on the Development of Evaluation Measures](https://reader031.vdocuments.net/reader031/viewer/2022020621/61e74dde4f6a846e2a4bf5fa/html5/thumbnails/37.jpg)
Normalised Discounted Cumulative Gain
black powder
ammunition
1
2
3
4
5
6
7
8
9
10
…
HR
R
N
N
HR
R
N
R
N
N
…
Discounted Gain
3
0.63
0
0
1.14
0.35
0
0.31
0
0
1/log2(r+1)
Relevance Score
2
1
0
0
2
1
0
1
0
0
Discount by rank
Relevance Gain
3
1
0
0
3
1
0
1
0
0
12 rrel
optDCG
DCGNDCG
![Page 38: Advances on the Development of Evaluation Measures](https://reader031.vdocuments.net/reader031/viewer/2022020621/61e74dde4f6a846e2a4bf5fa/html5/thumbnails/38.jpg)
Model Browsing Behavior
Cascade-based models
black powder
ammunition
1
2
3
4
5
6
7
8
9
10
…
• The user views search results from top to bottom
• At each rank i, the user has a certain probability of being satisfied. • Probability of satisfaction proportional to the
relevance grade of the document at rank i.
• Once the user is satisfied with a document, he terminates the search.
![Page 39: Advances on the Development of Evaluation Measures](https://reader031.vdocuments.net/reader031/viewer/2022020621/61e74dde4f6a846e2a4bf5fa/html5/thumbnails/39.jpg)
Rank Biased Precision
Query
Stop
View Next Item
black powder
ammunition
1
2
3
4
5
6
7
8
9
10
…
![Page 40: Advances on the Development of Evaluation Measures](https://reader031.vdocuments.net/reader031/viewer/2022020621/61e74dde4f6a846e2a4bf5fa/html5/thumbnails/40.jpg)
Expected Reciprocal Rank [Chapelle et al CIKM09]
Query
Stop
Relevant?
View Next Item
no somewhat highly
black powder
ammunition
1
2
3
4
5
6
7
8
9
10
…
![Page 41: Advances on the Development of Evaluation Measures](https://reader031.vdocuments.net/reader031/viewer/2022020621/61e74dde4f6a846e2a4bf5fa/html5/thumbnails/41.jpg)
Expected Reciprocal Rank [Chapelle et al CIKM09]
black powder
ammunition
rrank at
document"perfect the" finding of Utility :(r)
1/r (r)
)position at stopsuser (1
1
rPr
ERRn
r
1
11
)1(1 r
i
ri
n
r
RRr
ERR
document r theof grade relevance : th
rg
) positionat stopsP(user 2
12 doc of relevance of Prob.
maxrRr
g
g
r
r
1
2
3
4
5
6
7
8
9
10
…
![Page 42: Advances on the Development of Evaluation Measures](https://reader031.vdocuments.net/reader031/viewer/2022020621/61e74dde4f6a846e2a4bf5fa/html5/thumbnails/42.jpg)
Metrics derived from Query Logs
• Use the query logs to understand how users behave
• Learn the parameters of the user model from the query logs
– Utility, discount, etc.
![Page 43: Advances on the Development of Evaluation Measures](https://reader031.vdocuments.net/reader031/viewer/2022020621/61e74dde4f6a846e2a4bf5fa/html5/thumbnails/43.jpg)
Metrics derived from Query Logs
• Users tend to stop search if they are satisfied or frustrated
• P(observe a doc at rank r) highly affected by snippet quality
Relevance P(C|R)
Bad 0.50
Fair 0.49
Good 0.45
Excellent 0.59
Perfect 0.79
Relevance P(Stop|R)
Bad 0.49
Fair 0.41
Good 0.37
Excellent 0.53
Perfect 0.76
![Page 44: Advances on the Development of Evaluation Measures](https://reader031.vdocuments.net/reader031/viewer/2022020621/61e74dde4f6a846e2a4bf5fa/html5/thumbnails/44.jpg)
Metrics derived from Query Logs
• Users behave differently for different queries
– Informational queries
– Navigational queries
Navigational Informational
P(C|R) P(Stop|R) P(C|R) P(Stop|R)
Bad 0.632 0.587 0.516 0.431
Fair 0.569 0.523 0.455 0.357
Good 0.526 0.483 0.442 0.349
Excellent 0.700 0.669 0.533 0.458
Perfect 0.809 0.786 0.557 0.502
![Page 45: Advances on the Development of Evaluation Measures](https://reader031.vdocuments.net/reader031/viewer/2022020621/61e74dde4f6a846e2a4bf5fa/html5/thumbnails/45.jpg)
DEBU (r) P(Er ) P(C | Rr )
EBU DEBU (r)r 1
n
Rr
Expected Browsing Utility (Yilmaz et al. CIKM’10)
![Page 46: Advances on the Development of Evaluation Measures](https://reader031.vdocuments.net/reader031/viewer/2022020621/61e74dde4f6a846e2a4bf5fa/html5/thumbnails/46.jpg)
Basic User Model
1. Discount: What is the chance a user will visit a document?
– Model of the browsing behavior
2. Utility: What does the user gain by visiting a document?
– Mostly ad-hoc, no clear user model
![Page 47: Advances on the Development of Evaluation Measures](https://reader031.vdocuments.net/reader031/viewer/2022020621/61e74dde4f6a846e2a4bf5fa/html5/thumbnails/47.jpg)
Graded Average Precision (Robertson et al. SIGIR’10)
• One document is more useful than another • One possible meaning:
– one document is useful to more users than another
• Hence the following: – assume grades of relevance... ... but that user has a threshold relevance grade
which defines a binary view
– different users have different thresholds
described by a probability distribution over users
![Page 48: Advances on the Development of Evaluation Measures](https://reader031.vdocuments.net/reader031/viewer/2022020621/61e74dde4f6a846e2a4bf5fa/html5/thumbnails/48.jpg)
Graded Average Precision [Robertson et al. SIGIR10]
• User has binary view of relevance
– by thresholding the relevance scale
Considered relevant with probability g1
Irrelevant
Relevant
Highly Relevant
Relevance Scale
![Page 49: Advances on the Development of Evaluation Measures](https://reader031.vdocuments.net/reader031/viewer/2022020621/61e74dde4f6a846e2a4bf5fa/html5/thumbnails/49.jpg)
Graded Average Precision [Robertson et al. SIGIR10]
• User has binary view of relevance
– by thresholding the relevance scale
Irrelevant
Relevant
Highly Relevant
Relevance Scale
Considered relevant with probability g2
![Page 50: Advances on the Development of Evaluation Measures](https://reader031.vdocuments.net/reader031/viewer/2022020621/61e74dde4f6a846e2a4bf5fa/html5/thumbnails/50.jpg)
Graded Average Precision
• Assume relevance grades {0...c} – 0 for non-relevant, + c positive grades
• gi = P(user threshold is at i) for i ∈ {1...c} – i.e. user regards grades {i...c} as relevant, grades {0...(i-1)}
as not relevant – gis sum to one
• Step down the ranked list, stopping at documents that may be relevant – then calculate expected precision at each of these
(expected over the population of users)
![Page 51: Advances on the Development of Evaluation Measures](https://reader031.vdocuments.net/reader031/viewer/2022020621/61e74dde4f6a846e2a4bf5fa/html5/thumbnails/51.jpg)
Graded Average Precision (GAP)
1 HR
2 R
3 N
4 N
5 R
6 HR
7 R
Relevance
![Page 52: Advances on the Development of Evaluation Measures](https://reader031.vdocuments.net/reader031/viewer/2022020621/61e74dde4f6a846e2a4bf5fa/html5/thumbnails/52.jpg)
1 HR
2 R
3 N
4 N
5 R
6 HR
7 R
Relevance
1 Rel
2 Rel
3 N
4 N
5 Rel
6 Rel
7 Rel
with prob.
g1
6
4prec6
Graded Average Precision (GAP)
![Page 53: Advances on the Development of Evaluation Measures](https://reader031.vdocuments.net/reader031/viewer/2022020621/61e74dde4f6a846e2a4bf5fa/html5/thumbnails/53.jpg)
1 HR
2 R
3 N
4 N
5 R
6 HR
7 R
Relevance
1 Rel
2 N
3 N
4 N
5 N
6 Rel
7 N
with prob.
g2
6
2prec6
Graded Average Precision (GAP)
![Page 54: Advances on the Development of Evaluation Measures](https://reader031.vdocuments.net/reader031/viewer/2022020621/61e74dde4f6a846e2a4bf5fa/html5/thumbnails/54.jpg)
1 HR
2 R
3 NR
4 NR
5 R
6 HR
7 R
Relevance
wprec6 4
6 g1
2
6 g2
Graded Average Precision (GAP)
![Page 55: Advances on the Development of Evaluation Measures](https://reader031.vdocuments.net/reader031/viewer/2022020621/61e74dde4f6a846e2a4bf5fa/html5/thumbnails/55.jpg)
Probability Models
• Almost all the measures we’ve discussed are based on probabilistic models of users
– Most have one or more parameters representing something about user behavior
– Is there a way to incorporate variability in the user population?
• How do we estimate parameter values?
– Is a single point estimate good enough?
![Page 56: Advances on the Development of Evaluation Measures](https://reader031.vdocuments.net/reader031/viewer/2022020621/61e74dde4f6a846e2a4bf5fa/html5/thumbnails/56.jpg)
Choosing Parameter Values
• Parameter θ models a user – Higher θ more patience, more results viewed
– Lower θ less patience, fewer results viewed
• Different approaches: – Minimize variance in evaluation (Kanoulas & Aslam,
CIKM ‘09)
– Use click log; fit a model to gaps between clicks (Zhang et al., IRJ, 2010)
– All try to infer a single value for the parameters
![Page 57: Advances on the Development of Evaluation Measures](https://reader031.vdocuments.net/reader031/viewer/2022020621/61e74dde4f6a846e2a4bf5fa/html5/thumbnails/57.jpg)
Distribution of “Patience” for RBP
• Form a distribution P(θ)
• Sampling from P(θ) is like sampling a “user” defined by their patience
• How can we form a proper distribution of θ?
• Idea: mine logged search engine user data – Look at ranks users are clicking
– Estimate patience based on absence or presence of clicks
![Page 58: Advances on the Development of Evaluation Measures](https://reader031.vdocuments.net/reader031/viewer/2022020621/61e74dde4f6a846e2a4bf5fa/html5/thumbnails/58.jpg)
Modeling Patience from Log Data
• We will assume we have a flat prior θ that we want to update using log data L
• Decompose L into individual search sessions
– For each session q, count:
• cq, the total number of clicks
• rq, the total number of no-clicks
– Model cq with a negative binomial distribution conditional on rq and θ:
![Page 59: Advances on the Development of Evaluation Measures](https://reader031.vdocuments.net/reader031/viewer/2022020621/61e74dde4f6a846e2a4bf5fa/html5/thumbnails/59.jpg)
Modeling Patience from Log Data
• Marginalize P(θ|L) over r:
• Apply Bayes’ rule to P(θ | r, L):
• P(L | θ, r) is the likelihood of the observed clicks
![Page 60: Advances on the Development of Evaluation Measures](https://reader031.vdocuments.net/reader031/viewer/2022020621/61e74dde4f6a846e2a4bf5fa/html5/thumbnails/60.jpg)
Complete Model Expression
• Model components result in three equations to estimate P(θ | L)
![Page 61: Advances on the Development of Evaluation Measures](https://reader031.vdocuments.net/reader031/viewer/2022020621/61e74dde4f6a846e2a4bf5fa/html5/thumbnails/61.jpg)
Empirical Patience Profiles: Navigational Queries
![Page 62: Advances on the Development of Evaluation Measures](https://reader031.vdocuments.net/reader031/viewer/2022020621/61e74dde4f6a846e2a4bf5fa/html5/thumbnails/62.jpg)
Empirical Patience Profiles: Informational Queries
![Page 63: Advances on the Development of Evaluation Measures](https://reader031.vdocuments.net/reader031/viewer/2022020621/61e74dde4f6a846e2a4bf5fa/html5/thumbnails/63.jpg)
Extend to ERR Parameters
![Page 64: Advances on the Development of Evaluation Measures](https://reader031.vdocuments.net/reader031/viewer/2022020621/61e74dde4f6a846e2a4bf5fa/html5/thumbnails/64.jpg)
Evaluation Using Parameter Distributions
• Monte Carlo procedure:
– Sample a parameter value from P(θ | L)
• Or a vector of values for ERR
– Compute the measure with the sampled value
– Iterate to form distribution P(RBP) or P(ERR)
![Page 65: Advances on the Development of Evaluation Measures](https://reader031.vdocuments.net/reader031/viewer/2022020621/61e74dde4f6a846e2a4bf5fa/html5/thumbnails/65.jpg)
Marginal Distribution Analysis
• S1=[R N N N N N N N N N]
• S2=[N R R R R R R R R R]
![Page 66: Advances on the Development of Evaluation Measures](https://reader031.vdocuments.net/reader031/viewer/2022020621/61e74dde4f6a846e2a4bf5fa/html5/thumbnails/66.jpg)
Distribution of RBP
![Page 67: Advances on the Development of Evaluation Measures](https://reader031.vdocuments.net/reader031/viewer/2022020621/61e74dde4f6a846e2a4bf5fa/html5/thumbnails/67.jpg)
Distribution of ERR
![Page 68: Advances on the Development of Evaluation Measures](https://reader031.vdocuments.net/reader031/viewer/2022020621/61e74dde4f6a846e2a4bf5fa/html5/thumbnails/68.jpg)
Marginal Distribution Analysis
• Given two systems, over all choices of θ
– What is P(M1 > M2)?
– What is P((M1 - M2)>t)?
![Page 69: Advances on the Development of Evaluation Measures](https://reader031.vdocuments.net/reader031/viewer/2022020621/61e74dde4f6a846e2a4bf5fa/html5/thumbnails/69.jpg)
Marginal Distribution Analysis
![Page 70: Advances on the Development of Evaluation Measures](https://reader031.vdocuments.net/reader031/viewer/2022020621/61e74dde4f6a846e2a4bf5fa/html5/thumbnails/70.jpg)
Outline
• Intro to evaluation
– Different approaches to evaluation
– Traditional evaluation measures
• User model based evaluation measures
• Session Evaluation
• Novelty and Diversity
71
![Page 71: Advances on the Development of Evaluation Measures](https://reader031.vdocuments.net/reader031/viewer/2022020621/61e74dde4f6a846e2a4bf5fa/html5/thumbnails/71.jpg)
Why sessions?
• Current evaluation framework – Assesses the effectiveness of systems over one-
shot queries
• Users reformulate their initial query
• Still fine if … – optimizing system for one-shot queries led to
optimal performance over an entire session
![Page 72: Advances on the Development of Evaluation Measures](https://reader031.vdocuments.net/reader031/viewer/2022020621/61e74dde4f6a846e2a4bf5fa/html5/thumbnails/72.jpg)
When was the DuPont Science Essay Contest created?
Initial Query : DuPont Science Essay Contest
Reformulation : When was the DSEC created?
• e.g. retrieval systems should accumulate information along a session
Why sessions?
![Page 73: Advances on the Development of Evaluation Measures](https://reader031.vdocuments.net/reader031/viewer/2022020621/61e74dde4f6a846e2a4bf5fa/html5/thumbnails/73.jpg)
Paris Luxurious Hotels Paris Hilton J Lo Paris
![Page 74: Advances on the Development of Evaluation Measures](https://reader031.vdocuments.net/reader031/viewer/2022020621/61e74dde4f6a846e2a4bf5fa/html5/thumbnails/74.jpg)
Extend the evaluation framework
From one query evaluation
To multi-query sessions evaluation
![Page 75: Advances on the Development of Evaluation Measures](https://reader031.vdocuments.net/reader031/viewer/2022020621/61e74dde4f6a846e2a4bf5fa/html5/thumbnails/75.jpg)
Construct appropriate test collections
Rethink of evaluation measures
![Page 76: Advances on the Development of Evaluation Measures](https://reader031.vdocuments.net/reader031/viewer/2022020621/61e74dde4f6a846e2a4bf5fa/html5/thumbnails/76.jpg)
• A set of information needs A friend from Kenya is visiting you and you'd like to surprise him with
by cooking a traditional swahili dish. You would like to search online to decide which dish you will cook at home.
– A static sequence of m queries
Basic test collection
Initial Query : 1st Reformulation : 2nd Reformulation : … (m-1)th Reformulation :
kenya cooking traditional
kenya cooking traditional swahili
www.allrecipes.com
…
kenya swahili traditional food recipes
![Page 77: Advances on the Development of Evaluation Measures](https://reader031.vdocuments.net/reader031/viewer/2022020621/61e74dde4f6a846e2a4bf5fa/html5/thumbnails/77.jpg)
Basic Test Collection
Factual/Amorphous, Known-item search
Intellectual/Amorphous, Explanatory search
Factual/Amorphous, Known-item search
![Page 78: Advances on the Development of Evaluation Measures](https://reader031.vdocuments.net/reader031/viewer/2022020621/61e74dde4f6a846e2a4bf5fa/html5/thumbnails/78.jpg)
Experiment
kenya cooking
traditional swahili
kenya swahili
traditional food
recipes
kenya cooking
traditional
1
2
3
4
5
6
7
8
9
10
…
![Page 79: Advances on the Development of Evaluation Measures](https://reader031.vdocuments.net/reader031/viewer/2022020621/61e74dde4f6a846e2a4bf5fa/html5/thumbnails/79.jpg)
Experiment
1
2
3
4
5
6
7
8
9
10
…
kenya cooking
traditional swahili
kenya cooking
traditional
kenya swahili
traditional food
recipes
![Page 80: Advances on the Development of Evaluation Measures](https://reader031.vdocuments.net/reader031/viewer/2022020621/61e74dde4f6a846e2a4bf5fa/html5/thumbnails/80.jpg)
Construct appropriate test collections
Rethink of evaluation measures
![Page 81: Advances on the Development of Evaluation Measures](https://reader031.vdocuments.net/reader031/viewer/2022020621/61e74dde4f6a846e2a4bf5fa/html5/thumbnails/81.jpg)
What is a good system?
![Page 82: Advances on the Development of Evaluation Measures](https://reader031.vdocuments.net/reader031/viewer/2022020621/61e74dde4f6a846e2a4bf5fa/html5/thumbnails/82.jpg)
How can we measure “goodness”?
![Page 83: Advances on the Development of Evaluation Measures](https://reader031.vdocuments.net/reader031/viewer/2022020621/61e74dde4f6a846e2a4bf5fa/html5/thumbnails/83.jpg)
Measuring “goodness”
The user steps down a ranked list of documents and
observes each one of them until a decision point and either
a) abandons the search, or
b) reformulates
While stepping down or sideways, the user accumulates utility
![Page 84: Advances on the Development of Evaluation Measures](https://reader031.vdocuments.net/reader031/viewer/2022020621/61e74dde4f6a846e2a4bf5fa/html5/thumbnails/84.jpg)
What are the challenges?
![Page 85: Advances on the Development of Evaluation Measures](https://reader031.vdocuments.net/reader031/viewer/2022020621/61e74dde4f6a846e2a4bf5fa/html5/thumbnails/85.jpg)
Evaluation over a single ranked list
1
2
3
4
5
6
7
8
9
10
…
kenya cooking
traditional swahili
kenya cooking
traditional
kenya swahili
traditional food
recipes
![Page 86: Advances on the Development of Evaluation Measures](https://reader031.vdocuments.net/reader031/viewer/2022020621/61e74dde4f6a846e2a4bf5fa/html5/thumbnails/86.jpg)
![Page 87: Advances on the Development of Evaluation Measures](https://reader031.vdocuments.net/reader031/viewer/2022020621/61e74dde4f6a846e2a4bf5fa/html5/thumbnails/87.jpg)
Session DCG [Järvelin et al ECIR 2008]
kenya cooking
traditional swahili
kenya cooking
traditional
2rel(r ) 1
logb (r b 1)r1
k
2rel(r ) 1
logb (r b 1)r1
k
1
logc (1 c 1)DCG(RL1)
1
logc(2 c 1) DCG(RL2)
![Page 88: Advances on the Development of Evaluation Measures](https://reader031.vdocuments.net/reader031/viewer/2022020621/61e74dde4f6a846e2a4bf5fa/html5/thumbnails/88.jpg)
Session Metrics
• Session DCG [Järvelin et al ECIR 2008]
The user steps down the ranked list until rank k and reformulates [Deterministic; no early abandonment]
![Page 89: Advances on the Development of Evaluation Measures](https://reader031.vdocuments.net/reader031/viewer/2022020621/61e74dde4f6a846e2a4bf5fa/html5/thumbnails/89.jpg)
Model-based measures
Probabilistic space of users following
different paths
• Ω is the space of all paths
• P(ω) is the prob of a user following a path ω in Ω
• U(ω) is the utility of path ω in Ω
[Yang and Lad ICTIR 2009]
P()U()
![Page 90: Advances on the Development of Evaluation Measures](https://reader031.vdocuments.net/reader031/viewer/2022020621/61e74dde4f6a846e2a4bf5fa/html5/thumbnails/90.jpg)
Expected Global Utility [Yang and Lad ICTIR 2009]
1. User steps down ranked results one-by-one
2. Stops browsing documents based on a stochastic process that defines a stopping probability distribution over ranks and reformulates
3. Gains something from relevant documents, accumulating utility
![Page 91: Advances on the Development of Evaluation Measures](https://reader031.vdocuments.net/reader031/viewer/2022020621/61e74dde4f6a846e2a4bf5fa/html5/thumbnails/91.jpg)
Expected Global Utility [Yang and Lad ICTIR 2009]
• The probability of a user following a path ω:
P(ω) = P(r1, r2, ..., rK)
ri is the stopping and reformulation point in list i
– Assumption: stopping positions in each list are independent
P(r1, r2, ..., rK) = P(r1)P(r2)...P(rK)
– Use geometric distribution (RBP) to model the stopping and reformulation behaviour
P(ri = r) = (1-) k1
![Page 92: Advances on the Development of Evaluation Measures](https://reader031.vdocuments.net/reader031/viewer/2022020621/61e74dde4f6a846e2a4bf5fa/html5/thumbnails/92.jpg)
Q1 Q2 Q3
N R R
N R R
N R R
N R R
N R R
N N R
N N R
N N R
N N R
N N R
Geo
met
ric
w/
par
amet
er θ
Expected Global Utility
![Page 93: Advances on the Development of Evaluation Measures](https://reader031.vdocuments.net/reader031/viewer/2022020621/61e74dde4f6a846e2a4bf5fa/html5/thumbnails/93.jpg)
Session Metrics
• Session DCG [Järvelin et al ECIR 2008]
The user steps down the ranked list until rank k and reformulates [Deterministic; no early abandonment]
• Expected global utility [Yang and Lad ICTIR 2009]
The user steps down a ranked list of documents until a decision point and reformulates [Stochastic; no early abandonment]
![Page 94: Advances on the Development of Evaluation Measures](https://reader031.vdocuments.net/reader031/viewer/2022020621/61e74dde4f6a846e2a4bf5fa/html5/thumbnails/94.jpg)
Model-based measures
Probabilistic space of users following
different paths
• Ω is the space of all paths
• P(ω) is the prob of a user following a path ω in Ω
• Mω is a measure over a path ω
[Kanoulas et al. SIGIR 2011]
esM P()M
![Page 95: Advances on the Development of Evaluation Measures](https://reader031.vdocuments.net/reader031/viewer/2022020621/61e74dde4f6a846e2a4bf5fa/html5/thumbnails/95.jpg)
Probability of a path
Probability of abandoning at reform 2
X
Probability of reformulating at rank 3
Q1 Q2 Q3
N R R
N R R
N R R
N R R
N R R
N N R
N N R
N N R
N N R
N N R
… … …
(1) (2)
![Page 96: Advances on the Development of Evaluation Measures](https://reader031.vdocuments.net/reader031/viewer/2022020621/61e74dde4f6a846e2a4bf5fa/html5/thumbnails/96.jpg)
Q1 Q2 Q3
N R R
N R R
N R R
N R R
N R R
N N R
N N R
N N R
N N R
N N R
… … …
Probability of abandoning the session at reformulation i
Geometric w/ parameter preform
(1)
![Page 97: Advances on the Development of Evaluation Measures](https://reader031.vdocuments.net/reader031/viewer/2022020621/61e74dde4f6a846e2a4bf5fa/html5/thumbnails/97.jpg)
Q1 Q2 Q3
N R R
N R R
N R R
N R R
N R R
N N R
N N R
N N R
N N R
N N R
… … …
Truncated Geometric w/ parameter preform
Probability of abandoning the session at reformulation i
(1)
![Page 98: Advances on the Development of Evaluation Measures](https://reader031.vdocuments.net/reader031/viewer/2022020621/61e74dde4f6a846e2a4bf5fa/html5/thumbnails/98.jpg)
Q1 Q2 Q3
N R R
N R R
N R R
N R R
N R R
N N R
N N R
N N R
N N R
N N R
… … …
Truncated Geometric w/ parameter preform
Geo
met
ric
w/
par
amet
er p
do
wn
Probability of reformulating
at rank j (2)
![Page 99: Advances on the Development of Evaluation Measures](https://reader031.vdocuments.net/reader031/viewer/2022020621/61e74dde4f6a846e2a4bf5fa/html5/thumbnails/99.jpg)
Session Metrics
• Session DCG [Järvelin et al ECIR 2008]
The user steps down the ranked list until rank k and reformulates [Deterministic; no early abandonment]
• Expected global utility [Yang and Lad ICTIR 2009]
The user steps down a ranked list of documents until a decision point and reformulates [Stochastic; no early abandonment]
• Expected session measures [Kanoulas et al. SIGIR 2011]
The user steps down a ranked list of documents until a decision point and either abandons the query or reformulates [Stochastic; allows early abandonment]
![Page 100: Advances on the Development of Evaluation Measures](https://reader031.vdocuments.net/reader031/viewer/2022020621/61e74dde4f6a846e2a4bf5fa/html5/thumbnails/100.jpg)
Outline
• Intro to evaluation
– Different approaches to evaluation
– Traditional evaluation measures
• User model based evaluation measures
• Session Evaluation
• Novelty and Diversity
101
![Page 101: Advances on the Development of Evaluation Measures](https://reader031.vdocuments.net/reader031/viewer/2022020621/61e74dde4f6a846e2a4bf5fa/html5/thumbnails/101.jpg)
Novelty
• The redundancy problem:
– the first relevant document contains some useful information
– every document with the same information after that is worth less to the user
• but worth the same to traditional evaluation measures
• Novelty retrieval attempts to ensure that ranked results do not have much redundancy
![Page 102: Advances on the Development of Evaluation Measures](https://reader031.vdocuments.net/reader031/viewer/2022020621/61e74dde4f6a846e2a4bf5fa/html5/thumbnails/102.jpg)
Example
• query: “oil-producing nations” – members of OPEC
– North Atlantic nations
– South American nations
• 10 relevant articles about OPEC probably not as useful as one relevant article about each group – And one relevant article about all oil-producing
nations might be even better
![Page 103: Advances on the Development of Evaluation Measures](https://reader031.vdocuments.net/reader031/viewer/2022020621/61e74dde4f6a846e2a4bf5fa/html5/thumbnails/103.jpg)
How to Evaluate?
• One approach:
– List subtopics, aspects, or facets of the topic
– Judge each document relevant or not to each possible subtopic
• For oil-producing nations, subtopics could be names of nations
– Saudi Arabia, Russia, Canada, …
![Page 104: Advances on the Development of Evaluation Measures](https://reader031.vdocuments.net/reader031/viewer/2022020621/61e74dde4f6a846e2a4bf5fa/html5/thumbnails/104.jpg)
Subtopic Relevance Example
![Page 105: Advances on the Development of Evaluation Measures](https://reader031.vdocuments.net/reader031/viewer/2022020621/61e74dde4f6a846e2a4bf5fa/html5/thumbnails/105.jpg)
Evaluation Measures
• Subtopic recall and precision (Zhai et al., 2003) – Subtopic recall at rank k:
• Count unique subtopics in top k documents • Divide by total number of known unique subtopics
– Subtopic precision at recall r: • Find least k at which subtopic recall r is achieved • Find least k at which subtopic recall r could possibly be
achieved (by a perfect system) • Divide latter by former
– Models a user that wants all subtopics • and doesn’t care about redundancy as long as they are
seeing new information
![Page 106: Advances on the Development of Evaluation Measures](https://reader031.vdocuments.net/reader031/viewer/2022020621/61e74dde4f6a846e2a4bf5fa/html5/thumbnails/106.jpg)
Subtopic Relevance Evaluation
Copyright © Ben Carterette
![Page 107: Advances on the Development of Evaluation Measures](https://reader031.vdocuments.net/reader031/viewer/2022020621/61e74dde4f6a846e2a4bf5fa/html5/thumbnails/107.jpg)
Diversity
• Short keyword queries are inherently ambiguous
– An automatic system can never know the user’s intent
• Diversification attempts to retrieve results that may be relevant to a space of possible intents
![Page 108: Advances on the Development of Evaluation Measures](https://reader031.vdocuments.net/reader031/viewer/2022020621/61e74dde4f6a846e2a4bf5fa/html5/thumbnails/108.jpg)
Evaluation Measures
• Subtopic recall and precision
– This time with judgments to “intents” rather than subtopics
• Measures that know about intents:
– “Intent-aware” family of measures (Agrawal et al.)
– D, D♯ measures (Sakai et al.)
– α-nDCG (Clarke et al.)
– ERR-IA (Chapelle et al.)
![Page 109: Advances on the Development of Evaluation Measures](https://reader031.vdocuments.net/reader031/viewer/2022020621/61e74dde4f6a846e2a4bf5fa/html5/thumbnails/109.jpg)
Intent-Aware Measures
• Assume there is a probability distribution P(i | Q) over intents for a query Q
– Probability that a randomly-sampled user means intent i when submitting query Q
• The intent-aware version of a measure is its weighted average over this distribution
![Page 110: Advances on the Development of Evaluation Measures](https://reader031.vdocuments.net/reader031/viewer/2022020621/61e74dde4f6a846e2a4bf5fa/html5/thumbnails/110.jpg)
P@10-IA = 0.35*0.3 + 0.35*0.3 + 0.2*0.2 + 0.08*0.1 + 0.02*0.1 = 0.23
![Page 111: Advances on the Development of Evaluation Measures](https://reader031.vdocuments.net/reader031/viewer/2022020621/61e74dde4f6a846e2a4bf5fa/html5/thumbnails/111.jpg)
D-measure
• Take the idea of intent-awareness and apply it to computing document gain
– The gain for a document is the (weighted) average of its gains for subtopics it is relevant to
• D-nDCG is nDCG computed using intent-aware gains
![Page 112: Advances on the Development of Evaluation Measures](https://reader031.vdocuments.net/reader031/viewer/2022020621/61e74dde4f6a846e2a4bf5fa/html5/thumbnails/112.jpg)
D-DCG = 0.35/log 2 + 0.35/log 3 + …
![Page 113: Advances on the Development of Evaluation Measures](https://reader031.vdocuments.net/reader031/viewer/2022020621/61e74dde4f6a846e2a4bf5fa/html5/thumbnails/113.jpg)
α-nDCG
• α-nDCG is a generalization of nDCG that accounts for both novelty and diversity
• α is a geometric penalization for redundancy
– Redefine the gain of a document:
• +1 for each subtopic it is relevant to
• ×(1-α) for each document higher in the ranking that subtopic already appeared in
• Discount is the same as usual
![Page 114: Advances on the Development of Evaluation Measures](https://reader031.vdocuments.net/reader031/viewer/2022020621/61e74dde4f6a846e2a4bf5fa/html5/thumbnails/114.jpg)
+1
+1
+1
+1
+(1-α)
+1
+(1-α)
+(1-α)
+(1-α)2
+(1-α)2
![Page 115: Advances on the Development of Evaluation Measures](https://reader031.vdocuments.net/reader031/viewer/2022020621/61e74dde4f6a846e2a4bf5fa/html5/thumbnails/115.jpg)
ERR-IA
• Intent-aware version of ERR
• But it has appealing properties other IA measures do not have:
– ranges between 0 and 1
– submodularity: diminishing returns for relevance to a given subtopic -> built-in redundancy penalization
• Also has appealing properties over α-nDCG:
– Easily handles graded subtopic judgments
– Easily handles intent distributions
![Page 116: Advances on the Development of Evaluation Measures](https://reader031.vdocuments.net/reader031/viewer/2022020621/61e74dde4f6a846e2a4bf5fa/html5/thumbnails/116.jpg)
Granularity of Judging
• What exactly is a “subtopic”?
– Perhaps any piece of information a user may be interested in finding?
• At what granularity should subtopics be defined?
– For example:
• “cardinals” has many possible meanings
• “cardinals baseball team” is still very broad
• “cardinals baseball team schedule” covers 6 months
• “cardinals baseball team schedule august” covers ~25 games
• “cardinals baseball team schedule august 12th”
![Page 117: Advances on the Development of Evaluation Measures](https://reader031.vdocuments.net/reader031/viewer/2022020621/61e74dde4f6a846e2a4bf5fa/html5/thumbnails/117.jpg)
Preference Judgments for Novelty
• What about evaluating novelty with no subtopic judgments?
• Preference judgments:
– Is document A more relevant than document B?
• Conditional preference judgments:
– Is document A better than document B given that I’ve just seen document C?
– Assumption: preference is based on novelty over C
• Is it true? Come to our presentation on Wednesday…
![Page 118: Advances on the Development of Evaluation Measures](https://reader031.vdocuments.net/reader031/viewer/2022020621/61e74dde4f6a846e2a4bf5fa/html5/thumbnails/118.jpg)
![Page 119: Advances on the Development of Evaluation Measures](https://reader031.vdocuments.net/reader031/viewer/2022020621/61e74dde4f6a846e2a4bf5fa/html5/thumbnails/119.jpg)
Conclusions
• Strong interest in using evaluation measures to model user behavior and satisfaction – Driven by availability of user logs, increased
computational power, good abstract models
– DCG, RBP, ERR, EBU, session measures, diversity measures all model users in different ways
• Cranfield-style evaluation is still important!
• But there is still much to understand about users and how they derive satisfaction
![Page 120: Advances on the Development of Evaluation Measures](https://reader031.vdocuments.net/reader031/viewer/2022020621/61e74dde4f6a846e2a4bf5fa/html5/thumbnails/120.jpg)
Conclusions
• Ongoing and future work: – Models with more degrees of freedom
– Direct simulation of users from start of session to finish
– Application to other domains
• Thank you! – Slides will be available online
– http://ir.cis.udel.edu/SIGIR12tutorial