lecture 11: evaluation cont
DESCRIPTION
Prof. Ray Larson University of California, Berkeley School of Information. Lecture 11: Evaluation Cont. Principles of Information Retrieval. Mini-TREC. Proposed Schedule February 11 – Database and previous Queries March 4 – report on system acquisition and setup - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Lecture 11: Evaluation Cont](https://reader035.vdocuments.net/reader035/viewer/2022081505/56815b47550346895dc923c4/html5/thumbnails/1.jpg)
2013.03.04 - SLIDE 1IS 240 – Spring 2013
Prof. Ray Larson University of California, Berkeley
School of Information
Principles of Information Retrieval
Lecture 11: Evaluation Cont.
![Page 2: Lecture 11: Evaluation Cont](https://reader035.vdocuments.net/reader035/viewer/2022081505/56815b47550346895dc923c4/html5/thumbnails/2.jpg)
2013.03.04 - SLIDE 2IS 240 – Spring 2013
Mini-TREC
• Proposed Schedule– February 11 – Database and previous Queries– March 4 – report on system acquisition and
setup– March 6, New Queries for testing…– April 22, Results due– April 24, Results and system rankings– April 29 Group reports and discussion
![Page 3: Lecture 11: Evaluation Cont](https://reader035.vdocuments.net/reader035/viewer/2022081505/56815b47550346895dc923c4/html5/thumbnails/3.jpg)
2013.03.04 - SLIDE 3IS 240 – Spring 2013
Overview
• Evaluation of IR Systems – Review
• Blair and Maron• Calculating Precision vs. Recall• Using TREC_eval• Theoretical limits of precision and recall• Measures for very large-scale systems
![Page 4: Lecture 11: Evaluation Cont](https://reader035.vdocuments.net/reader035/viewer/2022081505/56815b47550346895dc923c4/html5/thumbnails/4.jpg)
2013.03.04 - SLIDE 4IS 240 – Spring 2013
Overview
• Evaluation of IR Systems – Review
• Blair and Maron• Calculating Precision vs. Recall• Using TREC_eval• Theoretical limits of precision and recall
![Page 5: Lecture 11: Evaluation Cont](https://reader035.vdocuments.net/reader035/viewer/2022081505/56815b47550346895dc923c4/html5/thumbnails/5.jpg)
2013.03.04 - SLIDE 5IS 240 – Spring 2013
What to Evaluate?
What can be measured that reflects users’ ability to use system? (Cleverdon 66)
– Coverage of Information– Form of Presentation– Effort required/Ease of Use– Time and Space Efficiency– Recall
• proportion of relevant material actually retrieved– Precision
• proportion of retrieved material actually relevant
effe
ctiv
enes
s
![Page 6: Lecture 11: Evaluation Cont](https://reader035.vdocuments.net/reader035/viewer/2022081505/56815b47550346895dc923c4/html5/thumbnails/6.jpg)
2013.03.04 - SLIDE 6IS 240 – Spring 2013
Relevant vs. Retrieved
Relevant
Retrieved
All docs
![Page 7: Lecture 11: Evaluation Cont](https://reader035.vdocuments.net/reader035/viewer/2022081505/56815b47550346895dc923c4/html5/thumbnails/7.jpg)
2013.03.04 - SLIDE 7IS 240 – Spring 2013
Precision vs. Recall
Relevant
Retrieved
All docs
![Page 8: Lecture 11: Evaluation Cont](https://reader035.vdocuments.net/reader035/viewer/2022081505/56815b47550346895dc923c4/html5/thumbnails/8.jpg)
2013.03.04 - SLIDE 8IS 240 – Spring 2013
Relation to Contingency Table
• Accuracy: (a+d) / (a+b+c+d)• Precision: a/(a+b)• Recall: ?• Why don’t we use Accuracy for
IR?– (Assuming a large collection)– Most docs aren’t relevant – Most docs aren’t retrieved– Inflates the accuracy value
Doc is Relevant
Doc is NOT relevant
Doc is retrieved a b
Doc is NOT retrieved c d
![Page 9: Lecture 11: Evaluation Cont](https://reader035.vdocuments.net/reader035/viewer/2022081505/56815b47550346895dc923c4/html5/thumbnails/9.jpg)
2013.03.04 - SLIDE 9IS 240 – Spring 2013
The F-Measure
• Another single measure that combines precision and recall:
• Where:
and
• “Balanced” when:
![Page 10: Lecture 11: Evaluation Cont](https://reader035.vdocuments.net/reader035/viewer/2022081505/56815b47550346895dc923c4/html5/thumbnails/10.jpg)
2013.03.04 - SLIDE 10IS 240 – Spring 2013
TREC
• Text REtrieval Conference/Competition– Run by NIST (National Institute of Standards & Technology)– 2000 was the 9th year - 10th TREC in November
• Collection: 5 Gigabytes (5 CRDOMs), >1.5 Million Docs– Newswire & full text news (AP, WSJ, Ziff, FT, San
Jose Mercury, LA Times)– Government documents (federal register,
Congressional Record)– FBIS (Foreign Broadcast Information Service)– US Patents
![Page 11: Lecture 11: Evaluation Cont](https://reader035.vdocuments.net/reader035/viewer/2022081505/56815b47550346895dc923c4/html5/thumbnails/11.jpg)
2013.03.04 - SLIDE 11IS 240 – Spring 2013
Sample TREC queries (topics)
<num> Number: 168<title> Topic: Financing AMTRAK
<desc> Description:A document will address the role of the Federal Government in financing the operation of the National Railroad Transportation Corporation (AMTRAK)
<narr> Narrative: A relevant document must provide information on the government’s responsibility to make AMTRAK an economically viable entity. It could also discuss the privatization of AMTRAK as an alternative to continuing government subsidies. Documents comparing government subsidies given to air and bus transportation with those provided to aMTRAK would also be relevant.
![Page 12: Lecture 11: Evaluation Cont](https://reader035.vdocuments.net/reader035/viewer/2022081505/56815b47550346895dc923c4/html5/thumbnails/12.jpg)
2013.03.04 - SLIDE 12IS 240 – Spring 2013
![Page 13: Lecture 11: Evaluation Cont](https://reader035.vdocuments.net/reader035/viewer/2022081505/56815b47550346895dc923c4/html5/thumbnails/13.jpg)
2013.03.04 - SLIDE 13IS 240 – Spring 2013
![Page 14: Lecture 11: Evaluation Cont](https://reader035.vdocuments.net/reader035/viewer/2022081505/56815b47550346895dc923c4/html5/thumbnails/14.jpg)
2013.03.04 - SLIDE 14IS 240 – Spring 2013
![Page 15: Lecture 11: Evaluation Cont](https://reader035.vdocuments.net/reader035/viewer/2022081505/56815b47550346895dc923c4/html5/thumbnails/15.jpg)
2013.03.04 - SLIDE 15IS 240 – Spring 2013
![Page 16: Lecture 11: Evaluation Cont](https://reader035.vdocuments.net/reader035/viewer/2022081505/56815b47550346895dc923c4/html5/thumbnails/16.jpg)
2013.03.04 - SLIDE 16IS 240 – Spring 2013
![Page 17: Lecture 11: Evaluation Cont](https://reader035.vdocuments.net/reader035/viewer/2022081505/56815b47550346895dc923c4/html5/thumbnails/17.jpg)
2013.03.04 - SLIDE 17IS 240 – Spring 2013
![Page 18: Lecture 11: Evaluation Cont](https://reader035.vdocuments.net/reader035/viewer/2022081505/56815b47550346895dc923c4/html5/thumbnails/18.jpg)
2013.03.04 - SLIDE 18IS 240 – Spring 2013
![Page 19: Lecture 11: Evaluation Cont](https://reader035.vdocuments.net/reader035/viewer/2022081505/56815b47550346895dc923c4/html5/thumbnails/19.jpg)
2013.03.04 - SLIDE 19IS 240 – Spring 2013
TREC Results
• Differ each year• For the main (ad hoc) track:
– Best systems not statistically significantly different– Small differences sometimes have big effects
• how good was the hyphenation model• how was document length taken into account
– Systems were optimized for longer queries and all performed worse for shorter, more realistic queries
• Ad hoc track suspended in TREC 9
![Page 20: Lecture 11: Evaluation Cont](https://reader035.vdocuments.net/reader035/viewer/2022081505/56815b47550346895dc923c4/html5/thumbnails/20.jpg)
2013.03.04 - SLIDE 20IS 240 – Spring 2013
Overview
• Evaluation of IR Systems – Review
• Blair and Maron• Calculating Precision vs. Recall• Using TREC_eval• Theoretical limits of precision and recall
![Page 21: Lecture 11: Evaluation Cont](https://reader035.vdocuments.net/reader035/viewer/2022081505/56815b47550346895dc923c4/html5/thumbnails/21.jpg)
2013.03.04 - SLIDE 21IS 240 – Spring 2013
Blair and Maron 1985
• A classic study of retrieval effectiveness– earlier studies were on unrealistically small collections
• Studied an archive of documents for a legal suit– ~350,000 pages of text– 40 queries– focus on high recall– Used IBM’s STAIRS full-text system
• Main Result: – The system retrieved less than 20% of the
relevant documents for a particular information need; lawyers thought they had 75%
• But many queries had very high precision
![Page 22: Lecture 11: Evaluation Cont](https://reader035.vdocuments.net/reader035/viewer/2022081505/56815b47550346895dc923c4/html5/thumbnails/22.jpg)
2013.03.04 - SLIDE 22IS 240 – Spring 2013
Blair and Maron, cont.
• How they estimated recall– generated partially random samples of unseen
documents– had users (unaware these were random) judge them
for relevance
• Other results:– two lawyers searches had similar performance– lawyers recall was not much different from paralegal’s
![Page 23: Lecture 11: Evaluation Cont](https://reader035.vdocuments.net/reader035/viewer/2022081505/56815b47550346895dc923c4/html5/thumbnails/23.jpg)
2013.03.04 - SLIDE 23IS 240 – Spring 2013
Blair and Maron, cont.
• Why recall was low– users can’t foresee exact words and phrases that will
indicate relevant documents• “accident” referred to by those responsible as:“event,” “incident,” “situation,” “problem,” …• differing technical terminology• slang, misspellings
– Perhaps the value of higher recall decreases as the number of relevant documents grows, so more detailed queries were not attempted once the users were satisfied
![Page 24: Lecture 11: Evaluation Cont](https://reader035.vdocuments.net/reader035/viewer/2022081505/56815b47550346895dc923c4/html5/thumbnails/24.jpg)
2013.03.04 - SLIDE 24IS 240 – Spring 2013
Overview
• Evaluation of IR Systems – Review
• Blair and Maron• Calculating Precision vs. Recall• Using TREC_eval• Theoretical limits of precision and recall
![Page 25: Lecture 11: Evaluation Cont](https://reader035.vdocuments.net/reader035/viewer/2022081505/56815b47550346895dc923c4/html5/thumbnails/25.jpg)
2013.03.04 - SLIDE 25IS 240 – Spring 2013
How Test Runs are Evaluated
• First ranked doc is relevant, which is 10% of the total relevant. Therefore Precision at the 10% Recall level is 100%
• Next Relevant gives us 66% Precision at 20% recall level
• Etc….
1. d123*
2. d84
3. d56*4. d6
5. d8
6. d9*7. d511
8. d129
9. d187
10. d25*11. d38
12. d48
13. d250
14. d113
15. d3*
Rq={d3,d5,d9,d25,d39,d44,d56,d71,d89,d123} : 10 Relevant
Examples from Chapter 3 in Baeza-Yates
![Page 26: Lecture 11: Evaluation Cont](https://reader035.vdocuments.net/reader035/viewer/2022081505/56815b47550346895dc923c4/html5/thumbnails/26.jpg)
2013.03.04 - SLIDE 26IS 240 – Spring 2013
Graphing for a Single Query
100 90 80 70 60 50 40 30 20 10 0
0 10 20 30 40 50 60 70 80 90 100
PRECISION
RECALL
![Page 27: Lecture 11: Evaluation Cont](https://reader035.vdocuments.net/reader035/viewer/2022081505/56815b47550346895dc923c4/html5/thumbnails/27.jpg)
2013.03.04 - SLIDE 27IS 240 – Spring 2013
Averaging Multiple Queries
![Page 28: Lecture 11: Evaluation Cont](https://reader035.vdocuments.net/reader035/viewer/2022081505/56815b47550346895dc923c4/html5/thumbnails/28.jpg)
2013.03.04 - SLIDE 28IS 240 – Spring 2013
Interpolation
Rq={d3,d56,d129}
1. d123*
2. d84
3. d56*4. d6
5. d8
6. d9*7. d511
8. d129
9. d187
10. d25*11. d38
12. d48
13. d250
14. d113
15. d3*
• First relevant doc is 56, which is gives recall and precision of 33.3%
• Next Relevant (129) gives us 66% recall at 25% precision
• Next (3) gives us 100% recall with 20% precision
• How do we figure out the precision at the 11 standard recall levels?
![Page 29: Lecture 11: Evaluation Cont](https://reader035.vdocuments.net/reader035/viewer/2022081505/56815b47550346895dc923c4/html5/thumbnails/29.jpg)
2013.03.04 - SLIDE 29IS 240 – Spring 2013
Interpolation
![Page 30: Lecture 11: Evaluation Cont](https://reader035.vdocuments.net/reader035/viewer/2022081505/56815b47550346895dc923c4/html5/thumbnails/30.jpg)
2013.03.04 - SLIDE 30IS 240 – Spring 2013
Interpolation
• So, at recall levels 0%, 10%, 20%, and 30% the interpolated precision is 33.3%
• At recall levels 40%, 50%, and 60% interpolated precision is 25%
• And at recall levels 70%, 80%, 90% and 100%, interpolated precision is 20%
• Giving graph…
![Page 31: Lecture 11: Evaluation Cont](https://reader035.vdocuments.net/reader035/viewer/2022081505/56815b47550346895dc923c4/html5/thumbnails/31.jpg)
2013.03.04 - SLIDE 31IS 240 – Spring 2013
Interpolation
100 90 80 70 60 50 40 30 20 10 0
0 10 20 30 40 50 60 70 80 90 100
PRECISION
RECALL
![Page 32: Lecture 11: Evaluation Cont](https://reader035.vdocuments.net/reader035/viewer/2022081505/56815b47550346895dc923c4/html5/thumbnails/32.jpg)
2013.03.04 - SLIDE 32IS 240 – Spring 2013
Overview
• Evaluation of IR Systems – Review
• Blair and Maron• Calculating Precision vs. Recall• Using TREC_eval• Theoretical limits of precision and recall
![Page 33: Lecture 11: Evaluation Cont](https://reader035.vdocuments.net/reader035/viewer/2022081505/56815b47550346895dc923c4/html5/thumbnails/33.jpg)
2013.03.04 - SLIDE 33IS 240 – Spring 2013
Using TREC_EVAL
• Developed from SMART evaluation programs for use in TREC– trec_eval [-q] [-a] [-o] trec_qrel_file top_ranked_file
• NOTE: Many other options in current version
• Uses:– List of top-ranked documents
• QID iter docno rank sim runid• 030 Q0 ZF08-175-870 0 4238 prise1
– QRELS file for collection• QID docno rel• 251 0 FT911-1003 1• 251 0 FT911-101 1• 251 0 FT911-1300 0
![Page 34: Lecture 11: Evaluation Cont](https://reader035.vdocuments.net/reader035/viewer/2022081505/56815b47550346895dc923c4/html5/thumbnails/34.jpg)
2013.03.04 - SLIDE 34IS 240 – Spring 2013
Running TREC_EVAL
• Options– -q gives evaluation for each query– -a gives additional (non-TREC) measures– -d gives the average document precision
measure– -o gives the “old style” display shown here
![Page 35: Lecture 11: Evaluation Cont](https://reader035.vdocuments.net/reader035/viewer/2022081505/56815b47550346895dc923c4/html5/thumbnails/35.jpg)
2013.03.04 - SLIDE 35IS 240 – Spring 2013
Running TREC_EVAL
• Output:– Retrieved: number retrieved for query– Relevant: number relevant in qrels file– Rel_ret: Relevant items that were retrieved
![Page 36: Lecture 11: Evaluation Cont](https://reader035.vdocuments.net/reader035/viewer/2022081505/56815b47550346895dc923c4/html5/thumbnails/36.jpg)
2013.03.04 - SLIDE 36IS 240 – Spring 2013
Running TREC_EVAL - OutputTotal number of documents over all queries Retrieved: 44000 Relevant: 1583 Rel_ret: 635Interpolated Recall - Precision Averages: at 0.00 0.4587 at 0.10 0.3275 at 0.20 0.2381 at 0.30 0.1828 at 0.40 0.1342 at 0.50 0.1197 at 0.60 0.0635 at 0.70 0.0493 at 0.80 0.0350 at 0.90 0.0221 at 1.00 0.0150 Average precision (non-interpolated) for all rel docs(averaged over queries) 0.1311
![Page 37: Lecture 11: Evaluation Cont](https://reader035.vdocuments.net/reader035/viewer/2022081505/56815b47550346895dc923c4/html5/thumbnails/37.jpg)
2013.03.04 - SLIDE 37IS 240 – Spring 2013
Plotting Output (using Gnuplot)
![Page 38: Lecture 11: Evaluation Cont](https://reader035.vdocuments.net/reader035/viewer/2022081505/56815b47550346895dc923c4/html5/thumbnails/38.jpg)
2013.03.04 - SLIDE 38IS 240 – Spring 2013
Plotting Output (using Gnuplot)
![Page 39: Lecture 11: Evaluation Cont](https://reader035.vdocuments.net/reader035/viewer/2022081505/56815b47550346895dc923c4/html5/thumbnails/39.jpg)
2013.03.04 - SLIDE 39IS 240 – Spring 2013
Gnuplot code
set title "Individual Queries"set ylabel "Precision"set xlabel "Recall"set xrange [0:1]set yrange [0:1]set xtics 0,.5,1set ytics 0,.2,1set grid
plot 'Group1/trec_top_file_1.txt.dat' title "Group1 trec_top_file_1" with lines 1
pause -1 "hit return"
0.00 0.4587 0.10 0.3275 0.20 0.2381 0.30 0.1828 0.40 0.1342 0.50 0.1197 0.60 0.0635 0.70 0.0493 0.80 0.0350 0.90 0.0221 1.00 0.0150
trec_top_file_1.txt.dat
![Page 40: Lecture 11: Evaluation Cont](https://reader035.vdocuments.net/reader035/viewer/2022081505/56815b47550346895dc923c4/html5/thumbnails/40.jpg)
2013.03.04 - SLIDE 40IS 240 – Spring 2013
Overview
• Evaluation of IR Systems – Review
• Blair and Maron• Calculating Precision vs. Recall• Using TREC_eval• Theoretical limits of precision and recall
![Page 41: Lecture 11: Evaluation Cont](https://reader035.vdocuments.net/reader035/viewer/2022081505/56815b47550346895dc923c4/html5/thumbnails/41.jpg)
2013.03.04 - SLIDE 41IS 240 – Spring 2013
Problems with Precision/Recall
• Can’t know true recall value – except in small collections
• Precision/Recall are related– A combined measure sometimes more
appropriate (like F or MAP)• Assumes batch mode
– Interactive IR is important and has different criteria for successful searches
– We will touch on this in the UI section• Assumes a strict rank ordering matters
![Page 42: Lecture 11: Evaluation Cont](https://reader035.vdocuments.net/reader035/viewer/2022081505/56815b47550346895dc923c4/html5/thumbnails/42.jpg)
2013.03.04 - SLIDE 42IS 240 – Spring 2013
Relationship between Precision and Recall
Doc is Relevant
Doc is NOT relevant
Doc is retrieved
Doc is NOT retrieved
Buckland & Gey, JASIS: Jan 1994
![Page 43: Lecture 11: Evaluation Cont](https://reader035.vdocuments.net/reader035/viewer/2022081505/56815b47550346895dc923c4/html5/thumbnails/43.jpg)
2013.03.04 - SLIDE 43IS 240 – Spring 2013
Recall Under various retrieval assumptions
Buckland & Gey, JASIS: Jan 1994
1.00.90.80.70.60.50.40.30.20.10.0
RECALL
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0Proportion of documents retrieved
Random
Perfect
Perverse
TangentParabolicRecall
ParabolicRecall 1000 Documents
100 Relevant
![Page 44: Lecture 11: Evaluation Cont](https://reader035.vdocuments.net/reader035/viewer/2022081505/56815b47550346895dc923c4/html5/thumbnails/44.jpg)
2013.03.04 - SLIDE 44IS 240 – Spring 2013
Precision under various assumptions
1000 Documents100 Relevant
1.00.90.80.70.60.50.40.30.20.10.0
PRECISION
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0Proportion of documents retrieved
Random
Perfect
Perverse
TangentParabolicRecall
ParabolicRecall
![Page 45: Lecture 11: Evaluation Cont](https://reader035.vdocuments.net/reader035/viewer/2022081505/56815b47550346895dc923c4/html5/thumbnails/45.jpg)
2013.03.04 - SLIDE 45IS 240 – Spring 2013
Recall-Precision
1000 Documents100 Relevant
1.00.90.80.70.60.50.40.30.20.10.0
PRECISION
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0RECALL
Random
Perfect
Perverse
TangentParabolicRecall
Parabolic Recall
![Page 46: Lecture 11: Evaluation Cont](https://reader035.vdocuments.net/reader035/viewer/2022081505/56815b47550346895dc923c4/html5/thumbnails/46.jpg)
2013.03.04 - SLIDE 46IS 240 – Spring 2013
CACM Query 25
![Page 47: Lecture 11: Evaluation Cont](https://reader035.vdocuments.net/reader035/viewer/2022081505/56815b47550346895dc923c4/html5/thumbnails/47.jpg)
2013.03.04 - SLIDE 47IS 240 – Spring 2013
Relationship of Precision and Recall
![Page 48: Lecture 11: Evaluation Cont](https://reader035.vdocuments.net/reader035/viewer/2022081505/56815b47550346895dc923c4/html5/thumbnails/48.jpg)
2013.03.04 - SLIDE 48IS 240 – Spring 2013
Measures for Large-Scale Eval• Typical user behavior in web search
systems has shown a preference for high precision
• Also graded scales of relevance seem more useful than just “yes/no”
• Measures have been devised to help evaluate situations taking these into account
![Page 49: Lecture 11: Evaluation Cont](https://reader035.vdocuments.net/reader035/viewer/2022081505/56815b47550346895dc923c4/html5/thumbnails/49.jpg)
2013.03.04 - SLIDE 49IS 240 – Spring 2013
Cumulative Gain measures
• If we assume that highly relevant documents are more useful when appearing earlier in a result list (are ranked higher)
• And, highly relevant documents are more useful than marginally relevant documents, which are in turn more useful than non-relevant documents
• Then measures that take these factors into account would better reflect user needs
![Page 50: Lecture 11: Evaluation Cont](https://reader035.vdocuments.net/reader035/viewer/2022081505/56815b47550346895dc923c4/html5/thumbnails/50.jpg)
2013.03.04 - SLIDE 50IS 240 – Spring 2013
Simple CG
• Cumulative Gain is simply the sum of all the graded relevance values the items in a ranked search result list
• The CG at a particular rank p is
• Where i is the rank and reli is the relevance score
![Page 51: Lecture 11: Evaluation Cont](https://reader035.vdocuments.net/reader035/viewer/2022081505/56815b47550346895dc923c4/html5/thumbnails/51.jpg)
2013.03.04 - SLIDE 51IS 240 – Spring 2013
Discounted Cumulative Gain
• DCG measures the gain (usefulness) of a document based on its position in the result list– The gain is accumulated (like simple CG) with
the gain of each result discounted at lower ranks
• The idea is that highly relevant docs appearing lower in the search result should be penalized proportion to their position in the results
![Page 52: Lecture 11: Evaluation Cont](https://reader035.vdocuments.net/reader035/viewer/2022081505/56815b47550346895dc923c4/html5/thumbnails/52.jpg)
2013.03.04 - SLIDE 52IS 240 – Spring 2013
Discounted Cumulative Gain
• The DCG is reduced logarithmically proportional to the position (p) in the ranking
• Why logs? No real reason except smooth reduction. Another formulation is:
• Puts a stronger emphasis on high ranks
![Page 53: Lecture 11: Evaluation Cont](https://reader035.vdocuments.net/reader035/viewer/2022081505/56815b47550346895dc923c4/html5/thumbnails/53.jpg)
2013.03.04 - SLIDE 53IS 240 – Spring 2013
Normalized DCG
• Because search results lists vary in size depending on the query, comparing results across queries doesn’t work with DCG alone
• To do this DCG is normalized across the query set:– First create an “ideal” result by sorting the
result list by relevance score– Use that ideal value to create a normalized
DCG
![Page 54: Lecture 11: Evaluation Cont](https://reader035.vdocuments.net/reader035/viewer/2022081505/56815b47550346895dc923c4/html5/thumbnails/54.jpg)
2013.03.04 - SLIDE 54IS 240 – Spring 2013
Normalized DCG
• Using the ideal DCG at a given position and the observed DCG at the same position
• The nDCG values for all test queries can then be averaged to give a measure of the average performance of the system
• If a system does perfect ranking, the IDCG and DCG will be the same, so nDCG will be 1 for each rank (nDCGp ranges from 0-1)
![Page 55: Lecture 11: Evaluation Cont](https://reader035.vdocuments.net/reader035/viewer/2022081505/56815b47550346895dc923c4/html5/thumbnails/55.jpg)
2013.03.04 - SLIDE 55IS 240 – Spring 2013
Next time
• Issues in Evaluation – Has there been improvement in systems?