1 lessons learned from evaluation of summarization systems: nightmares and pleasant surprises...
TRANSCRIPT
![Page 1: 1 Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia](https://reader036.vdocuments.net/reader036/viewer/2022062518/56649e545503460f94b4b250/html5/thumbnails/1.jpg)
1
Lessons Learned from Evaluation of Lessons Learned from Evaluation of Summarization Systems:Summarization Systems: Nightmares and Pleasant SurprisesNightmares and Pleasant Surprises
Kathleen McKeown
Department of Computer Science
Columbia University
Major contributers: Ani Nenkova, Becky Passonneau
![Page 2: 1 Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia](https://reader036.vdocuments.net/reader036/viewer/2022062518/56649e545503460f94b4b250/html5/thumbnails/2.jpg)
2
![Page 3: 1 Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia](https://reader036.vdocuments.net/reader036/viewer/2022062518/56649e545503460f94b4b250/html5/thumbnails/3.jpg)
3
QuestionsQuestions
What kinds of evaluation are possible?
What are the pitfalls? Are evaluation metrics fair? Is real research progress possible?
What are the benefits?
Should we evaluate our systems?
![Page 4: 1 Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia](https://reader036.vdocuments.net/reader036/viewer/2022062518/56649e545503460f94b4b250/html5/thumbnails/4.jpg)
4
What is the feel of the evaluation?What is the feel of the evaluation?
Is it competitive?
Does it foster a feeling of community?
Are the guidelines clearly established ahead of time?
Are the metrics fair? Do they measure what you want to measure?
![Page 5: 1 Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia](https://reader036.vdocuments.net/reader036/viewer/2022062518/56649e545503460f94b4b250/html5/thumbnails/5.jpg)
5
![Page 6: 1 Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia](https://reader036.vdocuments.net/reader036/viewer/2022062518/56649e545503460f94b4b250/html5/thumbnails/6.jpg)
6
The night Max wore his wolf suit and made mischief of one kind
![Page 7: 1 Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia](https://reader036.vdocuments.net/reader036/viewer/2022062518/56649e545503460f94b4b250/html5/thumbnails/7.jpg)
7
and another and another
![Page 8: 1 Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia](https://reader036.vdocuments.net/reader036/viewer/2022062518/56649e545503460f94b4b250/html5/thumbnails/8.jpg)
8
His mother called him “WILD THING” and he said “I’LL EAT YOU UP!” so he was sent to bed without eating anything.
![Page 9: 1 Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia](https://reader036.vdocuments.net/reader036/viewer/2022062518/56649e545503460f94b4b250/html5/thumbnails/9.jpg)
9
DARPA GALE: Global Autonomous DARPA GALE: Global Autonomous Language EnvironmentLanguage Environment Three large teams: BBN, IBM, SRI
SRI: UC Berkeley, U Washington, UCSD, Columbia, NYU, UMASS, NCRI, Systran, Fair Isaacs, Ohio State
Generate responses to open-ended questions 17 templates: definitions, biographies, events, relationships, reactions,
etc.
Using English, Chinese, and Arabic text and speech, blogs to news
Find all instances when a fact is mentioned (redundancy)
![Page 10: 1 Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia](https://reader036.vdocuments.net/reader036/viewer/2022062518/56649e545503460f94b4b250/html5/thumbnails/10.jpg)
10
GALE EvaluationGALE Evaluation
Can systems do at least 50% as well as a human? If not, the GALE program will not continue The team that does worst may be cut
Independent evaluator: BAE Has never done text evaluation before Has experience with task based evaluation
Gold Standard System responses graded by two judges Relevant facts added to the pool
Granularity of scoring: nuggets Metrics
Variants of precision/recall weighted Document citations Redundancy
![Page 11: 1 Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia](https://reader036.vdocuments.net/reader036/viewer/2022062518/56649e545503460f94b4b250/html5/thumbnails/11.jpg)
11
Year 1: Sample Q&AYear 1: Sample Q&A
LIST FACTS ABOUT [The Trial of Saddam Hussein]
The judge , however, that all people should have heard voices, the order of a court to solve technical problems. (Chi)
His account of events surrounding the torture and execution of more than 140 men and teenage boys from the Dujail , appeared to do little to advance the prosecution's goal of establishing Saddam 's "command responsibility" for the deaths.
A trial without Saddam could be an embarrassment for the U.S. government, which has worked hard to help create a tribunal that would be perceived by Iraqis as independent and fair.
As the trial got under way, a former secret police officer testified that he had not received any orders from Saddam during the investigations that followed an assassination attempt against him in Dujail in 1982 .
![Page 12: 1 Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia](https://reader036.vdocuments.net/reader036/viewer/2022062518/56649e545503460f94b4b250/html5/thumbnails/12.jpg)
12
Year 1: ResultsYear 1: Results
F-value (Beta of 1)
Machine average: 0.230 Human average: 0.353
Machine to Human average: 0.678
![Page 13: 1 Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia](https://reader036.vdocuments.net/reader036/viewer/2022062518/56649e545503460f94b4b250/html5/thumbnails/13.jpg)
13
DUC – Document Understanding DUC – Document Understanding ConferenceConference
Established and funded by DARPA TIDES Run by independent evaluator NIST
Open to summarization community Annual evaluations on common datasets 2001-present
Tasks Single document summarization Headline summarization Multi-document summarization Multi-lingual summarization Focused summarization Update summarization
![Page 14: 1 Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia](https://reader036.vdocuments.net/reader036/viewer/2022062518/56649e545503460f94b4b250/html5/thumbnails/14.jpg)
14
DUC is changing direction againDUC is changing direction again
DARPA GALE effort cutting back participation in DUC
Considering co-locating with TREC QA
Considering new data sources and tasks
![Page 15: 1 Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia](https://reader036.vdocuments.net/reader036/viewer/2022062518/56649e545503460f94b4b250/html5/thumbnails/15.jpg)
15
DUC EvaluationDUC Evaluation
Gold Standard Human summaries written by NIST From 2 to 9 summaries per input set
Multiple metrics Manual
Coverage (early years) Pyramids (later years) Responsiveness (later years) Quality questions
Automatic Rouge (-1, -2, -skipbigrams, LCS, BE)
Granularity Manual: sub-sentential elements Automatic: sentences
![Page 16: 1 Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia](https://reader036.vdocuments.net/reader036/viewer/2022062518/56649e545503460f94b4b250/html5/thumbnails/16.jpg)
16
TREC definition pilotTREC definition pilot
Long answer to request for a definition
As a pilot, less emphasis on results
Part of TREC QA
![Page 17: 1 Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia](https://reader036.vdocuments.net/reader036/viewer/2022062518/56649e545503460f94b4b250/html5/thumbnails/17.jpg)
17
Evaluation MethodsEvaluation Methods
Pool system responses and break into nuggets
A judge scores nuggets as vital, OK or invalid
Measure information precision and recall
Can a judge reliably determine which facts belong in a definition?
![Page 18: 1 Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia](https://reader036.vdocuments.net/reader036/viewer/2022062518/56649e545503460f94b4b250/html5/thumbnails/18.jpg)
18
Considerations Across EvaluationsConsiderations Across Evaluations
Independent evaluator Not always as knowledgeable as researchers Impartial determination of approach Extensive collection of resources
Determination of task Appealing to a broad cross-section of community Changes over time
DUC 2001-2002 Single and multi-document DUC 2003: headlines, multi-document DUC 2004: headlines, multilingual and multi-document, focused DUC 2005: focused summarization DUC 2006: focused and a new task, up for discussion
How long do participants have to prepare? When is a task dropped?
Scoring of text at the sub-sentential level
![Page 19: 1 Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia](https://reader036.vdocuments.net/reader036/viewer/2022062518/56649e545503460f94b4b250/html5/thumbnails/19.jpg)
19
Task-based EvaluationTask-based Evaluation
Use the summarization system as browser to do another task
Newsblaster: write a report given a broad prompt
DARPA utility evaluation: given a request for information, use question answering to write report
![Page 20: 1 Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia](https://reader036.vdocuments.net/reader036/viewer/2022062518/56649e545503460f94b4b250/html5/thumbnails/20.jpg)
20
Task EvaluationTask Evaluation
Hypothesis: multi-document summaries enable users to find information efficiently
Task: fact-gathering given topic and questions Resembles intelligence analyst task
![Page 21: 1 Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia](https://reader036.vdocuments.net/reader036/viewer/2022062518/56649e545503460f94b4b250/html5/thumbnails/21.jpg)
21
User Study: ObjectivesUser Study: Objectives
Does multi-document summarization help?
Do summaries help the user find information needed to perform a report writing task?
Do users use information from summaries in gathering their facts?
Do summaries increase user satisfaction with the online news system?
Do users create better quality reports with summaries? How do full multi-document summaries compare with
minimal 1-sentence summaries such as Google News?
![Page 22: 1 Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia](https://reader036.vdocuments.net/reader036/viewer/2022062518/56649e545503460f94b4b250/html5/thumbnails/22.jpg)
22
User Study: DesignUser Study: Design
Compared 4 parallel news browsing systems Level 1: Source documents only Level 2: One sentence multi-document summaries (e.g.,
Google News) linked to documents Level 3: Newsblaster multi-document summaries linked
to documents Level 4: Human written multi-document summaries
linked to documents
All groups write reports given four scenarios A task similar to analysts Can only use Newsblaster for research Time-restricted
![Page 23: 1 Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia](https://reader036.vdocuments.net/reader036/viewer/2022062518/56649e545503460f94b4b250/html5/thumbnails/23.jpg)
23
User Study: ExecutionUser Study: Execution
4 scenarios 4 event clusters each 2 directly relevant, 2 peripherally relevant Average 10 documents/cluster
45 participants Balance between liberal arts, engineering 138 reports
Exit survey Multiple-choice and open-ended questions
Usage tracking Each click logged, on or off-site
![Page 24: 1 Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia](https://reader036.vdocuments.net/reader036/viewer/2022062518/56649e545503460f94b4b250/html5/thumbnails/24.jpg)
24
““Geneva” PromptGeneva” Prompt
The conflict between Israel and the Palestinians has been difficult for government negotiators to settle. Most recently, implementation of the “road map for peace”, a diplomatic effort sponsored by ……
Who participated in the negotiations that produced the Geneva Accord?
Apart from direct participants, who supported the Geneva Accord preparations and how?
What has the response been to the Geneva Accord by the Palestinians?
![Page 25: 1 Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia](https://reader036.vdocuments.net/reader036/viewer/2022062518/56649e545503460f94b4b250/html5/thumbnails/25.jpg)
25
Measuring EffectivenessMeasuring Effectiveness
Score report content and compare across summary conditions
Compare user satisfaction per summary condition
Comparing where subjects took report content from
![Page 26: 1 Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia](https://reader036.vdocuments.net/reader036/viewer/2022062518/56649e545503460f94b4b250/html5/thumbnails/26.jpg)
26
Newsblaster
![Page 27: 1 Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia](https://reader036.vdocuments.net/reader036/viewer/2022062518/56649e545503460f94b4b250/html5/thumbnails/27.jpg)
27
User SatisfactionUser Satisfaction
More effective than a web search with Newsblaster
Not true with documents only or single-sentence summaries
Easier to complete the task with summaries than with documents only
Enough time with summaries than documents only
Summaries helped most 5% single sentence summaries 24% Newsblaster summaries 43% human summaries
![Page 28: 1 Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia](https://reader036.vdocuments.net/reader036/viewer/2022062518/56649e545503460f94b4b250/html5/thumbnails/28.jpg)
28
User Study: ConclusionsUser Study: Conclusions
Summaries measurably improve a news browser’s effectiveness for research
Users are more satisfied with Newsblaster summaries are better than single-sentence summaries like those of Google News
Users want search Not included in evaluation
![Page 29: 1 Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia](https://reader036.vdocuments.net/reader036/viewer/2022062518/56649e545503460f94b4b250/html5/thumbnails/29.jpg)
29
Potential ProblemsPotential Problems
![Page 30: 1 Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia](https://reader036.vdocuments.net/reader036/viewer/2022062518/56649e545503460f94b4b250/html5/thumbnails/30.jpg)
30 That very night in Max’s room a forest grew
![Page 31: 1 Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia](https://reader036.vdocuments.net/reader036/viewer/2022062518/56649e545503460f94b4b250/html5/thumbnails/31.jpg)
31 And grew
![Page 32: 1 Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia](https://reader036.vdocuments.net/reader036/viewer/2022062518/56649e545503460f94b4b250/html5/thumbnails/32.jpg)
32
And grew until the ceiling hung with vines and the walls became the world all around
![Page 33: 1 Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia](https://reader036.vdocuments.net/reader036/viewer/2022062518/56649e545503460f94b4b250/html5/thumbnails/33.jpg)
33
And an ocean tumbled by with a private boat for Maxand he sailed all through the night and day
![Page 34: 1 Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia](https://reader036.vdocuments.net/reader036/viewer/2022062518/56649e545503460f94b4b250/html5/thumbnails/34.jpg)
34
And he sailed in and out of weeks and almost over a yearto where the wild things are
![Page 35: 1 Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia](https://reader036.vdocuments.net/reader036/viewer/2022062518/56649e545503460f94b4b250/html5/thumbnails/35.jpg)
35
And when he came to where the wild things are they roared their terrible roars and gnashed their terrible teeth
![Page 36: 1 Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia](https://reader036.vdocuments.net/reader036/viewer/2022062518/56649e545503460f94b4b250/html5/thumbnails/36.jpg)
36
Comparing Text Against TextComparing Text Against Text
Which human summary makes a good gold standard? Many summaries are good
At what granularity is the comparison made?
When can we say that two pieces of text match?
![Page 37: 1 Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia](https://reader036.vdocuments.net/reader036/viewer/2022062518/56649e545503460f94b4b250/html5/thumbnails/37.jpg)
37
Measuring variation Measuring variation
Types of variation between humans
Applications
Translation same content
different wording
Summarization different content??
different wording
Generation different content??
different wording
![Page 38: 1 Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia](https://reader036.vdocuments.net/reader036/viewer/2022062518/56649e545503460f94b4b250/html5/thumbnails/38.jpg)
38
Human variation: content Human variation: content words (Ani Nenkova)words (Ani Nenkova)
• Summaries differ in vocabulary Differences cannot be explained by paraphrase
•7 translations 20 documents
•7 summaries 20 document sets
• Faster vocabulary growth in summarization
![Page 39: 1 Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia](https://reader036.vdocuments.net/reader036/viewer/2022062518/56649e545503460f94b4b250/html5/thumbnails/39.jpg)
39
Variation impacts evaluationVariation impacts evaluation
Comparing content is hard All kinds of judgment calls
Paraphrases VP vs. NP
Ministers have been exchanged Reciprocal ministerial visits
Length and constituent type Robotics assists doctors in the medical operating theater Surgeons started using robotic assistants
![Page 40: 1 Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia](https://reader036.vdocuments.net/reader036/viewer/2022062518/56649e545503460f94b4b250/html5/thumbnails/40.jpg)
40
NightmareNightmare: only one gold standard: only one gold standard
System may have chosen an equally good sentence but not in the one gold standard Pinochet arrested in London on Oct 16 at a Spanish judge’s
request for atrocities against Spaniards in Chile. Former Chilean dictator Augusto Pinochet has been
arrested in London at the request of the Spanish government
In DUC 2001 (one gold standard), human model had significant impact on scores (McKeown et al)
Five human summaries needed to avoid changes in rank (Nenkova and Passonneau)
DUC2003 data 3 topic sets, 1 highest scoring and 2 lowest scoring 10 model summaries
![Page 41: 1 Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia](https://reader036.vdocuments.net/reader036/viewer/2022062518/56649e545503460f94b4b250/html5/thumbnails/41.jpg)
41
How many summaries are How many summaries are enough?enough?
![Page 42: 1 Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia](https://reader036.vdocuments.net/reader036/viewer/2022062518/56649e545503460f94b4b250/html5/thumbnails/42.jpg)
42
ScoringScoring
Two main approaches used in DUC
ROUGE (Lin and Hovy)
Pyramids (Nenkova and Passonneau)
Problems: Are the results stable? How difficult is it to do the scoring?
![Page 43: 1 Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia](https://reader036.vdocuments.net/reader036/viewer/2022062518/56649e545503460f94b4b250/html5/thumbnails/43.jpg)
43
ROUGE: ROUGE: RRecall-ecall-OOriented riented UUnderstudy for nderstudy for GGisting isting EEvaluationvaluation
Rouge – Ngram co-occurrence metrics measuring content overlap
Counts of n-gram overlaps between candidate and model
summaries
Total n-grams in summary model
![Page 44: 1 Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia](https://reader036.vdocuments.net/reader036/viewer/2022062518/56649e545503460f94b4b250/html5/thumbnails/44.jpg)
44
ROUGEROUGE Experimentation with different units of comparison:
unigrams, bigrams, longest common substring, skip-bigams, basic elements
Automatic and thus easy to apply
Important to consider confidence intervals when determining differences between systems Scores falling within same interval not significantly different Rouge scores place systems into large groups: can be hard to
definitively say one is better than another
Sometimes results unintuitive: Multilingual scores as high as English scores Use in speech summarization shows no discrimination
Good for training regardless of intervals: can see trends
![Page 45: 1 Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia](https://reader036.vdocuments.net/reader036/viewer/2022062518/56649e545503460f94b4b250/html5/thumbnails/45.jpg)
45
PyramidsPyramids Uses multiple human summaries Information is ranked by its importance Allows for multiple good summaries A pyramid is created from the human
summaries Elements of the pyramid are content units System summaries are scored by comparison
with the pyramid
![Page 46: 1 Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia](https://reader036.vdocuments.net/reader036/viewer/2022062518/56649e545503460f94b4b250/html5/thumbnails/46.jpg)
46
Content units: better study of Content units: better study of variation than sentencesvariation than sentences
Semantic units
Link different surface realizations with the same meaning
Emerge from the comparison of several texts
![Page 47: 1 Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia](https://reader036.vdocuments.net/reader036/viewer/2022062518/56649e545503460f94b4b250/html5/thumbnails/47.jpg)
47
Content unit exampleContent unit example
S1 Pinochet arrested in London on Oct 16 at a Spanish judge’s request for atrocities against Spaniards in Chile.
S2 Former Chilean dictator Augusto Pinochet has been arrested in London at the request of the Spanish government.
S3 Britain caused international controversy and Chilean turmoil by arresting former Chilean dictator Pinochet in London.
![Page 48: 1 Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia](https://reader036.vdocuments.net/reader036/viewer/2022062518/56649e545503460f94b4b250/html5/thumbnails/48.jpg)
48
SCU: SCU: A cable car caught fireA cable car caught fire (Weight = 4)(Weight = 4)A. The cause of the fire was unknown.B. A cable car caught fire just after entering a
mountainside tunnel in an alpine resort in Kaprun, Austria on the morning of November 11, 2000.
C. A cable car pulling skiers and snowboarders to the Kitzsteinhorn resort, located 60 miles south of Salzburg in the Austrian Alps, caught fire inside a mountain tunnel, killing approximately 170 people.
D. On November 10, 2000, a cable car filled to capacity caught on fire, trapping 180 passengers inside the Kitzsteinhorn mountain, located in the town of Kaprun, 50 miles south of Salzburg in the central Austrian Alps.
![Page 49: 1 Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia](https://reader036.vdocuments.net/reader036/viewer/2022062518/56649e545503460f94b4b250/html5/thumbnails/49.jpg)
49
SCU: SCU: The cause of the fire is The cause of the fire is unknownunknown (Weight = 1) (Weight = 1)A. The cause of the fire was unknown.B. A cable car caught fire just after entering a
mountainside tunnel in an alpine resort in Kaprun, Austria on the morning of November 11, 2000.
C. A cable car pulling skiers and snowboarders to the Kitzsteinhorn resort, located 60 miles south of Salzburg in the Austrian Alps, caught fire inside a mountain tunnel, killing approximately 170 people.
D. On November 10, 2000, a cable car filled to capacity caught on fire, trapping 180 passengers inside the Kitzsteinhorn mountain, located in the town of Kaprun, 50 miles south of Salzburg in the central Austrian Alps.
![Page 50: 1 Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia](https://reader036.vdocuments.net/reader036/viewer/2022062518/56649e545503460f94b4b250/html5/thumbnails/50.jpg)
50
Idealized representationIdealized representation
Tiers of differentially weighted SCUs
Top: few SCUs, high weight
Bottom: many SCUs, low weight
W=1
W=2
W=3
![Page 51: 1 Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia](https://reader036.vdocuments.net/reader036/viewer/2022062518/56649e545503460f94b4b250/html5/thumbnails/51.jpg)
51
Comparison of Scoring Methods Comparison of Scoring Methods in DUC05in DUC05 Analysis of scores for the 20 pyramid sets
Columbia prepared pyramids Participants scored systems against pyramids
Comparisons between Pyramid (original,modified), responsiveness, and Rouge-SU4
Pyramids score computed from multiple humans Responsiveness is just one human’s judgment Rouge-SU4 equivalent to Rouge-2
![Page 52: 1 Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia](https://reader036.vdocuments.net/reader036/viewer/2022062518/56649e545503460f94b4b250/html5/thumbnails/52.jpg)
52
Creation of pyramids Creation of pyramids
Done at Columbia for each of 20 out of 50 sets
Primary annotator, secondary checker
Held round-table discussions of problematic constructions that occurred in this data set
Comma separated lists Extractive reserves have been formed for managed harvesting of
timber, rubber, Brazil nuts, and medical plants without deforestation.
General vs. specific Eastern Europe vs. Hungary, Poland, Lithuania, and Turkey
![Page 53: 1 Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia](https://reader036.vdocuments.net/reader036/viewer/2022062518/56649e545503460f94b4b250/html5/thumbnails/53.jpg)
53
Characteristics of the ResponsesCharacteristics of the Responses
Proportion of SCUs of Weight 1 is large 44% (D324) to 81% (D695)
Mean SCU weight: 1.9
Agreement among human responders is quite low
![Page 54: 1 Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia](https://reader036.vdocuments.net/reader036/viewer/2022062518/56649e545503460f94b4b250/html5/thumbnails/54.jpg)
54 SCU Weights
# of SCUs at each weight
![Page 55: 1 Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia](https://reader036.vdocuments.net/reader036/viewer/2022062518/56649e545503460f94b4b250/html5/thumbnails/55.jpg)
55
Preview of ResultsPreview of Results
Manual metrics Large differences between humans and machines
No single system the clear winner But a top group identified by all metrics
Significant differences Different predictions from manual and automatic metrics
Correlations between metrics Some correlation but one cannot be substituted for another This is good
![Page 56: 1 Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia](https://reader036.vdocuments.net/reader036/viewer/2022062518/56649e545503460f94b4b250/html5/thumbnails/56.jpg)
56
Human performance/Best sysHuman performance/Best sys
Pyramid Modified Resp ROUGE-SU4
B: 0.5472 B: 0.4814 A: 4.895 A: 0.1722 A: 0.4969 A: 0.4617 B: 4.526 B: 0.1552~~~~~~~~~~~~~~~~~
14: 0.2587 10: 0.2052 4: 2.85 15: 0.139 Best system ~50% of human performance on manual metrics
Best system ~80% of human performance on ROUGE
![Page 57: 1 Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia](https://reader036.vdocuments.net/reader036/viewer/2022062518/56649e545503460f94b4b250/html5/thumbnails/57.jpg)
57
Pyramid original Modified Resp Rouge-SU414: 0.2587 10: 0.2052 4: 2.85 15: 0.139 17: 0.2492 17: 0.1972 14: 2.8 4: 0.134 15: 0.2423 14: 0.1908 10: 2.65 17: 0.1346 10: 0.2379 7: 0.1852 15: 2.6 19: 0.1275 4: 0.2321 15: 0.1808 17: 2.55 11: 0.1259 7: 0.2297 4: 0.177 11: 2.5 10: 0.127816: 0.2265 16: 0.1722 28: 2.45 6: 0.1239 6: 0.2197 11: 0.1703 21: 2.45 7: 0.1213 32: 0.2145 6: 0.1671 6: 2.4 14: 0.1264 21: 0.2127 12: 0.1664 24: 2.4 25: 0.1188 12: 0.2126 19: 0.1636 19: 2.4 21: 0.1183 11: 0.2116 21: 0.1613 6: 2.4 16: 0.1218 26: 0.2106 32: 0.1601 27: 2.35 24: 0.118 19: 0.2072 26: 0.1464 12: 2.35 12: 0.116 28: 0.2048 3: 0.145 7: 2.3 3: 0.1198 13: 0.1983 28: 0.1427 25: 2.2 28: 0.1203 3: 0.1949 13: 0.1424 32: 2.15 27: 0.110 1: 0.1747 25: 0.1406 3: 2.1 13: 0.1097
![Page 58: 1 Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia](https://reader036.vdocuments.net/reader036/viewer/2022062518/56649e545503460f94b4b250/html5/thumbnails/58.jpg)
58
Pyramid original Modified Resp Rouge-SU414: 0.2587 10: 0.2052 4: 2.85 15: 0.139 17: 0.2492 17: 0.1972 14: 2.8 4: 0.134 15: 0.2423 14: 0.1908 10: 2.65 17: 0.1346 10: 0.2379 7: 0.1852 15: 2.6 19: 0.1275 4: 0.2321 15: 0.1808 17: 2.55 11: 0.1259 7: 0.2297 4: 0.177 11: 2.5 10: 0.127816: 0.2265 16: 0.1722 28: 2.45 6: 0.1239 6: 0.2197 11: 0.1703 21: 2.45 7: 0.1213 32: 0.2145 6: 0.1671 6: 2.4 14: 0.1264 21: 0.2127 12: 0.1664 24: 2.4 25: 0.1188 12: 0.2126 19: 0.1636 19: 2.4 21: 0.1183 11: 0.2116 21: 0.1613 6: 2.4 16: 0.1218 26: 0.2106 32: 0.1601 27: 2.35 24: 0.118 19: 0.2072 26: 0.1464 12: 2.35 12: 0.116 28: 0.2048 3: 0.145 7: 2.3 3: 0.1198 13: 0.1983 28: 0.1427 25: 2.2 28: 0.1203 3: 0.1949 13: 0.1424 32: 2.15 27: 0.110 1: 0.1747 25: 0.1406 3: 2.1 13: 0.1097
![Page 59: 1 Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia](https://reader036.vdocuments.net/reader036/viewer/2022062518/56649e545503460f94b4b250/html5/thumbnails/59.jpg)
59
Pyramid original Modified Resp Rouge-SU414: 0.2587 10: 0.2052 4: 2.85 15: 0.139 17: 0.2492 17: 0.1972 14: 2.8 4: 0.134 15: 0.2423 14: 0.1908 10: 2.65 17: 0.1346 10: 0.2379 7: 0.1852 15: 2.6 19: 0.1275 4: 0.2321 15: 0.1808 17: 2.55 11: 0.1259 7: 0.2297 4: 0.177 11: 2.5 10: 0.127816: 0.2265 16: 0.1722 28: 2.45 6: 0.1239 6: 0.2197 11: 0.1703 21: 2.45 7: 0.1213 32: 0.2145 6: 0.1671 6: 2.4 14: 0.1264 21: 0.2127 12: 0.1664 24: 2.4 25: 0.1188 12: 0.2126 19: 0.1636 19: 2.4 21: 0.1183 11: 0.2116 21: 0.1613 6: 2.4 16: 0.1218 26: 0.2106 32: 0.1601 27: 2.35 24: 0.118 19: 0.2072 26: 0.1464 12: 2.35 12: 0.116 28: 0.2048 3: 0.145 7: 2.3 3: 0.1198 13: 0.1983 28: 0.1427 25: 2.2 28: 0.1203 3: 0.1949 13: 0.1424 32: 2.15 27: 0.110 1: 0.1747 25: 0.1406 3: 2.1 13: 0.1097
![Page 60: 1 Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia](https://reader036.vdocuments.net/reader036/viewer/2022062518/56649e545503460f94b4b250/html5/thumbnails/60.jpg)
60
Pyramid original Modified Resp Rouge-SU414: 0.2587 10: 0.2052 4: 2.85 15: 0.139 17: 0.2492 17: 0.1972 14: 2.8 4: 0.134 15: 0.2423 14: 0.1908 10: 2.65 17: 0.1346 10: 0.2379 7: 0.1852 15: 2.6 19: 0.1275 4: 0.2321 15: 0.1808 17: 2.55 11: 0.1259 7: 0.2297 4: 0.177 11: 2.5 10: 0.127816: 0.2265 16: 0.1722 28: 2.45 6: 0.1239 6: 0.2197 11: 0.1703 21: 2.45 7: 0.1213 32: 0.2145 6: 0.1671 6: 2.4 14: 0.1264 21: 0.2127 12: 0.1664 24: 2.4 25: 0.1188 12: 0.2126 19: 0.1636 19: 2.4 21: 0.1183 11: 0.2116 21: 0.1613 6: 2.4 16: 0.1218 26: 0.2106 32: 0.1601 27: 2.35 24: 0.118 19: 0.2072 26: 0.1464 12: 2.35 12: 0.116 28: 0.2048 3: 0.145 7: 2.3 3: 0.1198 13: 0.1983 28: 0.1427 25: 2.2 28: 0.1203 3: 0.1949 13: 0.1424 32: 2.15 27: 0.110 1: 0.1747 25: 0.1406 3: 2.1 13: 0.1097
![Page 61: 1 Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia](https://reader036.vdocuments.net/reader036/viewer/2022062518/56649e545503460f94b4b250/html5/thumbnails/61.jpg)
61
Significant DifferencesSignificant Differences
Manual metrics Few differences between systems
Pyramid: 23 is worse Responsive: 23 and 31 are worse
Both humans better than all systems
Automatic (Rouge-SU4) More differences between systems One human indistinguishable from 5 systems
![Page 62: 1 Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia](https://reader036.vdocuments.net/reader036/viewer/2022062518/56649e545503460f94b4b250/html5/thumbnails/62.jpg)
62
Correlations: Pearson’s, 25 Correlations: Pearson’s, 25 systemssystems
Pyr-mod Resp-1 Resp2 R-2 R-SU4
Pyr-orig 0.96 0.77 0.86 0.84 0.80
Pyr-mod 0.81 0.90 0.90 0.86
Resp-1 0.83 0.92 0.92
Resp-2 0.88 0.87
R-2 0.98
![Page 63: 1 Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia](https://reader036.vdocuments.net/reader036/viewer/2022062518/56649e545503460f94b4b250/html5/thumbnails/63.jpg)
63
Correlations: Pearson’s, 25 Correlations: Pearson’s, 25 systemssystems
Pyr-mod Resp-1 Resp2 R-2 R-SU4
Pyr-orig 0.96 0.77 0.86 0.84 0.80
Pyr-mod 0.81 0.90 0.90 0.86
Resp-1 0.83 0.92 0.92
Resp-2 0.88 0.87
R-2 0.98
Questionable that responsiveness could be a gold standard
![Page 64: 1 Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia](https://reader036.vdocuments.net/reader036/viewer/2022062518/56649e545503460f94b4b250/html5/thumbnails/64.jpg)
64
Pyramid and responsivenessPyramid and responsiveness
Pyr-mod Resp-1 Resp2 R-2 R-SU4
Pyr-orig 0.96 0.77 0.86 0.84 0.80
Pyr-mod 0.81 0.90 0.90 0.86
Resp-1 0.83 0.92 0.92
Resp-2 0.88 0.87
R-2 0.98
High correlation, but the metrics are not mutually substitutable
![Page 65: 1 Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia](https://reader036.vdocuments.net/reader036/viewer/2022062518/56649e545503460f94b4b250/html5/thumbnails/65.jpg)
65
Pyramid and RougePyramid and Rouge
Pyr-mod Resp-1 Resp2 R-2 R-SU4
Pyr-orig 0.96 0.77 0.86 0.84 0.80
Pyr-mod 0.81 0.90 0.90 0.86
Resp-1 0.83 0.92 0.92
Resp-2 0.88 0.87
R-2 0.98
High correlation, but the metrics are not mutually substitutable
![Page 66: 1 Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia](https://reader036.vdocuments.net/reader036/viewer/2022062518/56649e545503460f94b4b250/html5/thumbnails/66.jpg)
66
CorrelationsCorrelations
Original and modified can substitute for each other
High correlation between manual and automatic, but automatic not yet a substitute
Similar patterns between pyramid and responsiveness
![Page 67: 1 Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia](https://reader036.vdocuments.net/reader036/viewer/2022062518/56649e545503460f94b4b250/html5/thumbnails/67.jpg)
67
NightmareNightmare
Scoring metric that is not stable used to decide funding
Insignificant differences between systems determine funding
![Page 68: 1 Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia](https://reader036.vdocuments.net/reader036/viewer/2022062518/56649e545503460f94b4b250/html5/thumbnails/68.jpg)
68
Is Task Evaluation Nightmare Is Task Evaluation Nightmare Free?Free?
Impact of user interface issues Can have more impact than the summary
Controlling for proper mix of subjects
Quantity of subjects and time to carry out is large
![Page 69: 1 Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia](https://reader036.vdocuments.net/reader036/viewer/2022062518/56649e545503460f94b4b250/html5/thumbnails/69.jpg)
69
Till Max said “Be still!” and tamed them with the magic trick
![Page 70: 1 Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia](https://reader036.vdocuments.net/reader036/viewer/2022062518/56649e545503460f94b4b250/html5/thumbnails/70.jpg)
70
Of staring into their yellow eyes without blinking onceAnd they were frightened and called him the most wild thing of all
![Page 71: 1 Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia](https://reader036.vdocuments.net/reader036/viewer/2022062518/56649e545503460f94b4b250/html5/thumbnails/71.jpg)
71 And made him king of all wild things
![Page 72: 1 Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia](https://reader036.vdocuments.net/reader036/viewer/2022062518/56649e545503460f94b4b250/html5/thumbnails/72.jpg)
72 “And now,” cried Max “Let the wild rumpus start!”
![Page 73: 1 Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia](https://reader036.vdocuments.net/reader036/viewer/2022062518/56649e545503460f94b4b250/html5/thumbnails/73.jpg)
73
![Page 74: 1 Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia](https://reader036.vdocuments.net/reader036/viewer/2022062518/56649e545503460f94b4b250/html5/thumbnails/74.jpg)
74
![Page 75: 1 Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia](https://reader036.vdocuments.net/reader036/viewer/2022062518/56649e545503460f94b4b250/html5/thumbnails/75.jpg)
75
![Page 76: 1 Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia](https://reader036.vdocuments.net/reader036/viewer/2022062518/56649e545503460f94b4b250/html5/thumbnails/76.jpg)
76
Are we having fun yet?Are we having fun yet?Benefits of evaluationBenefits of evaluation Emergence of evaluation methods
ROUGE Pyramids Nuggetteer
Research into characteristics of metrics
Analyses of sub-sentential units
Paraphrase as a research issue
![Page 77: 1 Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia](https://reader036.vdocuments.net/reader036/viewer/2022062518/56649e545503460f94b4b250/html5/thumbnails/77.jpg)
77
Available DataAvailable Data
DUC data sets 4 years of summary/document set pairs
Multidocument summarization training data not available beforehand
4 years of scoring patterns Led to analysis of human summaries
Pyramids Pyramids and peers for 40 topics (DUC04, DUC05) Many more from Nenkova and Passonneau Training data for paraphrase Training data for abstraction -> see systems moving
away from pure sentence extraction
![Page 78: 1 Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia](https://reader036.vdocuments.net/reader036/viewer/2022062518/56649e545503460f94b4b250/html5/thumbnails/78.jpg)
78
Wrapping upWrapping up
![Page 79: 1 Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia](https://reader036.vdocuments.net/reader036/viewer/2022062518/56649e545503460f94b4b250/html5/thumbnails/79.jpg)
79
Lessons LearnedLessons Learned
Evaluation environment is important Find a task with broad appeal Use independent evaluator At least a committee
Use multiple gold standards Compare text at the content unit level Evaluate the metrics
Look at significant differences
![Page 80: 1 Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia](https://reader036.vdocuments.net/reader036/viewer/2022062518/56649e545503460f94b4b250/html5/thumbnails/80.jpg)
80
Is Evaluation Worth It?Is Evaluation Worth It?
DUC: creation of a community From ~15 participants year 1 -> 30 participants year 5 No longer impacts funding
Enables research into evaluation At start, no idea how to evaluate summaries
But, results do not tell us everything
![Page 81: 1 Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia](https://reader036.vdocuments.net/reader036/viewer/2022062518/56649e545503460f94b4b250/html5/thumbnails/81.jpg)
81
And he sailed back over a year, in and out of weeks and through a day
![Page 82: 1 Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia](https://reader036.vdocuments.net/reader036/viewer/2022062518/56649e545503460f94b4b250/html5/thumbnails/82.jpg)
82
And into the night of his very own room where he found his supper waiting for him .. And it was still warm.