progressive summarization: summarizing relevant and novel...
TRANSCRIPT
Progressive Summarization: Summarizing relevant and
novel information
by
Praveen Bysani
Thesis submitted in partial fulfillment
of the requirements for the degree of
Master of Science (By Research) in
Computer science and Engineering
Search and Information Extraction Lab
International Institute of Information Technology, Hyderabad
December 2010
Copyright © Praveen Bysani, 2010
All Rights Reserved
International Institute of Information Technology
Hyderabad, India
CERTIFICATE
It is certified that the work contained in this thesis, titled “Progressive Summarization: Sum-
marizing relevant and novel information” by Praveen Bysani submitted in partial fulfillment for
the award of the degree of Master of Science (By Research) in Computer Science and Engi-
neering, has been carried out under my supervision and is not submitted elsewhere for a degree.
Date Adviser: Dr.Vasudeva Varma
Associate Professor
IIIT Hyderabad
To all the good, bad and evil people around me
Acknowledgments
I thank heartily Dr. Vasudeva Varma, my thesis advisor, for the guidance, support and
encouragement he provided all throughout my journey in SIEL. He helped me transform from
an average undergraduate to a successful post graduate. I sincerely acknowledge Dr. Prasad
Pingali, for his valuable suggestions and discussions that stimulated me to work towards my
thesis. It would be a crime if i do not mention Rahul Katragadda, for he is such a responsible
mentor and helped me a great deal with my communication skills. I thank each and every one
in SIEL for their assistance, Mr. Mahender for doing all the painful paper work, Mr. Babji the
System administrator of the lab.
I take this opportunity to thank all the anonymous reviewers of my work at ICON 2009
and NAACL 2010. I thank Dr. Rajeev sangal for providing me an opportunity to travel to
Los Angeles, California, to present my work. I will always be thankful to Student Research
division, NAACL for supporting my travel and stay at NAACL 2010.
I feel blessed to be in a group of really good friends. I cherish each and every moment
of my non academic life in OBH. I feel lucky to be associated with Vijay Bharat during my
initial days at SIEL. He is a major contributor to the preliminary work in my thesis and also in
restructuring the code base of summarization. I am greatly indebted to Sai Krishna, Our senior,
who shared invaluable thoughts and mentored during my honors and semester projects. I also
thank my peers and my juniors for supporting me during Text Analysis Conference (TAC) and
for their valuable inputs during my thesis.
If it is not for my family, i would not sustain all the pressure with such an ease. I am
fortunate to have my parents Sai, Rani and my sister Anusha who helped me evolve as a
responsible person.
v
Abstract
The amount of textual and multimedia information on world wide web has been increasing
many folds every year. A user seeking information on the web is often overloaded with colossal
amount of related documents by search engines and information retrieval systems to satisfy his
information need. In this context, it has become increasingly important to develop information
access systems that provide focused and precise answers to the user. Text Summarization is a
popular information access solution for information overload problem.
Internet allows its users to follow any popular and temporal topic on web. A temporal topic
has a lot of publishing sources and the user cannot handle with the huge amount of raw infor-
mation from news aggregators, blogs. In such scenario it is not sensible to wait for the topic
to complete to produce a summary, nor does it make sense to produce an overall summary at
every time interval. In this thesis we study a variant of text summarization, “Progressive Sum-
marization”, focusing only on relevant and novel information, and produce an informative, non
redundant summary of the topic. We provide progressive summaries at regular time intervals
that helps in updating the user knowledge about the topic.
In this work we focus only on extractive methods of summarization, where only the text
units from document collection are used in producing a summary. Sentence is considered as a
basic text unit for the summary. Extractive summarizers generally follow a sequential frame-
work, that include Pre-processing of text for sentence boundary identification and extraction, a
Feature Extraction stage where several statistical, linguistic and heuristic models are employed
to estimate sentence importance, Sentence Ranking stage that estimates sentence importance
through weighted linear combination of the features and finally Summary Extraction, during
which a subset of ranked sentences are selected into summary.
vi
vii
The most important factor in extracting important items for a progressive summary, is the
identification of novel information. Progressive summarization requires differentiating be-
tween Relevant and Novel Vs Non-Relevant and Novel Vs Relevant and Redundant information.
Since the existing features are successful in capturing only the relevance, We devise multiple
new features (NF, NW, HKLID) to capture novelty of a sentence along with its relevance. We
build alternative methods to incorporate novelty in conventional summarization framework.
We devise a new re-ranking measure, Proximity Re-ranking (ProximRank), that computes the
rank of a sentence based on the relevant novelty of its surrounding sentences. We model nov-
elty detection in the context of progressive summarization as an information filtering task.
Sentences that possibly contain prior information are filtered out from summary by creating a
Novelty Pool (NP). These methods are successfully related to different stages of summarization
and evaluated against each other to find the best. In this thesis we also discover the importance
of prepositions in determining the salience and relevance of a sentence to the topic. Also, we
use a machine learning technique (Regression) to estimate sentence importance from its fea-
ture vector. Thus we overcome the problem of determining ideal weight combination that takes
ample amount of experiments and human judgment when several features are used.
The techniques described in this thesis are used in building a progressive summarization sys-
tem (Siel 09) that outperformed all other 43 participating systems in Text Analysis Conference
(TAC 2009) in manual evaluations.
Contents
Chapter Page
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Automatic Text Summarization . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Types of Summarization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3 Novelty Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.4.1 Summary Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.4.1.1 Evaluation Workshops: . . . . . . . . . . . . . . . . . . . . 8
1.4.2 Novelty Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91.5 Organization of thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.2 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.3 Outline of the Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.4 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183.1 State of the art approaches in Summarization . . . . . . . . . . . . . . . . . . 18
3.1.1 Machine Learning Approaches . . . . . . . . . . . . . . . . . . . . . . 213.2 Novelty Detection Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . 233.3 Approaches to Progressive/Temporal Summarization . . . . . . . . . . . . . . 25
4 Supervised sentence ranking using Regression . . . . . . . . . . . . . . . . . . . . 284.1 Summarization Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.1.1 Stages of Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . 284.2 Feature combination using Support Vector Regression . . . . . . . . . . . . . . 31
4.2.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324.2.2 Support Vector Regression (SVR) . . . . . . . . . . . . . . . . . . . . 32
4.2.2.1 Sentence Importance Estimation . . . . . . . . . . . . . . . 334.2.3 Feature Combination . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.3 Extraction of sentence relevancy features . . . . . . . . . . . . . . . . . . . . . 344.3.1 Sentence position . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
viii
CONTENTS ix
4.3.2 TF-IDF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364.3.3 Document Frequency Score (DFS): . . . . . . . . . . . . . . . . . . . 374.3.4 Sentence Frequency score (SFS): . . . . . . . . . . . . . . . . . . . . 374.3.5 Probabilistic hyperspace analogue to language (PHAL) . . . . . . . . . 384.3.6 Kullback Leibler divergence (KLD) . . . . . . . . . . . . . . . . . . . 384.3.7 Prepositional Importance (PrepImp) . . . . . . . . . . . . . . . . . . . 394.3.8 Oracle Summaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5 Progressive Summarization: Summarization with Novelty Detection . . . . . . . . 415.1 Feature Extraction level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.1.1 Novelty Factor (NF) . . . . . . . . . . . . . . . . . . . . . . . . . . . 435.1.2 New Word Measure (NW) . . . . . . . . . . . . . . . . . . . . . . . . 445.1.3 Hybrid Kullback-Leibler Information Divergence (HKLID) . . . . . . 44
5.2 Sentence Ranking Level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455.2.1 Redundancy Re-ranking . . . . . . . . . . . . . . . . . . . . . . . . . 455.2.2 Proximity Re-ranking (ProximRank) . . . . . . . . . . . . . . . . . . . 47
5.3 Summary Extraction Level . . . . . . . . . . . . . . . . . . . . . . . . . . . . 485.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
6 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 506.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 506.2 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
6.2.1 ROUGE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 526.2.2 Pyramids . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 546.2.3 Readability and overall responsiveness . . . . . . . . . . . . . . . . . 54
6.3 Evaluation of Supervised ranking . . . . . . . . . . . . . . . . . . . . . . . . . 556.3.1 Kernel Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 556.3.2 Regression Vs Weighted Linear scoring . . . . . . . . . . . . . . . . . 566.3.3 Combination of features . . . . . . . . . . . . . . . . . . . . . . . . . 58
6.4 Evaluation of Progressive Summarization . . . . . . . . . . . . . . . . . . . . 59
7 Conclusions and Future directions . . . . . . . . . . . . . . . . . . . . . . . . . . 667.1 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
List of Figures
Figure Page
4.1 Sample news article from AQUAINT news corpus . . . . . . . . . . . . . . . . 304.2 Stages in a Multi Document Summarizer . . . . . . . . . . . . . . . . . . . . . 31
5.1 Novelty detection at different stages in a Multi Document Summarizer . . . . . 42
6.1 Sample topic and narrative in TAC 2008 . . . . . . . . . . . . . . . . . . . . . 51
x
List of Tables
Table Page
5.1 Statistics of relevant, novel and consecutive relevant sentences in TREC 2003 . 47
6.1 ROUGE-2, ROUGE SU4 scores of pdocs using different kernels . . . . . . . . 566.2 ROUGE-2, ROUGE SU4 scores of ndocs using different kernels . . . . . . . . 566.3 ROUGE 2 scores of pdocs and ndocs while using Regression to estimate the
sentence importance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 576.4 ROUGE SU4 scores of pdocs and ndocs while using Regression to estimate
the sentence importance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 586.5 ROUGE scores of pdocs for different combinations of features . . . . . . . . . 606.6 ROUGE scores of ndocs for different combinations of features . . . . . . . . . 616.7 ROUGE scores of different configurations with novelty detection techniques . . 626.8 Automated and Manual evaluation results of TAC systems . . . . . . . . . . . 63
xi
Chapter 1
Introduction
With ever growing content on World Wide Web, it has been increasingly difficult for users to
search for useful information. Rapid growth of news portals, blogs and social networking sites
lead to enormous surge of online content. Search engines, that are supposed to satisfy users
information need, has too much information to offer than what is required. This problem is
referred as information overload. In this context, it has been increasingly important to develop
information access solutions that can provide an easy and efficient access to users. Automatic
summarization systems address information overload problem by producing a summary of
related documents that provides an overall understanding of the topic without having to go
through every document.
1.1 Automatic Text Summarization
Text Summarization is the process of condensing text to its most essential points. Although
the definition of summarization is obvious, it needs to be emphasized that summarizing is a
hard problem. A summarization system has to interpret the source content, where content
is a matter of both information and expression, and identify important information, where
importance is a substance of both salience and essence. Summarization is a challenging task
for its inherent cognitive process, as an ideal summarization system has to mimic a human
mind in the process of abstracting. Summarization is also interesting for its practical and real
1
life applications. Researchers [22] have postulated summarization as a tripartite processing
model,
1. Topic Identification: An initial exploration to identify the genre and topics of source text.
Most important units of text are identified using several independent modules.
2. Interpretation: Important topics are fused, and expressed in a new formulation using
concepts that are not explicitly contained in the input.
3. Summary Generation: Unreadable abstract representations from interpretation are trans-
formed into a coherent human readable format.
Each major process may subsume several sub processes depending on the context and purpose.
1.2 Types of Summarization
Summarization systems have been categorized into several types based on their inputfactors
like language, media, genre and purpose like audience, use and situation. Following are few
popular types of summarization classified based on the medium of content,
• Document Summarization: Summarizing information in the form of digital text is re-
ferred as Document Summarization. It is the most focused area of text summarization
with almost five decades of research. Document summarization branched out into sin-
gle document and multi document summarization over the course of time. News arti-
cles summarization and Scientific papers abstraction are two popular areas of document
summarization. Focused workshops like Document Understanding Conference (DUC)
provided a common platform and set evaluation benchmarks that has cultivated interest
and enabled researchers to participate in large-scale experiments.
• Opinion Summarization/Blog Summarization: With the advent of Web 2.0 and flourish-
ing growth of blogs and forums, people are now able to express their opinions through
blog posts and reviews. It is important to understand the opinions of people on a par-
ticular product, for a business organization to devise commercial strategies, or for an
2
individual to analyse the reviews on a topic of his interest. Since there are millions of
people writing their opinions everyday, mining knowledge from this huge information is
challenging. In this scenario, an opinion summarization system that extracts, analyse and
summarize opinions will be useful. Recently, opinion mining has received huge inter-
est in information systems and language technologies communities through upbringing
of International conferences on web logs and social media (ICWSM) and Text Analysis
Conference (TAC) - opinion summarization, opinion question and answering tasks.
• Book Summarization: Books represent one of the oldest forms of written communication
and has been used as a means to store and transmit information. There is an increasingly
large number of books becoming available in electronic format, through projects such as
Gutenberg1, Million Books project 2. This means the requirement of language processing
techniques that could handle very large documents such as books is escalating. A book
summarization system can be used to produce short abstracts of every chapter in a novel
or a technical book. User can just skim through summaries of previous chapters to refresh
his proceedings in the book so far. Alternatively, it can be used to produce a summary
for the whole book.
• Speech Summarization: There has been an explosive growth of multimedia content on
world wide web due to availability of broadcast radios and news channels. The amount
of audio content is only going to grow with the availability of cheap and mass storage
means. It has necessitated the need of systems that could efficiently process huge amount
of audio data. Speech summarization is one solution to this problem, with wide variety
of applications. Broadcast news summarization is a popular area within speech summa-
rization, where it serves the purpose of summarizing important content of a news show.
It can also be used for summarizing long voice mails and save a lot of time for the user.
• Video/Multimedia Summarization: The growing availability of multimedia software and
hardware platforms makes Multimedia Summarization, an important application area
of summarization. There is a huge amount of multimedia content available on web in
1http://gutenberg.org2http://archive.org/details/millionbooks
3
form of images, speech, video and flash. Research in this area is evolving very rapidly,
with many developments taking place outside summarization community within digi-
tal libraries, speech understanding, multimedia, and other communities. A multimedia
summarizer could summarize movies, video lectures and allow users to skip the lengthy
boring parts.
Apart from the medium of content, Summaries are also classified into many categories depend-
ing on the context within which summary is intended to use. Below we discuss few of such
popular dimensions in summarization,
• Extract vs. Abstract: Abstractive methods generate summary from an abstract repre-
sentation of source documents, that may contain sentences which are not necessarily
present in document set. Extractive methods rely only on sentences in original document
set. Extraction is the process of selecting important units from original document and
presenting them in form of summary. Although there has been some efforts to generate
abstract summaries [18], Extraction still remains as the most feasible approach and dom-
inant portion of the work in summarization is based on extraction. Focus of this work is
on extractive based document summarization, with sentences as primary units.
• Single Document vs. Multi Document: Text Summarization has progressed from single
document summarization to a more challenging problem of multiple document sum-
marization. Generating summary for a set of multiple related documents on a topic is
more difficult task, as documents are likely to contain similar content. Concatenation of
individual single document summaries may not necessarily produce a multi document
summary.
• Query Focused vs. Generic: A generic summarizer produce summaries that encapsulate
most salient points of source document set. On the contrary, a query focused summa-
rization system has access to user’s information need in form of query and tailors its
summary accordingly. With the growth of online search and retrieval, query focused
summarizers would provide a better output than generic systems.
4
• Personalized: The interpretation of a piece of text depends on the domain knowledge
and personal interests of a human. The notion of importance and relevance changes from
person to person. Normal summarization systems produce uniform summary for all users
irrespective of their personal interests. A personalized summarizer caters user’s personal
background and interests. Hence, a personalized summary changes in accordance to
preference of the reader.
• Progressive Summarization/ Temporal Summarization: Temporal Summarization is
targeted to aid users having access to rapidly flowing stream of articles on a topic and
has no time to look at each article. In such situation, a person would prefer to be updated
on events within topic, and dive into details only when reported events trigger enough
interest.
It is not sensible to wait for the topic to complete to produce a summary, nor does it make
sense to produce an overall summary at every time interval. After all, the user has already
been informed about prior events. A temporal summarization system produce revised
summaries on a topic at regular time intervals and update users knowledge. Although,
there are prior attempts in this dimension, it gained a lot of focus after its introduction as
“Update Summarization” in DUC workshops.
We coin the term Progressive Summarization for Extractive, Query focused, Multi document
Temporal Summarization, around which this thesis is surrounded. Details and related work
in this dimension is thoroughly studied in chapters 2 and 3. Detecting novel and relevant
information is a major challenge in temporal summarization.
1.3 Novelty Detection
Novelty is an inherently difficult phenomenon to operationalize. Detecting novel informa-
tion from source documents given user’s prior knowledge over the topic is termed as novelty
detection. It is not sensible to identify new information that is not relevant to user’s interest.
Hence, Novel information is generally regarded as relevant novel information. The problem
of novelty detection has long been a significant challenge in information retrieval systems.
5
As the task of finding new information from a pool of relevant information is difficult even
for experienced human assessors [46], novelty detection still remains as an active area of re-
search. Document level novelty, while intuitive is rarely useful because nearly every document
contains something new. Hence novelty detection is usually performed at two levels,
1. At event level: National Institute of Standards and Technology (NIST) along with Lin-
guistic Data Consortium (LDC) has started a project named Topic Detection and Track-
ing (TDT) to understand and discover topical structure in unsegmented streams of news
reports across different sources and languages. TDT tasks consider each news story as
a set of events occurring over a course of time. One of the tasks under this study, First
story detection (FSD), requires constant monitoring of news and identify the onset of
new event in a particular topic. First Story Detection (FSD) is an inherent first step for
TDT.
FSD is the process of detecting all first stories within a corpus of news articles that are
the first stories to describe an event. FSD is a major leap in event level novelty detection,
that fostered efficient techniques in text processing.
2. At sentence level: The “selective dissemination and information” (SDI) paradigm assume
that the people wanted to be able to track new information relating to known topics as
their primary task. While most SDI and information filtering systems in the literature
have focused on similarity to a topical profile [46], or a community of users with shared
interests, recent efforts have looked at the retrieval of specifically novel information.
Novelty track conducted as part of Text Retrieval Conference (TREC) during 2002-2004,
promoted the task of highlighting sentences containing relevant and new information in
a topical document stream. The basic task is to return sentences that are both relevant
and novel, given a topic and an ordered set of related documents on that topic, segmented
into sentences. There are two major problems that participants must solve in this task.
The first is identifying relevant sentences, while the second task is to identify those rele-
vant sentences that contain new information. The operational definition of “new” here is
information that has not appeared previously in topic’s set of documents. Since each sen-
tence adds to user’s knowledge, and later sentences are to be retrieved only if they contain
6
new information, novelty detection can be looked upon as a filtering task. In many ways,
Novelty track can be viewed as sentence level analogue of First story Detection task.
As our focus is on sentence extractive summarization, we concentrate on sentence level novelty
detection techniques in this thesis. Successful novelty detection techniques are employed in
summarization to produce progressive summaries.
1.4 Evaluation
Like the case with other language understanding technologies, Evaluation offers many ad-
vantages to the field of text summarization. It can foster the creation of infrastructure and
reusable resources, and provides an environment for comparison of peer results.
1.4.1 Summary Evaluation
Evaluation of summary is a non-trivial task, principally because there is no “ideal” summary
as such. Studies in past [21] show that, human summarizers tend to agree only 60% (approx-
imately) of the time, and only in 82% of the cases humans agreed with their own judgment.
Also, there is always a possibility of system generating a better summary that is quite different
from reference human summary used as an approximate to the ideal output summary.
Research in summarization evaluation has broadly been classified into two major categories,
intrinsic and extrinsic. Intrinsic evaluation techniques tests the summarization in itself usually
through its content, readability and coherence. The second method, extrinsic evaluation tests
the summarization system based on how it affects other language processing tasks like rele-
vance assessment, reading assessment and text categorization etc. Intrinsic evaluation is the
widely accepted mechanism for evaluating summaries through out the literature, hence in this
thesis we focus only on intrinsic evaluation of content and coherence of a summary.
Evaluating Coherence: Summaries have two main characteristics, content and form. Co-
herence evaluation refers to enabling quantification of the form of a summary. Coherence can
be assessed by having humans grade summaries on some criteria. Usually extractive sum-
maries cause coherence problems like dangling anaphors and rhetorical structure of summary.
7
Subjects grade coherence of summary based on presence of dangling anaphors, lack of preser-
vation of integrity, and presence of incomplete statements in the text.
There hasn’t been much efforts on automating Coherence/Readability aspects of summary eval-
uation. [6] fairly investigated the discourse level constraints on adjacent sentence that are in-
dicative of coherence. A recent study by [40], investigated impact of linguistic, syntactic and
discourse features on readability of Wall Street Journal (WSJ) corpus.
Evaluating Content: Content evaluation refers to enabling quantification of informative-
ness of a summary. Measure of informativeness is to assess how much information from source
or human written summary is preserved by the summary. It is the most accepted and pop-
ular evaluation criterion, used for comparing summarization systems at large scale. There
exists manual evaluation metrics for content evaluation like ‘Pyramid Evaluation’, and Content
Responsiveness. Since manual methods are time consuming and non-repeatable, automated
counterparts are introduced that are inexpensive and repeatable. Over the years, research on
automated content evaluation has produced useful evaluation tools like ROUGE and Basic El-
ements. Detailed description of these metrics will be presented in Chapter 6
1.4.1.1 Evaluation Workshops:
Much interest and activity is aimed at developing multipurpose information systems in
the late 1990’s. Several Government organizations like Defense Advanced Research Projects
Agency (DARPA) and NIST have started programmes focusing on Trans lingual Information
Access (TIDES), Text Retrieval(TREC) and First Story detection(FSD) among others. These
tasks require their own evaluation designs and data, thus creating an evaluation framework
over the years. Initial TIDES workshops focused in document understanding and explored
different ways of summarization. Additionally, the brain storming sessions conducted during
the workshop lead to a focused evaluation effort in summarization, Document Understanding
Conference (DUC).
DUC is the first large scale Summarization evaluation forum, that has provided a common
ground for researchers to explore various approaches in summarization and evaluate them on
large scale. DUC is organized from 2001 through 2007 by NIST and later transformed as
Text Analysis Conference (TAC). The first TAC workshop is conducted in 2008 by NIST,
8
carrying on the tradition of DUC, Recognizing Textual Entailment (RTE) and TREC Question
Answering (Q&A) tracks. TAC workshops have designed interesting and challenging tasks
for summarization community, like Opinion Summarization in 2008, Update summarization
in 2009 and Guided Summarization in 2010. Many popular evaluation schemes like ROUGE,
Pyramids, besides novel summarization techniques are developed during these workshops.
A task on automated evaluation of summaries of peers (AESOP) is introduced recently dur-
ing TAC 2009. Automated evaluations of content provide platform for tracking incremental
developments in state of art summarization systems. The purpose of first edition of AESOP
was to promote research and development of systems that evaluate quality of content in sum-
maries. Focus is on developing automatic metrics like ROUGE, that act as surrogates to human
evaluation.
1.4.2 Novelty Evaluation
Novelty Track conducted by TREC provided an ideal setting to evaluate sentence level
novelty detection techniques. The track is divided into four tasks, the first one is to identify the
relevant and novel sentences given the document set on a particular topic, the second task is
to identify all the novel sentences given the relevant sentences, and the third task it to identify
relevant and novel sentences given the relevant sentences of the first five documents, and the
final task is to retrieve novel sentences given all relevant sentences and the novel sentence from
first five documents. These tasks are designed such that participants can test their techniques
at varying levels of training.
The series of Novelty workshops provide relevance and novelty judgments for a set of news
articles in AQUAINT corpus, to calculate the precision and recall for each technique. The
sentences selected manually by the NIST assessors (judgements ) will be considered as the
truth data, and referred as new relevant in the discussion below. Agreement between these
sentences and those found by the systems will be used as input for recall and precision.
precision =|new relevant ∧ SentencesRetrieved|
|SentencesRetrieved|
recall =|new relevant ∧ SentencesRetrieved|
|new relevant|
9
The official measure of Novelty track to measure the efficiency of a particular technique is its
F measure,
F −measure =2 ∗ precision ∗ recallprecision+ recall
1.5 Organization of thesis
The main focus of this thesis is to produce informative and human readable summaries of
a set of topically related documents having known that the user has prior knowledge over the
topic and previously read some articles on the same. In this thesis we also aim to estimate
sentence importance through optimal combination of several sentence scoring features without
any manual effort. The rest of this thesis is organized as follows
Chapter 2 describes the motivation behind choosing this research problem and explain the
challenges involved with the same. We define our problem statement and the exact goals of
this thesis. We briefly explain our approach to the underlying research problems and provide
major contributions of this thesis.
Chapter 3 provides a detailed survey on relevant literature in the context of this thesis. State of
the art approaches in summarization including Lexical chains, Graph based models, Language
modeling approaches are described in this chapter. We discuss about representative previous
work in the field of machine learning that is applied to summarization. Later in this chapter
we describe about the levels of novelty detection and successful approaches in this field. At
the end of this chapter we describe the efforts made in the direction of temporal summarization
that is close to the problem we address in this thesis.
In Chapter 4, we describe general summarization framework explaining various stages of
a summarizer. We explain about support vector regression and how it is useful in predicting
sentence importance. Details about several sentence scoring features, both existing and newly
devised, that are used for producing summaries are provided.
10
Chapter 5 describes the role and importance of novelty detection in progressive summariza-
tion. We describe how novelty detection is integrated into summarization framework and why
it is important. Each section in this chapter describe various novelty detection techniques used
at Feature extraction, sentence ranking and summary extraction stages of summarizer.
Chapter 6 provides the details about the data set and evaluation metrics used for the experi-
ments in this thesis. We discuss the details about several experiments conducted to evaluate re-
gression and determine the significance of proposed novelty detection techniques over generic
summarizer. Evaluation results of the experiments are compared to state of the art approaches
in summarization.
Chapter 7 Finally, chapter 7 concludes this thesis explaining the work done and discusses
the results of our experiments. It also provide details about foreseeable future work from this
thesis.
11
Chapter 2
Problem Statement
2.1 Motivation
Internet allows its users to virtually follow about any interesting news story. There are
umpteen number of news portals that periodically aggregate information about every category
and domain in several languages. Unlike scientific articles and blogs, a news topic has multiple
information sources, paraphrasing same information in various surface forms. Each news topic
has a particular longevity depending on its nature and popularity.
For instance, consider the news topic about “Michael Jackson’s death”. On the first day of
reporting the incident, the topic is started with an article about his tragic death due to excessive
drugs. Over a period of time, the news reports cover the details about police investigations,
mourning of celebrities, financial troubles, details about funeral and so forth. To provide suf-
ficient background knowledge for the reader, news reporters usually include prior information
about the topic while describing the new events or proceedings. Such reporting leads to repeti-
tion of information in future articles on the topic. Below, We provide snippets of news articles
from Reuters1 to illustrate our discussion [48]
Article 1 (On 26th June 2009):
The 50-year-old, whose towering legacy was tarnished by often bizarre behavior was pro-
nounced dead on Thursday in Los Angeles after going into cardiac arrest. An autopsy was
1http://www.reuters.com/
12
conducted on Friday, and while investigators will not know results of toxicology tests for six to
eight weeks, speculation turned to his prescription drug use as a culprit.Mourning his death
were legions of fans around the world, including U.S. President Barack Obama, who called the
”Thriller” singer a ”spectacular performer” and offered his condolences to Jackson’s family.
Article 2(On 27th June 2009):
The King of Pop died suddenly on Thursday at the age of 50, after a career spanning 40 years
that included the biggest-selling pop album of all time, ”Thriller.” Despite taking in hundreds
of millions of dollars as one of the most successful pop musicians of all time, Jackson racked
up about 500 million of debt, according to sources cited by The Wall Street Journal earlier this
month.
Article 3 (On 28th June 2009):
Jackson, 50, was stricken Thursday at his rented chateau in Holmby Hills, above Sunset Boule-
vard, and died after suffering what his brother Jermaine Jackson said was cardiac arrest.
Families who obtain a second autopsy often do so because they want to confirm the cause of
death. A second autopsy can also give relatives information much faster than an autopsy con-
ducted by law enforcement officials, said Michael Baden including the criminal trials of O.J.
Simpson and Phil Spector.
Articles 1,2 and 3 are published by the same news source, ordered chronologically based on
their date of publication. In order to provide some context, both articles 2 & 3 provided some
prior information that is already known to the reader through previous articles (article 1) .
Consider a scenario, where the user intends to follow one such temporal news topic. The
topic has lots of related articles generated by numerous news aggregators, blogs or stand alone
news websites and the volume of articles increase with time. Since the user cannot deal with the
huge amount of raw information, there is a great need of a summarizer that processes all these
articles and produce a targeted informative summary about the topic. With a sophisticated
summarizer, a user can now access information in the form of a summary instead of going
through all the articles and save his productive time. As the life span of these news topics
13
can range from weeks to months to years depending on its nature, the user is expected to use
the summarizer periodically for producing a summary of recent articles (since the previous
summary).
2.2 Problem Definition
Automatic summarization is an information access technique used to present only the most
important information from multiple documents, thereby reducing the need to refer source doc-
ument. A normal multi document summarizer calculates importance of a text unit just in terms
of its relevance to the topic. In a real-world scenario, a reader needs to keep track of a popular
temporal topic. But a normal summarizer fails to produce a good summary since it cannot
handle the prior information reported in earlier articles. Progressive Summarization addresses
this problem, by producing quality summaries that informs only the progression/update on a
particular topic. A progressive summarizer measures the importance of a text unit both in terms
of its relevance to the topic and novelty to the user.
In this thesis we aim to reduce the problem of information overload by periodically pro-
ducing multi document summaries to update the knowledge of user. The goal here is to take
clusters of chronologically divided documents related to the same topic and generate short and
concise summaries that can be read in lieu of original document set.
The most important factor in extracting important items for a progressive summary, is the
identification of novel information. Progressive summarization requires differentiating be-
tween Relevant and Novel Vs Non-Relevant and Novel Vs Relevant and Redundant informa-
tion. The summary need to contain only relevant and novel information, that is feasible only
with combination of efficient novelty detection methods in summarization. In this thesis, we
strive to devise efficient sentence level novelty detection methods in the context of Progressive
Summarization.
14
2.3 Outline of the Solution
Summarization can be achieved either through abstraction or extraction of information from
source documents. While abstractive summaries could provide a readable and coherent sum-
mary, state of the art systems are all extractive summarizers due to the robustness and scala-
bility of these approaches. Extractive approaches of summarization can employ various levels
of granularity like keywords, sentence, or paragraph. As keywords hardly provide any read-
able summary and paragraphs being unlikely to cover information under space constraints,
sentences have emerged as the most popular unit of text for summaries.
Extractive summarizers generally follow a sequential architecture, that includes Pre pro-
cessing of text for sentence boundary identification and extraction, a Feature Extraction stage
where several statistical, linguistic and heuristic models are employed to estimate sentence im-
portance, Sentence Ranking stage that estimates sentence importance through weighted linear
combination of the features and finally Summary Extraction, during which a subset of ranked
sentences are selected into summary.
Since the existing features are successful in capturing only the relevance, We devise mul-
tiple new features to capture novelty of a sentence along with its relevance. Novelty is an
inherently difficult phenomenon to operationalize. Determining ground truth about novelty is
more difficult task than for relevancy. It is hard even for a human because he must try to re-
member everything he has read. We build alternative methods to incorporate novelty detection
in conventional summarization framework. These methods are successfully related to different
stages of summarization and evaluated against each other to find the best.
The use of more and more features for estimating sentence importance, makes the weighted
linear combination a critical aspect of summarization. The process of determining the ideal
set of weights takes good amount of human resources and ample amount of experiments. We
overcome this problem by employing a Machine Learning algorithm to learn the rank of a
sentence from training set of features. Hence our approach is robust to the weights assigned
and number of features employed. A detailed description of our summarizer and the newly
devised novelty detection methods are provided in chapter 4 and 5.
15
2.4 Contributions
The major contributions of this thesis are provided below that include devising new methods
for detecting novelty and relevancy among other things.
1. Regression is widely used in broad spectrum of fields including information retrieval, to
predict unknown variables from set of dependent variables. In chapter 4 we successfully
use regression to estimate sentence importance from its feature vector. Detailed analysis
and experiments are carried out using several combinations of features. The successful
combinations are on par with best summarizers in the world according to evaluation
results.
2. We treat the problem of progressive summarization in an unique way by projecting the
importance of having a novelty detection module in summarization framework. In this
thesis we follow a systematic approach for comparing different novelty detection tech-
niques and relate them to various stages in a summarization framework.
3. To the best of our knowledge the role of prepositions is never explored in determining
the importance of a sentence. We identify that the frequency of prepositions implicitly
achieves the effect of Named Entity Recognition (NER) in a sentence. We develop a
new feature, PrepImp, that scores a sentence based on the frequency of prepositions it
contain.
4. Conventional scoring features capture only the relevance of a sentence. In chapter 5 we
devise new scoring features Novelty Factor (NF) and Hybrid Kullback Leibler Informa-
tion Divergence (HKLID) to capture novelty of a sentence along with its relevance.
• NF, is a statistical feature that measures importance a sentence in terms of docu-
ment frequencies of the words it contain
• HKLID, is an extension to the popular KL divergence, that scores a sentence based
on the divergence of its sentence and document language model from prior cluster
of documents.
16
5. We make a new hypothesis, based on the statistics of document collection, that new in-
formation is often spanned over a group of sentences that belongs to a context. Based on
this hypothesis we devise a new Re-ranking measure, Proximity Re-ranking, that com-
putes the rank of a sentence based on the relevant novelty of its surrounding sentences.
6. We model novelty detection in the context of progressive summarization as an informa-
tion filtering task. Sentences that possibly contain prior information are filtered out from
summary by creating a Novelty Pool (NP). NP contains sentences having words that are
dominant in new cluster of documents compared to previous documents.
17
Chapter 3
Related Work
Summarization has been a popular area of research in information retrieval for a very long
time. The early work in summarization in late 1950’s and early 1960’s by [36] [14] suggested
that text summarization by computer was feasible though not trivial. Progress in language
processing along with exponential increase of computer memory and speed, and the growing
presence of text on the web renewed interest in automatic text summarization. In this thesis
we use summary as a generic term that is produced from one or more texts, that contains a
significant portion of the information in the original texts, and that is no longer than half of the
original texts.
Effective summarization requires an explicit analysis of context and the purpose of sum-
maries. Text summarization has seen a lot of research in the past two decades and the ap-
proaches have been categorised at many levels. Since it is not feasible to list out the wide
range of approaches in summarization, we provide below only the state of art and most popular
approaches.
3.1 State of the art approaches in Summarization
The introduction of summarization track at TAC and DUC allowed researchers to compare
their results and induced a notion of competition that resulted in enormous increase in the
number of approaches. The spectrum of summarization approaches encompass several cate-
18
gories like heuristic, discourse based, machine learning approaches, and language modeling
approaches among others. Below we provide some of the popular approaches,
1. Lexical Chains: [5] describe a work that used considerable amount of linguistic analysis
for performing the task of summarization. Authors describe the notion of cohesion in
text as means of sticking together different parts of text. Cohesion occurs not only at
the word level but at word sequences too resulting in lexical chains. They made use of
lexical chains, a sequence of related words in a text spanning short or long distances to
identify important information. After segmenting input text, lexical chains are identified
and sentences containing strong lexical chains are selected for extraction.
Semantically related words and word sequences were identified in the document, and
several chains were extracted, that form a representation of the document. Wordnet [39]
distance is used as a relatedness measure to find out lexical chains.
2. Graph Spreading Activation: [37] describe a graph based method to find similarities and
dissimilarities in pairs of documents. This is a topic driven approach, with topics rep-
resented through a set of entry nodes in the graph. Each document is represented as a
graph, with each node representing the occurrence of a single word. Each node has sev-
eral links encoding its adjacency, semantic relatedness, co-references with other nodes
in the graph. Once the graph is built, search for semantically related text is propagated
from entry nodes to the other nodes of the graph through spreading activation 1. Salient
words and phrases are initialized according to their TF-IDF score. The weight of neigh-
boring nodes becomes an exponentially decaying function of the traversed path. Given
pair of document graphs, the algorithm computes two scores reflecting the presence of
common and difference nodes. Sentences having higher score are highlighted, with user
being able to specify the number of sentences in summary.
3. Centroid Based Summarization: [42] exploited the use of cluster centroids to summa-
rize documents. News articles describing the same event are grouped together using an
agglomerative clustering algorithm that operates over TF-IDF vector representations of
1The name spreading activation is borrowed from a method used in information retrieval to expand the searchvocabulary.
19
documents. Later centroid of these clusters are used to identify sentences that are central
to the topic of cluster.
Two metrics cluster based relative utility (CBRU) and cross sentence information sub-
sumption (CSIS) are introduced to calculate the importance of a sentence. Three sentence
level features centroid value, positional value, first sentence overlap are used to approx-
imate these metrics. Final score of each sentence, that is computed by the combination
of these scores along with a redundancy penalty, is used for ranking sentences. The
approach is well known as MEAD, and open sourced for research purposes.
4. Probabilistic language models for summarization: [28] defines summarization in terms
of probabilistic language model and use the definition for automatically generating topic
hierarchies. Authors use a language model to characterize documents that will be sum-
marized and then apply graph-theoretic algorithm to determine the best topic words for
the summary. An approximation of relative entropy algorithm/ KL divergence with bi-
gram model is used to compare language models of topic set to a general English corpus.
Language models are used to define ’topicality’ and ’predictiveness’ of a word that re-
flects topic orientedness and existence of subtopic hierarchies for the word.
More recently Jagarlamudi [20] has shown how a relevance based language modeling
paradigm can be applied to query focused multi-document summarization through Prob-
abilistic Hyper space Analogue to languages model (PHAL). The PHAL is a natural
extension to Hyper space Analogue to language model, as term co-occurrence counts
can be used to define conditional probabilities. The PHAL can be interpreted as, given a
word w what is the probability of observing another word w′ with w in a window of size
K. Details about PHAL can be found in chapter 4
5. Other Approaches: There are also some unconventional approaches that investigate the
details that underline summarization process rather than aiming to build a full summa-
rization system. [51] present a system that generated headline style summaries for
publicly available news articles from Reuters and associated press. The system learned
statistical models of the relationship between source text units and headline units. It at-
20
tempted to model both the order and the likelihood of the appearance of tokens in the
target documents.
For Content selection a translation model was learned between a document and its sum-
mary. This model in the simplest case can be thought as a mapping between a word
in the document and the likelihood of some word appearing in the summary. A bigram
model was used for surface realization. Viterbi beam search was used to efficiently find
a near-optimal summary. The Markov assumption was violated by using backtracking at
every state to strongly discourage paths that repeated terms. Both the models were used
to co-constrain each other during the search in the summary generation task.
Sentence Positional information is a simple but powerful heuristic in summarization.
Sentence Position has been extensively studied since its introduction to summarization
by [14]. [32] empirically characterized position feature as a genre dependent feature
and derived a position policy, as an ordering of priority of sentence importance. Most
recently, [24] described a Sub-optimal Sentence Position Policy (SPP) based on pyramid
annotation data and implemented the SPP as an algorithm to show that a position policy
thus formed is a good representative of the genre and thus performs way above median
performance.
3.1.1 Machine Learning Approaches
Recent advances in the field of machine learning have been adapted to summarization
through the literature to identify important sentences. Some representative work in this section
include,
1. Naive Bayes Methods: [26] modeled summarization process as a classification problem,
where sentences are classified as summary or non-summary sentences based on a set of
features using naive-Bayes classifier. Let s be a particular sentence, S the set of sentences
that make up the summary, and F1, ..., Fk the features. Assuming independence of the
features, the importance of each sentence is computed through:
P (s ∈ S|F1, ..Fk) =
∏ki=1 P (Fi|s ∈ S).P (s ∈ S)∏k
i=1 P (Fi)
21
[4] also incorporated naive Bayes with rich feature set to derive signature words 2. Au-
thors also employed some shallow discourse analysis like reference to same entities in
the text, to maintain cohesion. The references were resolved at a very shallow level by
linking name aliases within a document.
2. Neural Networks: [47] propose an algorithm based on neural nets and use of third
party data sets to tackle the problem of extractive summarization. A trained model is
built from the labels and features for each sentence of an article, that could infer the
proper ranking of sentences in a test document. The ranking was accomplished using
RankNet [8], a pair-based neural network algorithm designed to rank a set of inputs that
uses the gradient descent method for training. Similarity score between a human written
judgment and a sentence in the training document is used as soft labels for training.
Novelty of this framework lay in the use of features that derived information from query
logs from Microsofts news search engine 3 and Wikipedia 4 entries. The authors conjec-
ture that if a document sentence contained keywords used in the news search engine, or
entities found in Wikipedia articles, then there is a greater chance of having that sentence
in the summary.
3. Hidden Markov Models: [11] modeled the problem of extracting a sentence from a doc-
ument using a hidden Markov model (HMM). The basic motivation for using a sequential
model is to account for local dependencies between sentences. The HMM contained 2s
+ 1 states, alternating between s summary states and s+1 non summary states. The au-
thors allowed “hesitation” only in non summary states and “skipping next state” only in
summary states. The authors obtained the maximum-likelihood estimate for each tran-
sition probability, forming the transition matrix estimate M , whose element (i; j) is the
empirical probability of transitioning from state i to j.
Associated with each state i is an output function, bi(O) = Pr(O|statei) where O is an
observed vector of features. They made a simplifying assumption that the features are
2Words that indicate key concepts in a document.3search.live.com/news4www.wikipedia.org
22
multivariate normal. The output function for each state was thus estimated by using the
training data to compute the maximum likelihood estimate of its mean and covariance
matrix.
CLASSY [13], the best system in DUC 2004 and MSE 2005 also uses a Hidden Markov
Model for selecting sentences from each document and a pivoted QR algorithm for gen-
erating a multi-document summary.
3.2 Novelty Detection Approaches
Progressive Summaries are generated at regular time intervals to update user knowledge
about a particular topic. Novelty detection is an inherent component of progressive summa-
rization to identify sentences containing relevant and new information. First Story Detection
in TDT task allowed a lot of researchers to work on the problem of event level Novelty detec-
tion. Since we deal with sentence level extractive summarization, we cite here some influential
work in sentence level novelty detection. Most of the techniques listed here are developed
during TREC Novelty track.
[30] proposed a novelty detection approach based on the identification of sentence level
information patterns. The approach is motivated by the intuition that information patterns in
sentences such as combinations of query words, sentence lengths, named entities and phrases,
and other sentence patterns may contain more important and relevant information than single
words. The proposed novelty detection approach focuses on the identification of previously
unseen query-related patterns in sentences. Specifically, a query is preprocessed and repre-
sented with patterns that include both query words and required answer types. These patterns
are used to retrieve sentences, which are then determined to be novel if it is likely that a new
answer is present.
[44] demonstrated the importance of context in novelty detection systems. The idea stems
from the fact that novelty often comes in bursts, which is not surprising since the articles are
composed of some number of smaller, coherent segments. Each segment is started by some
kind of introductory passage, and that is where authors expect to find the novel words. Novel
words are identified by comparing the current sentence’s words against a table of all words
23
seen in the inputs to that point. Subsequent passages are likely to continue the novel discussion
whether or not they contain novel words. They may contain pronominal references or other
anaphoric references to the novel entity. In order to determine whether information within
a sentence has been seen in material read previously, authors integrate information about the
context of the sentence with novel words and named entities within the sentence, and uses a
specialized learning algorithm to tune the system parameters.
In addition to straightforward count of named entities and noun phrases in a sentence, [15]
tried several experiments, one using synonyms in addition to the words for novelty compar-
isons, and one using word sense disambiguation. They have expanded all noun phrases using
Wordnet [39] and used corresponding synsets for comparisons. [49] utilized a method based on
variants of employing an information retrieval (IR) system to find relevant and novel sentences.
A sentence is considered as a query of a reference corpus, and similarity between sentences
is measured in terms of the weighting vectors of document lists ranked by IR systems. A dy-
namic threshold setting approach that is based on the percentage of relevant sentences within a
document set, is used to decide thresholds for extracting novel sentences. [12] have used their
hidden Markov model based sentence retrieval model [11] for extracting relevant sentences and
tested pivoted QR decomposition 5 and Maximal marginal relevance algorithms [9] to identify
a set of sentences containing new information.
Unlike other works on novelty detection, [2] investigated the sensitivity of novelty detec-
tion on the presence of non-relevant sentences in the documents. Authors explored the task of
the TREC novelty track in much greater depth than was done for the TREC workshop, with
substantial focus on the problem of how novelty detection degrades as the quality of relevant in-
formation drops. They experimented with three well-known retrieval models: the vector space
model with tf-idf weighting [43], a language modeling approach with the KL divergence [25]
as scoring function, and a two stage language modeling approach [53]. For detecting novelty,
authors have used several measures that include simple new word counts, Cosine distance met-
ric where negative of the cosine of the angle between a sentence vector and each previously
seen sentence vectors then determines the novelty score for that sentence, and language model
5QR decomposition of a matrix is a decomposition of the matrix into an orthogonal and a upper triangularmatrix often used to solve the linear least squares problem, and is the basis for a particular eigenvalue algorithm.
24
based novelty measures with interpolated, dirichlet and shrinkage smoothing models. These
models differ in the language models they compare while measuring KL divergence.
The difference between the two groups of measures is that one just counts words and the
other looks at the distribution of words. When non-relevant sentences are added, the probabil-
ity distribution of vocabulary shifts so that arriving sentences have more and more dissimilar
distributions, suggesting that they are novel. On the other hand, word counting approaches are
less distracted by the new words. Relevant sentences that are not novel will generally reuse vo-
cabulary from earlier relevant sentences, and will not be sidetracked by the random vocabulary
introduced by the non-relevant sentences. The authors anticipate that as the density of relevant
documents drops the word counting measures will continue to perform the best.
3.3 Approaches to Progressive/Temporal Summarization
Progressive summarization shares much similarity with temporal or time based processing
of news topics in summarization. Regular summarizers deal with a static set of documents,
but a progressive summarizer receives a stream of news articles and the document set is dy-
namic. Progressive summarization is relatively a new area of research within summarization,
which has gained a lot of focus through the introduction of “Update Summarization” track at
DUC 2007. We follow the term Progressive summarization in lieu of Update/Temporal sum-
marization in this thesis, since the term ‘progressive’ express the essence of the task in a better
manner. The user is updated with progress about the events in the topic, hence the term “Pro-
gressive summarization”. We present here some influential work towards this direction in the
recent past.
[9] is the first known work combining query-relevance and information novelty in context
of retrieval and summarization . Authors coin the term “relevant novelty” and explained the
need of computing importance of an element through combined criterion of query relevance
and information novelty of that element. The linear combination of independently measured
Relevance and Novelty of an element is referred as its “marginal relevance”. The method
described in this work strives to maximize marginal relevance and hence labeled as “Maximal
25
Marginal Relevance (MMR)”. The MMR criterion for multi document summarization looks
like
MMR = Argmaxsi∈S[λ(Sim1(si, Q)− (1− λ)maxsj∈SSim2(si, sj))]
Where S is the set of sentences in document cluster, Q is the information need of user
represented as query, Sim1 is the similarity metric to measure the relevance of sentence with
query and Sim2 can be the same as Sim1 or a different metric. For intermediate values of
the parameter λ in the interval [0,1], a linear combination of both relevance and novelty is
optimized. Users wishing to sample the information space around the query, should set λ at a
smaller value, and those wishing to focus more on identifying novelty should set λ to a value
closer to 1.
[1] uses a language model based approach for producing a revised summary at regular time
intervals. The goal is to model topic and event from sentences and identify the occurrence
of new events (novelty) with in the topic (usefulness). Authors proposed different language
models for characterizing usefulness and novelty and combine them both into a single measure
of interestingness.
[35] modeled summarization using information distance theory and produced summaries
with minimal conditional information distance with prior document set. Summarization is con-
verted into an optimization problem limited by the summary’s information content and solved
by approximating Kolmogorov complexity. [17] describes a maximum coverage model for
summarization inspired form well known set cover problem 6. Simple word bigrams valued by
their document frequency are modeled as concepts. Sentences are selected into summary such
that they best cover the relevant concepts in the document set. The Maximization problem is
approximately solved using Integer Linear Programming (ILP). Value of summary is computed
by the number of unique concept values it contains thus limiting redundancy implicitly and pro-
viding scope for novelty. Authors hypothesize that articles about topics that have already been
in the news tend to state new information first before recapping past details. Sentence position
is used as a unique feature to identify novel sentences.
6www.wikipedia.org/wiki/Set cover problem
26
[7] proposes a statistical method based on a maximization-minimization of sentence simi-
larity measures between sentence vectors. Cross summary sentence redundancy is minimized
to limit the redundancy of progressive summary with previous summary while maximizing the
newness of information in summary. Sentences close to the topic description are chosen to
sustain relevancy.
In this thesis we solve progressive summarization by devising multiple novelty detection
techniques at various summarization stages and combine them all to generate an informative
summary. Unlike most of the previous work, our approach has the advantage of using more
than one technique to detect novel information and is integrated within summarization frame-
work.
27
Chapter 4
Supervised sentence ranking using Regression
4.1 Summarization Framework
Summarization can be viewed from different viewpoints, as a decision theory problem, as
a classification problem of summary and non summary sentences or as a data compression
problem of lossy and lossless compression of sentences or as an Information Retrieval problem
of extracting relevant sentences. In this thesis we use a general model allowing different views
to be implemented as individual features of the summarization framework. We use machine
learning methods (Regression technique) to combine all these distinctive features and produce
a final informative summary. In this chapter, we explain the methodology of our framework
and provide details about the various features that are used in our experiments,
4.1.1 Stages of Framework
As the focus of this thesis is only on extractive summarization, the term summarization/summarizer
implies sentence extractive multi document summarization. The model of our summarization
is inspired from MEAD architecture, an elaborate publicly available platform for multi doc-
ument multilingual text summarization [41]. The flexible nature of our framework allows us
to implement arbitrary algorithms in a standardized framework. Our summarization has four
major modules,
• Pre-processing:
28
Articles collected from the web or any publicly available corpus has unnecessary article
heads, HTML tags that doesn’t provide any information about the article. Each article
is represented as a document in the framework and parsed to extract the content. Stan-
dard sentence boundary identifiers and word breakers are used to split document into
sentences. Stop words are removed from sentences and porter stemmer is used to derive
root words stripping suffixes and prefixes for each sentence. Figure 4.1 shows a sample
news article from AQUAINT corpus 1,
• Scoring/ Feature Extraction:
Sentences extracted during pre-processing stage are considered as units of summary.
Each sentence has scores assigned by several scoring features, reflecting its relevance
either on a positive or negative scale. These features may consist probabilistic language
models, heuristics derived from corpus, entropy based measures, statistical information
about the data, linguistic and knowledge based measures among others. Usually more
than one feature is used in scoring to attain robustness. A close observation at the sample
news article (figure 4.1) reveals that often the important information is conveyed in either
the top or bottom parts of the article. Since all the articles in the cluster are relevant to the
topic, importance of a concept is directly proportional to its occurrence in articles. These
observations are leveraged as features (DFS, SFS, SL1, SL2) and described in detail in
section 4.3
Features are pluggable components of the framework, hence each combination of fea-
tures becomes an unique configuration of summarizer. The multiple feature scores of
each sentence is combined into a single rank in the Sentence Ranking stage.
• Sentence Ranking:
Rank of a sentence is directly proportional to the importance of a sentence and decides its
membership in the summary. Conventionally sentence rank is computed as the weighted
1AQUAINT corpus consists of newswire text data in English, drawn from three sources: the Xinhua NewsService (People’s Republic of China), the New York Times News Service, and the Associated Press WorldstreamNews Service. It was prepared by the LDC for the AQUAINT Project, and will be used in official benchmarkevaluations conducted by National Institute of Standards and Technology (NIST).
29
Figure 4.1 Sample news article from AQUAINT news corpus
linear combination of feature scores. As the feature space grows, it becomes very difficult
to come up with an optimal set of weights for the combination.
To overcome the cost of numerous experiments and manual effort in the process of find-
ing an ideal weight combination, we use a machine learning technique to estimate the
rank of a sentence. Details of the ranking procedure is explained in detail in section 4.2
• Summary Extraction:
30
Summary Extraction is the final stage of summarization, where a subset of ranked sen-
tences are selected into summary until the desired summary length is reached. Only
sentences complying with several constraints like minimum/maximum length, and hav-
ing minimal redundancy to already produced summary is selected to the summary. As a
result, summary covers a wide range of aspects about the topic. Uninformative phrases of
sentence are removed using simple heuristics and some minimal set of rules. Sentences
are adjusted based on their order of occurrence in the documents to improve readability
of summary.
A pictorial representation of all the stages in extractive summarization is presented in figure 4.2.
Figure 4.2 Stages in a Multi Document Summarizer
4.2 Feature combination using Support Vector Regression
Recent Advances in the field of Machine Learning have been adapted to summarization
through the literature using several features to identify sentence importance. Previously, ma-
chine learning models like neural networks [47], naive bayes classifiers [26], hidden markov
31
models [11], and most recently gradient boosted decision trees [38] have been used for sen-
tence ranking. In this work we experimented with a popular machine learning technique called
regression to predict the sentence importance.
4.2.1 Motivation
Regression is a statistical technique to model a dependent variable from a set of indepen-
dent variables. It is very popular in forecasting and prediction tasks and is used over a broad
spectrum of areas like finance [50], biology,and weather predictions [19] among others. It is
also used in various Information Retrieval and Information Extraction tasks by [27] [54].
Regression techniques are relatively less explored compared to other machine learning al-
gorithms in the context of summarization. While classification approaches classify a sentence
to be relevant or non relevant, regression predicts the exact real value of sentence importance.
Other popular machine learning approaches like Gradient decision trees and Neural networks
become intractable as the feature space grows. The fact that regression techniques are showed
to perform on par with them [38] encourages us to use regression for predicting sentence im-
portance. To the best of our knowledge, [29] and [45] are the only prior works that have used
regression for predicting sentence importance. Our work goes beyond [45] by proposing more
powerful features that are best predictors of sentence relevancy, according to the evaluation re-
sults of summaries (Chapter 6). We also extend the regression SVM to predict the importance
of progressive summaries.
Regression using Support Vectors is called Support Vector Regression (SVR). In following
sections we briefly explain SVR, our sentence importance estimation and summary extraction
algorithms.
4.2.2 Support Vector Regression (SVR)
Regression analysis refers to techniques for predicting a real valued dependent variable from
one or more independent variables. We model sentence importance as the dependent variable
and the vector of feature scores as independent variables. A little detail about the theory behind
support vector regression is provided below,
32
Consider the problem of approximating the set of training data
T = {(F1, i1), (F2, i2)...(Fs, is)} ⊂ F ×R.
where F is space of feature vectors and R is the set of real numbers.
A tuple (Fs, is) represents feature vector Fs and importance score is of sentence s. Each
sample satisfies a linear function q(f) = 〈w, f〉+b, withw ∈ F, b ∈ R. The optimal regression
function is given by minimum of functional,
Φ(w, ξ) =1
2‖w‖2 + C
∑i
ξi− + ξi
+
where C is a pre-specified value, and ξi−, ξi+are slack variables representing upper and lower
constraints on the outputs of the system.
Like other machine learning algorithms, Support vector regression has two phases, training
and testing. During training phase we compute feature vector of each sentence along with its
importance. In testing phase, feature vectors of all sentences are generated and their corre-
sponding sentence importance is assessed by trained model.
4.2.2.1 Sentence Importance Estimation
Importance score (is) is not pre-defined for sentences in training data. We estimate the value
of importance using gold standard, human written summaries(also known as models) on that
topic.
ROUGE [33] is a recall oriented metric which automatically evaluates machine generated
summaries based on their overlap with models. ROUGE-2 and ROUGE-su4 scores highly
correlate with human evaluation [31]. Hence we make a safe assumption that importance of a
sentence is directly proportional to its overlap with model summaries. Sentence importance is
estimated as the ROUGE-2 score of that sentence. The importance of a sentence s, denoted by
is is computed as follows
is =
∑m∈models |Bigramm
⋂Bigrams|
|s|(4.1)
|Bigramm
⋂Bigrams| is number of bigrams shared by both modelm and sentence s. This
count is normalized using sentence length |s|. The number of models may vary depending
33
upon the resources. A more detailed description of ROUGE is provided in chapter 6, in our
discussion about the content evaluation metrics of summarization.
4.2.3 Feature Combination
Sentence scores from different features are combined to compute final rank of a sentence.
Normally feature scores are manually weighted to calculate the rank value. With use of SVR,
the whole process is automated in three steps:
• Sentence tuple generation: Feature values of every sentence are extracted and its importance(is)
is estimated as described in Section 4.2.2.1. Each sentence s in training data is converted
into a tuple of form (Fs, is). Details about the these features are described in Section
4.3. Fs is vector of feature values of sentence, Fs = {f1, f2, f3}. All the sentences in the
document set are projected as sample points in this feature space.
• Model building: During this phase, a linear regression model q is built over the training
vectors. The parameters of the regression model are not fine tuned on the training data
to attain robustness. We used epsilon SVR component of LibSVM package [10] for this
purpose.
• Rank prediction: Importance of a sentence in testing dataset is predicted based on the the
trained model q. The estimated importance value is considered as final rank of sentence
for further processing.
is = q(Fs)
4.3 Extraction of sentence relevancy features
We describe here several sentence scoring features that were used as part of this work.
Some of these features are devised by the authors while others are inspired from previous work
and implemented as part of experiments. Motivation for the use of these specific features are
drawn from the observations made on news articles as explained in section 4.1.1. Since we
34
have a machine learning algorithm at our disposal to carry out the tedious job of combining
features and scoring sentences, we are able to carry out numerous experiments without being
worried about finding ideal weight combinations.
4.3.1 Sentence position
Position of a sentence in a document or the position of a word in a sentence give good clues
towards importance of sentence or words respectively. Such features are called Locational fea-
tures. Locational features have been consistently used to identify salience of a sentence. It is
well studied and still used as a feature in most state of the art summarization systems [24] [23].
We use the location information of a sentence in two separate ways to score a sentence.
Sentence Location 1 (SL1):
Sentence position is a very old and popular feature used in summarization [14]. It deals with
presence of key sentences at specific locations in the text. According to our analysis of ora-
cle summaries(in Section 4.3.8), nearly 40% of all the sentences in the oracle summaries are
picked up from among the first three sentences of each document. Hence it allows us to make
an assumption that first three sentences of a document generally contain the most informative
content of that document. We propose our first feature Sentence Location1,
SL1(snd) = 1− n
Nif n <= 3
=n
Nelse
Where SL1(snd) is the score of a sentence s at position n in document d and N is the total
number of sentences in the document collection. The SL1 scores sentences such that
SL1(s1d) > SL1(s2d) > SL1(s3d) >> SL1(snd)
Sentence Location 2 (SL2):
35
SL1 is a corpus sensitive future, and it works under the heuristic that most informative
content lies at the head of a document. This heuristic works well in most of the cases especially
in news genre, but it might not be the case with other genre of documents like novels or books.
Sentence Location 2 (SL2) is a corpus independent feature, that assigns positional index of a
sentence in the document as its feature value. Training model will learn the optimum sentence
position for the corpus based on its genre, which may not necessarily be the head sentences.
Hence this feature is not inclined towards top or bottom few sentences in a document like SL1.
SL2(snd) = n
where sn is nth sentence in document d. SL2 is a very simple feature but it is as effective as
SL1 in determining sentence relevance.
4.3.2 TF-IDF
TF-IDF is a popular information retrieval technique to measure the document relevance and
consequently to rank documents. We use a similar technique to measure relevancy at a sentence
level. Term frequency(TF) of a term(ti) in a document (dj) is the ratio of number of times it
occurred in dj (ni,j) to total number of terms in dj .
TFi,j =ni,j∑k nk,j
Inverse Document Frequency(IDF) of a term (ti) is the ratio of total number of documents
in cluster (D) to number of documents in which term occurred.
IDFi = log|D|
{|d| : ti ∈ d}
While TF measures the importance of a term in a particular document, IDF measures the
exclusiveness/informativeness of that particular term. A product of TF,IDF gives an overall
measure of salience of that word. Final score of sentence(s) in document dj is the average
TF-IDF value of all the terms it contains,
TF − IDF (s) =
∑i∈s TFi,j ∗ IDFi
|s|
36
4.3.3 Document Frequency Score (DFS):
In conventional IR the document set is a mixture of relevant and non-relevant documents,
hence IDF is used as a distinctive feature between them. IDF is not useful in summarization
since the document collection consists only relevant documents on a particular topic. [45]
devised Document Frequency Score (DFS) that works very well in summarization. DFS of a
word is defined as the ratio of number of documents in which it occurred to the total number
of documents in the collection. dfs of a word w is given by,
dfs(w) ={|d| : w ∈ d}|D|
where d is document, |D| is total number of documents in dataset. DFS is a simple statistical
feature that exploits the relatedness of every document in the collection to compute the salience
of a sentence.
4.3.4 Sentence Frequency score (SFS):
Sentence Frequency Score (SFS) is a sentence level variant of DFS. As every document in
the collection is assumed to be relevant to the topic, it is also safe to assume that majority of
the sentences from these documents are also relevant. The feature SFS is devised to capture
the most relevant out of the relevant sentences in the document collection. SFS of a word is
defined as the ratio of number of sentences in which the word occurred in document set to the
total number of sentences in document set.
sfs score of a word w is given by,
sfs(w) ={|s| : w ∈ s}|N |
where s is a sentence, |N | is total number of sentences in dataset. Average sentence frequency
score of all the words in a sentence is considered as its feature score.
Score(s) =
∑i∈s sfs(wi)
|s|
37
4.3.5 Probabilistic hyperspace analogue to language (PHAL)
A language model or alternatively a statistical language model is a probabilistic mechanism
for generating text. A Hyperspace analogue to language model (HAL) constructs dependencies
of a word w with other words based on their co occurrence in a window of size k. The PHAL,
is probabilistic extension to HAL spaces, where term co occurrence counts are used to compute
conditional probabilities. We use PHAL model proposed by [20] as a sentence scoring feature.
PHAL can be interpretated as the probability of observing a word w’ with the word w in a
window of size k,
PHAL(w′|w) =HAL(w′|w)
n(w)× k
Assuming word independence, the relevance of a sentence S given an information need Q is
computed as,
P (S|Q) ≈∏wi∈S
P (wi|Q)
≈∏wi∈S
P (wi)
P (Q)
∏qj
PHAL(qj|wi)
≈∏wi∈S
P (wi)∏qj
PHAL(qj|wi)
4.3.6 Kullback Leibler divergence (KLD)
Kullback-Leibler divergence or relative entropy is a non symmetric metric to measure the
difference between two probability distributions. KLD is used to calculate generic or query
independent importance of the information by contrastive analysis of given document set D
with a random document set D′. If a term has similar probability distribution in both D and
D’, then the generic importance for that term is assumed to be high. The KLD of a sentence s
is computed as,
KLD(s) =
|s|∑i=1
P (wi|D)logp(wi|D)
p(wi|D′)
38
4.3.7 Prepositional Importance (PrepImp)
In english grammar, a preposition is a part of speech that links nouns, pronouns to other
phrases in a sentence. A preposition generally represents the temporal, spatial or logical rela-
tionship of its object to the rest of the sentence. Observe the role of prepositions on,of,to,in,from
in the below sentences,
The book is on the table
President of India lives in delhi
The Indian cricket team is traveling from Australia to new zealand
It is very interesting to observe how prepositions are implicitly capturing the key elements
in a sentence. The preposition on in first sentence is conveying that there is a book, a table
and some relation between them. Similarly other two sentences has some key information
regarding one or more entities implicitly conveyed through connecting prepositions. To the
best of our knowledge, the role of prepositions is never explored before to calculate sentence
importance.
As a primary step in this direction, we propose using the frequency of a small set of prepo-
sitions in a sentence as its feature score. The frequency of prepositions is indirectly achieving
the effect of performing a Named Entity Recognition (NER) on a sentence, but without any
additional cost of processing or using any POS tags. Score of a sentence (s) calculated by
PrepImp is given as,
PrepImp(s) =
∑wi∈s IsPrep(wi)
|s|
The list of prepositions used for calculating sentence importance are limited to simple single
word prepositions like in,on,of,at,for,from,to,by,with, after a careful observation over the data.
4.3.8 Oracle Summaries
Oracle summary is the best sentence extractive summary that can be generated by any sen-
tence extractive summarization system for a particular topic. We generated sentence-extractive
39
oracle summaries using document collection and human written summaries for that topic. Each
sentence is scored using equation 4.1, and the score is considered as its final rank. Summaries
are extracted as described in section 4.1.1. Oracle summaries serve as the upper limit of what
can be achieved through extractive methods in summarization. They are generated to observe
the details about most informative sentences in the document collecction and to depict the
scope of improvement in summarization.
4.4 Summary
In this chapter we described a general multidocument summarizer framework explaining
pre processing, feature extraction, sentence ranking and summary extraction stages. Normally
sentence rank is computed from the weighted linear combination of features. Instead, we used
a supervised machine learning algorithm, regression, to estimate sentence importance from the
feature vector. We provided details about the theory and mathematical formulation of scoring
features that will be used for experiments in chapter 6. These features include both existing
features like PHAL, KL, DFS and TF-IDF and newly devised features as SFS, PrepImp, SL1
and SL2. Oracle summaries are generated with an objective to find the upper limit of what can
be achieved through extractive summarization and thus depict the scope of improvement from
existing approaches.
40
Chapter 5
Progressive Summarization: Summarization with Novelty
Detection
Summarization in its basic essence is to extract the essential information from a collection
of textual content and present in user readable format. Summarization is a multi disciplinary
problem having roots in Information Retrieval, Natural Language Processing and Cognitive
Sciences. Over the years, research in summarization lead to some interesting sub problems like
Personalized summarization, Multi document summarization, Query focused summarization
among others.
Progressive summarization is relatively a new area within summarization, designed to aid
the users having access to rapidly flowing stream of articles on a topic and has no time to
look at each article. In such situation, a person would prefer to be updated on events within
topic, and dive into details only when reported events trigger enough interest. Recent focus
within summarization community has been shifted towards progressive summarization after
the introduction of Update summarization track at Document Understanding conference in
2007.
It is not sensible to wait for the topic to complete to produce a summary, nor does it make
sense to produce an overall summary at every time interval. After all, the user has already
been informed about prior events. Hence progressive summaries are generated at regular time
intervals to update user knowledge on a particular news topic.
Major challenge in progressive summarization lies in distinguishing Relevant and Novel
Vs Relevant and Redundant Vs Non relevant and redundant information. Detecting novel in-
41
formation from source documents given user’s prior knowledge over the topic is termed as
novelty detection. It is very important and needful to have a Novelty detection module in the
summarization framework for identifying relevant new information.
Figure 5.1 Novelty detection at different stages in a Multi Document Summarizer
In this thesis we identify the possibility of novelty detection at Feature extraction, Sentence
ranking and Summary extraction stages of summarization as shown in figure 5.1. We propose
different techniques at each stage and the details are given in following sections.
5.1 Feature Extraction level
In general multi document summarization systems, word level or sentence level features
calculate the score by measuring their relevance to the topic. But in feature extraction stage of
progressive summarization, features should be capable to capture sentence novelty along with
its relevance. In our work we devised three such features Novelty Factor (NF), New Words
(NW) and Hybrid Kullback Leibler information divergence (HKLID).
42
Imagine a set of articles published on an evolving news topic over time period T, with td
being publishing time stamp of article d. All the articles published from time 0 to time t are
assumed to have been read previously, hence prior knowledge, pdocs (short for previous doc-
uments). Articles published in the interval t to T that contain new information are considered
ndocs (short for new documents).
ndocs = {d : td > t}
pdocs = {d : td <= t}
5.1.1 Novelty Factor (NF)
We propose a new feature, novelty factor (NF), that primarily focuses on progressive sum-
marization problem. Novelty Factor is inspired by the DFS in generic multi document sum-
marization. The essence of novelty is to find the information that is dominant and relevant in
the new cluster of documents (ndocs) rather than information that is existing in prior knowl-
edge. Novelty is directly proportional to the relevance in ndocs and inversely proportional to
the dominance in pdocs. NF of a word w is calculated as,
NF (w) =|ndt|
|pdt|+ |D|
ndt = {d : w ∈ d ∧ d ∈ ndocs}
pdt = {d : w ∈ d ∧ d ∈ pdocs}
D = {d : td > t}
Numerator |ndt| is the number of documents in the new cluster that contain word w. It is
directly proportional to relevancy of the term, since all the documents in the cluster are relevant
to the topic. The term |pdt| in denominator will penalize any word that occurs frequently in
previous clusters, in other words it elevates novelty of a term. |D| is the total number of
43
documents in current cluster, this is useful for smoothing Novelty Factor when w does not
occur in previous clusters. NF score of a sentence is a measure of its relevance and novelty to
the topic. Score of a sentence s is the average NF value of its content words.
Score(s) =
∑i∈sNF (wi)
|s|
5.1.2 New Word Measure (NW)
The motivation behind NW comes from TREC Novelty track, where a lot of systems try
to estimate newness of a sentence as the amount of new words it contains. A word that never
occurred before in document cluster is considered new. So all the words that are not present in
pdocs are regarded as new. NW score of a sentence(s) is given by,
Score(s) =
∑w∈sNW (w)
|s|
NW (w) = 0 if w ∈ pdocs
= n/N else
n is frequency of w in ndocs
N is total term frequency of ndocs
Normalized term frequency of w is used in calculating feature score of a sentence. Unlike
NF, NW only captures the newness of a sentence but not its relevancy. NW has to be used in
combination with other relevancy features to calculate the relevant novelty of a sentence.
5.1.3 Hybrid Kullback-Leibler Information Divergence (HKLID)
Kullback-Leibler information divergence (KL) [25] is a popular technique to measure the
difference between two probability distributions. The principle behind KL is used to assess the
generic importance of a sentence in normal summarizers as explained in section 4.3.
We used an extension of KL to measure the divergence between hybrid language models
(LM) of two sentences that are built over pdocs and ndocs. A hybrid language model is the
44
combination of document and sentence language models, for better divergence calculations.
HKLID between LM’s of two sentences si in ndocs and sj in pdocs is calculated using,
HKLID(si||sj) =∑w∈si
P (w|si)P (w|ndocs) ∗ log ∗ P (w|si)P (w|ndocs)P (w|sj)P (w|pdocs)
HKLID measures the importance of a sentence in ndocs conditioned over the sentences in
pdocs. So more the divergence between these hybrid language models, more the novelty of that
sentence. Average HKLID between a sentence si in ndocs and all the sentences in pdocs is
used as its novelty score. Probability distributions are smoothed using Dirichlet principle [52].
5.2 Sentence Ranking Level
Sentence ranker combines multitude of scoring features into a single rank that is directly
proportional to the importance of a sentence. In progressive summarization ranked sentence
list is improved by re-ordering sentences through two different techniques “Redundancy Re-
ranking” and “Proximity Re-ranking”. The goal of re ordering is to promote the sentences with
new information over the stale ones in the ranked list.
5.2.1 Redundancy Re-ranking
In Redundancy re-ranking, ranked set is re-ordered using Maximal Marginal relevance
(MMR) [9] criterion. MMR computes importance of an element through combined criterion of
query relevance and information novelty of that element. The linear combination of indepen-
dently measured Relevance and Novelty of an element is referred as its “marginal relevance”.
The method strives to maximize marginal relevance and hence labeled as “Maximal Marginal
Relevance (MMR)”.
Final rank (“Rank”) of a sentence is computed as weighted linear combination of the original
sentence rank and the redundancy measure of that sentence,
Ranksi= µ ∗ scoresi
− (1− µ) ∗ redundancy score
Where “score” is the original sentence score predicted by regression model as described in sec-
tion 4.2, and “ redundancy score” is an estimate for the amount of prior information a sentence
45
contains. µ is a balancing parameter to adjust the relevancy and novelty of a sentence. In this
work “redundancy score” of a sentence is calculated by its Information theoretic Similarity
measure( ITSim) and Cosine Similarity measure (CoSim) to the previous sentences.
Information Theoretic Similarity (ITSim)
[34] presented an information theoretic definition of similarity and demonstrated its application
in various domains. This particular definition of similarity does not assume a particular domain
or type of problem. It is applicable as long as the domain has a probability model. Unlike other
similarity measures it is not defined by a formula, rather it is derived from a set of assumptions
about similarity between two entities. The similarity between A and B is measured by the
ratio between the amount of information needed to state the commonality of A and B and the
information needed to fully describe what A and B are,
sim(A,B) =logP (common(A,B)
logP (descritption(A,B)(5.1)
According to information theory, Entropy quantifies the amount of information carried with a
message. Extending this analogy to text content, Entropy I(w) of a word w is calculated as,
I(w) = −p(w) ∗ log(p(w))
p(w) = n/N
Motivated by the information theoretic definition of similarity, we extend the similarity de-
scribed in equation 5.1 between two entities to sentences s1 and s2,
ITSim(s1, s2) =2 ∗
∑w∈s1∧s2 I(w)∑
w∈s1 I(w) +∑
w∈s2 I(w)
Information of a sentence is calculated by the entropy of all the words it contains. The term in
numerator is proportional to the commonality between s1 and s2, denominator measures the
description of both the sentences.
Cosine Similarity (CoSim)
Cosine Similarity is a popular and old technique often used to compare documents in text min-
ing. It is a measure of similarity between two vectors of n dimensions by finding the cosine of
46
the angle between them. Given two vectors A,B, the cosine similarity is represented using a
dot product and magnitude as,
Sim(A,B) = cos(Θ) =A.B
|A||B|
Sentences are represented as tf-idf vectors [43] of its constituent words in an n-dimension
space. Term frequency of each word represent its cardinality and the number of unique words
in the document collection determines the dimension. Cosine similarity between two sentences
is measured as,
CoSim(s1, s2) =
∑w∈s1∧s2 tfidf(ws1) ∗ tfidf(ws2)√∑
w∈s1 tif(w)2 ∗∑
w∈s2 tif(w)2
Maximum similarity value (ITSim,CoSim) of a sentence in ndocswith all sentences in pdocs
is considered as its redundancy score.
5.2.2 Proximity Re-ranking (ProximRank)
TREC Novelty track
NIST has created a gold standard data for evaluations, as part of Novelty track from TREC
6,7,8 consisting 50 odd topics. Each topic has set of relevant and irrelevant documents, and
sentences marked with relevance and novelty judgments. [46] investigated the details about
the percentage of relevant and novel sentences in the document collection and the adjacency of
relevant sentences for 2003 and 2004 data. Table 5.1 shows the excerpts of this analysis,
2003 2004Relevant 0.39 0.20
Consecutive 0.91 0.70
Novelty 0.68 0.40
Table 5.1 Statistics of relevant, novel and consecutive relevant sentences in TREC 2003
Almost 40% of the sentences were selected as relevant, and in particular 90% of the rele-
vant sentences were adjacent. Analysis also shows a huge disparity in the fraction of relevant
47
and novelty sentences between 2003 and 2004. However, the authors have not explored the
importance of proximity in novel sentences.
We carried out experiments on TREC 2004 Novelty track data and found that about 75% of
novel sentences occur as a pair and approximately 61% of novel sentences occur in a group of
three. These statistics prove that new information is often spanned over a group of sentences
that belongs to a context. Hence it is intuitive to compute the rank of a sentence using relevant
novelty of its surrounding sentences. Final rank of a sentence si after proximity re-ranking is,
Ranksi= λ ∗ scoresi
+ (1− λ) ∗scoresi−1
+ scoresi+1
2
scoresiis the relevant novel score estimated by the regression model and λ is a balancing
parameter that is set manually.
5.3 Summary Extraction Level
Summary extraction is the final stage of summarization where sentences from the ranked list
are selected into summary pertaining the redundancy, coherence and other limits of the frame-
work. At this stage, novelty can be induced into summary by selecting only the sentences that
are estimated to contain relevant novelty given the background knowledge.
Novelty Pool (NP)
A progressive summarizer assume that the user is most concerned about finding new infor-
mation, and is tolerant of reading information he already knows because of his background
knowledge. Since each sentence adds to the users knowledge,and later sentences are to be
retrieved only if they contain new information, novelty retrieval resembles a filtering task.
We model novelty detection as a filtering task at summary extraction stage. Sentences that
possibly contain prior information are filtered out from summary by creating Novelty Pool
(NP). We introduce the notion of dominant and novelwords to explain the intuition behind NP.
A word w is considered dominant if the DFS of the word is above half of the total documents
in the cluster. Two sets of dominant words are generated one for each pdocs and ndocs,
domndocs = {w : DFSndocs(w) > ndocs/2}
48
dompdocs = {w : DFSpdocs(w) > pdocs/2}
Difference of these two sets gives us a list of relevant novelwords,
novelwords = domndocs − dompdocs
Finally we extract a set of novelword that are both dominant and new. During summary
extraction we select the sentences having novelwords more than the average noverwolds per
sentence ratio (npr). These set of sentences are referred to as Novelty pool (NP).
novelwordcount(si) =
|si|∑j=1
isnovelword(wij)
npr(si) =
∑|S|i=1 novelwordcount(si)
|S|NP = {si : novelwordcount(si) > npr}
S represents the set of sentences in ndocs, and |S| is the cardinality of set S.
5.4 Summary
In this chapter we described the role and importance of novelty detection in progressive
summarization. We explained how novelty detection is integrated into summarization frame-
work and why it is important for producing informative progressive summaries. Each section in
this chapter described various novelty detection techniques used at feature extraction, sentence
ranking and summary extraction stages of summarizer. We developed two new scoring features
to capture the relevancy along with novelty of a sentence. We used content based re-ranking
measures and proximity based re-ranking measures to improve the ranked list of sentences. Fi-
nally, at summary extraction stage we modeled novelty detection as a filtering task and filtered
out probable sentences containing prior information.
49
Chapter 6
Evaluation
Evaluation of summaries can broadly be classified into two classes. The first is extrinsic
evaluation, that measures the effect of summarization on completion of some other tasks like
relevance assessment and text categorization. The second class of methods are intrinsic that test
the summarization itself. Intrinsic evaluations assess mainly informativeness and cohesiveness
of the summary. In this thesis we evaluate our progressive summarization techniques and the
advantage of supervised sentence ranking over traditional weighted linear approach through
popular intrinsic measure, Recall Oriented Understudy of Gisting Evaluation (ROUGE). This
chapter discusses the experimental setup and the evaluation results of our experiments and later
compare them to state of the art approaches in summarization.
6.1 Dataset
We conducted all our experiments on TAC update summarization track dataset. It serves as
an ideal testbed for evaluating progressive summaries. The scenario of update summarization
described in TAC assume that each user is an educated, adult US native who is aware of current
events as they appear in the news. The user is interested in a particular news story and wants to
track it as it develops over time, so he subscribes to a news feed that sends him relevant articles
as they are submitted from various news services. However, either there’s so much news that
he can’t keep up with it, or he has to leave for a while and then wants to catch up. Whenever he
50
checks up on the news, it bothers him that most articles keep repeating the same information;
he would like to read summaries that only talk about what’s new or different.
In the scenario, a user initially gives you a topic statement (query and narrative) expressing
his information need. News articles about the story then arrive in batches over time (clusters
of articles), and the task is to write a summary of 100 words for each cluster of articles, that
addresses the information need of the user.
The test dataset comprises approximately 48 topics. Each topic has a topic statement (query)
and 20 relevant documents which have been divided into 2 sets: cluster A (pdocs) and cluster
B (ndocs). Each document set has 10 documents, where all the documents in pdocs chrono-
logically precede the documents in ndocs. The documents will come from the AQUAINT-2
collection of news articles. Figure 6.1 shows a sample topic statement and document set de-
scriptions. Each topic statement and its two document sets have four model summaries (gold
standard) written by professional NIST assessors. These model summaries are used by NIST
to evaluate the content of system generated summaries (peers/peer summaries).
Figure 6.1 Sample topic and narrative in TAC 2008
51
The task description and the structure of data in both TAC 2008 and TAC 2009 remains the
same, with difference being clear distinction of events across the clusters. This allowed us to
use TAC 2008 data for training regression models while TAC 2009 to carry out the experiments.
6.2 Evaluation Metrics
We evaluate the quality of summaries using both content and form based evaluation mea-
sures. The focus is mainly on intrinsic content based evaluations. Below we provide the details
about the evaluation metrics that are used in our experiments.
6.2.1 ROUGE
ROUGE is a recall oriented metric to automatically score the peer summaries based on pair-
wise comparisons with reference summaries. ROUGE provides several measures that count
the number of overlapping units such as n-grams, word sequences and word pairs between
peer and model summaries. ROUGE [33] package is a openly available resource that has four
major measures: ROUGE-N, ROUGE-L, ROUGE-W and ROUGE-S. Below we provide a brief
description of ROUGE-N and ROUGE-S, variants of which are popularly used in comparing
summarization systems.
ROUGE-N is an n-gram recall between a candidate summary and a set of reference sum-
maries. ROUGE-N is computed as follows:
ROUGE −N =
∑m∈models
∑gramn∈mCountmatch(gramn)∑
m∈models
∑gramn∈mCount(gramn)
where n is the length of n-gram, Countmatch(gramn) is the maximum number of overlap-
ping n-grams between candidate summary and the model summary m. It goes without expla-
nation that ROUGE is a recall oriented metric since the denominator of the equation is the total
sum of the number of n-grams occurring in model summaries. When multiple reference sum-
maries were used, a pairwise summary-level ROUGE-N between a candidate summary s and
every model, m , in the set of model summaries is computed. Then the maximum of pairwise
52
summary-level ROUGE-N scores is treated as the final multiple reference ROUGE-N score.
Jack knifing principle is used in computing the final ROUGE scores to make model summaries
comparable with peers.
ROUGE S Skip bigrams are 2 word length sub sequences of a string, having the words in same
order as original sequence but with arbitrary gaps between them. Skip-bigram co-occurrence
statistics measure the overlap of skip-bigrams between a candidate summary and a set of refer-
ence summaries. For a better understanding of skip bigrams, consider the following example
• S1: sachin tendulkar plays cricket
• S2: sachin plays test cricket
• S3: plays cricket sachin always
• S4: plays cricket sachin tendulkar
Each sentence has 6 skip bigrams. First sentence has following skip bigrams (sachin ten-
dulkar, sachin plays, sachin cricket, tendulkar plays, tendulkar cricket, plays cricket). S2 has
three matches with S1, S3 has one while S4 has two skip matches. Given a model summary M
of length m and peer summary S of length n and if SKIP 2(M, S ) is the number of skip-bigrams
between M and S , then the ROUGE-S is computed as follows:
Rskip2 =skip2(M,S)
C(m, 2)
Pskip2 =skip2(M,S)
C(n, 2)
ROUGE − S =2Rskip2Pskip2
Rskip2 + Pskip2
ROUGE SU is an improved version of ROUGE S. One potential problem for ROUGE-S is
that it does not give any credit to a candidate sentence if the sentence does not have any word
pair co-occurring with models. To achieve this, a simple extension to ROUGE-S is ROUGE-
SUn is employed, where n is the skip distance for a bigram. ROUGE-SU includes all the
bigrams obtained by ROUGE-S and all unigrams, and hence removes the above problem.
53
As a standard routine ROUGE 2 and ROUGE SU4 scores of summaries are considered as
yardsticks of evaluation through out the DUC and TAC series of workshops. Hence, we also
consider only these two scores during our evaluations.
6.2.2 Pyramids
The pyramid method of evaluation provide a unified method to handle semantic equiva-
lence, Human variation, Analytic Granularity and other aspects of summary at various levels
of granularity. The key assumption of pyramids like ROUGE, is the need of multiple human
authored model summaries, which are considered as a gold standard for peer summaries.
Semantic Content Units (SCU) or alternatively referred to as summary content units, are
semantically motivated subsentential units that are variable in length. SCUs emerge from an-
notation of the set of model summaries for a topic. Sentences in summary are broken down into
clauses, each of which is a SCU in the pyramid. Each SCU has a associated weight indicat-
ing the number of model summaries in which it appeared. Repetition of information through
changes as small as modifier to as large as a clause in the model summaries gives rise to a single
SCU. Peer summaries are evaluated based on the presence of SCU’s and their corresponding
weights.
A key feature of a pyramid is that it quantitatively represents agreement among the human
summaries as SCUs that appear in more of the human summaries are weighted more. Such
weighting allows differentiation between important content from less important content and
it is necessary in summarization evaluation considering personal opinions of assessors while
writing summaries. Fine details about SCU and the pyramid evaluation in total is described
in [3]
6.2.3 Readability and overall responsiveness
Along with the above mentioned content based evaluations, readability of a summary is
also assessed during TAC evaluations. The readability of the summaries were assessed using
five linguistic quality questions which measured qualities of the summary that do not involve
comparison with a reference summary or TAC topic. The linguistic qualities measured were
54
Grammaticality, Non-redundancy, Referential clarity, Focus, and Structure and coherence. Hu-
mans assessed peer summaries based on these linguistic questions and assigned score on a five
point scale, 1 denote worst and 5 denotes best summary.
NIST assessors assigned an overall responsiveness score to each of the automatic and human
summaries. The overall responsiveness score is an integer between 1 (very poor) and 5 (very
good) and is based on both the linguistic quality of the summary and the amount of information
in the summary that helps to satisfy the information need defined for the topic’s narrative.
6.3 Evaluation of Supervised ranking
In this section we assess the supervised ranking technique, regression, discussed in chapter 4
and provide the evaluation results of various combinations of features.
6.3.1 Kernel Functions
Support Vector Machines construct a hyperplane or set of hyperplanes in a high or infinite
dimensional space, which can be used for classification, regression or other tasks. Intuitively,
a good separation is achieved by the hyperplane that has the largest distance to the nearest
training data points of any class (so-called functional margin), since in general the larger the
margin the lower the generalization error of the classifier.
Real world problems are often stated in a finite dimensional space, it happens that in that
space the sets to be discriminated are not linearly separable. For this reason it was proposed
that the original finite dimensional space be mapped into a much higher dimensional space
making the separation easier in that space. SVM schemes use a mapping into a larger space
so that cross products may be computed easily in terms of the variables in the original space
making the computational load reasonable. The cross products in the larger space are defined in
terms of a kernel function. There are various kernel functions, out of which a suitable function
need to be selected for the problem.
We have experimented with four popular kernel functions Linear, Sigmoid, Polynomial and
Radial basis for our regression problem. In tables 6.1 and 6.2 we present the ROUGE 2 and
55
ROUGE SU4 scores of summaries generated using DFS as a single feature for pdocs and ndocs
respectively.
pdocs ROUGE-2 ROUGE SU4
Linear 0.10133 0.13839
Sigmoid 0.08208 0.12009
Polynomial 0.10133 0.13839
Radial Basis 0.10230 0.13927
Table 6.1 ROUGE-2, ROUGE SU4 scores of pdocs using different kernels
ndocs ROUGE-2 ROUGE SU4
Linear 0.02020 0.06068
Sigmoid 0.06714 0.10648
Polynomial 0.04845 0.09407
Radial Basis 0.08548 0.12680
Table 6.2 ROUGE-2, ROUGE SU4 scores of ndocs using different kernels
From the evaluation results, we observed that radial basis function is best suited for our
problem compared to linear, sigmoid or polynomial kernel functions. Hence we chose to use
radial basis function as our kernel function in further experiments.
6.3.2 Regression Vs Weighted Linear scoring
All the sentence scoring features described in chapter 4 are evaluated individually using
both regression and weighted linear scoring. We carried out our experiments at two levels, first
at indvidual feature level and later on the combinations of different features. At first level of
our experiments, feature vector (Fs) of a sentence ‘s’ has only one value. Evaluation results of
the summaries generated using regression are compared against summaries generated through
normal ranking of single features (weight will be 1, as there is only one feature). The results
are presented in tables 6.3 and 6.4 for pdocs and ndocs respectively.
56
pdocs ndocsRegression Normal Regression Normal
DFS 0.10230 0.09574 0.08548 0.08954
KL 0.09751 0.09499 0.08822 0.08601
PHAL 0.07736 0.09245 0.06815 0.08255
SL1 0.09402 0.09069 0.08491 0.08144
SL2 0.09142 0.03443 0.08334 0.03711
SFS 0.08768 0.08504 0.07497 0.07705
PrepImportance 0.04594 0.04683 0.05011 0.04847
TF-IDF 0.08075 0.03916 0.06779 0.03899
Table 6.3 ROUGE 2 scores of pdocs and ndocs while using Regression to estimate the sentenceimportance
Analysis of Regression Evaluation:
As all the documents in cluster are relevant to the topic, the intuition behind using DFS as rel-
evancy feature worked well in generic summarization. Similarly the features SL1 and SL2 that
boosted first and last sentences of the article are successful as the most important information
is always present at the top of document. Query focused feature PHAL and query independent
feature KL achieved good results, that are expected from their past success in DUC shared task.
PrepImportance that is designed as a preliminary feature to exploit the usage of prepositions in
a sentence did not fare well, which is not surprising as it just considers the count of preposi-
tions alone as a relevancy measure. It is evident from the evaluation results that regression is
as good as normal ranking procedures and even better for features like DFS, KL, SL1 and SFS.
In particular SL2 has a huge gap between its regression model and normal ranking scheme as
SL2 scores the sentence based on its relative position in the document. Hence it does not make
much sense to use SL2 as an individual feature in normal ranking procedure without a learning
model like regression. According to the results, support vector regression estimates sentence
importance better than the feature itself in most cases.
57
pdocs ndocsRegression Normal Regression Normal
DFS 0.13927 0.13374 0.12680 0.13039
KL 0.13875 0.13514 0.13124 0.12929
PHAL 0.11621 0.13050 0.11038 0.12929
SL1 0.12985 0.12653 0.12596 0.12222
SL2 0.12759 0.07654 0.12427 0.08256
SFS 0.13000 0.12657 0.12067 0.12034
PrepImportance 0.09203 0.07840 0.09907 0.09700
TF-IDF 0.12177 0.07968 0.11156 0.08228
Table 6.4 ROUGE SU4 scores of pdocs and ndocs while using Regression to estimate thesentence importance
6.3.3 Combination of features
Experimental results of the first level of experiments promotes the use of regression instead
of normal ranking procedure in summarization. In the next level, we experimented with all
possible combinations of features that are used for sentence scoring. Feature vector (Fs) has
more than one value depending on the number of scoring features combined. Unlike weighted
linear ranking, regression allows us to combine any number of features without worrying about
the optimal weight combinations. ROUGE scores of a few of these combinations are provided
in table 6.5 and table 6.6.
Analysis of Results:
The advantage of regression is depicted when two complementing features are combined to
produce a more desirable summary. For example, the combination of DFS and SL1 has pro-
duced better quality summaries than any of the single features DFS or SL1. The feature TF-IDF
is not effective when used as a stand alone feature, but the combination of DFS+SL1+TFIDF
resulted in better summaries than DFS+SL1. It is also to be observed that combinations of
two successful features might not always result in a better result, as the features may not en-
58
hance the overall combination. Consider DFS and KL, both very successful individual features
but the combination DFS+KL have not produced summaries as good as DFS+SL1. This is
because SL1 and DFS are both complementing each other where as DFS and KL are not.
Similarly the quality of the combination PHAL+KL is dropped when combined with TF-IDF
(PHAL+KL+TF-IDF). Although, the feature PrepImp has not contributed much as an individ-
ual feature, the combination of PrepImp with DFS+SL1 achieved the best results in the table.
Similar patterns of results are observed for both pdocs and ndocs. Although the values of
ROUGE scores are more for pdocs than ndocs, the effectiveness of the combinations is clear in
both cases. The reason for low scores for ndocs is the lack of specific novelty detection mea-
sures, as all the features used at this level are generic multi document summarization oriented.
Observing the results from tables 6.3 6.4 and 6.5, the maximum ROUGE-2 score is increased
from 0.10230 (DFS) to 0.11041 (DFS+SL1) for pdocs and 0.085 (DFS) to 0.096 (DFS+SL1)
for ndocs. Similarly the ROUGE SU4 scores have increased from 0.13927 and 0.12680 to
0.14628 and 0.13761 respectively for pdocs and ndocs. Approximately 8% improvement of
ROUGE-2 and ROUGE-SU4 scores is achieved for pdocs by combination of features through
regression. The purpose of this evaluation is to find the best configuration of generic sum-
marization and apply the novelty detection techniques over this system and finally produce
informative progressive summaries.
6.4 Evaluation of Progressive Summarization
Progressive summarization is focused in improving the summaries of ndocs given the prior
knowledge in the form of pdocs. The progressive summaries are generated under the assump-
tion that user has complete knowledge about the information presented in pdocs. In this section,
we evaluate all the novelty detection techniques that are proposed in Chapter 5.
We chose the combination DFS+SL as our baseline summarization configuration. This
particular configuration has produced very good results for pdocs and reasonable scores for
ndocs. The combination DFS+SL is here after referred to as MultiDocSumm, short notation for
normal multi document summarizer. MultiDocSumm serve as a baseline to depict the effect of
proposed Novelty detection techniques.
59
pdocs ROUGE-2 ROUGE-SU4
DFS+SL1 0.11041 0.14628
DFS+SL2 0.10715 0.14270
DFS+KL 0.10155 0.14069
DFS+SFS 0.10494 0.14234
SFS+KL 0.09797 0.13872
SFS+SL1 0.10705 0.14497
PHAL+KL 0.10319 0.13988
PHAL+KL+DFS 0.10442 0.14145
PHAL+KL+SFS 0.09721 0.13693
PHAL+KL+PrepImp 0.10040 0.13817
PHAL+KL+TFIDF 0.09959 0.13828
DFS+SL1+PrepImp 0.11134 0.14757
DFS+SL1+KL 0.10786 0.14630
DFS+SL1+TFIDF 0.11021 0.14634
Table 6.5 ROUGE scores of pdocs for different combinations of features
Several configurations of summarizers are generated, each having one or more novelty de-
tection techniques at either scoring, ranking or summary extraction stages of summarization.
A brief descriptions of these configurations are provided below,
MultiDocSumm+ Novelty Features In these set of configurations, new scoring features
like NDF, NW and HKLID are used along with original features of MultiDocSumm for build-
ing feature vectors
MultiDocSumm + Re-ranking Measures The ranked list of MultiDocSumm is re ordered us-
ing various similarity measures like ITSim and CoSim. A proximity measure (ProximRank) is
also used to re rank the original ranked list of MultiDocSumm during sentence ranking stage.
In these set of configurations, the scoring features remain same.
MultiDocSumm+ Novelty Pool Sentences from Novelty Pool (NP) alone are selected during
summary extraction stage of MultiDocSumm
60
ndocs ROUGE-2 ROUGE-SU4
DFS+SL1 0.09607 0.13761
DFS+SL2 0.09683 0.13675
DFS+KL 0.08954 0.13201
DFS+SFS 0.08368 0.12792
SFS+KL 0.09204 0.13429
SFS+SL1 0.09604 0.13716
PHAL+KL 0.08878 0.13019
PHAL+KL+DFS 0.08694 0.12841
PHAL+KL+SFS 0.08572 0.12587
PHAL+KL+PrepImp 0.08612 0.12696
PHAL+KL+TFIDF 0.08492 0.12582
DFS+SL1+PrepImp 0.09644 0.13867
DFS+SL1+KL 0.09464 0.13705
DFS+SL1+TFIDF 0.09616 0.13831
Table 6.6 ROUGE scores of ndocs for different combinations of features
MultiDocSumm+ Novelty Features +Novelty Pool Novelty features are used in conjunction
with features of MultiDocSumm and finally the sentences from Novelty Pool are extracted into
summary
MultiDocSumm+ Novelty Features + Re-ranking Measures Sentences scored with novelty
features along with original features of MultiDocSumm are re ranked using re-ranking mea-
sures
MultiDocSumm+ Novelty Features + Re-ranking Measures+Novelty Pool This configura-
tion has combination of all the proposed novelty detection techniques applied on MultiDoc-
Summ
Evaluation results of all these configurations in terms of ROUGE 2 and ROUGE SU4 scores
are presented in table 6.7
61
Configuration ROUGE-2 ROUGE-SU4
MultiDocSumm 0.9607 0.13761
MultiDocSumm+NF 0.09895 0.14004
MultiDocSumm+NW 0.09753 0.14045
MultiDocSumm+HKLID 0.09955 0.14023
MultiDocSumm+NF+NW 0.09885 0.14146
MultiDocSumm+NF+HKLID 0.10223 0.14266
MultiDocSumm+NW+HKLID 0.10057 0.14286
MultiDocSumm+NF+NW+HKLID 0.10102 0.14280
MultiDocSumm+ITSim 0.09461 0.13306
MultiDocSumm+CoSim 0.08338 0.12607
MultiDocSumm+ProximRank 0.09933 0.14067
MultiDocSumm+NP 0.09873 0.13977
MultiDocSumm+NF+NP 0.09875 0.14010
MultiDocSumm+NF+NP+ITSim 0.09764 0.13912
Table 6.7 ROUGE scores of different configurations with novelty detection techniques
We have participated in TAC 2009 update summarization track, that is considered to be the
most reputed summarization evaluation platform at present. Some of the participating teams
include University of Ottawa, Peking University, Thomson Reuters research, EML Research
among others. A total of 23 teams from around the world competed to produce best update
summaries for the given test data set. We compare our approach to the top two performing
systems at TAC 2009, International Computer Science Institute,Berkley (ICSI) and Tsinghua
university (THUSUM). Below we provide a brief description about these two approaches,
ICSI: ICSI’s approach [17] to sentence selection is based on the maximum coverage model for
summarization. Authors model summary as the set of sentences that best covers the relevant
concepts in the document set, where concepts are simply word bigrams valued by their docu-
ment frequency. The value of a summary is the sum of the unique concept values it contains,
thus limiting redundancy implicitly. The local Maximization problem is solved with Integer
62
System ROUGE-2 ROUGE-SU4 Overall Responsiveness Avg Pyramid score
MultiDocSumm+NF+HKLID 0.10223 0.14266 4.614 0.307
ICSI 0.10417 0.13959 4.568 0.290
THUSUM 0.09608 0.13499 5.023 0.296
Oracle Summary 0.17619 0.19877 – –
Model Summary 0.12436 0.16602 8.682 0.616Baseline 0.05865 0.09333 3.636 0.175
Table 6.8 Automated and Manual evaluation results of TAC systems
Linear Programming (ILP). For Update summarization they hypothesize that articles about
topics that have already been in the news tend to state new information first before recapping
past details. The values of concepts appearing in first sentences are upweighted according to
this inference.
THUSUM: The framework of THUSUM is based on theory of conditional independence from
many objects. They propose a information distance to solve the summarization problem. A
detailed description about the system is presented in [35]
TAC 2009 also provided a baseline by returning the first 100 words of the most recent doc-
ument as a summary to the topic. The evaluation results of these systems at TAC 2009 along
with the best configuration of our experiments are presented in table 6.8. We also present the
results of Oracle Summaries ,best possible extractive summaries, created in chapter 4 and one
of the four human written model summary that is considered as a gold standard in the evalua-
tions. It is evident from the results that our progressive summarizer has outperformed the state
of the art approaches in all content based evaluation metrics including ROUGE, Pyramids and
overall responsiveness score. Most combinations presented in table 6.7 have better ROUGE
scores than ICSI or THUSUM, proving that our novelty detection techniques are very effective
in detecting relevant novel information.
63
Analysis of Novelty Detection techniques:
All the configurations in table 6.7, other than the similarity based re-ranking measures have
showed significant improvement over the MultiDocSumm. The best results are obtained for the
configuration MultiDocSumm+NF+HKLID with 6% improvement in ROUGE-2 and ROUGE
SU-4 scores. The proximity based re-ranking technique enhanced the scores by approximately
3%. The Novelty pool technique (NP) allowed us to produce progressive summaries by only
selecting sentences with dominant novel words into summary. The improvement of ROUGE
scores is not substantial, when novelty detection techniques at scoring, ranking and extracting
stages are combined together. As Novel sentences are already scored high through NF, HKLID,
the effect of re-ranking and filtering techniques is not significant in the combination.
We provide below summaries generated by both Generic (MultiDocSumm) and Progressive
(MultiDocSumm+NF+HKLID) configurations for a particular topic “Michael Jackson’s child
molestation trial” in TAC 2009 dataset. First cluster of documents (pdocs) contain events about
the allegations, investigations and DNA tests conducted on michael jackson as part of the case.
The next cluster of documents (ndocs) has articles focusing on events like the trial, jury selec-
tion and health issues of jackson.
Generic summary for ndocs
Jackson was rushed to the hospital after he vomited in his car as he was being driven to the
Santa Maria court, where he is on trial on charges of sexually molesting a 13-year-old boy.
Trial Judge Rodney Melville unveiled for the first time details of the charges of child molesta-
tion and conspiracy charges against pop icon Michael Jackson. Sneddon alleged that pop icon
Jackson had been in tremendous financial debt, which led him and his aides to hatch a plot to
kidnap the boy and his family and hold them against their will.
Progressive summary for ndocs
The long-awaited child molestation trial of pop superstar Michael Jackson officially got un-
64
derway Monday with the judge calling the court to order. Pop icon Michael Jackson was
Tuesday rushed to hospital suffering from the flu, his trial judge said, delaying jury selection
in his child sex trial. Well ahead of schedule, a jury was selected Wednesday for the child
molestation trial of pop star Michael Jackson. Michael Jackson health is in stable condition
but needs further care for persistent viral symptoms. Trial Judge Rodney Melville unveiled
for the first time details of the charges of child molestation
It is explicit that the Progressive summary informs the user more about the recent events in
the topic. While the generic summary only has information relating to the deteriorating health
of jackson, the progressive summary has information focusing on his health, jury selection and
the proceedings of trial. It is evident that our novelty techniques are effective in finding relevant
new information for the user.
The huge gap of results between oracle summaries and best systems at TAC (in table 6.8)
show that there is still a lot of scope for improvement in the extractive summarization. The
results of participating teams are on par with some of the human model summaries in terms of
ROUGE, but far behind in manual evaluations like pyramid scores and overall responsiveness.
Extractive summarizers are hindered by the coherence and readability issues of summary, that
effects the overall responsiveness of the summary.
65
Chapter 7
Conclusions and Future directions
The field of text summarization is a well studied problem concerning multiple disciplines
like Cognitive science, Information access and Natural Language processing. It is viewed
as a decision theory problem, as a classification problem, as a data compression problem
(lossy/lossless) and as an information retrieval problem. It has been an active area of research
for the last four decades and has branched out into several areas. Progressive summarization
is a recent development in text summarization community, much popularized after its intro-
duction in Text Analysis Conference (TAC) in 2007. The task of progressive summarization
is to produce informative and human readable summaries about a particular topic under the
assumption that user have gained prior knowledge about the same topic through reading a set
of documents. The challenging part of progressive summarization is to identify information
that is both relevant and novel given the prior knowledge of user and then present it in the form
of a summary.
Traditional sentence ranking stage of summarization uses weighted linear combination of
individual feature scores. As the feature space grows it becomes more difficult to come up with
an ideal weight combination to compute the rank. In our work we use a supervised learning
algorithm, regression, to estimate the sentence importance from feature vectors. This allowed
us to experiment with wide variety of combinations without worrying about optimal weights to
combine them. We used good number of features ranging from Language models like PHAL,
KL to Statistics of document collection like DFS, SFS and heuristic based features like SL1,
SL2 and PrepImp for our experiments. Experiments supported our intuition that regression
66
will estimate the sentence importance better than the feature itself. We carried out an extensive
analysis over combinations of all possible features and identified most successful and stable
combination to be our baseline generic summarizer. This baseline (MultiDocSumm) is used to
depict the effect of our proposed novelty detection techniques in progressive summarization.
In this thesis, we addressed the problem of progressive summarization by devising novelty
detection techniques at various stages of extractive summarization. We treated the problem in
an unique way, by projecting the importance of having a novelty detection module in the sum-
marization framework. At feature extraction stage of summarization, New sentence scoring
features like NF, HKLID, NW are devised to capture the novelty of a sentence along with its
relevance. Two re-ranking techniques, Redundancy re-ranking, and proximity re-ranking are
also proposed in this work to re order the list of ranked sentences promoting novel sentences
in the ranked list. A new content based similarity measure Information theoretic distance (IT-
Dist) is used along with traditional cosine similarity measure for computing similarity between
sentences. At summary extraction stage of summarization, a filtering strategy is adopted by
only selecting sentences from Novelty Pool (NP) into the summary.
By devising novelty detection techniques at various stages, we are able to combine multi-
ple novelty detection techniques unlike most of the previous work. A detailed analysis on the
effect of detecting novelty at each of these stages is also performed, and the experimental re-
sults show that novelty is best detected at feature extraction/scoring stage. Novelty Pool (NP)
improved the quality of summaries by discarding probable redundant sentences into summary.
Proximity based re ranking helped us in producing better progressive summaries by computing
the importance of a sentence based on the relevant novelty of its surrounding sentences. The
re-ranking techniques proposed at sentence ranking stage of summarization, ITSim and CoSim
did not improve the quality of progressive summaries. Since CoSim is a word overlap mea-
sure, and novel information is often embedded within a sentence containing formerly known
information, quality of progressive summaries is declined. ITSim performs better than Cosim
because it considers entropy of a word in similarity computations, which is a better estimate of
information.
67
7.1 Future Directions
In this thesis we focused on progressive summarization of news topics. We tried to address
the problems that arise for a normal reader while he is trying to follow a temporal topic span-
ning for months, years or even more. Although the techniques proposed here are adaptable
across other domains it will make an interesting problem to apply progressive summarization
to online product reviews or novels/books. Progressive summaries of online product reviews
would be a lot of value for both customers and vendors. Similar to news topics, the vendor is
interested to know when there is a shift of reviews about particular product and the user wants
to know about the reviews that are different from the reviews that he already read. Progressive
summarization is also applicable to summarize chapters of novels/books.
The Novelty Factor (NF) described here is an extension of popular Document frequency
(DF), that uses the information about the ratio of document frequencies of words in pdocs and
ndocs. In future, NF can be developed into a more sophisticated feature capturing language
models of both pdocs and ndocs. Currently we are only using the frequency of prepositions in
a sentence as its measure of importance (PrepImp), and that did not produce effective results
like we anticipated. But we strongly believe that prepositions are strong indicators of important
entities in a sentence and can be exploited in better ways in future. Although the features
described in this work are simple, we believe that the novel treatment of the problem will
inspire a lot of new techniques at each stage of summarization.
The similarity measures used in this work are simple content overlapping measures of two
units. As the redundancy detection is a complex task, there is a need for sophisticated semantic
similarity measures that can capture semantic relatedness between two text units. Exploiting
encyclopedic knowledge from Wikipedia or social bookmarking tags will help in computing
semantic distance between two concepts.
The current summarization system does not produce high linguistic quality summaries since
there is no special care taken concerning readability issues of the summary. There is a lot
of scope to work on improving grammatical quality and coherence in summary through co
reference resolution along with content quality.
68
The current state of the art summarization systems are all extractive in nature, but the com-
munity is gradually progressing towards abstractive summarization [16]. Although a complete
abstractive summarization would require deeper natural language understanding and process-
ing, a hybrid or shallow abstractive summarization can be achieved through sentence compres-
sion and textual entailment techniques. Textual entailment helps in detecting shorter versions
of text that entail with same meaning as original text. With textual entailment we can produce
more concise and shorter summaries. A recent development in summarization that is intro-
duced at TAC 2010 is Guided summarization, where the user’s information need is represented
as a template of aspects instead of a query. The summary is expected to cover answers for
all the aspects including any other relevant information. These template of aspects may vary
depending upon the category of the topic. Guided summarization initiated the use of informa-
tion extraction techniques in summarization, that may very well lead to a shallow abstractive
summary.
Research in Summarization continues to enhance the diversity and information richness,
and strive to produce coherent and focused answers to user’s information need.
69
Related Publications
Praveen Bysani, Vijay Bharat, Vaudeva Varma Modeling Novelty and Feature Combina-
tion Using Support Vector Regression for Update Summarization. The 7th International
Conference on Natural Language Processing (ICON 2009), India, December 2009
Praveen Bysani Novelty Detection in the context of Progressive Summarization. At Stu-
dent Research Workshop in the 11th annual Conference of the North American Chapter of the
Association for Computational Linguistics(NAACL-HLT 2010), Los Angeles, June 2010
Praveen Bysani, Kranthi Reddy, Vasudeva Varma et.al IIIT Hyderabad at TAC 2009. In
proceedings of Text Analysis Conference (TAC 2009), Maryland USA, November 2009
70
Bibliography
[1] J. Allan, R. Gupta, and V. Khandelwal. Topic models for summarizing novelty. 2001.
[2] J. Allan, C. Wade, and A. Bolivar. Retrieval and novelty detection at the sentence level.
In SIGIR ’03: Proceedings of the 26th annual international ACM SIGIR conference on
Research and development in informaion retrieval, pages 314–321, New York, NY, USA,
2003. ACM.
[3] A. H. Ani, A. Nenkova, R. Passonneau, and O. Rambow. Automation of summary eval-
uation by the pyramid method. In In Proceedings of the Conference of Recent Advances
in Natural Language Processing (RANLP, page 226, 2005.
[4] C. Aone, M. E. Okurowski, J. Gorlinsky, and B. Larsen. A trainable summarizer with
knowledge acquired from robust NLP techniques, pages 71–80. 1999.
[5] R. Barzilay and M. Elbadad. Using lexical chains for text summarization, 1997.
[6] R. Barzilay and M. Lapata. Modeling local coherence: an entity-based approach. In
ACL ’05: Proceedings of the 43rd Annual Meeting on Association for Computational
Linguistics, pages 141–148, Morristown, NJ, USA, 2005. Association for Computational
Linguistics.
[7] F. Boudin and J.-M. Torres-Moreno. A cosine maximization-minimization approach for
user-oriented multi-document update summarization. In In the proceedings of Recent
Advances in Natural Language Processing (RANLP), 2007.
[8] C. Burges, T. Shaked, E. Renshaw, A. Lazier, M. Deeds, N. Hamilton, and G. Hullen-
der. Learning to rank using gradient descent. In ICML ’05: Proceedings of the 22nd
71
international conference on Machine learning, pages 89–96, New York, NY, USA, 2005.
ACM.
[9] J. Carbonell and J. Goldstein. The use of mmr, diversity-based reranking for reordering
documents and producing summaries. In SIGIR ’98: Proceedings of the 21st annual inter-
national ACM SIGIR conference on Research and development in information retrieval,
pages 335–336, New York, NY, USA, 1998. ACM.
[10] C.-C. Chang and C.-J. Lin. LIBSVM: a library for support vector machines. 2001.
[11] J. Conroy and D. P. O’leary. Text summarization via hidden markov models and pivoted
qr matrix decomposition. Technical report, In SIGIR, 2001.
[12] J. M. Conroy. A hidden markov model for the trec novelty task. 2003.
[13] J. M. Conroy, J. Goldstein, J. D. Schlesinger, and D. P. Oleary. Left-brain/right-brain
multi-document summarization. In In Proceedings of the Document Understanding Con-
ference (DUC, 2004.
[14] H. P. Edmundson. New methods in automatic extracting. J. ACM, 16(2):264–285, 1969.
[15] D. Eichmannac, Y. Zhangb, S. Bradshawbc, X. Y. Qiub, P. Srinivasanabc, and A. Kumar.
Novelty, question answering and genomics: The university of iowa response. 2004.
[16] P.-E. Genest and G. Lapalme. Text generation for abstractive summarization. 2010.
[17] D. Gillick, B. Favre, D. Hakkani-Tur, B. Bohnet, Y. Liu, and S. Xie. The icsi/utd summa-
rization system at tac 2009. 2009.
[18] U. Hahn and I. Mani. The challenges of automatic summarization. Computer, 33(11):29–
36, 2000.
[19] C. Huang, D.-D. Liu, and J.-S. Wang. Forecast daily indices of solar activity,using support
vector regression method. In Research in Astronomy and Astrophysics vol9. RAA, 2009.
[20] J. Jagarlamudi, P. Pingali, and V. Varma. A relevance-based language modeling approach
to duc 2005. In Document Understanding Conference, 2005.
72
[21] H. Jing, R. Barzilay, K. Mckeown, and M. Elhadad. Summarization evaluation methods:
Experiments and analysis. In In AAAI Symposium on Intelligent Summarization, pages
60–68, 1998.
[22] K. S. Jones. Automatic summarising: Factors and directions. In Advances in Automatic
Text Summarization, pages 1–12. MIT Press, 1998.
[23] I. Kastner and C. Monz. Automatic single-document key fact extraction from newswire
articles. In Proceedings of the 12th Conference of the European Chapter of the ACL
(EACL 2009), pages 415–423, Athens, Greece, March 2009. Association for Computa-
tional Linguistics.
[24] R. Katragadda, P. Pingali, and V. Varma. Sentence position revisited: a robust light-weight
update summarization ’baseline’ algorithm. In CLIAWS3 ’09: Proceedings of the Third
International Workshop on Cross Lingual Information Access, pages 46–52, Morristown,
NJ, USA, 2009. Association for Computational Linguistics.
[25] S. Kullback and R. A. Leibler. On information and sufficiency. The Annals of Mathemat-
ical Statistics, pages 79–86, 1951.
[26] J. kupeic, J. pedersen, and F. chen. A trainable document summarizer. In In proceedings
of ACM SIGIR 95, pages 68–73. ACM, 1995.
[27] R. R. Larson. A logistic regression approach to distributed ir. In SIGIR ’02: Proceedings
of the 25th annual international ACM SIGIR conference on Research and development in
information retrieval, pages 399–400, New York, NY, USA, 2002. ACM.
[28] D. J. Lawrie. Language models for hierarchical summarization, 2003.
[29] S. Li, Y. Ouyang, W. Wang, and B. Sun. Multi-document summarization using support
vector regression. In DUC 2007 notebook, 2007. Document Understanding Conference,
November 2007.
73
[30] X. Li and W. B. Croft. Novelty detection based on sentence level patterns. In CIKM ’05:
Proceedings of the 14th ACM international conference on Information and knowledge
management, pages 744–751, New York, NY, USA, 2005. ACM.
[31] Lin and Chin-Yew. Looking for a few good metrics: Automatic summarization evaluation
- how many samples are enough? In Proceedings of the NTCIR Workshop 4, June 2004.
[32] C.-Y. Lin, , C. yew Lin, and E. Hovy. Identifying topics by position, 1997.
[33] C.-Y. Lin. Rouge: A package for automatic evaluation of summaries. pages 74–81,
Barcelona, Spain, July 2004. Association for Computational Linguistics.
[34] D. Lin. An information-theoretic definition of similarity. In ICML ’98: Proceedings of the
Fifteenth International Conference on Machine Learning, pages 296–304, San Francisco,
CA, USA, 1998. Morgan Kaufmann Publishers Inc.
[35] C. Long, M. Huang, and X. Zhu. Tsinghua university at tac 2009: Summarizing multi-
documents by information distance. 2009.
[36] H. P. Luhn. The automatic creation of literature abstracts. IBM J. Res. Dev., 2(2):159–165,
1958.
[37] I. Mani. Multi-document summarization by graph search and matching. In In Proceedings
of the Fifteenth National Conference on Artificial Intelligence (AAAI-97, pages 622–628.
AAAI, 1997.
[38] D. Metzler and T. Kanungo. Machine learned sentence selection strategies for query-
biased summarization. sigir learning to rank workshop, 2008.
[39] G. A. Miller. Wordnet: a lexical database for english. Commun. ACM, 38(11):39–41,
1995.
[40] E. Pitler and A. Nenkova. Revisiting readability: A unified framework for predicting text
quality.
74
[41] D. Radev, T. Allison, S. Blair-goldensohn, J. Blitzer, A. elebi, S. Dimitrov, E. Drabek,
A. Hakim, W. Lam, D. Liu, J. Otterbacher, H. Qi, H. Saggion, S. Teufel, A. Winkel, and
Z. Zhang. Mead - a platform for multidocument multilingual text summarization. In in
LREC 2004, 2004.
[42] D. Radev, H. Jing, and M. Budzikowska. Centroid-based summarization of multi-
ple documents: Sentence extraction, utility-based evaluation, and user studies. In In
ANLP/NAACL Workshop on Summarization, pages 21–29, 2000.
[43] G. Salton and C. Buckley. Term weighting approaches in automatic text retrieval. Tech-
nical report, Ithaca, NY, USA, 1987.
[44] B. Schiffman and K. R. McKeown. Context and learning in novelty detection. In HLT ’05:
Proceedings of the conference on Human Language Technology and Empirical Methods
in Natural Language Processing, pages 716–723, Morristown, NJ, USA, 2005. Associa-
tion for Computational Linguistics.
[45] F. Schilder and R. Kondadandi. Fastsum: fast and accurate query-based multi-document
summarization. In Proceedings of the 46th Annual Meeting of the Association for Com-
putational Linguistics on Human Language Technologies. Human Language Technology
Conference, 2008.
[46] I. Soboroff and D. Harman. Novelty detection: The trec experience. In In HLT/EMNLP,
pages 105–112, 2005.
[47] K. M. Svore. Enhancing single-document summarization by combining ranknet and third-
party sources, 2007.
[48] S. Teufel and M. Moens. Summarizing scientific articles: experiments with relevance and
rhetorical status. Comput. Linguist., 28(4):409–445, 2002.
[49] M.-F. Tsai, M.-H. Hsu, and H.-H. Chen. Similarity computation in novelty detection.
2004.
75
[50] P. Venkataraman, S. Dulluri, and N. R. S. Raghavan. Short-term forecasting of nifty index
using support vector regression. In ICFAI Journal of Applied Finance, january 2006.
[51] M. J. Witbrock and V. O. Mittal. Ultra-summarization (poster abstract): a statistical
approach to generating highly condensed non-extractive summaries. In SIGIR ’99: Pro-
ceedings of the 22nd annual international ACM SIGIR conference on Research and de-
velopment in information retrieval, pages 315–316, New York, NY, USA, 1999. ACM.
[52] C. Zhai and J. Lafferty. A study of smoothing methods for language models applied to
information retrieval. ACM Trans. Inf. Syst., 22(2):179–214, 2004.
[53] C. Zhai and J. Lafferty. Two-stage language models for information retrieval. In Pro-
ceedings of the 25th Annual International ACM SIGIR Conference on Research and De-
velopment in Information Retrieval,Tampere, Finland, August 11-15, 2002.
[54] J. Zhang, Y. Yang, and J. Carbonell. New event detection with nearest neighbour, support
vector machines, and kernel regression. March 2003.
76