progressive summarization: summarizing relevant and novel...

Progressive Summarization: Summarizing relevant and

novel information

by

Praveen Bysani

[email protected]

Thesis submitted in partial fulfillment

of the requirements for the degree of

Master of Science (By Research) in

Computer science and Engineering

Search and Information Extraction Lab

International Institute of Information Technology, Hyderabad

December 2010

International Institute of Information Technology

Hyderabad, India

CERTIFICATE

It is certified that the work contained in this thesis, titled “Progressive Summarization: Sum-

marizing relevant and novel information” by Praveen Bysani submitted in partial fulfillment for

the award of the degree of Master of Science (By Research) in Computer Science and Engi-

neering, has been carried out under my supervision and is not submitted elsewhere for a degree.

Date Adviser: Dr.Vasudeva Varma

Associate Professor

IIIT Hyderabad

To all the good, bad and evil people around me

Acknowledgments

I thank heartily Dr. Vasudeva Varma, my thesis advisor, for the guidance, support and

encouragement he provided all throughout my journey in SIEL. He helped me transform from

an average undergraduate to a successful post graduate. I sincerely acknowledge Dr. Prasad

Pingali, for his valuable suggestions and discussions that stimulated me to work towards my

thesis. It would be a crime if i do not mention Rahul Katragadda, for he is such a responsible

mentor and helped me a great deal with my communication skills. I thank each and every one

in SIEL for their assistance, Mr. Mahender for doing all the painful paper work, Mr. Babji the

System administrator of the lab.

I take this opportunity to thank all the anonymous reviewers of my work at ICON 2009

and NAACL 2010. I thank Dr. Rajeev sangal for providing me an opportunity to travel to

Los Angeles, California, to present my work. I will always be thankful to Student Research

division, NAACL for supporting my travel and stay at NAACL 2010.

I feel blessed to be in a group of really good friends. I cherish each and every moment

of my non academic life in OBH. I feel lucky to be associated with Vijay Bharat during my

initial days at SIEL. He is a major contributor to the preliminary work in my thesis and also in

restructuring the code base of summarization. I am greatly indebted to Sai Krishna, Our senior,

who shared invaluable thoughts and mentored during my honors and semester projects. I also

thank my peers and my juniors for supporting me during Text Analysis Conference (TAC) and

for their valuable inputs during my thesis.

If it is not for my family, i would not sustain all the pressure with such an ease. I am

fortunate to have my parents Sai, Rani and my sister Anusha who helped me evolve as a

responsible person.

v

Abstract

The amount of textual and multimedia information on world wide web has been increasing

many folds every year. A user seeking information on the web is often overloaded with colossal

amount of related documents by search engines and information retrieval systems to satisfy his

information need. In this context, it has become increasingly important to develop information

access systems that provide focused and precise answers to the user. Text Summarization is a

popular information access solution for information overload problem.

Internet allows its users to follow any popular and temporal topic on web. A temporal topic

has a lot of publishing sources and the user cannot handle with the huge amount of raw infor-

mation from news aggregators, blogs. In such scenario it is not sensible to wait for the topic

to complete to produce a summary, nor does it make sense to produce an overall summary at

every time interval. In this thesis we study a variant of text summarization, “Progressive Sum-

marization”, focusing only on relevant and novel information, and produce an informative, non

redundant summary of the topic. We provide progressive summaries at regular time intervals

that helps in updating the user knowledge about the topic.

In this work we focus only on extractive methods of summarization, where only the text

units from document collection are used in producing a summary. Sentence is considered as a

basic text unit for the summary. Extractive summarizers generally follow a sequential frame-

work, that include Pre-processing of text for sentence boundary identification and extraction, a

Feature Extraction stage where several statistical, linguistic and heuristic models are employed

to estimate sentence importance, Sentence Ranking stage that estimates sentence importance

through weighted linear combination of the features and finally Summary Extraction, during

which a subset of ranked sentences are selected into summary.

vi

vii

The most important factor in extracting important items for a progressive summary, is the

identification of novel information. Progressive summarization requires differentiating be-

tween Relevant and Novel Vs Non-Relevant and Novel Vs Relevant and Redundant information.

Since the existing features are successful in capturing only the relevance, We devise multiple

new features (NF, NW, HKLID) to capture novelty of a sentence along with its relevance. We

build alternative methods to incorporate novelty in conventional summarization framework.

We devise a new re-ranking measure, Proximity Re-ranking (ProximRank), that computes the

rank of a sentence based on the relevant novelty of its surrounding sentences. We model nov-

elty detection in the context of progressive summarization as an information filtering task.

Sentences that possibly contain prior information are filtered out from summary by creating a

Novelty Pool (NP). These methods are successfully related to different stages of summarization

and evaluated against each other to find the best. In this thesis we also discover the importance

of prepositions in determining the salience and relevance of a sentence to the topic. Also, we

use a machine learning technique (Regression) to estimate sentence importance from its fea-

ture vector. Thus we overcome the problem of determining ideal weight combination that takes

ample amount of experiments and human judgment when several features are used.

The techniques described in this thesis are used in building a progressive summarization sys-

tem (Siel 09) that outperformed all other 43 participating systems in Text Analysis Conference

(TAC 2009) in manual evaluations.

Contents

Chapter Page

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Automatic Text Summarization . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Types of Summarization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3 Novelty Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.4.1 Summary Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.4.1.1 Evaluation Workshops: . . . . . . . . . . . . . . . . . . . . 8

1.4.2 Novelty Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91.5 Organization of thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.2 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.3 Outline of the Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.4 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183.1 State of the art approaches in Summarization . . . . . . . . . . . . . . . . . . 18

3.1.1 Machine Learning Approaches . . . . . . . . . . . . . . . . . . . . . . 213.2 Novelty Detection Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . 233.3 Approaches to Progressive/Temporal Summarization . . . . . . . . . . . . . . 25

4 Supervised sentence ranking using Regression . . . . . . . . . . . . . . . . . . . . 284.1 Summarization Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

4.1.1 Stages of Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . 284.2 Feature combination using Support Vector Regression . . . . . . . . . . . . . . 31

4.2.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324.2.2 Support Vector Regression (SVR) . . . . . . . . . . . . . . . . . . . . 32

4.2.2.1 Sentence Importance Estimation . . . . . . . . . . . . . . . 334.2.3 Feature Combination . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.3 Extraction of sentence relevancy features . . . . . . . . . . . . . . . . . . . . . 344.3.1 Sentence position . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

viii

CONTENTS ix

4.3.2 TF-IDF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364.3.3 Document Frequency Score (DFS): . . . . . . . . . . . . . . . . . . . 374.3.4 Sentence Frequency score (SFS): . . . . . . . . . . . . . . . . . . . . 374.3.5 Probabilistic hyperspace analogue to language (PHAL) . . . . . . . . . 384.3.6 Kullback Leibler divergence (KLD) . . . . . . . . . . . . . . . . . . . 384.3.7 Prepositional Importance (PrepImp) . . . . . . . . . . . . . . . . . . . 394.3.8 Oracle Summaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

5 Progressive Summarization: Summarization with Novelty Detection . . . . . . . . 415.1 Feature Extraction level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

5.1.1 Novelty Factor (NF) . . . . . . . . . . . . . . . . . . . . . . . . . . . 435.1.2 New Word Measure (NW) . . . . . . . . . . . . . . . . . . . . . . . . 445.1.3 Hybrid Kullback-Leibler Information Divergence (HKLID) . . . . . . 44

5.2 Sentence Ranking Level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455.2.1 Redundancy Re-ranking . . . . . . . . . . . . . . . . . . . . . . . . . 455.2.2 Proximity Re-ranking (ProximRank) . . . . . . . . . . . . . . . . . . . 47

5.3 Summary Extraction Level . . . . . . . . . . . . . . . . . . . . . . . . . . . . 485.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

6 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 506.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 506.2 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

6.2.1 ROUGE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 526.2.2 Pyramids . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 546.2.3 Readability and overall responsiveness . . . . . . . . . . . . . . . . . 54

6.3 Evaluation of Supervised ranking . . . . . . . . . . . . . . . . . . . . . . . . . 556.3.1 Kernel Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 556.3.2 Regression Vs Weighted Linear scoring . . . . . . . . . . . . . . . . . 566.3.3 Combination of features . . . . . . . . . . . . . . . . . . . . . . . . . 58

6.4 Evaluation of Progressive Summarization . . . . . . . . . . . . . . . . . . . . 59

7 Conclusions and Future directions . . . . . . . . . . . . . . . . . . . . . . . . . . 667.1 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

List of Figures

Figure Page

4.1 Sample news article from AQUAINT news corpus . . . . . . . . . . . . . . . . 304.2 Stages in a Multi Document Summarizer . . . . . . . . . . . . . . . . . . . . . 31

5.1 Novelty detection at different stages in a Multi Document Summarizer . . . . . 42

6.1 Sample topic and narrative in TAC 2008 . . . . . . . . . . . . . . . . . . . . . 51

x

List of Tables

Table Page

5.1 Statistics of relevant, novel and consecutive relevant sentences in TREC 2003 . 47

6.1 ROUGE-2, ROUGE SU4 scores of pdocs using different kernels . . . . . . . . 566.2 ROUGE-2, ROUGE SU4 scores of ndocs using different kernels . . . . . . . . 566.3 ROUGE 2 scores of pdocs and ndocs while using Regression to estimate the

sentence importance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 576.4 ROUGE SU4 scores of pdocs and ndocs while using Regression to estimate

the sentence importance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 586.5 ROUGE scores of pdocs for different combinations of features . . . . . . . . . 606.6 ROUGE scores of ndocs for different combinations of features . . . . . . . . . 616.7 ROUGE scores of different configurations with novelty detection techniques . . 626.8 Automated and Manual evaluation results of TAC systems . . . . . . . . . . . 63

xi

Chapter 1

Introduction

With ever growing content on World Wide Web, it has been increasingly difficult for users to

search for useful information. Rapid growth of news portals, blogs and social networking sites

lead to enormous surge of online content. Search engines, that are supposed to satisfy users

information need, has too much information to offer than what is required. This problem is

referred as information overload. In this context, it has been increasingly important to develop

information access solutions that can provide an easy and efficient access to users. Automatic

summarization systems address information overload problem by producing a summary of

related documents that provides an overall understanding of the topic without having to go

through every document.

1.1 Automatic Text Summarization

Text Summarization is the process of condensing text to its most essential points. Although

the definition of summarization is obvious, it needs to be emphasized that summarizing is a

hard problem. A summarization system has to interpret the source content, where content

is a matter of both information and expression, and identify important information, where

importance is a substance of both salience and essence. Summarization is a challenging task

for its inherent cognitive process, as an ideal summarization system has to mimic a human

mind in the process of abstracting. Summarization is also interesting for its practical and real

1

life applications. Researchers [22] have postulated summarization as a tripartite processing

model,

1. Topic Identification: An initial exploration to identify the genre and topics of source text.

Most important units of text are identified using several independent modules.

2. Interpretation: Important topics are fused, and expressed in a new formulation using

concepts that are not explicitly contained in the input.

3. Summary Generation: Unreadable abstract representations from interpretation are trans-

formed into a coherent human readable format.

Each major process may subsume several sub processes depending on the context and purpose.

1.2 Types of Summarization

Summarization systems have been categorized into several types based on their inputfactors

like language, media, genre and purpose like audience, use and situation. Following are few

popular types of summarization classified based on the medium of content,

• Document Summarization: Summarizing information in the form of digital text is re-

ferred as Document Summarization. It is the most focused area of text summarization

with almost five decades of research. Document summarization branched out into sin-

gle document and multi document summarization over the course of time. News arti-

cles summarization and Scientific papers abstraction are two popular areas of document

summarization. Focused workshops like Document Understanding Conference (DUC)

provided a common platform and set evaluation benchmarks that has cultivated interest

and enabled researchers to participate in large-scale experiments.

• Opinion Summarization/Blog Summarization: With the advent of Web 2.0 and flourish-

ing growth of blogs and forums, people are now able to express their opinions through

blog posts and reviews. It is important to understand the opinions of people on a par-

ticular product, for a business organization to devise commercial strategies, or for an

2

individual to analyse the reviews on a topic of his interest. Since there are millions of

people writing their opinions everyday, mining knowledge from this huge information is

challenging. In this scenario, an opinion summarization system that extracts, analyse and

summarize opinions will be useful. Recently, opinion mining has received huge inter-

est in information systems and language technologies communities through upbringing

of International conferences on web logs and social media (ICWSM) and Text Analysis

Conference (TAC) - opinion summarization, opinion question and answering tasks.

• Book Summarization: Books represent one of the oldest forms of written communication

and has been used as a means to store and transmit information. There is an increasingly

large number of books becoming available in electronic format, through projects such as

Gutenberg1, Million Books project 2. This means the requirement of language processing

techniques that could handle very large documents such as books is escalating. A book

summarization system can be used to produce short abstracts of every chapter in a novel

or a technical book. User can just skim through summaries of previous chapters to refresh

his proceedings in the book so far. Alternatively, it can be used to produce a summary

for the whole book.

• Speech Summarization: There has been an explosive growth of multimedia content on

world wide web due to availability of broadcast radios and news channels. The amount

of audio content is only going to grow with the availability of cheap and mass storage

means. It has necessitated the need of systems that could efficiently process huge amount

of audio data. Speech summarization is one solution to this problem, with wide variety

of applications. Broadcast news summarization is a popular area within speech summa-

rization, where it serves the purpose of summarizing important content of a news show.

It can also be used for summarizing long voice mails and save a lot of time for the user.

• Video/Multimedia Summarization: The growing availability of multimedia software and

hardware platforms makes Multimedia Summarization, an important application area

of summarization. There is a huge amount of multimedia content available on web in

1http://gutenberg.org2http://archive.org/details/millionbooks

3

form of images, speech, video and flash. Research in this area is evolving very rapidly,

with many developments taking place outside summarization community within digi-

tal libraries, speech understanding, multimedia, and other communities. A multimedia

summarizer could summarize movies, video lectures and allow users to skip the lengthy

boring parts.

Apart from the medium of content, Summaries are also classified into many categories depend-

ing on the context within which summary is intended to use. Below we discuss few of such

popular dimensions in summarization,

• Extract vs. Abstract: Abstractive methods generate summary from an abstract repre-

sentation of source documents, that may contain sentences which are not necessarily

present in document set. Extractive methods rely only on sentences in original document

set. Extraction is the process of selecting important units from original document and

presenting them in form of summary. Although there has been some efforts to generate

abstract summaries [18], Extraction still remains as the most feasible approach and dom-

inant portion of the work in summarization is based on extraction. Focus of this work is

on extractive based document summarization, with sentences as primary units.

• Single Document vs. Multi Document: Text Summarization has progressed from single

document summarization to a more challenging problem of multiple document sum-

marization. Generating summary for a set of multiple related documents on a topic is

more difficult task, as documents are likely to contain similar content. Concatenation of

individual single document summaries may not necessarily produce a multi document

summary.

• Query Focused vs. Generic: A generic summarizer produce summaries that encapsulate

most salient points of source document set. On the contrary, a query focused summa-

rization system has access to user’s information need in form of query and tailors its

summary accordingly. With the growth of online search and retrieval, query focused

summarizers would provide a better output than generic systems.

4

• Personalized: The interpretation of a piece of text depends on the domain knowledge

and personal interests of a human. The notion of importance and relevance changes from

person to person. Normal summarization systems produce uniform summary for all users

irrespective of their personal interests. A personalized summarizer caters user’s personal

background and interests. Hence, a personalized summary changes in accordance to

preference of the reader.

• Progressive Summarization/ Temporal Summarization: Temporal Summarization is

targeted to aid users having access to rapidly flowing stream of articles on a topic and

has no time to look at each article. In such situation, a person would prefer to be updated

on events within topic, and dive into details only when reported events trigger enough

interest.

It is not sensible to wait for the topic to complete to produce a summary, nor does it make

sense to produce an overall summary at every time interval. After all, the user has already

been informed about prior events. A temporal summarization system produce revised

summaries on a topic at regular time intervals and update users knowledge. Although,

there are prior attempts in this dimension, it gained a lot of focus after its introduction as

“Update Summarization” in DUC workshops.

We coin the term Progressive Summarization for Extractive, Query focused, Multi document

Temporal Summarization, around which this thesis is surrounded. Details and related work

in this dimension is thoroughly studied in chapters 2 and 3. Detecting novel and relevant

information is a major challenge in temporal summarization.

1.3 Novelty Detection

Novelty is an inherently difficult phenomenon to operationalize. Detecting novel informa-

tion from source documents given user’s prior knowledge over the topic is termed as novelty

detection. It is not sensible to identify new information that is not relevant to user’s interest.

Hence, Novel information is generally regarded as relevant novel information. The problem

of novelty detection has long been a significant challenge in information retrieval systems.

5

As the task of finding new information from a pool of relevant information is difficult even

for experienced human assessors [46], novelty detection still remains as an active area of re-

search. Document level novelty, while intuitive is rarely useful because nearly every document

contains something new. Hence novelty detection is usually performed at two levels,

1. At event level: National Institute of Standards and Technology (NIST) along with Lin-

guistic Data Consortium (LDC) has started a project named Topic Detection and Track-

ing (TDT) to understand and discover topical structure in unsegmented streams of news

reports across different sources and languages. TDT tasks consider each news story as

a set of events occurring over a course of time. One of the tasks under this study, First

story detection (FSD), requires constant monitoring of news and identify the onset of

new event in a particular topic. First Story Detection (FSD) is an inherent first step for

TDT.

FSD is the process of detecting all first stories within a corpus of news articles that are

the first stories to describe an event. FSD is a major leap in event level novelty detection,

that fostered efficient techniques in text processing.

2. At sentence level: The “selective dissemination and information” (SDI) paradigm assume

that the people wanted to be able to track new information relating to known topics as

their primary task. While most SDI and information filtering systems in the literature

have focused on similarity to a topical profile [46], or a community of users with shared

interests, recent efforts have looked at the retrieval of specifically novel information.

Novelty track conducted as part of Text Retrieval Conference (TREC) during 2002-2004,

promoted the task of highlighting sentences containing relevant and new information in

a topical document stream. The basic task is to return sentences that are both relevant

and novel, given a topic and an ordered set of related documents on that topic, segmented

into sentences. There are two major problems that participants must solve in this task.

The first is identifying relevant sentences, while the second task is to identify those rele-

vant sentences that contain new information. The operational definition of “new” here is

information that has not appeared previously in topic’s set of documents. Since each sen-

tence adds to user’s knowledge, and later sentences are to be retrieved only if they contain

6

new information, novelty detection can be looked upon as a filtering task. In many ways,

Novelty track can be viewed as sentence level analogue of First story Detection task.

As our focus is on sentence extractive summarization, we concentrate on sentence level novelty

detection techniques in this thesis. Successful novelty detection techniques are employed in

summarization to produce progressive summaries.

1.4 Evaluation

Like the case with other language understanding technologies, Evaluation offers many ad-

vantages to the field of text summarization. It can foster the creation of infrastructure and

reusable resources, and provides an environment for comparison of peer results.

1.4.1 Summary Evaluation

Evaluation of summary is a non-trivial task, principally because there is no “ideal” summary

as such. Studies in past [21] show that, human summarizers tend to agree only 60% (approx-

imately) of the time, and only in 82% of the cases humans agreed with their own judgment.

Also, there is always a possibility of system generating a better summary that is quite different

from reference human summary used as an approximate to the ideal output summary.

Research in summarization evaluation has broadly been classified into two major categories,

intrinsic and extrinsic. Intrinsic evaluation techniques tests the summarization in itself usually

through its content, readability and coherence. The second method, extrinsic evaluation tests

the summarization system based on how it affects other language processing tasks like rele-

vance assessment, reading assessment and text categorization etc. Intrinsic evaluation is the

widely accepted mechanism for evaluating summaries through out the literature, hence in this

thesis we focus only on intrinsic evaluation of content and coherence of a summary.

Evaluating Coherence: Summaries have two main characteristics, content and form. Co-

herence evaluation refers to enabling quantification of the form of a summary. Coherence can

be assessed by having humans grade summaries on some criteria. Usually extractive sum-

maries cause coherence problems like dangling anaphors and rhetorical structure of summary.

7

Subjects grade coherence of summary based on presence of dangling anaphors, lack of preser-

vation of integrity, and presence of incomplete statements in the text.

There hasn’t been much efforts on automating Coherence/Readability aspects of summary eval-

uation. [6] fairly investigated the discourse level constraints on adjacent sentence that are in-

dicative of coherence. A recent study by [40], investigated impact of linguistic, syntactic and

discourse features on readability of Wall Street Journal (WSJ) corpus.

Evaluating Content: Content evaluation refers to enabling quantification of informative-

ness of a summary. Measure of informativeness is to assess how much information from source

or human written summary is preserved by the summary. It is the most accepted and pop-

ular evaluation criterion, used for comparing summarization systems at large scale. There

exists manual evaluation metrics for content evaluation like ‘Pyramid Evaluation’, and Content

Responsiveness. Since manual methods are time consuming and non-repeatable, automated

counterparts are introduced that are inexpensive and repeatable. Over the years, research on

automated content evaluation has produced useful evaluation tools like ROUGE and Basic El-

ements. Detailed description of these metrics will be presented in Chapter 6

1.4.1.1 Evaluation Workshops:

Much interest and activity is aimed at developing multipurpose information systems in

the late 1990’s. Several Government organizations like Defense Advanced Research Projects

Agency (DARPA) and NIST have started programmes focusing on Trans lingual Information

Access (TIDES), Text Retrieval(TREC) and First Story detection(FSD) among others. These

tasks require their own evaluation designs and data, thus creating an evaluation framework

over the years. Initial TIDES workshops focused in document understanding and explored

different ways of summarization. Additionally, the brain storming sessions conducted during

the workshop lead to a focused evaluation effort in summarization, Document Understanding

Conference (DUC).

DUC is the first large scale Summarization evaluation forum, that has provided a common

ground for researchers to explore various approaches in summarization and evaluate them on

large scale. DUC is organized from 2001 through 2007 by NIST and later transformed as

Text Analysis Conference (TAC). The first TAC workshop is conducted in 2008 by NIST,

8

carrying on the tradition of DUC, Recognizing Textual Entailment (RTE) and TREC Question

Answering (Q&A) tracks. TAC workshops have designed interesting and challenging tasks

for summarization community, like Opinion Summarization in 2008, Update summarization

in 2009 and Guided Summarization in 2010. Many popular evaluation schemes like ROUGE,

Pyramids, besides novel summarization techniques are developed during these workshops.

A task on automated evaluation of summaries of peers (AESOP) is introduced recently dur-

ing TAC 2009. Automated evaluations of content provide platform for tracking incremental

developments in state of art summarization systems. The purpose of first edition of AESOP

was to promote research and development of systems that evaluate quality of content in sum-

maries. Focus is on developing automatic metrics like ROUGE, that act as surrogates to human

evaluation.

1.4.2 Novelty Evaluation

Novelty Track conducted by TREC provided an ideal setting to evaluate sentence level

novelty detection techniques. The track is divided into four tasks, the first one is to identify the

relevant and novel sentences given the document set on a particular topic, the second task is

to identify all the novel sentences given the relevant sentences, and the third task it to identify

relevant and novel sentences given the relevant sentences of the first five documents, and the

final task is to retrieve novel sentences given all relevant sentences and the novel sentence from

first five documents. These tasks are designed such that participants can test their techniques

at varying levels of training.

The series of Novelty workshops provide relevance and novelty judgments for a set of news

articles in AQUAINT corpus, to calculate the precision and recall for each technique. The

sentences selected manually by the NIST assessors (judgements ) will be considered as the

truth data, and referred as new relevant in the discussion below. Agreement between these

sentences and those found by the systems will be used as input for recall and precision.

precision =|new relevant ∧ SentencesRetrieved|

|SentencesRetrieved|

recall =|new relevant ∧ SentencesRetrieved|

|new relevant|

9

The official measure of Novelty track to measure the efficiency of a particular technique is its

F measure,

F −measure =2 ∗ precision ∗ recallprecision+ recall

1.5 Organization of thesis

The main focus of this thesis is to produce informative and human readable summaries of

a set of topically related documents having known that the user has prior knowledge over the

topic and previously read some articles on the same. In this thesis we also aim to estimate

sentence importance through optimal combination of several sentence scoring features without

any manual effort. The rest of this thesis is organized as follows

Chapter 2 describes the motivation behind choosing this research problem and explain the

challenges involved with the same. We define our problem statement and the exact goals of

this thesis. We briefly explain our approach to the underlying research problems and provide

major contributions of this thesis.

Chapter 3 provides a detailed survey on relevant literature in the context of this thesis. State of

the art approaches in summarization including Lexical chains, Graph based models, Language

modeling approaches are described in this chapter. We discuss about representative previous

work in the field of machine learning that is applied to summarization. Later in this chapter

we describe about the levels of novelty detection and successful approaches in this field. At

the end of this chapter we describe the efforts made in the direction of temporal summarization

that is close to the problem we address in this thesis.

In Chapter 4, we describe general summarization framework explaining various stages of

a summarizer. We explain about support vector regression and how it is useful in predicting

sentence importance. Details about several sentence scoring features, both existing and newly

devised, that are used for producing summaries are provided.

10

Chapter 5 describes the role and importance of novelty detection in progressive summariza-

tion. We describe how novelty detection is integrated into summarization framework and why

it is important. Each section in this chapter describe various novelty detection techniques used

at Feature extraction, sentence ranking and summary extraction stages of summarizer.

Chapter 6 provides the details about the data set and evaluation metrics used for the experi-

ments in this thesis. We discuss the details about several experiments conducted to evaluate re-

gression and determine the significance of proposed novelty detection techniques over generic

summarizer. Evaluation results of the experiments are compared to state of the art approaches

in summarization.

Chapter 7 Finally, chapter 7 concludes this thesis explaining the work done and discusses

the results of our experiments. It also provide details about foreseeable future work from this

thesis.

11

Chapter 2

Problem Statement

2.1 Motivation

Internet allows its users to virtually follow about any interesting news story. There are

umpteen number of news portals that periodically aggregate information about every category

and domain in several languages. Unlike scientific articles and blogs, a news topic has multiple

information sources, paraphrasing same information in various surface forms. Each news topic

has a particular longevity depending on its nature and popularity.

For instance, consider the news topic about “Michael Jackson’s death”. On the first day of

reporting the incident, the topic is started with an article about his tragic death due to excessive

drugs. Over a period of time, the news reports cover the details about police investigations,

mourning of celebrities, financial troubles, details about funeral and so forth. To provide suf-

ficient background knowledge for the reader, news reporters usually include prior information

about the topic while describing the new events or proceedings. Such reporting leads to repeti-

tion of information in future articles on the topic. Below, We provide snippets of news articles

from Reuters1 to illustrate our discussion [48]

Article 1 (On 26th June 2009):

The 50-year-old, whose towering legacy was tarnished by often bizarre behavior was pro-

nounced dead on Thursday in Los Angeles after going into cardiac arrest. An autopsy was

1http://www.reuters.com/

12

conducted on Friday, and while investigators will not know results of toxicology tests for six to

eight weeks, speculation turned to his prescription drug use as a culprit.Mourning his death

were legions of fans around the world, including U.S. President Barack Obama, who called the

”Thriller” singer a ”spectacular performer” and offered his condolences to Jackson’s family.

Article 2(On 27th June 2009):

The King of Pop died suddenly on Thursday at the age of 50, after a career spanning 40 years

that included the biggest-selling pop album of all time, ”Thriller.” Despite taking in hundreds

of millions of dollars as one of the most successful pop musicians of all time, Jackson racked

up about 500 million of debt, according to sources cited by The Wall Street Journal earlier this

month.

Article 3 (On 28th June 2009):

Jackson, 50, was stricken Thursday at his rented chateau in Holmby Hills, above Sunset Boule-

vard, and died after suffering what his brother Jermaine Jackson said was cardiac arrest.

Families who obtain a second autopsy often do so because they want to confirm the cause of

death. A second autopsy can also give relatives information much faster than an autopsy con-

ducted by law enforcement officials, said Michael Baden including the criminal trials of O.J.

Simpson and Phil Spector.

Articles 1,2 and 3 are published by the same news source, ordered chronologically based on

their date of publication. In order to provide some context, both articles 2 & 3 provided some

prior information that is already known to the reader through previous articles (article 1) .

Consider a scenario, where the user intends to follow one such temporal news topic. The

topic has lots of related articles generated by numerous news aggregators, blogs or stand alone

news websites and the volume of articles increase with time. Since the user cannot deal with the

huge amount of raw information, there is a great need of a summarizer that processes all these

articles and produce a targeted informative summary about the topic. With a sophisticated

summarizer, a user can now access information in the form of a summary instead of going

through all the articles and save his productive time. As the life span of these news topics

13

can range from weeks to months to years depending on its nature, the user is expected to use

the summarizer periodically for producing a summary of recent articles (since the previous

summary).

2.2 Problem Definition

Automatic summarization is an information access technique used to present only the most

important information from multiple documents, thereby reducing the need to refer source doc-

ument. A normal multi document summarizer calculates importance of a text unit just in terms

of its relevance to the topic. In a real-world scenario, a reader needs to keep track of a popular

temporal topic. But a normal summarizer fails to produce a good summary since it cannot

handle the prior information reported in earlier articles. Progressive Summarization addresses

this problem, by producing quality summaries that informs only the progression/update on a

particular topic. A progressive summarizer measures the importance of a text unit both in terms

of its relevance to the topic and novelty to the user.

In this thesis we aim to reduce the problem of information overload by periodically pro-

ducing multi document summaries to update the knowledge of user. The goal here is to take

clusters of chronologically divided documents related to the same topic and generate short and

concise summaries that can be read in lieu of original document set.

The most important factor in extracting important items for a progressive summary, is the

identification of novel information. Progressive summarization requires differentiating be-

tween Relevant and Novel Vs Non-Relevant and Novel Vs Relevant and Redundant informa-

tion. The summary need to contain only relevant and novel information, that is feasible only

with combination of efficient novelty detection methods in summarization. In this thesis, we

strive to devise efficient sentence level novelty detection methods in the context of Progressive

Summarization.

14

2.3 Outline of the Solution

Summarization can be achieved either through abstraction or extraction of information from

source documents. While abstractive summaries could provide a readable and coherent sum-

mary, state of the art systems are all extractive summarizers due to the robustness and scala-

bility of these approaches. Extractive approaches of summarization can employ various levels

of granularity like keywords, sentence, or paragraph. As keywords hardly provide any read-

able summary and paragraphs being unlikely to cover information under space constraints,

sentences have emerged as the most popular unit of text for summaries.

Extractive summarizers generally follow a sequential architecture, that includes Pre pro-

cessing of text for sentence boundary identification and extraction, a Feature Extraction stage

where several statistical, linguistic and heuristic models are employed to estimate sentence im-

portance, Sentence Ranking stage that estimates sentence importance through weighted linear

combination of the features and finally Summary Extraction, during which a subset of ranked

sentences are selected into summary.

Since the existing features are successful in capturing only the relevance, We devise mul-

tiple new features to capture novelty of a sentence along with its relevance. Novelty is an

inherently difficult phenomenon to operationalize. Determining ground truth about novelty is

more difficult task than for relevancy. It is hard even for a human because he must try to re-

member everything he has read. We build alternative methods to incorporate novelty detection

in conventional summarization framework. These methods are successfully related to different

stages of summarization and evaluated against each other to find the best.

The use of more and more features for estimating sentence importance, makes the weighted

linear combination a critical aspect of summarization. The process of determining the ideal

set of weights takes good amount of human resources and ample amount of experiments. We

overcome this problem by employing a Machine Learning algorithm to learn the rank of a

sentence from training set of features. Hence our approach is robust to the weights assigned

and number of features employed. A detailed description of our summarizer and the newly

devised novelty detection methods are provided in chapter 4 and 5.

15

2.4 Contributions

The major contributions of this thesis are provided below that include devising new methods

for detecting novelty and relevancy among other things.

1. Regression is widely used in broad spectrum of fields including information retrieval, to

predict unknown variables from set of dependent variables. In chapter 4 we successfully

use regression to estimate sentence importance from its feature vector. Detailed analysis

and experiments are carried out using several combinations of features. The successful

combinations are on par with best summarizers in the world according to evaluation

results.

2. We treat the problem of progressive summarization in an unique way by projecting the

importance of having a novelty detection module in summarization framework. In this

thesis we follow a systematic approach for comparing different novelty detection tech-

niques and relate them to various stages in a summarization framework.

3. To the best of our knowledge the role of prepositions is never explored in determining

the importance of a sentence. We identify that the frequency of prepositions implicitly

achieves the effect of Named Entity Recognition (NER) in a sentence. We develop a

new feature, PrepImp, that scores a sentence based on the frequency of prepositions it

contain.

4. Conventional scoring features capture only the relevance of a sentence. In chapter 5 we

devise new scoring features Novelty Factor (NF) and Hybrid Kullback Leibler Informa-

tion Divergence (HKLID) to capture novelty of a sentence along with its relevance.

• NF, is a statistical feature that measures importance a sentence in terms of docu-

ment frequencies of the words it contain

• HKLID, is an extension to the popular KL divergence, that scores a sentence based

on the divergence of its sentence and document language model from prior cluster

of documents.

16

5. We make a new hypothesis, based on the statistics of document collection, that new in-

formation is often spanned over a group of sentences that belongs to a context. Based on

this hypothesis we devise a new Re-ranking measure, Proximity Re-ranking, that com-

putes the rank of a sentence based on the relevant novelty of its surrounding sentences.

6. We model novelty detection in the context of progressive summarization as an informa-

tion filtering task. Sentences that possibly contain prior information are filtered out from

summary by creating a Novelty Pool (NP). NP contains sentences having words that are

dominant in new cluster of documents compared to previous documents.

17

Chapter 3

Related Work

Summarization has been a popular area of research in information retrieval for a very long

time. The early work in summarization in late 1950’s and early 1960’s by [36] [14] suggested

that text summarization by computer was feasible though not trivial. Progress in language

processing along with exponential increase of computer memory and speed, and the growing

presence of text on the web renewed interest in automatic text summarization. In this thesis

we use summary as a generic term that is produced from one or more texts, that contains a

significant portion of the information in the original texts, and that is no longer than half of the

original texts.

Effective summarization requires an explicit analysis of context and the purpose of sum-

maries. Text summarization has seen a lot of research in the past two decades and the ap-

proaches have been categorised at many levels. Since it is not feasible to list out the wide

range of approaches in summarization, we provide below only the state of art and most popular

approaches.

3.1 State of the art approaches in Summarization

The introduction of summarization track at TAC and DUC allowed researchers to compare

their results and induced a notion of competition that resulted in enormous increase in the

number of approaches. The spectrum of summarization approaches encompass several cate-

18

gories like heuristic, discourse based, machine learning approaches, and language modeling

approaches among others. Below we provide some of the popular approaches,

1. Lexical Chains: [5] describe a work that used considerable amount of linguistic analysis

for performing the task of summarization. Authors describe the notion of cohesion in

text as means of sticking together different parts of text. Cohesion occurs not only at

the word level but at word sequences too resulting in lexical chains. They made use of

lexical chains, a sequence of related words in a text spanning short or long distances to

identify important information. After segmenting input text, lexical chains are identified

and sentences containing strong lexical chains are selected for extraction.

Semantically related words and word sequences were identified in the document, and

several chains were extracted, that form a representation of the document. Wordnet [39]

distance is used as a relatedness measure to find out lexical chains.

2. Graph Spreading Activation: [37] describe a graph based method to find similarities and

dissimilarities in pairs of documents. This is a topic driven approach, with topics rep-

resented through a set of entry nodes in the graph. Each document is represented as a

graph, with each node representing the occurrence of a single word. Each node has sev-

eral links encoding its adjacency, semantic relatedness, co-references with other nodes

in the graph. Once the graph is built, search for semantically related text is propagated

from entry nodes to the other nodes of the graph through spreading activation 1. Salient

words and phrases are initialized according to their TF-IDF score. The weight of neigh-

boring nodes becomes an exponentially decaying function of the traversed path. Given

pair of document graphs, the algorithm computes two scores reflecting the presence of

common and difference nodes. Sentences having higher score are highlighted, with user

being able to specify the number of sentences in summary.

3. Centroid Based Summarization: [42] exploited the use of cluster centroids to summa-

rize documents. News articles describing the same event are grouped together using an

agglomerative clustering algorithm that operates over TF-IDF vector representations of

1The name spreading activation is borrowed from a method used in information retrieval to expand the searchvocabulary.

19

documents. Later centroid of these clusters are used to identify sentences that are central

to the topic of cluster.

Two metrics cluster based relative utility (CBRU) and cross sentence information sub-

sumption (CSIS) are introduced to calculate the importance of a sentence. Three sentence

level features centroid value, positional value, first sentence overlap are used to approx-

imate these metrics. Final score of each sentence, that is computed by the combination

of these scores along with a redundancy penalty, is used for ranking sentences. The

approach is well known as MEAD, and open sourced for research purposes.

4. Probabilistic language models for summarization: [28] defines summarization in terms

of probabilistic language model and use the definition for automatically generating topic

hierarchies. Authors use a language model to characterize documents that will be sum-

marized and then apply graph-theoretic algorithm to determine the best topic words for

the summary. An approximation of relative entropy algorithm/ KL divergence with bi-

gram model is used to compare language models of topic set to a general English corpus.

Language models are used to define ’topicality’ and ’predictiveness’ of a word that re-

flects topic orientedness and existence of subtopic hierarchies for the word.

More recently Jagarlamudi [20] has shown how a relevance based language modeling

paradigm can be applied to query focused multi-document summarization through Prob-

abilistic Hyper space Analogue to languages model (PHAL). The PHAL is a natural

extension to Hyper space Analogue to language model, as term co-occurrence counts

can be used to define conditional probabilities. The PHAL can be interpreted as, given a

word w what is the probability of observing another word w′ with w in a window of size

K. Details about PHAL can be found in chapter 4

5. Other Approaches: There are also some unconventional approaches that investigate the

details that underline summarization process rather than aiming to build a full summa-

rization system. [51] present a system that generated headline style summaries for

publicly available news articles from Reuters and associated press. The system learned

statistical models of the relationship between source text units and headline units. It at-

20

tempted to model both the order and the likelihood of the appearance of tokens in the

target documents.

For Content selection a translation model was learned between a document and its sum-

mary. This model in the simplest case can be thought as a mapping between a word

in the document and the likelihood of some word appearing in the summary. A bigram

model was used for surface realization. Viterbi beam search was used to efficiently find

a near-optimal summary. The Markov assumption was violated by using backtracking at

every state to strongly discourage paths that repeated terms. Both the models were used

to co-constrain each other during the search in the summary generation task.

Sentence Positional information is a simple but powerful heuristic in summarization.

Sentence Position has been extensively studied since its introduction to summarization

by [14]. [32] empirically characterized position feature as a genre dependent feature

and derived a position policy, as an ordering of priority of sentence importance. Most

recently, [24] described a Sub-optimal Sentence Position Policy (SPP) based on pyramid

annotation data and implemented the SPP as an algorithm to show that a position policy

thus formed is a good representative of the genre and thus performs way above median

performance.

3.1.1 Machine Learning Approaches

Recent advances in the field of machine learning have been adapted to summarization

through the literature to identify important sentences. Some representative work in this section

include,

1. Naive Bayes Methods: [26] modeled summarization process as a classification problem,

where sentences are classified as summary or non-summary sentences based on a set of

features using naive-Bayes classifier. Let s be a particular sentence, S the set of sentences

that make up the summary, and F1, ..., Fk the features. Assuming independence of the

features, the importance of each sentence is computed through:

P (s ∈ S|F1, ..Fk) =

∏ki=1 P (Fi|s ∈ S).P (s ∈ S)∏k

i=1 P (Fi)

21

[4] also incorporated naive Bayes with rich feature set to derive signature words 2. Au-

thors also employed some shallow discourse analysis like reference to same entities in

the text, to maintain cohesion. The references were resolved at a very shallow level by

linking name aliases within a document.

2. Neural Networks: [47] propose an algorithm based on neural nets and use of third

party data sets to tackle the problem of extractive summarization. A trained model is

built from the labels and features for each sentence of an article, that could infer the

proper ranking of sentences in a test document. The ranking was accomplished using

RankNet [8], a pair-based neural network algorithm designed to rank a set of inputs that

uses the gradient descent method for training. Similarity score between a human written

judgment and a sentence in the training document is used as soft labels for training.

Novelty of this framework lay in the use of features that derived information from query

logs from Microsofts news search engine 3 and Wikipedia 4 entries. The authors conjec-

ture that if a document sentence contained keywords used in the news search engine, or

entities found in Wikipedia articles, then there is a greater chance of having that sentence

in the summary.

3. Hidden Markov Models: [11] modeled the problem of extracting a sentence from a doc-

ument using a hidden Markov model (HMM). The basic motivation for using a sequential

model is to account for local dependencies between sentences. The HMM contained 2s

+ 1 states, alternating between s summary states and s+1 non summary states. The au-

thors allowed “hesitation” only in non summary states and “skipping next state” only in

summary states. The authors obtained the maximum-likelihood estimate for each tran-

sition probability, forming the transition matrix estimate M , whose element (i; j) is the

empirical probability of transitioning from state i to j.

Associated with each state i is an output function, bi(O) = Pr(O|statei) where O is an

observed vector of features. They made a simplifying assumption that the features are

2Words that indicate key concepts in a document.3search.live.com/news4www.wikipedia.org

22

multivariate normal. The output function for each state was thus estimated by using the

training data to compute the maximum likelihood estimate of its mean and covariance

matrix.

CLASSY [13], the best system in DUC 2004 and MSE 2005 also uses a Hidden Markov

Model for selecting sentences from each document and a pivoted QR algorithm for gen-

erating a multi-document summary.

3.2 Novelty Detection Approaches

Progressive Summaries are generated at regular time intervals to update user knowledge

about a particular topic. Novelty detection is an inherent component of progressive summa-

rization to identify sentences containing relevant and new information. First Story Detection

in TDT task allowed a lot of researchers to work on the problem of event level Novelty detec-

tion. Since we deal with sentence level extractive summarization, we cite here some influential

work in sentence level novelty detection. Most of the techniques listed here are developed

during TREC Novelty track.

[30] proposed a novelty detection approach based on the identification of sentence level

information patterns. The approach is motivated by the intuition that information patterns in

sentences such as combinations of query words, sentence lengths, named entities and phrases,

and other sentence patterns may contain more important and relevant information than single

words. The proposed novelty detection approach focuses on the identification of previously

unseen query-related patterns in sentences. Specifically, a query is preprocessed and repre-

sented with patterns that include both query words and required answer types. These patterns

are used to retrieve sentences, which are then determined to be novel if it is likely that a new

answer is present.

[44] demonstrated the importance of context in novelty detection systems. The idea stems

from the fact that novelty often comes in bursts, which is not surprising since the articles are

composed of some number of smaller, coherent segments. Each segment is started by some

kind of introductory passage, and that is where authors expect to find the novel words. Novel

words are identified by comparing the current sentence’s words against a table of all words

23

seen in the inputs to that point. Subsequent passages are likely to continue the novel discussion

whether or not they contain novel words. They may contain pronominal references or other

anaphoric references to the novel entity. In order to determine whether information within

a sentence has been seen in material read previously, authors integrate information about the

context of the sentence with novel words and named entities within the sentence, and uses a

specialized learning algorithm to tune the system parameters.

In addition to straightforward count of named entities and noun phrases in a sentence, [15]

tried several experiments, one using synonyms in addition to the words for novelty compar-

isons, and one using word sense disambiguation. They have expanded all noun phrases using

Wordnet [39] and used corresponding synsets for comparisons. [49] utilized a method based on

variants of employing an information retrieval (IR) system to find relevant and novel sentences.

A sentence is considered as a query of a reference corpus, and similarity between sentences

is measured in terms of the weighting vectors of document lists ranked by IR systems. A dy-

namic threshold setting approach that is based on the percentage of relevant sentences within a

document set, is used to decide thresholds for extracting novel sentences. [12] have used their

hidden Markov model based sentence retrieval model [11] for extracting relevant sentences and

tested pivoted QR decomposition 5 and Maximal marginal relevance algorithms [9] to identify

a set of sentences containing new information.

Unlike other works on novelty detection, [2] investigated the sensitivity of novelty detec-

tion on the presence of non-relevant sentences in the documents. Authors explored the task of

the TREC novelty track in much greater depth than was done for the TREC workshop, with

substantial focus on the problem of how novelty detection degrades as the quality of relevant in-

formation drops. They experimented with three well-known retrieval models: the vector space

model with tf-idf weighting [43], a language modeling approach with the KL divergence [25]

as scoring function, and a two stage language modeling approach [53]. For detecting novelty,

authors have used several measures that include simple new word counts, Cosine distance met-

ric where negative of the cosine of the angle between a sentence vector and each previously

seen sentence vectors then determines the novelty score for that sentence, and language model

5QR decomposition of a matrix is a decomposition of the matrix into an orthogonal and a upper triangularmatrix often used to solve the linear least squares problem, and is the basis for a particular eigenvalue algorithm.

24

based novelty measures with interpolated, dirichlet and shrinkage smoothing models. These

models differ in the language models they compare while measuring KL divergence.

The difference between the two groups of measures is that one just counts words and the

other looks at the distribution of words. When non-relevant sentences are added, the probabil-

ity distribution of vocabulary shifts so that arriving sentences have more and more dissimilar

distributions, suggesting that they are novel. On the other hand, word counting approaches are

less distracted by the new words. Relevant sentences that are not novel will generally reuse vo-

cabulary from earlier relevant sentences, and will not be sidetracked by the random vocabulary

introduced by the non-relevant sentences. The authors anticipate that as the density of relevant

documents drops the word counting measures will continue to perform the best.

3.3 Approaches to Progressive/Temporal Summarization

Progressive summarization shares much similarity with temporal or time based processing

of news topics in summarization. Regular summarizers deal with a static set of documents,

but a progressive summarizer receives a stream of news articles and the document set is dy-

namic. Progressive summarization is relatively a new area of research within summarization,

which has gained a lot of focus through the introduction of “Update Summarization” track at

DUC 2007. We follow the term Progressive summarization in lieu of Update/Temporal sum-

marization in this thesis, since the term ‘progressive’ express the essence of the task in a better

manner. The user is updated with progress about the events in the topic, hence the term “Pro-

gressive summarization”. We present here some influential work towards this direction in the

recent past.

[9] is the first known work combining query-relevance and information novelty in context

of retrieval and summarization . Authors coin the term “relevant novelty” and explained the

need of computing importance of an element through combined criterion of query relevance

and information novelty of that element. The linear combination of independently measured

Relevance and Novelty of an element is referred as its “marginal relevance”. The method

described in this work strives to maximize marginal relevance and hence labeled as “Maximal

25

Marginal Relevance (MMR)”. The MMR criterion for multi document summarization looks

like

MMR = Argmaxsi∈S[λ(Sim1(si, Q)− (1− λ)maxsj∈SSim2(si, sj))]

Where S is the set of sentences in document cluster, Q is the information need of user

represented as query, Sim1 is the similarity metric to measure the relevance of sentence with

query and Sim2 can be the same as Sim1 or a different metric. For intermediate values of

the parameter λ in the interval [0,1], a linear combination of both relevance and novelty is

optimized. Users wishing to sample the information space around the query, should set λ at a

smaller value, and those wishing to focus more on identifying novelty should set λ to a value

closer to 1.

[1] uses a language model based approach for producing a revised summary at regular time

intervals. The goal is to model topic and event from sentences and identify the occurrence

of new events (novelty) with in the topic (usefulness). Authors proposed different language

models for characterizing usefulness and novelty and combine them both into a single measure

of interestingness.

[35] modeled summarization using information distance theory and produced summaries

with minimal conditional information distance with prior document set. Summarization is con-

verted into an optimization problem limited by the summary’s information content and solved

by approximating Kolmogorov complexity. [17] describes a maximum coverage model for

summarization inspired form well known set cover problem 6. Simple word bigrams valued by

their document frequency are modeled as concepts. Sentences are selected into summary such

that they best cover the relevant concepts in the document set. The Maximization problem is

approximately solved using Integer Linear Programming (ILP). Value of summary is computed

by the number of unique concept values it contains thus limiting redundancy implicitly and pro-

viding scope for novelty. Authors hypothesize that articles about topics that have already been

in the news tend to state new information first before recapping past details. Sentence position

is used as a unique feature to identify novel sentences.

6www.wikipedia.org/wiki/Set cover problem

26

[7] proposes a statistical method based on a maximization-minimization of sentence simi-

larity measures between sentence vectors. Cross summary sentence redundancy is minimized

to limit the redundancy of progressive summary with previous summary while maximizing the

newness of information in summary. Sentences close to the topic description are chosen to

sustain relevancy.

In this thesis we solve progressive summarization by devising multiple novelty detection

techniques at various summarization stages and combine them all to generate an informative

summary. Unlike most of the previous work, our approach has the advantage of using more

than one technique to detect novel information and is integrated within summarization frame-

work.

27

Chapter 4

Supervised sentence ranking using Regression

4.1 Summarization Framework

Summarization can be viewed from different viewpoints, as a decision theory problem, as

a classification problem of summary and non summary sentences or as a data compression

problem of lossy and lossless compression of sentences or as an Information Retrieval problem

of extracting relevant sentences. In this thesis we use a general model allowing different views

to be implemented as individual features of the summarization framework. We use machine

learning methods (Regression technique) to combine all these distinctive features and produce

a final informative summary. In this chapter, we explain the methodology of our framework

and provide details about the various features that are used in our experiments,

4.1.1 Stages of Framework

As the focus of this thesis is only on extractive summarization, the term summarization/summarizer

implies sentence extractive multi document summarization. The model of our summarization

is inspired from MEAD architecture, an elaborate publicly available platform for multi doc-

ument multilingual text summarization [41]. The flexible nature of our framework allows us

to implement arbitrary algorithms in a standardized framework. Our summarization has four

major modules,

• Pre-processing:

28

Articles collected from the web or any publicly available corpus has unnecessary article

heads, HTML tags that doesn’t provide any information about the article. Each article

is represented as a document in the framework and parsed to extract the content. Stan-

dard sentence boundary identifiers and word breakers are used to split document into

sentences. Stop words are removed from sentences and porter stemmer is used to derive

root words stripping suffixes and prefixes for each sentence. Figure 4.1 shows a sample

news article from AQUAINT corpus 1,

• Scoring/ Feature Extraction:

Sentences extracted during pre-processing stage are considered as units of summary.

Each sentence has scores assigned by several scoring features, reflecting its relevance

either on a positive or negative scale. These features may consist probabilistic language

models, heuristics derived from corpus, entropy based measures, statistical information

about the data, linguistic and knowledge based measures among others. Usually more

than one feature is used in scoring to attain robustness. A close observation at the sample

news article (figure 4.1) reveals that often the important information is conveyed in either

the top or bottom parts of the article. Since all the articles in the cluster are relevant to the

topic, importance of a concept is directly proportional to its occurrence in articles. These

observations are leveraged as features (DFS, SFS, SL1, SL2) and described in detail in

section 4.3

Features are pluggable components of the framework, hence each combination of fea-

tures becomes an unique configuration of summarizer. The multiple feature scores of

each sentence is combined into a single rank in the Sentence Ranking stage.

• Sentence Ranking:

Rank of a sentence is directly proportional to the importance of a sentence and decides its

membership in the summary. Conventionally sentence rank is computed as the weighted

1AQUAINT corpus consists of newswire text data in English, drawn from three sources: the Xinhua NewsService (People’s Republic of China), the New York Times News Service, and the Associated Press WorldstreamNews Service. It was prepared by the LDC for the AQUAINT Project, and will be used in official benchmarkevaluations conducted by National Institute of Standards and Technology (NIST).

29

Figure 4.1 Sample news article from AQUAINT news corpus

linear combination of feature scores. As the feature space grows, it becomes very difficult

to come up with an optimal set of weights for the combination.

To overcome the cost of numerous experiments and manual effort in the process of find-

ing an ideal weight combination, we use a machine learning technique to estimate the

rank of a sentence. Details of the ranking procedure is explained in detail in section 4.2

• Summary Extraction:

30

Summary Extraction is the final stage of summarization, where a subset of ranked sen-

tences are selected into summary until the desired summary length is reached. Only

sentences complying with several constraints like minimum/maximum length, and hav-

ing minimal redundancy to already produced summary is selected to the summary. As a

result, summary covers a wide range of aspects about the topic. Uninformative phrases of

sentence are removed using simple heuristics and some minimal set of rules. Sentences

are adjusted based on their order of occurrence in the documents to improve readability

of summary.

A pictorial representation of all the stages in extractive summarization is presented in figure 4.2.

Figure 4.2 Stages in a Multi Document Summarizer

4.2 Feature combination using Support Vector Regression

Recent Advances in the field of Machine Learning have been adapted to summarization

through the literature using several features to identify sentence importance. Previously, ma-

chine learning models like neural networks [47], naive bayes classifiers [26], hidden markov

31

models [11], and most recently gradient boosted decision trees [38] have been used for sen-

tence ranking. In this work we experimented with a popular machine learning technique called

regression to predict the sentence importance.

4.2.1 Motivation

Regression is a statistical technique to model a dependent variable from a set of indepen-

dent variables. It is very popular in forecasting and prediction tasks and is used over a broad

spectrum of areas like finance [50], biology,and weather predictions [19] among others. It is

also used in various Information Retrieval and Information Extraction tasks by [27] [54].

Regression techniques are relatively less explored compared to other machine learning al-

gorithms in the context of summarization. While classification approaches classify a sentence

to be relevant or non relevant, regression predicts the exact real value of sentence importance.

Other popular machine learning approaches like Gradient decision trees and Neural networks

become intractable as the feature space grows. The fact that regression techniques are showed

to perform on par with them [38] encourages us to use regression for predicting sentence im-

portance. To the best of our knowledge, [29] and [45] are the only prior works that have used

regression for predicting sentence importance. Our work goes beyond [45] by proposing more

powerful features that are best predictors of sentence relevancy, according to the evaluation re-

sults of summaries (Chapter 6). We also extend the regression SVM to predict the importance

of progressive summaries.

Regression using Support Vectors is called Support Vector Regression (SVR). In following

sections we briefly explain SVR, our sentence importance estimation and summary extraction

algorithms.

4.2.2 Support Vector Regression (SVR)

Regression analysis refers to techniques for predicting a real valued dependent variable from

one or more independent variables. We model sentence importance as the dependent variable

and the vector of feature scores as independent variables. A little detail about the theory behind

support vector regression is provided below,

32

Consider the problem of approximating the set of training data

T = {(F1, i1), (F2, i2)...(Fs, is)} ⊂ F ×R.

where F is space of feature vectors and R is the set of real numbers.

A tuple (Fs, is) represents feature vector Fs and importance score is of sentence s. Each

sample satisfies a linear function q(f) = 〈w, f〉+b, withw ∈ F, b ∈ R. The optimal regression

function is given by minimum of functional,

Φ(w, ξ) =1

2‖w‖2 + C

∑i

ξi− + ξi

+

where C is a pre-specified value, and ξi−, ξi+are slack variables representing upper and lower

constraints on the outputs of the system.

Like other machine learning algorithms, Support vector regression has two phases, training

and testing. During training phase we compute feature vector of each sentence along with its

importance. In testing phase, feature vectors of all sentences are generated and their corre-

sponding sentence importance is assessed by trained model.

4.2.2.1 Sentence Importance Estimation

Importance score (is) is not pre-defined for sentences in training data. We estimate the value

of importance using gold standard, human written summaries(also known as models) on that

topic.

ROUGE [33] is a recall oriented metric which automatically evaluates machine generated

summaries based on their overlap with models. ROUGE-2 and ROUGE-su4 scores highly

correlate with human evaluation [31]. Hence we make a safe assumption that importance of a

sentence is directly proportional to its overlap with model summaries. Sentence importance is

estimated as the ROUGE-2 score of that sentence. The importance of a sentence s, denoted by

is is computed as follows

is =

∑m∈models |Bigramm

⋂Bigrams|

|s|(4.1)

|Bigramm

⋂Bigrams| is number of bigrams shared by both modelm and sentence s. This

count is normalized using sentence length |s|. The number of models may vary depending

33

upon the resources. A more detailed description of ROUGE is provided in chapter 6, in our

discussion about the content evaluation metrics of summarization.

4.2.3 Feature Combination

Sentence scores from different features are combined to compute final rank of a sentence.

Normally feature scores are manually weighted to calculate the rank value. With use of SVR,

the whole process is automated in three steps:

• Sentence tuple generation: Feature values of every sentence are extracted and its importance(is)

is estimated as described in Section 4.2.2.1. Each sentence s in training data is converted

into a tuple of form (Fs, is). Details about the these features are described in Section

4.3. Fs is vector of feature values of sentence, Fs = {f1, f2, f3}. All the sentences in the

document set are projected as sample points in this feature space.

• Model building: During this phase, a linear regression model q is built over the training

vectors. The parameters of the regression model are not fine tuned on the training data

to attain robustness. We used epsilon SVR component of LibSVM package [10] for this

purpose.

• Rank prediction: Importance of a sentence in testing dataset is predicted based on the the

trained model q. The estimated importance value is considered as final rank of sentence

for further processing.

is = q(Fs)

4.3 Extraction of sentence relevancy features

We describe here several sentence scoring features that were used as part of this work.

Some of these features are devised by the authors while others are inspired from previous work

and implemented as part of experiments. Motivation for the use of these specific features are

drawn from the observations made on news articles as explained in section 4.1.1. Since we

34

have a machine learning algorithm at our disposal to carry out the tedious job of combining

features and scoring sentences, we are able to carry out numerous experiments without being

worried about finding ideal weight combinations.

4.3.1 Sentence position

Position of a sentence in a document or the position of a word in a sentence give good clues

towards importance of sentence or words respectively. Such features are called Locational fea-

tures. Locational features have been consistently used to identify salience of a sentence. It is

well studied and still used as a feature in most state of the art summarization systems [24] [23].

We use the location information of a sentence in two separate ways to score a sentence.

Sentence Location 1 (SL1):

Sentence position is a very old and popular feature used in summarization [14]. It deals with

presence of key sentences at specific locations in the text. According to our analysis of ora-

cle summaries(in Section 4.3.8), nearly 40% of all the sentences in the oracle summaries are

picked up from among the first three sentences of each document. Hence it allows us to make

an assumption that first three sentences of a document generally contain the most informative

content of that document. We propose our first feature Sentence Location1,

SL1(snd) = 1− n

Nif n <= 3

=n

Nelse

Where SL1(snd) is the score of a sentence s at position n in document d and N is the total

number of sentences in the document collection. The SL1 scores sentences such that

SL1(s1d) > SL1(s2d) > SL1(s3d) >> SL1(snd)

Sentence Location 2 (SL2):

35

SL1 is a corpus sensitive future, and it works under the heuristic that most informative

content lies at the head of a document. This heuristic works well in most of the cases especially

in news genre, but it might not be the case with other genre of documents like novels or books.

Sentence Location 2 (SL2) is a corpus independent feature, that assigns positional index of a

sentence in the document as its feature value. Training model will learn the optimum sentence

position for the corpus based on its genre, which may not necessarily be the head sentences.

Hence this feature is not inclined towards top or bottom few sentences in a document like SL1.

SL2(snd) = n

where sn is nth sentence in document d. SL2 is a very simple feature but it is as effective as

SL1 in determining sentence relevance.

4.3.2 TF-IDF

TF-IDF is a popular information retrieval technique to measure the document relevance and

consequently to rank documents. We use a similar technique to measure relevancy at a sentence

level. Term frequency(TF) of a term(ti) in a document (dj) is the ratio of number of times it

occurred in dj (ni,j) to total number of terms in dj .

TFi,j =ni,j∑k nk,j

Inverse Document Frequency(IDF) of a term (ti) is the ratio of total number of documents

in cluster (D) to number of documents in which term occurred.

IDFi = log|D|

{|d| : ti ∈ d}

While TF measures the importance of a term in a particular document, IDF measures the

exclusiveness/informativeness of that particular term. A product of TF,IDF gives an overall

measure of salience of that word. Final score of sentence(s) in document dj is the average

TF-IDF value of all the terms it contains,

TF − IDF (s) =

∑i∈s TFi,j ∗ IDFi

|s|

36

4.3.3 Document Frequency Score (DFS):

In conventional IR the document set is a mixture of relevant and non-relevant documents,

hence IDF is used as a distinctive feature between them. IDF is not useful in summarization

since the document collection consists only relevant documents on a particular topic. [45]

devised Document Frequency Score (DFS) that works very well in summarization. DFS of a

word is defined as the ratio of number of documents in which it occurred to the total number

of documents in the collection. dfs of a word w is given by,

dfs(w) ={|d| : w ∈ d}|D|

where d is document, |D| is total number of documents in dataset. DFS is a simple statistical

feature that exploits the relatedness of every document in the collection to compute the salience

of a sentence.

4.3.4 Sentence Frequency score (SFS):

Sentence Frequency Score (SFS) is a sentence level variant of DFS. As every document in

the collection is assumed to be relevant to the topic, it is also safe to assume that majority of

the sentences from these documents are also relevant. The feature SFS is devised to capture

the most relevant out of the relevant sentences in the document collection. SFS of a word is

defined as the ratio of number of sentences in which the word occurred in document set to the

total number of sentences in document set.

sfs score of a word w is given by,

sfs(w) ={|s| : w ∈ s}|N |

where s is a sentence, |N | is total number of sentences in dataset. Average sentence frequency

score of all the words in a sentence is considered as its feature score.

Score(s) =

∑i∈s sfs(wi)

|s|

37

4.3.5 Probabilistic hyperspace analogue to language (PHAL)

A language model or alternatively a statistical language model is a probabilistic mechanism

for generating text. A Hyperspace analogue to language model (HAL) constructs dependencies

of a word w with other words based on their co occurrence in a window of size k. The PHAL,

is probabilistic extension to HAL spaces, where term co occurrence counts are used to compute

conditional probabilities. We use PHAL model proposed by [20] as a sentence scoring feature.

PHAL can be interpretated as the probability of observing a word w’ with the word w in a

window of size k,

PHAL(w′|w) =HAL(w′|w)

n(w)× k

Assuming word independence, the relevance of a sentence S given an information need Q is

computed as,

P (S|Q) ≈∏wi∈S

P (wi|Q)

≈∏wi∈S

P (wi)

P (Q)

∏qj

PHAL(qj|wi)

≈∏wi∈S

P (wi)∏qj

PHAL(qj|wi)

4.3.6 Kullback Leibler divergence (KLD)

Kullback-Leibler divergence or relative entropy is a non symmetric metric to measure the

difference between two probability distributions. KLD is used to calculate generic or query

independent importance of the information by contrastive analysis of given document set D

with a random document set D′. If a term has similar probability distribution in both D and

D’, then the generic importance for that term is assumed to be high. The KLD of a sentence s

is computed as,

KLD(s) =

|s|∑i=1

P (wi|D)logp(wi|D)

p(wi|D′)

38

4.3.7 Prepositional Importance (PrepImp)

In english grammar, a preposition is a part of speech that links nouns, pronouns to other

phrases in a sentence. A preposition generally represents the temporal, spatial or logical rela-

tionship of its object to the rest of the sentence. Observe the role of prepositions on,of,to,in,from

in the below sentences,

The book is on the table

President of India lives in delhi

The Indian cricket team is traveling from Australia to new zealand

It is very interesting to observe how prepositions are implicitly capturing the key elements

in a sentence. The preposition on in first sentence is conveying that there is a book, a table

and some relation between them. Similarly other two sentences has some key information

regarding one or more entities implicitly conveyed through connecting prepositions. To the

best of our knowledge, the role of prepositions is never explored before to calculate sentence

importance.

As a primary step in this direction, we propose using the frequency of a small set of prepo-

sitions in a sentence as its feature score. The frequency of prepositions is indirectly achieving

the effect of performing a Named Entity Recognition (NER) on a sentence, but without any

additional cost of processing or using any POS tags. Score of a sentence (s) calculated by

PrepImp is given as,

PrepImp(s) =

∑wi∈s IsPrep(wi)

|s|

The list of prepositions used for calculating sentence importance are limited to simple single

word prepositions like in,on,of,at,for,from,to,by,with, after a careful observation over the data.

4.3.8 Oracle Summaries

Oracle summary is the best sentence extractive summary that can be generated by any sen-

tence extractive summarization system for a particular topic. We generated sentence-extractive

39

oracle summaries using document collection and human written summaries for that topic. Each

sentence is scored using equation 4.1, and the score is considered as its final rank. Summaries

are extracted as described in section 4.1.1. Oracle summaries serve as the upper limit of what

can be achieved through extractive methods in summarization. They are generated to observe

the details about most informative sentences in the document collecction and to depict the

scope of improvement in summarization.

4.4 Summary

In this chapter we described a general multidocument summarizer framework explaining

pre processing, feature extraction, sentence ranking and summary extraction stages. Normally

sentence rank is computed from the weighted linear combination of features. Instead, we used

a supervised machine learning algorithm, regression, to estimate sentence importance from the

feature vector. We provided details about the theory and mathematical formulation of scoring

features that will be used for experiments in chapter 6. These features include both existing

features like PHAL, KL, DFS and TF-IDF and newly devised features as SFS, PrepImp, SL1

and SL2. Oracle summaries are generated with an objective to find the upper limit of what can

be achieved through extractive summarization and thus depict the scope of improvement from

existing approaches.

40

Chapter 5

Progressive Summarization: Summarization with Novelty

Detection

Summarization in its basic essence is to extract the essential information from a collection

of textual content and present in user readable format. Summarization is a multi disciplinary

problem having roots in Information Retrieval, Natural Language Processing and Cognitive

Sciences. Over the years, research in summarization lead to some interesting sub problems like

Personalized summarization, Multi document summarization, Query focused summarization

among others.

Progressive summarization is relatively a new area within summarization, designed to aid

the users having access to rapidly flowing stream of articles on a topic and has no time to

look at each article. In such situation, a person would prefer to be updated on events within

topic, and dive into details only when reported events trigger enough interest. Recent focus

within summarization community has been shifted towards progressive summarization after

the introduction of Update summarization track at Document Understanding conference in

2007.

It is not sensible to wait for the topic to complete to produce a summary, nor does it make

sense to produce an overall summary at every time interval. After all, the user has already

been informed about prior events. Hence progressive summaries are generated at regular time

intervals to update user knowledge on a particular news topic.

Major challenge in progressive summarization lies in distinguishing Relevant and Novel

Vs Relevant and Redundant Vs Non relevant and redundant information. Detecting novel in-

41

formation from source documents given user’s prior knowledge over the topic is termed as

novelty detection. It is very important and needful to have a Novelty detection module in the

summarization framework for identifying relevant new information.

Figure 5.1 Novelty detection at different stages in a Multi Document Summarizer

In this thesis we identify the possibility of novelty detection at Feature extraction, Sentence

ranking and Summary extraction stages of summarization as shown in figure 5.1. We propose

different techniques at each stage and the details are given in following sections.

5.1 Feature Extraction level

In general multi document summarization systems, word level or sentence level features

calculate the score by measuring their relevance to the topic. But in feature extraction stage of

progressive summarization, features should be capable to capture sentence novelty along with

its relevance. In our work we devised three such features Novelty Factor (NF), New Words

(NW) and Hybrid Kullback Leibler information divergence (HKLID).

42

Imagine a set of articles published on an evolving news topic over time period T, with td

being publishing time stamp of article d. All the articles published from time 0 to time t are

assumed to have been read previously, hence prior knowledge, pdocs (short for previous doc-

uments). Articles published in the interval t to T that contain new information are considered

ndocs (short for new documents).

ndocs = {d : td > t}

pdocs = {d : td <= t}

5.1.1 Novelty Factor (NF)

We propose a new feature, novelty factor (NF), that primarily focuses on progressive sum-

marization problem. Novelty Factor is inspired by the DFS in generic multi document sum-

marization. The essence of novelty is to find the information that is dominant and relevant in

the new cluster of documents (ndocs) rather than information that is existing in prior knowl-

edge. Novelty is directly proportional to the relevance in ndocs and inversely proportional to

the dominance in pdocs. NF of a word w is calculated as,

NF (w) =|ndt|

|pdt|+ |D|

ndt = {d : w ∈ d ∧ d ∈ ndocs}

pdt = {d : w ∈ d ∧ d ∈ pdocs}

D = {d : td > t}

Numerator |ndt| is the number of documents in the new cluster that contain word w. It is

directly proportional to relevancy of the term, since all the documents in the cluster are relevant

to the topic. The term |pdt| in denominator will penalize any word that occurs frequently in

previous clusters, in other words it elevates novelty of a term. |D| is the total number of

43

documents in current cluster, this is useful for smoothing Novelty Factor when w does not

occur in previous clusters. NF score of a sentence is a measure of its relevance and novelty to

the topic. Score of a sentence s is the average NF value of its content words.

Score(s) =

∑i∈sNF (wi)

|s|

5.1.2 New Word Measure (NW)

The motivation behind NW comes from TREC Novelty track, where a lot of systems try

to estimate newness of a sentence as the amount of new words it contains. A word that never

occurred before in document cluster is considered new. So all the words that are not present in

pdocs are regarded as new. NW score of a sentence(s) is given by,

Score(s) =

∑w∈sNW (w)

|s|

NW (w) = 0 if w ∈ pdocs

= n/N else

n is frequency of w in ndocs

N is total term frequency of ndocs

Normalized term frequency of w is used in calculating feature score of a sentence. Unlike

NF, NW only captures the newness of a sentence but not its relevancy. NW has to be used in

combination with other relevancy features to calculate the relevant novelty of a sentence.

5.1.3 Hybrid Kullback-Leibler Information Divergence (HKLID)

Kullback-Leibler information divergence (KL) [25] is a popular technique to measure the

difference between two probability distributions. The principle behind KL is used to assess the

generic importance of a sentence in normal summarizers as explained in section 4.3.

We used an extension of KL to measure the divergence between hybrid language models

(LM) of two sentences that are built over pdocs and ndocs. A hybrid language model is the

44

combination of document and sentence language models, for better divergence calculations.

HKLID between LM’s of two sentences si in ndocs and sj in pdocs is calculated using,

HKLID(si||sj) =∑w∈si

P (w|si)P (w|ndocs) ∗ log ∗ P (w|si)P (w|ndocs)P (w|sj)P (w|pdocs)

HKLID measures the importance of a sentence in ndocs conditioned over the sentences in

pdocs. So more the divergence between these hybrid language models, more the novelty of that

sentence. Average HKLID between a sentence si in ndocs and all the sentences in pdocs is

used as its novelty score. Probability distributions are smoothed using Dirichlet principle [52].

5.2 Sentence Ranking Level

Sentence ranker combines multitude of scoring features into a single rank that is directly

proportional to the importance of a sentence. In progressive summarization ranked sentence

list is improved by re-ordering sentences through two different techniques “Redundancy Re-

ranking” and “Proximity Re-ranking”. The goal of re ordering is to promote the sentences with

new information over the stale ones in the ranked list.

5.2.1 Redundancy Re-ranking

In Redundancy re-ranking, ranked set is re-ordered using Maximal Marginal relevance

(MMR) [9] criterion. MMR computes importance of an element through combined criterion of

query relevance and information novelty of that element. The linear combination of indepen-

dently measured Relevance and Novelty of an element is referred as its “marginal relevance”.

The method strives to maximize marginal relevance and hence labeled as “Maximal Marginal

Relevance (MMR)”.

Final rank (“Rank”) of a sentence is computed as weighted linear combination of the original

sentence rank and the redundancy measure of that sentence,

Ranksi= µ ∗ scoresi

− (1− µ) ∗ redundancy score

Where “score” is the original sentence score predicted by regression model as described in sec-

tion 4.2, and “ redundancy score” is an estimate for the amount of prior information a sentence

45

contains. µ is a balancing parameter to adjust the relevancy and novelty of a sentence. In this

work “redundancy score” of a sentence is calculated by its Information theoretic Similarity

measure( ITSim) and Cosine Similarity measure (CoSim) to the previous sentences.

Information Theoretic Similarity (ITSim)

[34] presented an information theoretic definition of similarity and demonstrated its application

in various domains. This particular definition of similarity does not assume a particular domain

or type of problem. It is applicable as long as the domain has a probability model. Unlike other

similarity measures it is not defined by a formula, rather it is derived from a set of assumptions

about similarity between two entities. The similarity between A and B is measured by the

ratio between the amount of information needed to state the commonality of A and B and the

information needed to fully describe what A and B are,

sim(A,B) =logP (common(A,B)

logP (descritption(A,B)(5.1)

According to information theory, Entropy quantifies the amount of information carried with a

message. Extending this analogy to text content, Entropy I(w) of a word w is calculated as,

I(w) = −p(w) ∗ log(p(w))

p(w) = n/N

Motivated by the information theoretic definition of similarity, we extend the similarity de-

scribed in equation 5.1 between two entities to sentences s1 and s2,

ITSim(s1, s2) =2 ∗

∑w∈s1∧s2 I(w)∑

w∈s1 I(w) +∑

w∈s2 I(w)

Information of a sentence is calculated by the entropy of all the words it contains. The term in

numerator is proportional to the commonality between s1 and s2, denominator measures the

description of both the sentences.

Cosine Similarity (CoSim)

Cosine Similarity is a popular and old technique often used to compare documents in text min-

ing. It is a measure of similarity between two vectors of n dimensions by finding the cosine of

46

the angle between them. Given two vectors A,B, the cosine similarity is represented using a

dot product and magnitude as,

Sim(A,B) = cos(Θ) =A.B

|A||B|

Sentences are represented as tf-idf vectors [43] of its constituent words in an n-dimension

space. Term frequency of each word represent its cardinality and the number of unique words

in the document collection determines the dimension. Cosine similarity between two sentences

is measured as,

CoSim(s1, s2) =

∑w∈s1∧s2 tfidf(ws1) ∗ tfidf(ws2)√∑

w∈s1 tif(w)2 ∗∑

w∈s2 tif(w)2

Maximum similarity value (ITSim,CoSim) of a sentence in ndocswith all sentences in pdocs

is considered as its redundancy score.

5.2.2 Proximity Re-ranking (ProximRank)

TREC Novelty track

NIST has created a gold standard data for evaluations, as part of Novelty track from TREC

6,7,8 consisting 50 odd topics. Each topic has set of relevant and irrelevant documents, and

sentences marked with relevance and novelty judgments. [46] investigated the details about

the percentage of relevant and novel sentences in the document collection and the adjacency of

relevant sentences for 2003 and 2004 data. Table 5.1 shows the excerpts of this analysis,

2003 2004Relevant 0.39 0.20

Consecutive 0.91 0.70

Novelty 0.68 0.40

Table 5.1 Statistics of relevant, novel and consecutive relevant sentences in TREC 2003

Almost 40% of the sentences were selected as relevant, and in particular 90% of the rele-

vant sentences were adjacent. Analysis also shows a huge disparity in the fraction of relevant

47

and novelty sentences between 2003 and 2004. However, the authors have not explored the

importance of proximity in novel sentences.

We carried out experiments on TREC 2004 Novelty track data and found that about 75% of

novel sentences occur as a pair and approximately 61% of novel sentences occur in a group of

three. These statistics prove that new information is often spanned over a group of sentences

that belongs to a context. Hence it is intuitive to compute the rank of a sentence using relevant

novelty of its surrounding sentences. Final rank of a sentence si after proximity re-ranking is,

Ranksi= λ ∗ scoresi

+ (1− λ) ∗scoresi−1

+ scoresi+1

2

scoresiis the relevant novel score estimated by the regression model and λ is a balancing

parameter that is set manually.

5.3 Summary Extraction Level

Summary extraction is the final stage of summarization where sentences from the ranked list

are selected into summary pertaining the redundancy, coherence and other limits of the frame-

work. At this stage, novelty can be induced into summary by selecting only the sentences that

are estimated to contain relevant novelty given the background knowledge.

Novelty Pool (NP)

A progressive summarizer assume that the user is most concerned about finding new infor-

mation, and is tolerant of reading information he already knows because of his background

knowledge. Since each sentence adds to the users knowledge,and later sentences are to be

retrieved only if they contain new information, novelty retrieval resembles a filtering task.

We model novelty detection as a filtering task at summary extraction stage. Sentences that

possibly contain prior information are filtered out from summary by creating Novelty Pool

(NP). We introduce the notion of dominant and novelwords to explain the intuition behind NP.

A word w is considered dominant if the DFS of the word is above half of the total documents

in the cluster. Two sets of dominant words are generated one for each pdocs and ndocs,

domndocs = {w : DFSndocs(w) > ndocs/2}

48

dompdocs = {w : DFSpdocs(w) > pdocs/2}

Difference of these two sets gives us a list of relevant novelwords,

novelwords = domndocs − dompdocs

Finally we extract a set of novelword that are both dominant and new. During summary

extraction we select the sentences having novelwords more than the average noverwolds per

sentence ratio (npr). These set of sentences are referred to as Novelty pool (NP).

novelwordcount(si) =

|si|∑j=1

isnovelword(wij)

npr(si) =

∑|S|i=1 novelwordcount(si)

|S|NP = {si : novelwordcount(si) > npr}

S represents the set of sentences in ndocs, and |S| is the cardinality of set S.

5.4 Summary

In this chapter we described the role and importance of novelty detection in progressive

summarization. We explained how novelty detection is integrated into summarization frame-

work and why it is important for producing informative progressive summaries. Each section in

this chapter described various novelty detection techniques used at feature extraction, sentence

ranking and summary extraction stages of summarizer. We developed two new scoring features

to capture the relevancy along with novelty of a sentence. We used content based re-ranking

measures and proximity based re-ranking measures to improve the ranked list of sentences. Fi-

nally, at summary extraction stage we modeled novelty detection as a filtering task and filtered

out probable sentences containing prior information.

49

Chapter 6

Evaluation

Evaluation of summaries can broadly be classified into two classes. The first is extrinsic

evaluation, that measures the effect of summarization on completion of some other tasks like

relevance assessment and text categorization. The second class of methods are intrinsic that test

the summarization itself. Intrinsic evaluations assess mainly informativeness and cohesiveness

of the summary. In this thesis we evaluate our progressive summarization techniques and the

advantage of supervised sentence ranking over traditional weighted linear approach through

popular intrinsic measure, Recall Oriented Understudy of Gisting Evaluation (ROUGE). This

chapter discusses the experimental setup and the evaluation results of our experiments and later

compare them to state of the art approaches in summarization.

6.1 Dataset

We conducted all our experiments on TAC update summarization track dataset. It serves as

an ideal testbed for evaluating progressive summaries. The scenario of update summarization

described in TAC assume that each user is an educated, adult US native who is aware of current

events as they appear in the news. The user is interested in a particular news story and wants to

track it as it develops over time, so he subscribes to a news feed that sends him relevant articles

as they are submitted from various news services. However, either there’s so much news that

he can’t keep up with it, or he has to leave for a while and then wants to catch up. Whenever he

50

checks up on the news, it bothers him that most articles keep repeating the same information;

he would like to read summaries that only talk about what’s new or different.

In the scenario, a user initially gives you a topic statement (query and narrative) expressing

his information need. News articles about the story then arrive in batches over time (clusters

of articles), and the task is to write a summary of 100 words for each cluster of articles, that

addresses the information need of the user.

The test dataset comprises approximately 48 topics. Each topic has a topic statement (query)

and 20 relevant documents which have been divided into 2 sets: cluster A (pdocs) and cluster

B (ndocs). Each document set has 10 documents, where all the documents in pdocs chrono-

logically precede the documents in ndocs. The documents will come from the AQUAINT-2

collection of news articles. Figure 6.1 shows a sample topic statement and document set de-

scriptions. Each topic statement and its two document sets have four model summaries (gold

standard) written by professional NIST assessors. These model summaries are used by NIST

to evaluate the content of system generated summaries (peers/peer summaries).

Figure 6.1 Sample topic and narrative in TAC 2008

51

The task description and the structure of data in both TAC 2008 and TAC 2009 remains the

same, with difference being clear distinction of events across the clusters. This allowed us to

use TAC 2008 data for training regression models while TAC 2009 to carry out the experiments.

6.2 Evaluation Metrics

We evaluate the quality of summaries using both content and form based evaluation mea-

sures. The focus is mainly on intrinsic content based evaluations. Below we provide the details

about the evaluation metrics that are used in our experiments.

6.2.1 ROUGE

ROUGE is a recall oriented metric to automatically score the peer summaries based on pair-

wise comparisons with reference summaries. ROUGE provides several measures that count

the number of overlapping units such as n-grams, word sequences and word pairs between

peer and model summaries. ROUGE [33] package is a openly available resource that has four

major measures: ROUGE-N, ROUGE-L, ROUGE-W and ROUGE-S. Below we provide a brief

description of ROUGE-N and ROUGE-S, variants of which are popularly used in comparing

summarization systems.

ROUGE-N is an n-gram recall between a candidate summary and a set of reference sum-

maries. ROUGE-N is computed as follows:

ROUGE −N =

∑m∈models

∑gramn∈mCountmatch(gramn)∑

m∈models

∑gramn∈mCount(gramn)

where n is the length of n-gram, Countmatch(gramn) is the maximum number of overlap-

ping n-grams between candidate summary and the model summary m. It goes without expla-

nation that ROUGE is a recall oriented metric since the denominator of the equation is the total

sum of the number of n-grams occurring in model summaries. When multiple reference sum-

maries were used, a pairwise summary-level ROUGE-N between a candidate summary s and

every model, m , in the set of model summaries is computed. Then the maximum of pairwise

52

summary-level ROUGE-N scores is treated as the final multiple reference ROUGE-N score.

Jack knifing principle is used in computing the final ROUGE scores to make model summaries

comparable with peers.

ROUGE S Skip bigrams are 2 word length sub sequences of a string, having the words in same

order as original sequence but with arbitrary gaps between them. Skip-bigram co-occurrence

statistics measure the overlap of skip-bigrams between a candidate summary and a set of refer-

ence summaries. For a better understanding of skip bigrams, consider the following example

• S1: sachin tendulkar plays cricket

• S2: sachin plays test cricket

• S3: plays cricket sachin always

• S4: plays cricket sachin tendulkar

Each sentence has 6 skip bigrams. First sentence has following skip bigrams (sachin ten-

dulkar, sachin plays, sachin cricket, tendulkar plays, tendulkar cricket, plays cricket). S2 has

three matches with S1, S3 has one while S4 has two skip matches. Given a model summary M

of length m and peer summary S of length n and if SKIP 2(M, S ) is the number of skip-bigrams

between M and S , then the ROUGE-S is computed as follows:

Rskip2 =skip2(M,S)

C(m, 2)

Pskip2 =skip2(M,S)

C(n, 2)

ROUGE − S =2Rskip2Pskip2

Rskip2 + Pskip2

ROUGE SU is an improved version of ROUGE S. One potential problem for ROUGE-S is

that it does not give any credit to a candidate sentence if the sentence does not have any word

pair co-occurring with models. To achieve this, a simple extension to ROUGE-S is ROUGE-

SUn is employed, where n is the skip distance for a bigram. ROUGE-SU includes all the

bigrams obtained by ROUGE-S and all unigrams, and hence removes the above problem.

53

As a standard routine ROUGE 2 and ROUGE SU4 scores of summaries are considered as

yardsticks of evaluation through out the DUC and TAC series of workshops. Hence, we also

consider only these two scores during our evaluations.

6.2.2 Pyramids

The pyramid method of evaluation provide a unified method to handle semantic equiva-

lence, Human variation, Analytic Granularity and other aspects of summary at various levels

of granularity. The key assumption of pyramids like ROUGE, is the need of multiple human

authored model summaries, which are considered as a gold standard for peer summaries.

Semantic Content Units (SCU) or alternatively referred to as summary content units, are

semantically motivated subsentential units that are variable in length. SCUs emerge from an-

notation of the set of model summaries for a topic. Sentences in summary are broken down into

clauses, each of which is a SCU in the pyramid. Each SCU has a associated weight indicat-

ing the number of model summaries in which it appeared. Repetition of information through

changes as small as modifier to as large as a clause in the model summaries gives rise to a single

SCU. Peer summaries are evaluated based on the presence of SCU’s and their corresponding

weights.

A key feature of a pyramid is that it quantitatively represents agreement among the human

summaries as SCUs that appear in more of the human summaries are weighted more. Such

weighting allows differentiation between important content from less important content and

it is necessary in summarization evaluation considering personal opinions of assessors while

writing summaries. Fine details about SCU and the pyramid evaluation in total is described

in [3]

6.2.3 Readability and overall responsiveness

Along with the above mentioned content based evaluations, readability of a summary is

also assessed during TAC evaluations. The readability of the summaries were assessed using

five linguistic quality questions which measured qualities of the summary that do not involve

comparison with a reference summary or TAC topic. The linguistic qualities measured were

54

Grammaticality, Non-redundancy, Referential clarity, Focus, and Structure and coherence. Hu-

mans assessed peer summaries based on these linguistic questions and assigned score on a five

point scale, 1 denote worst and 5 denotes best summary.

NIST assessors assigned an overall responsiveness score to each of the automatic and human

summaries. The overall responsiveness score is an integer between 1 (very poor) and 5 (very

good) and is based on both the linguistic quality of the summary and the amount of information

in the summary that helps to satisfy the information need defined for the topic’s narrative.

6.3 Evaluation of Supervised ranking

In this section we assess the supervised ranking technique, regression, discussed in chapter 4

and provide the evaluation results of various combinations of features.

6.3.1 Kernel Functions

Support Vector Machines construct a hyperplane or set of hyperplanes in a high or infinite

dimensional space, which can be used for classification, regression or other tasks. Intuitively,

a good separation is achieved by the hyperplane that has the largest distance to the nearest

training data points of any class (so-called functional margin), since in general the larger the

margin the lower the generalization error of the classifier.

Real world problems are often stated in a finite dimensional space, it happens that in that

space the sets to be discriminated are not linearly separable. For this reason it was proposed

that the original finite dimensional space be mapped into a much higher dimensional space

making the separation easier in that space. SVM schemes use a mapping into a larger space

so that cross products may be computed easily in terms of the variables in the original space

making the computational load reasonable. The cross products in the larger space are defined in

terms of a kernel function. There are various kernel functions, out of which a suitable function

need to be selected for the problem.

We have experimented with four popular kernel functions Linear, Sigmoid, Polynomial and

Radial basis for our regression problem. In tables 6.1 and 6.2 we present the ROUGE 2 and

55

ROUGE SU4 scores of summaries generated using DFS as a single feature for pdocs and ndocs

respectively.

pdocs ROUGE-2 ROUGE SU4

Linear 0.10133 0.13839

Sigmoid 0.08208 0.12009

Polynomial 0.10133 0.13839

Radial Basis 0.10230 0.13927

Table 6.1 ROUGE-2, ROUGE SU4 scores of pdocs using different kernels

ndocs ROUGE-2 ROUGE SU4

Linear 0.02020 0.06068

Sigmoid 0.06714 0.10648

Polynomial 0.04845 0.09407

Radial Basis 0.08548 0.12680

Table 6.2 ROUGE-2, ROUGE SU4 scores of ndocs using different kernels

From the evaluation results, we observed that radial basis function is best suited for our

problem compared to linear, sigmoid or polynomial kernel functions. Hence we chose to use

radial basis function as our kernel function in further experiments.

6.3.2 Regression Vs Weighted Linear scoring

All the sentence scoring features described in chapter 4 are evaluated individually using

both regression and weighted linear scoring. We carried out our experiments at two levels, first

at indvidual feature level and later on the combinations of different features. At first level of

our experiments, feature vector (Fs) of a sentence ‘s’ has only one value. Evaluation results of

the summaries generated using regression are compared against summaries generated through

normal ranking of single features (weight will be 1, as there is only one feature). The results

are presented in tables 6.3 and 6.4 for pdocs and ndocs respectively.

56

pdocs ndocsRegression Normal Regression Normal

DFS 0.10230 0.09574 0.08548 0.08954

KL 0.09751 0.09499 0.08822 0.08601

PHAL 0.07736 0.09245 0.06815 0.08255

SL1 0.09402 0.09069 0.08491 0.08144

SL2 0.09142 0.03443 0.08334 0.03711

SFS 0.08768 0.08504 0.07497 0.07705

PrepImportance 0.04594 0.04683 0.05011 0.04847

TF-IDF 0.08075 0.03916 0.06779 0.03899

Table 6.3 ROUGE 2 scores of pdocs and ndocs while using Regression to estimate the sentenceimportance

Analysis of Regression Evaluation:

As all the documents in cluster are relevant to the topic, the intuition behind using DFS as rel-

evancy feature worked well in generic summarization. Similarly the features SL1 and SL2 that

boosted first and last sentences of the article are successful as the most important information

is always present at the top of document. Query focused feature PHAL and query independent

feature KL achieved good results, that are expected from their past success in DUC shared task.

PrepImportance that is designed as a preliminary feature to exploit the usage of prepositions in

a sentence did not fare well, which is not surprising as it just considers the count of preposi-

tions alone as a relevancy measure. It is evident from the evaluation results that regression is

as good as normal ranking procedures and even better for features like DFS, KL, SL1 and SFS.

In particular SL2 has a huge gap between its regression model and normal ranking scheme as

SL2 scores the sentence based on its relative position in the document. Hence it does not make

much sense to use SL2 as an individual feature in normal ranking procedure without a learning

model like regression. According to the results, support vector regression estimates sentence

importance better than the feature itself in most cases.

57

pdocs ndocsRegression Normal Regression Normal

DFS 0.13927 0.13374 0.12680 0.13039

KL 0.13875 0.13514 0.13124 0.12929

PHAL 0.11621 0.13050 0.11038 0.12929

SL1 0.12985 0.12653 0.12596 0.12222

SL2 0.12759 0.07654 0.12427 0.08256

SFS 0.13000 0.12657 0.12067 0.12034

PrepImportance 0.09203 0.07840 0.09907 0.09700

TF-IDF 0.12177 0.07968 0.11156 0.08228

Table 6.4 ROUGE SU4 scores of pdocs and ndocs while using Regression to estimate thesentence importance

6.3.3 Combination of features

Experimental results of the first level of experiments promotes the use of regression instead

of normal ranking procedure in summarization. In the next level, we experimented with all

possible combinations of features that are used for sentence scoring. Feature vector (Fs) has

more than one value depending on the number of scoring features combined. Unlike weighted

linear ranking, regression allows us to combine any number of features without worrying about

the optimal weight combinations. ROUGE scores of a few of these combinations are provided

in table 6.5 and table 6.6.

Analysis of Results:

The advantage of regression is depicted when two complementing features are combined to

produce a more desirable summary. For example, the combination of DFS and SL1 has pro-

duced better quality summaries than any of the single features DFS or SL1. The feature TF-IDF

is not effective when used as a stand alone feature, but the combination of DFS+SL1+TFIDF

resulted in better summaries than DFS+SL1. It is also to be observed that combinations of

two successful features might not always result in a better result, as the features may not en-

58

hance the overall combination. Consider DFS and KL, both very successful individual features

but the combination DFS+KL have not produced summaries as good as DFS+SL1. This is

because SL1 and DFS are both complementing each other where as DFS and KL are not.

Similarly the quality of the combination PHAL+KL is dropped when combined with TF-IDF

(PHAL+KL+TF-IDF). Although, the feature PrepImp has not contributed much as an individ-

ual feature, the combination of PrepImp with DFS+SL1 achieved the best results in the table.

Similar patterns of results are observed for both pdocs and ndocs. Although the values of

ROUGE scores are more for pdocs than ndocs, the effectiveness of the combinations is clear in

both cases. The reason for low scores for ndocs is the lack of specific novelty detection mea-

sures, as all the features used at this level are generic multi document summarization oriented.

Observing the results from tables 6.3 6.4 and 6.5, the maximum ROUGE-2 score is increased

from 0.10230 (DFS) to 0.11041 (DFS+SL1) for pdocs and 0.085 (DFS) to 0.096 (DFS+SL1)

for ndocs. Similarly the ROUGE SU4 scores have increased from 0.13927 and 0.12680 to

0.14628 and 0.13761 respectively for pdocs and ndocs. Approximately 8% improvement of

ROUGE-2 and ROUGE-SU4 scores is achieved for pdocs by combination of features through

regression. The purpose of this evaluation is to find the best configuration of generic sum-

marization and apply the novelty detection techniques over this system and finally produce

informative progressive summaries.

6.4 Evaluation of Progressive Summarization

Progressive summarization is focused in improving the summaries of ndocs given the prior

knowledge in the form of pdocs. The progressive summaries are generated under the assump-

tion that user has complete knowledge about the information presented in pdocs. In this section,

we evaluate all the novelty detection techniques that are proposed in Chapter 5.

We chose the combination DFS+SL as our baseline summarization configuration. This

particular configuration has produced very good results for pdocs and reasonable scores for

ndocs. The combination DFS+SL is here after referred to as MultiDocSumm, short notation for

normal multi document summarizer. MultiDocSumm serve as a baseline to depict the effect of

proposed Novelty detection techniques.

59

pdocs ROUGE-2 ROUGE-SU4

DFS+SL1 0.11041 0.14628

DFS+SL2 0.10715 0.14270

DFS+KL 0.10155 0.14069

DFS+SFS 0.10494 0.14234

SFS+KL 0.09797 0.13872

SFS+SL1 0.10705 0.14497

PHAL+KL 0.10319 0.13988

PHAL+KL+DFS 0.10442 0.14145

PHAL+KL+SFS 0.09721 0.13693

PHAL+KL+PrepImp 0.10040 0.13817

PHAL+KL+TFIDF 0.09959 0.13828

DFS+SL1+PrepImp 0.11134 0.14757

DFS+SL1+KL 0.10786 0.14630

DFS+SL1+TFIDF 0.11021 0.14634

Table 6.5 ROUGE scores of pdocs for different combinations of features

Several configurations of summarizers are generated, each having one or more novelty de-

tection techniques at either scoring, ranking or summary extraction stages of summarization.

A brief descriptions of these configurations are provided below,

MultiDocSumm+ Novelty Features In these set of configurations, new scoring features

like NDF, NW and HKLID are used along with original features of MultiDocSumm for build-

ing feature vectors

MultiDocSumm + Re-ranking Measures The ranked list of MultiDocSumm is re ordered us-

ing various similarity measures like ITSim and CoSim. A proximity measure (ProximRank) is

also used to re rank the original ranked list of MultiDocSumm during sentence ranking stage.

In these set of configurations, the scoring features remain same.

MultiDocSumm+ Novelty Pool Sentences from Novelty Pool (NP) alone are selected during

summary extraction stage of MultiDocSumm

60

ndocs ROUGE-2 ROUGE-SU4

DFS+SL1 0.09607 0.13761

DFS+SL2 0.09683 0.13675

DFS+KL 0.08954 0.13201

DFS+SFS 0.08368 0.12792

SFS+KL 0.09204 0.13429

SFS+SL1 0.09604 0.13716

PHAL+KL 0.08878 0.13019

PHAL+KL+DFS 0.08694 0.12841

PHAL+KL+SFS 0.08572 0.12587

PHAL+KL+PrepImp 0.08612 0.12696

PHAL+KL+TFIDF 0.08492 0.12582

DFS+SL1+PrepImp 0.09644 0.13867

DFS+SL1+KL 0.09464 0.13705

DFS+SL1+TFIDF 0.09616 0.13831

Table 6.6 ROUGE scores of ndocs for different combinations of features

MultiDocSumm+ Novelty Features +Novelty Pool Novelty features are used in conjunction

with features of MultiDocSumm and finally the sentences from Novelty Pool are extracted into

summary

MultiDocSumm+ Novelty Features + Re-ranking Measures Sentences scored with novelty

features along with original features of MultiDocSumm are re ranked using re-ranking mea-

sures

MultiDocSumm+ Novelty Features + Re-ranking Measures+Novelty Pool This configura-

tion has combination of all the proposed novelty detection techniques applied on MultiDoc-

Summ

Evaluation results of all these configurations in terms of ROUGE 2 and ROUGE SU4 scores

are presented in table 6.7

61

Configuration ROUGE-2 ROUGE-SU4

MultiDocSumm 0.9607 0.13761

MultiDocSumm+NF 0.09895 0.14004

MultiDocSumm+NW 0.09753 0.14045

MultiDocSumm+HKLID 0.09955 0.14023

MultiDocSumm+NF+NW 0.09885 0.14146

MultiDocSumm+NF+HKLID 0.10223 0.14266

MultiDocSumm+NW+HKLID 0.10057 0.14286

MultiDocSumm+NF+NW+HKLID 0.10102 0.14280

MultiDocSumm+ITSim 0.09461 0.13306

MultiDocSumm+CoSim 0.08338 0.12607

MultiDocSumm+ProximRank 0.09933 0.14067

MultiDocSumm+NP 0.09873 0.13977

MultiDocSumm+NF+NP 0.09875 0.14010

MultiDocSumm+NF+NP+ITSim 0.09764 0.13912

Table 6.7 ROUGE scores of different configurations with novelty detection techniques

We have participated in TAC 2009 update summarization track, that is considered to be the

most reputed summarization evaluation platform at present. Some of the participating teams

include University of Ottawa, Peking University, Thomson Reuters research, EML Research

among others. A total of 23 teams from around the world competed to produce best update

summaries for the given test data set. We compare our approach to the top two performing

systems at TAC 2009, International Computer Science Institute,Berkley (ICSI) and Tsinghua

university (THUSUM). Below we provide a brief description about these two approaches,

ICSI: ICSI’s approach [17] to sentence selection is based on the maximum coverage model for

summarization. Authors model summary as the set of sentences that best covers the relevant

concepts in the document set, where concepts are simply word bigrams valued by their docu-

ment frequency. The value of a summary is the sum of the unique concept values it contains,

thus limiting redundancy implicitly. The local Maximization problem is solved with Integer

62

System ROUGE-2 ROUGE-SU4 Overall Responsiveness Avg Pyramid score

MultiDocSumm+NF+HKLID 0.10223 0.14266 4.614 0.307

ICSI 0.10417 0.13959 4.568 0.290

THUSUM 0.09608 0.13499 5.023 0.296

Oracle Summary 0.17619 0.19877 – –

Model Summary 0.12436 0.16602 8.682 0.616Baseline 0.05865 0.09333 3.636 0.175

Table 6.8 Automated and Manual evaluation results of TAC systems

Linear Programming (ILP). For Update summarization they hypothesize that articles about

topics that have already been in the news tend to state new information first before recapping

past details. The values of concepts appearing in first sentences are upweighted according to

this inference.

THUSUM: The framework of THUSUM is based on theory of conditional independence from

many objects. They propose a information distance to solve the summarization problem. A

detailed description about the system is presented in [35]

TAC 2009 also provided a baseline by returning the first 100 words of the most recent doc-

ument as a summary to the topic. The evaluation results of these systems at TAC 2009 along

with the best configuration of our experiments are presented in table 6.8. We also present the

results of Oracle Summaries ,best possible extractive summaries, created in chapter 4 and one

of the four human written model summary that is considered as a gold standard in the evalua-

tions. It is evident from the results that our progressive summarizer has outperformed the state

of the art approaches in all content based evaluation metrics including ROUGE, Pyramids and

overall responsiveness score. Most combinations presented in table 6.7 have better ROUGE

scores than ICSI or THUSUM, proving that our novelty detection techniques are very effective

in detecting relevant novel information.

63

Analysis of Novelty Detection techniques:

All the configurations in table 6.7, other than the similarity based re-ranking measures have

showed significant improvement over the MultiDocSumm. The best results are obtained for the

configuration MultiDocSumm+NF+HKLID with 6% improvement in ROUGE-2 and ROUGE

SU-4 scores. The proximity based re-ranking technique enhanced the scores by approximately

3%. The Novelty pool technique (NP) allowed us to produce progressive summaries by only

selecting sentences with dominant novel words into summary. The improvement of ROUGE

scores is not substantial, when novelty detection techniques at scoring, ranking and extracting

stages are combined together. As Novel sentences are already scored high through NF, HKLID,

the effect of re-ranking and filtering techniques is not significant in the combination.

We provide below summaries generated by both Generic (MultiDocSumm) and Progressive

(MultiDocSumm+NF+HKLID) configurations for a particular topic “Michael Jackson’s child

molestation trial” in TAC 2009 dataset. First cluster of documents (pdocs) contain events about

the allegations, investigations and DNA tests conducted on michael jackson as part of the case.

The next cluster of documents (ndocs) has articles focusing on events like the trial, jury selec-

tion and health issues of jackson.

Generic summary for ndocs

Jackson was rushed to the hospital after he vomited in his car as he was being driven to the

Santa Maria court, where he is on trial on charges of sexually molesting a 13-year-old boy.

Trial Judge Rodney Melville unveiled for the first time details of the charges of child molesta-

tion and conspiracy charges against pop icon Michael Jackson. Sneddon alleged that pop icon

Jackson had been in tremendous financial debt, which led him and his aides to hatch a plot to

kidnap the boy and his family and hold them against their will.

Progressive summary for ndocs

The long-awaited child molestation trial of pop superstar Michael Jackson officially got un-

64

derway Monday with the judge calling the court to order. Pop icon Michael Jackson was

Tuesday rushed to hospital suffering from the flu, his trial judge said, delaying jury selection

in his child sex trial. Well ahead of schedule, a jury was selected Wednesday for the child

molestation trial of pop star Michael Jackson. Michael Jackson health is in stable condition

but needs further care for persistent viral symptoms. Trial Judge Rodney Melville unveiled

for the first time details of the charges of child molestation

It is explicit that the Progressive summary informs the user more about the recent events in

the topic. While the generic summary only has information relating to the deteriorating health

of jackson, the progressive summary has information focusing on his health, jury selection and

the proceedings of trial. It is evident that our novelty techniques are effective in finding relevant

new information for the user.

The huge gap of results between oracle summaries and best systems at TAC (in table 6.8)

show that there is still a lot of scope for improvement in the extractive summarization. The

results of participating teams are on par with some of the human model summaries in terms of

ROUGE, but far behind in manual evaluations like pyramid scores and overall responsiveness.

Extractive summarizers are hindered by the coherence and readability issues of summary, that

effects the overall responsiveness of the summary.

65

Chapter 7

Conclusions and Future directions

The field of text summarization is a well studied problem concerning multiple disciplines

like Cognitive science, Information access and Natural Language processing. It is viewed

as a decision theory problem, as a classification problem, as a data compression problem

(lossy/lossless) and as an information retrieval problem. It has been an active area of research

for the last four decades and has branched out into several areas. Progressive summarization

is a recent development in text summarization community, much popularized after its intro-

duction in Text Analysis Conference (TAC) in 2007. The task of progressive summarization

is to produce informative and human readable summaries about a particular topic under the

assumption that user have gained prior knowledge about the same topic through reading a set

of documents. The challenging part of progressive summarization is to identify information

that is both relevant and novel given the prior knowledge of user and then present it in the form

of a summary.

Traditional sentence ranking stage of summarization uses weighted linear combination of

individual feature scores. As the feature space grows it becomes more difficult to come up with

an ideal weight combination to compute the rank. In our work we use a supervised learning

algorithm, regression, to estimate the sentence importance from feature vectors. This allowed

us to experiment with wide variety of combinations without worrying about optimal weights to

combine them. We used good number of features ranging from Language models like PHAL,

KL to Statistics of document collection like DFS, SFS and heuristic based features like SL1,

SL2 and PrepImp for our experiments. Experiments supported our intuition that regression

66

will estimate the sentence importance better than the feature itself. We carried out an extensive

analysis over combinations of all possible features and identified most successful and stable

combination to be our baseline generic summarizer. This baseline (MultiDocSumm) is used to

depict the effect of our proposed novelty detection techniques in progressive summarization.

In this thesis, we addressed the problem of progressive summarization by devising novelty

detection techniques at various stages of extractive summarization. We treated the problem in

an unique way, by projecting the importance of having a novelty detection module in the sum-

marization framework. At feature extraction stage of summarization, New sentence scoring

features like NF, HKLID, NW are devised to capture the novelty of a sentence along with its

relevance. Two re-ranking techniques, Redundancy re-ranking, and proximity re-ranking are

also proposed in this work to re order the list of ranked sentences promoting novel sentences

in the ranked list. A new content based similarity measure Information theoretic distance (IT-

Dist) is used along with traditional cosine similarity measure for computing similarity between

sentences. At summary extraction stage of summarization, a filtering strategy is adopted by

only selecting sentences from Novelty Pool (NP) into the summary.

By devising novelty detection techniques at various stages, we are able to combine multi-

ple novelty detection techniques unlike most of the previous work. A detailed analysis on the

effect of detecting novelty at each of these stages is also performed, and the experimental re-

sults show that novelty is best detected at feature extraction/scoring stage. Novelty Pool (NP)

improved the quality of summaries by discarding probable redundant sentences into summary.

Proximity based re ranking helped us in producing better progressive summaries by computing

the importance of a sentence based on the relevant novelty of its surrounding sentences. The

re-ranking techniques proposed at sentence ranking stage of summarization, ITSim and CoSim

did not improve the quality of progressive summaries. Since CoSim is a word overlap mea-

sure, and novel information is often embedded within a sentence containing formerly known

information, quality of progressive summaries is declined. ITSim performs better than Cosim

because it considers entropy of a word in similarity computations, which is a better estimate of

information.

67

7.1 Future Directions

In this thesis we focused on progressive summarization of news topics. We tried to address

the problems that arise for a normal reader while he is trying to follow a temporal topic span-

ning for months, years or even more. Although the techniques proposed here are adaptable

across other domains it will make an interesting problem to apply progressive summarization

to online product reviews or novels/books. Progressive summaries of online product reviews

would be a lot of value for both customers and vendors. Similar to news topics, the vendor is

interested to know when there is a shift of reviews about particular product and the user wants

to know about the reviews that are different from the reviews that he already read. Progressive

summarization is also applicable to summarize chapters of novels/books.

The Novelty Factor (NF) described here is an extension of popular Document frequency

(DF), that uses the information about the ratio of document frequencies of words in pdocs and

ndocs. In future, NF can be developed into a more sophisticated feature capturing language

models of both pdocs and ndocs. Currently we are only using the frequency of prepositions in

a sentence as its measure of importance (PrepImp), and that did not produce effective results

like we anticipated. But we strongly believe that prepositions are strong indicators of important

entities in a sentence and can be exploited in better ways in future. Although the features

described in this work are simple, we believe that the novel treatment of the problem will

inspire a lot of new techniques at each stage of summarization.

The similarity measures used in this work are simple content overlapping measures of two

units. As the redundancy detection is a complex task, there is a need for sophisticated semantic

similarity measures that can capture semantic relatedness between two text units. Exploiting

encyclopedic knowledge from Wikipedia or social bookmarking tags will help in computing

semantic distance between two concepts.

The current summarization system does not produce high linguistic quality summaries since

there is no special care taken concerning readability issues of the summary. There is a lot

of scope to work on improving grammatical quality and coherence in summary through co

reference resolution along with content quality.

68

The current state of the art summarization systems are all extractive in nature, but the com-

munity is gradually progressing towards abstractive summarization [16]. Although a complete

abstractive summarization would require deeper natural language understanding and process-

ing, a hybrid or shallow abstractive summarization can be achieved through sentence compres-

sion and textual entailment techniques. Textual entailment helps in detecting shorter versions

of text that entail with same meaning as original text. With textual entailment we can produce

more concise and shorter summaries. A recent development in summarization that is intro-

duced at TAC 2010 is Guided summarization, where the user’s information need is represented

as a template of aspects instead of a query. The summary is expected to cover answers for

all the aspects including any other relevant information. These template of aspects may vary

depending upon the category of the topic. Guided summarization initiated the use of informa-

tion extraction techniques in summarization, that may very well lead to a shallow abstractive

summary.

Research in Summarization continues to enhance the diversity and information richness,

and strive to produce coherent and focused answers to user’s information need.

69

Related Publications

Praveen Bysani, Vijay Bharat, Vaudeva Varma Modeling Novelty and Feature Combina-

tion Using Support Vector Regression for Update Summarization. The 7th International

Conference on Natural Language Processing (ICON 2009), India, December 2009

Praveen Bysani Novelty Detection in the context of Progressive Summarization. At Stu-

dent Research Workshop in the 11th annual Conference of the North American Chapter of the

Association for Computational Linguistics(NAACL-HLT 2010), Los Angeles, June 2010

Praveen Bysani, Kranthi Reddy, Vasudeva Varma et.al IIIT Hyderabad at TAC 2009. In

proceedings of Text Analysis Conference (TAC 2009), Maryland USA, November 2009

70

Bibliography

[1] J. Allan, R. Gupta, and V. Khandelwal. Topic models for summarizing novelty. 2001.

[2] J. Allan, C. Wade, and A. Bolivar. Retrieval and novelty detection at the sentence level.

In SIGIR ’03: Proceedings of the 26th annual international ACM SIGIR conference on

Research and development in informaion retrieval, pages 314–321, New York, NY, USA,

2003. ACM.

[3] A. H. Ani, A. Nenkova, R. Passonneau, and O. Rambow. Automation of summary eval-

uation by the pyramid method. In In Proceedings of the Conference of Recent Advances

in Natural Language Processing (RANLP, page 226, 2005.

[4] C. Aone, M. E. Okurowski, J. Gorlinsky, and B. Larsen. A trainable summarizer with

knowledge acquired from robust NLP techniques, pages 71–80. 1999.

[5] R. Barzilay and M. Elbadad. Using lexical chains for text summarization, 1997.

[6] R. Barzilay and M. Lapata. Modeling local coherence: an entity-based approach. In

ACL ’05: Proceedings of the 43rd Annual Meeting on Association for Computational

Linguistics, pages 141–148, Morristown, NJ, USA, 2005. Association for Computational

Linguistics.

[7] F. Boudin and J.-M. Torres-Moreno. A cosine maximization-minimization approach for

user-oriented multi-document update summarization. In In the proceedings of Recent

Advances in Natural Language Processing (RANLP), 2007.

[8] C. Burges, T. Shaked, E. Renshaw, A. Lazier, M. Deeds, N. Hamilton, and G. Hullen-

der. Learning to rank using gradient descent. In ICML ’05: Proceedings of the 22nd

71

international conference on Machine learning, pages 89–96, New York, NY, USA, 2005.

ACM.

[9] J. Carbonell and J. Goldstein. The use of mmr, diversity-based reranking for reordering

documents and producing summaries. In SIGIR ’98: Proceedings of the 21st annual inter-

national ACM SIGIR conference on Research and development in information retrieval,

pages 335–336, New York, NY, USA, 1998. ACM.

[10] C.-C. Chang and C.-J. Lin. LIBSVM: a library for support vector machines. 2001.

[11] J. Conroy and D. P. O’leary. Text summarization via hidden markov models and pivoted

qr matrix decomposition. Technical report, In SIGIR, 2001.

[12] J. M. Conroy. A hidden markov model for the trec novelty task. 2003.

[13] J. M. Conroy, J. Goldstein, J. D. Schlesinger, and D. P. Oleary. Left-brain/right-brain

multi-document summarization. In In Proceedings of the Document Understanding Con-

ference (DUC, 2004.

[14] H. P. Edmundson. New methods in automatic extracting. J. ACM, 16(2):264–285, 1969.

[15] D. Eichmannac, Y. Zhangb, S. Bradshawbc, X. Y. Qiub, P. Srinivasanabc, and A. Kumar.

Novelty, question answering and genomics: The university of iowa response. 2004.

[16] P.-E. Genest and G. Lapalme. Text generation for abstractive summarization. 2010.

[17] D. Gillick, B. Favre, D. Hakkani-Tur, B. Bohnet, Y. Liu, and S. Xie. The icsi/utd summa-

rization system at tac 2009. 2009.

[18] U. Hahn and I. Mani. The challenges of automatic summarization. Computer, 33(11):29–

36, 2000.

[19] C. Huang, D.-D. Liu, and J.-S. Wang. Forecast daily indices of solar activity,using support

vector regression method. In Research in Astronomy and Astrophysics vol9. RAA, 2009.

[20] J. Jagarlamudi, P. Pingali, and V. Varma. A relevance-based language modeling approach

to duc 2005. In Document Understanding Conference, 2005.

72

[21] H. Jing, R. Barzilay, K. Mckeown, and M. Elhadad. Summarization evaluation methods:

Experiments and analysis. In In AAAI Symposium on Intelligent Summarization, pages

60–68, 1998.

[22] K. S. Jones. Automatic summarising: Factors and directions. In Advances in Automatic

Text Summarization, pages 1–12. MIT Press, 1998.

[23] I. Kastner and C. Monz. Automatic single-document key fact extraction from newswire

articles. In Proceedings of the 12th Conference of the European Chapter of the ACL

(EACL 2009), pages 415–423, Athens, Greece, March 2009. Association for Computa-

tional Linguistics.

[24] R. Katragadda, P. Pingali, and V. Varma. Sentence position revisited: a robust light-weight

update summarization ’baseline’ algorithm. In CLIAWS3 ’09: Proceedings of the Third

International Workshop on Cross Lingual Information Access, pages 46–52, Morristown,

NJ, USA, 2009. Association for Computational Linguistics.

[25] S. Kullback and R. A. Leibler. On information and sufficiency. The Annals of Mathemat-

ical Statistics, pages 79–86, 1951.

[26] J. kupeic, J. pedersen, and F. chen. A trainable document summarizer. In In proceedings

of ACM SIGIR 95, pages 68–73. ACM, 1995.

[27] R. R. Larson. A logistic regression approach to distributed ir. In SIGIR ’02: Proceedings

of the 25th annual international ACM SIGIR conference on Research and development in

information retrieval, pages 399–400, New York, NY, USA, 2002. ACM.

[28] D. J. Lawrie. Language models for hierarchical summarization, 2003.

[29] S. Li, Y. Ouyang, W. Wang, and B. Sun. Multi-document summarization using support

vector regression. In DUC 2007 notebook, 2007. Document Understanding Conference,

November 2007.

73

[30] X. Li and W. B. Croft. Novelty detection based on sentence level patterns. In CIKM ’05:

Proceedings of the 14th ACM international conference on Information and knowledge

management, pages 744–751, New York, NY, USA, 2005. ACM.

[31] Lin and Chin-Yew. Looking for a few good metrics: Automatic summarization evaluation

- how many samples are enough? In Proceedings of the NTCIR Workshop 4, June 2004.

[32] C.-Y. Lin, , C. yew Lin, and E. Hovy. Identifying topics by position, 1997.

[33] C.-Y. Lin. Rouge: A package for automatic evaluation of summaries. pages 74–81,

Barcelona, Spain, July 2004. Association for Computational Linguistics.

[34] D. Lin. An information-theoretic definition of similarity. In ICML ’98: Proceedings of the

Fifteenth International Conference on Machine Learning, pages 296–304, San Francisco,

CA, USA, 1998. Morgan Kaufmann Publishers Inc.

[35] C. Long, M. Huang, and X. Zhu. Tsinghua university at tac 2009: Summarizing multi-

documents by information distance. 2009.

[36] H. P. Luhn. The automatic creation of literature abstracts. IBM J. Res. Dev., 2(2):159–165,

1958.

[37] I. Mani. Multi-document summarization by graph search and matching. In In Proceedings

of the Fifteenth National Conference on Artificial Intelligence (AAAI-97, pages 622–628.

AAAI, 1997.

[38] D. Metzler and T. Kanungo. Machine learned sentence selection strategies for query-

biased summarization. sigir learning to rank workshop, 2008.

[39] G. A. Miller. Wordnet: a lexical database for english. Commun. ACM, 38(11):39–41,

1995.

[40] E. Pitler and A. Nenkova. Revisiting readability: A unified framework for predicting text

quality.

74

[41] D. Radev, T. Allison, S. Blair-goldensohn, J. Blitzer, A. elebi, S. Dimitrov, E. Drabek,

A. Hakim, W. Lam, D. Liu, J. Otterbacher, H. Qi, H. Saggion, S. Teufel, A. Winkel, and

Z. Zhang. Mead - a platform for multidocument multilingual text summarization. In in

LREC 2004, 2004.

[42] D. Radev, H. Jing, and M. Budzikowska. Centroid-based summarization of multi-

ple documents: Sentence extraction, utility-based evaluation, and user studies. In In

ANLP/NAACL Workshop on Summarization, pages 21–29, 2000.

[43] G. Salton and C. Buckley. Term weighting approaches in automatic text retrieval. Tech-

nical report, Ithaca, NY, USA, 1987.

[44] B. Schiffman and K. R. McKeown. Context and learning in novelty detection. In HLT ’05:

Proceedings of the conference on Human Language Technology and Empirical Methods

in Natural Language Processing, pages 716–723, Morristown, NJ, USA, 2005. Associa-

tion for Computational Linguistics.

[45] F. Schilder and R. Kondadandi. Fastsum: fast and accurate query-based multi-document

summarization. In Proceedings of the 46th Annual Meeting of the Association for Com-

putational Linguistics on Human Language Technologies. Human Language Technology

Conference, 2008.

[46] I. Soboroff and D. Harman. Novelty detection: The trec experience. In In HLT/EMNLP,

pages 105–112, 2005.

[47] K. M. Svore. Enhancing single-document summarization by combining ranknet and third-

party sources, 2007.

[48] S. Teufel and M. Moens. Summarizing scientific articles: experiments with relevance and

rhetorical status. Comput. Linguist., 28(4):409–445, 2002.

[49] M.-F. Tsai, M.-H. Hsu, and H.-H. Chen. Similarity computation in novelty detection.

2004.

75

[50] P. Venkataraman, S. Dulluri, and N. R. S. Raghavan. Short-term forecasting of nifty index

using support vector regression. In ICFAI Journal of Applied Finance, january 2006.

[51] M. J. Witbrock and V. O. Mittal. Ultra-summarization (poster abstract): a statistical

approach to generating highly condensed non-extractive summaries. In SIGIR ’99: Pro-

ceedings of the 22nd annual international ACM SIGIR conference on Research and de-

velopment in information retrieval, pages 315–316, New York, NY, USA, 1999. ACM.

[52] C. Zhai and J. Lafferty. A study of smoothing methods for language models applied to

information retrieval. ACM Trans. Inf. Syst., 22(2):179–214, 2004.

[53] C. Zhai and J. Lafferty. Two-stage language models for information retrieval. In Pro-

ceedings of the 25th Annual International ACM SIGIR Conference on Research and De-

velopment in Information Retrieval,Tampere, Finland, August 11-15, 2002.

[54] J. Zhang, Y. Yang, and J. Carbonell. New event detection with nearest neighbour, support

vector machines, and kernel regression. March 2003.

76

progressive summarization: summarizing relevant and novel...

Documents