content modeling for automatic document summarization · content modeling for automatic document...

Content Modeling for Automatic Document

Summarization

vorgelegt vonMagister ArtiumLeonhard Hennig

Von der Fakultat IV – Elektrotechnik und Informatik –der Technischen Universitat Berlin

zur Erlangung des akademischen Grades

Doktor der Ingenieurwissenschaften– Dr.-Ing. –

genehmigte Dissertation

Promotionsausschuss:

Vorsitzender: Prof. Dr. KaoBerichter: Prof. Dr. AlbayrakBerichter: Prof. Dr. Stede

Tag der wissenschaftlichen Aussprache: 22. November 2011

Berlin 2011

D 83

Abstract

Current search engines filter the vast amounts of information available onthe Internet by retrieving a potentially large set of documents in response toa user’s query. However, the burden of finding the searched-for informationwithin these documents stays with the user. Computational methods thatprogress beyond today’s document-centric information retrieval solutions aretherefore essential to help users to cope with the sheer amount of relevantdocuments and the information they contain. Automatic text summariza-tion is such a technology, as summaries present a concise gist of much largersubjects while filtering out irrelevant and redundant content. In addition,summaries can satisfy complex information needs in a personalized manner.Summarization can thus be a powerful tool to reduce the amount of infor-mation users have to process.

This dissertation develops novel algorithms for the personalized summa-rization of collections of thematically related news articles. Of particularinterest in this scenario is the identification of the various subtopics centeredaround the collection’s main theme, which helps to determine importantsource content and reduce redundancies. However, the ambiguity of naturallanguage and the sparsity of sentence vocabularies present problems that gobeyond the capabilities of common modeling techniques. The algorithms in-troduced in this dissertation are especially tailored to reduce the effects oflexical variability and sparsity in order to derive more precise and robustsummarization models. Exhaustive tests for different settings and variousdatasets show that the developed solutions produce summaries of higherquality than the current state-of-the-art.

News articles reporting on the same event are similar not only in terms ofthe subtopics they address, but often also relate similar facts. Fact identifica-tion is a highly desirable, if yet unsolved, subtask of summarization, since anautomatic assessment of the semantic similarity of phrasal text spans is cur-rently not feasible with the required precision. The latter part of this thesisis dedicated to an extensive analysis of semantic, fact-like text units in newsarticles and human reference summaries and proposes a novel algorithm forthe detection of text units that approximate human-annotated facts.

iii

Zusammenfassung

Herkommliche Suchmaschinen filtern die großen Mengen im Internetverfugbarer Daten durch die Abbildung einer Nutzer-Suchanfrage auf eine po-tentiell große Menge von Dokumenten. Das Auffinden der gesuchten Informa-tionen innerhalb dieser Dokumente bleibt jedoch Aufgabe des Nutzers. Da-her ist die Entwicklung computergestutzten Methoden essentiell, welche uberherkommliche, dokumentenzentrische Informationsbeschaffungslosungen hi-nausgehen, und die den Nutzer bei der Verarbeitung großer Dokumenten-mengen und darin enthaltener Informationen unterstutzen. AutomatisierteTextzusammenfassung ist eine solche Technologie, da Zusammenfassungenkonzise die Kernpunkte wesentlich großerer Quelltexte zusammentragen unddabei irrelevante und redundante Informationen herausfiltern. Textzusam-menfassungssysteme konnen also ein machtiges Werkzeug zur Reduktion dervon einem Nutzer zu verarbeitenden Datenmengen darstellen.

Im Rahmen dieser Arbeit werden Algorithmen zur personalisierten Zu-sammenfassung von Kollektionen thematisch aufeinander bezogener Nach-richtenartikel entworfen und evaluiert. Von speziellem Interesse ist hierbeidie Identifikation der Unterthemen, die das ubergeordnete Hauptthema ei-ner solchen Kollektion strukturieren, da dieses die Bestimmung wesentlicherInhalte und die Erkennung von Redundanzen erleichtert. Existierende Model-lierungsverfahren berucksichtigen nicht in ausreichendem Maße die Mehrdeu-tigkeit naturlicher Sprache und die Begrenztheit von Satzvokabularen. Die indieser Arbeit entwickelten Algorithmen hingegen werten Wortkontextinfor-mationen zur Themenerkennung aus, und profitieren von den dadurch gege-benen Zusatzinformationen bei der Erstellung personalisierter Zusammenfas-sungen. Ausfuhrliche Tests in verschiedenen Szenarien und fur unterschiedli-che Datensatze zeigen, dass die entwickelten Losungen Zusammenfassungenvon hoherer Qualitat liefern als existierende Ansatze.

Nachrichtenartikel, die uber ein bestimmtes Ereignis berichten, sind nichtnur ahnlich in Hinblick auf ihre Unterthemen, sondern enthalten auch oft diegleichen Fakten. Die Erkennung ahnlicher Fakten ist eine wunschenswerte,aber derzeit ungeloste Teilaufgabe von Zusammenfassungssystemen, da ei-ne Bewertung der semantischen Ahnlichkeit von Teilsatzen nicht mit derbenotigten Prazision moglich ist. Einen weiteren Schwerpunkt dieser Arbeitbilden daher eine ausfuhrliche Analyse von faktenahnlichen Satzabschnittenin Nachrichtenartikeln und Referenzzusammenfassungen, sowie die Entwick-lung eines Algorithmus zur Erkennung ahnlicher Teilsatze.

v

Acknowledgments

This dissertation would not have been possible without the constant en-couragement and support of many inspiring people. Above all, I would liketo thank my advisors Professor Dr.-Ing. Sahin Albayrak and Professor Dr.Manfred Stede for their valuable guidance and helpful advice during the writ-ing of this thesis, and for giving me the opportunity to pursue my researchinterests.

Furthermore, I have been fortunate to work with an amazing group ofresearchers during my time at the DAI-Labor of the Technische UniversitatBerlin. I would especially like to thank Dr. Ernesto William De Luca, headof the IRML group, for his enduring support and many valuable discussions.I am also deeply thankful to Andreas Lommatzsch, Sascha Narr, DanutaPloch, Till Plumbaum, Alan Said, Thomas Strecker, Winfried Umbrath, andRobert Wetzker. Much of what we have achieved was only possible as a jointeffort, as the various fruitful in-group collaborations show. I would like tothank all members of the IRML group for the many inspiring discussionsand constant encouragement, and the entire team of the DAI-Labor for theirsupport during the last years.

Finally, I would like to express my deep gratitude to the many peoplewho proofread this dissertation: Karsten Bsufka, Ahmet Camtepe, ErnestoWilliam De Luca, Susanne Gebhard, Martin Hecker, Benjamin Kille, BarbaraKuntze, Andreas Lommatzsch, Sascha Narr, Michael Meder, Danuta Ploch,Till Plumbaum, Alan Said, Stephan Spiegel and Thomas Strecker. Theirvaluable comments, questions, and ideas have helped to improve the qualityand coherence of this dissertation significantly.

vii

Publications

Many of the materials and concepts in this thesis have appeared in previouspublications by the author.

• Leonhard Hennig, Ernesto William De Luca, Sahin Albayrak, LearningSummary Content Units with Topic Modeling, In: 23rd Int. Conf. onComputational Linguistics (COLING), 2010

• Leonhard Hennig, Thomas Strecker, Sascha Narr, Ernesto William DeLuca, Sahin Albayrak, Identifying Sentence-Level Semantic ContentUnits with Topic Models, In: 21st Int. Conf. on Database and ExpertSystems Applications (DEXA), 7th International Workshop on Text-based Information Retrieval, 2010

• Leonhard Hennig, Sahin Albayrak, Personalized Multi-Document Sum-marization using N-Gram Topic Model Fusion, In: 7th Int. Conf. onLanguage Resources and Evaluation (LREC), 1st Workshop on Seman-tic Personalized Information Management, 2010

• Leonhard Hennig, Topic-based Multi-Document Summarization withProbabilistic Latent Semantic Analysis, In: Int. Conf. on Recent Ad-vances in Natural Language Processing (RANLP), 2009

• Leonhard Hennig, Robert Wetzker, Winfried Umbrath, An Ontology-based Approach to Text Summarization, In: Int. Conf. on Web Intelli-gence and Intelligent Agent Technology (WI-IAT), Workshop on Nat-ural Language Processing and Ontology Engineering (NLPOE), 2008

• Leonhard Hennig, Thomas Strecker, Tailoring text for automatic lay-outing of newspaper pages, In: 19th Int. Conf. on Pattern Recognition(ICPR), 2008

• Thomas Strecker, Leonhard Hennig, Automatic Layouting of Person-alized Newspaper Pages, In: Int. Conf. on Operations Research (OR),2008

ix

Contents

Introduction 1Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1Thesis contributions and structure . . . . . . . . . . . . . . . . . . 3Related publications . . . . . . . . . . . . . . . . . . . . . . . . . . 6Out of scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

I Automatic Text Summarization 11

1 An introduction to automatic text summarization 131.1 Human professional summarization . . . . . . . . . . . . . . . 151.2 Automatic text summarization . . . . . . . . . . . . . . . . . . 171.3 Summarization concepts . . . . . . . . . . . . . . . . . . . . . 201.4 Summarizer evaluation . . . . . . . . . . . . . . . . . . . . . . 281.5 Summarization & IR . . . . . . . . . . . . . . . . . . . . . . . 391.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

2 Related work 452.1 Classical approaches . . . . . . . . . . . . . . . . . . . . . . . 462.2 Extractive summarization . . . . . . . . . . . . . . . . . . . . 502.3 Multi-document summarization . . . . . . . . . . . . . . . . . 592.4 Latent factor models . . . . . . . . . . . . . . . . . . . . . . . 672.5 Content models . . . . . . . . . . . . . . . . . . . . . . . . . . 752.6 Subsentential content units . . . . . . . . . . . . . . . . . . . . 832.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

II Content Modeling for Multi-Document Summa-rization 89

3 Modeling subtopics with hierarchical ontologies 91

xi

CONTENTS

3.1 An ontology of topics . . . . . . . . . . . . . . . . . . . . . . . 923.2 Summarizing with ontology features . . . . . . . . . . . . . . . 963.3 Experiments: Generic multi-document summarization . . . . . 1013.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

4 A probabilistic approach to content modeling 1094.1 Probabilistic Latent Semantic Analysis . . . . . . . . . . . . . 1114.2 Content modeling for summarization . . . . . . . . . . . . . . 1124.3 Experiments: Query-focused multi-document summarization . 1164.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

5 Content modeling beyond bag-of-words 1275.1 Combining topic and language models . . . . . . . . . . . . . 1285.2 Summarizing with a hybrid content model . . . . . . . . . . . 1325.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1355.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139

III Subsentential Content Units 141

6 Subsentential content units in news articles 1436.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1456.2 Subsentential content units . . . . . . . . . . . . . . . . . . . . 1476.3 Sentence-level topic models . . . . . . . . . . . . . . . . . . . . 1516.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1536.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159

7 Content units in human-written reference summaries 1637.1 Summary content units . . . . . . . . . . . . . . . . . . . . . . 1657.2 Topic modeling in human reference summaries . . . . . . . . . 1677.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1707.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177

Conclusions 179Summary of contributions . . . . . . . . . . . . . . . . . . . . . . . 180Future research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181

A Example summaries 185

B Notation 193

xii

List of Figures

1.1 An example news article headline and teaser . . . . . . . . . . 14

1.2 Example abstractive and extractive summaries for a news article 21

1.3 Topic statement for query-focused summarization . . . . . . . 25

1.4 Illustrative schema of extractive single- and multi-documentsummarization . . . . . . . . . . . . . . . . . . . . . . . . . . 26

1.5 Interface of the Columbia Newsblaster summarization service . 41

2.1 Example sentence graph . . . . . . . . . . . . . . . . . . . . . 48

2.2 Graphical model representation of PLSA . . . . . . . . . . . . 73

2.3 Graphical model representation of LDA . . . . . . . . . . . . . 75

2.4 Example subtopics in a news article . . . . . . . . . . . . . . . 76

3.1 Illustration of a hierarchical topic tree . . . . . . . . . . . . . 93

3.2 Algorithm for mapping sentences to ontology topics . . . . . . 95

3.3 News article with example ontology topics . . . . . . . . . . . 97

4.1 DUC 2006: Rouge-2 recall curves as a function of latent topics121

4.2 DUC 2007: Rouge-2 recall curves as a function of latent topics122

4.3 DUC 2007: Rouge recall curves as a function of latent topics 123

5.1 Graphical model representation of extended PLSA . . . . . . . 130

5.2 DUC 2007: Rouge-2 recall for different parameter settings . . 136

5.3 Rouge recall curves for hybrid PLSA summarizers . . . . . . 137

6.1 Example documents reporting similar facts . . . . . . . . . . . 144

6.2 Example content units . . . . . . . . . . . . . . . . . . . . . . 148

6.3 Distribution of annotated gold-standard content units by type 150

6.4 Pairwise similarities of content units and latent topics . . . . . 156

6.5 F1 scores on the task of content unit discovery . . . . . . . . . 157

6.6 Precision and recall of content units . . . . . . . . . . . . . . . 158

6.7 Precision and recall for different types of content units . . . . 160

xiii

LIST OF FIGURES

7.1 Pairwise similarities of SCUs and latent topics . . . . . . . . . 1717.2 Precision, recall and fraction of Topic-SCU matches for differ-

ent settings of γ . . . . . . . . . . . . . . . . . . . . . . . . . . 1747.3 F1 and MAP for different values of T as a fraction δ of the

number of SCUs . . . . . . . . . . . . . . . . . . . . . . . . . . 1757.4 Performance of model for settings of α and β . . . . . . . . . . 1767.5 Recall, precision and fraction of Topic-SCU matches by weight 177

xiv

List of Tables

1 List of author’s related publications . . . . . . . . . . . . . . . 7

1.1 Types of Summaries . . . . . . . . . . . . . . . . . . . . . . . 271.2 Main challenges of automatic text summarization . . . . . . . 291.3 Summary content units used in the Pyramid evaluation . . . . 331.4 Statistics for summarization datasets . . . . . . . . . . . . . . 38

2.1 An illustration of latent topics . . . . . . . . . . . . . . . . . . 68

3.1 Overview of baseline sentence features . . . . . . . . . . . . . 993.2 Overview of ontology-derived sentence features . . . . . . . . . 1003.3 Precision, recall and F1 values for SVM sentence classification 1043.4 Rouge scores for 200-word summaries . . . . . . . . . . . . . 1053.5 Rouge scores for 400-word summaries . . . . . . . . . . . . . 105

4.1 Rouge recall scores for various summarizers on DUC 2006 . . 1184.2 Rouge recall scores for various summarizers on DUC 2007 . . 1194.3 Terms for DUC topic D0743J . . . . . . . . . . . . . . . . . . 1244.4 Sentences for DUC topic D0743J . . . . . . . . . . . . . . . . 125

5.1 Rouge recall scores for various summarizers on DUC 2007 . . 139

6.1 Annotated document pairs used for content unit analysis . . . 1466.2 Example matches of latent topics and content units . . . . . . 157

7.1 Example SCUs from topic D0742 of DUC 2007 . . . . . . . . . 1667.2 Top terms of matching latent topics and SCUs . . . . . . . . . 172

xv

Introduction

Motivation

In today’s information society, we are faced with an ever-increasing amountof data, as the Internet and other media offer the interested user numerousopportunities to access information. Search engines such as Google,1 Bing2

and Yahoo!3 allow users to filter this mass of information by mapping alist of user-specified keywords (a query) to a set of relevant documents (i.e.web sites, text and multimedia files). However, a typical search engine queryreturns references to thousands of documents, and it is up to the user tofollow these links and to scan each document for the answers to her informa-tion need. Furthermore, the searched-for information is typically distributedacross a number of different documents, and many documents contain re-dundant and irrelevant information. These downsides of today’s standardInformation Retrieval (IR) paradigm are amplified if a user’s informationneed is more complex than just a keyword-based search for a set of facts.For example, users who want to follow the development of a news story asit evolves over time are interested in the story’s main points and in updatesabout new events. News aggregation portals like Google News4 and Yahoo!News5 overwhelm users with the sheer amount of news sites reporting aboutthe same event, and do not provide comprehensive overviews of the maindevelopments of a news story.

Summaries condense larger amounts of source information, while aim-ing to reflect and consolidate the source’s main contents [Man01]. In ourdaily life, summaries, such as newspaper headlines, scientific paper abstracts,movie trailers, tables and diagrams, or book reviews, are omnipresent. Theyhelp us to digest and organize large amounts of information, and to make

1http://www.google.com2http://www.bing.com3http://www.yahoo.com4http://news.google.com5http://news.yahoo.com

1

http://www.google.com

http://www.bing.com

http://www.yahoo.com

http://news.google.com

http://news.yahoo.com

Introduction

effective decisions in less time [HM00]. Summaries exist in many forms andfor different types of source information, such as text, multimedia contents,or product databases. Besides condensing information and eliminating re-dundancies, they can be used to aggregate and combine information fromdifferent source documents, highlight similarities and differences, and deliverhighly concentrated, topic-focused digests of huge amounts of source mate-rial [Man01]. These advantages are complemented by the fact that sum-maries can be personalized according to a user’s information need, and thuscan act as a filter for irrelevant source information. All these potential ben-efits of summaries have stimulated much interest in the task of automaticsummarization. Furthermore, the rapid growth of electronically accessibleinformation makes automatic summarization a very desirable technology fornext generation IR solutions.

By condensing and aggregating information from a potentially large num-ber of source documents, summarization can be considered a powerful tech-nology to channel and focus the vast amounts of information users are facedwith every day. From an information retrieval perspective, summarizationbecomes especially appealing in scenarios where users have to process a largenumber of documents related to a specific topic or query, such as searchengine result lists, news article clusters, or user-generated content such ascomments, blogs, and product reviews. This task is addressed by multi-document summarization (MDS) systems, which have become the pre-dominant focus of recent research in automatic text summarization. Theinput to such systems is a set of thematically related documents, and option-ally a query which specifies a user’s information need. In order to providecomprehensive and concise summaries, MDS systems have to identify similarand differing information, which then allows the determination of impor-tant content, the elimination of redundancies, and the adequate coverage ofdifferent aspects of the information contained in the source documents.

Of particular interest to the research community has been the summa-rization of clusters of news articles related to a common topic [Jon07]. Onerecurrent observation when summarizing such clusters is that the articlesare structured into subtopics centered around the collection’s main theme.For example, news stories about an earthquake will generally contain infor-mation about the earthquake’s strength and location, reports of casualties,rescue efforts, aftershocks, and international help. These subtopical elementsreappear in similar form in different news articles, and with different empha-sis as the news story develops. The identification and utilization of suchdomain-specific information structures helps human summarizers to deter-mine important content [EN98], and can serve as a broad approximationto the definition of similar and differing information. In automatic summa-

2

Thesis contributions and structure

rization, the term content models has been introduced to denote modelswhich aim to represent such subtopical information [BL04]. Summarizationmethods that create content models of text, and which integrate subtopicinformation into a model of text passage importance and redundancy arestill rare and leave much room for improvement.

News articles reporting on the same event are similar not only in terms ofthe subtopics they address, but often also relate the same facts. In the field ofmulti-document summarization, text spans that express the same semanticcontent are called content units [NP04]. Content units play a major rolein several summarization evaluation schemes, as the comparison of meaning-ful units of text, larger than words or phrases, but smaller than sentences,provides a useful granularity for determining content similarity, importanceand redundancy. Content units are defined by their meaning, and are thusindependent of the actual choice of words and phrases used to express them.This property, coupled with the sparsity of word information available inshort text spans, makes it difficult to automatically discover similar contentunits in source texts as well as human reference summaries [HLZ05]. How-ever, an automatic identification of content units could immensely lower thecosts of summarizer evaluation by significantly reducing the required manualeffort, as well as help to create better summaries.


Part I of this thesis is dedicated to an introduction to automatic summa-rization, types of summaries, and an overview of summarization tasks relatedto different application scenarios. It describes the main challenges of auto-matic summarization, and introduces different summary evaluation schemes(Chapter 1). We then present an exhaustive discussion of previously pre-sented work related to multi-document summarization, content models, andcontent unit discovery in order to emphasize the novel contributions of thisthesis (Chapter 2).

Following the discussion of the thesis’ fundamentals, Part II developsnovel summarization methods that especially focus on content models of text.The discovery and integration of subtopic structures of real-world news storycollections helps to design an effective summarization system, and overcomesproblems related to the sparsity and variability of word usage patterns inhuman-written news articles. The main contributions presented in this partof the thesis are the introduction of three novel summarization approacheswhich incorporate subtopic-focused content models of text, and which ad-vance the state-of-the-art in the most relevant summarization tasks:

3

Introduction

Topic taxonomy The taxonomy-based approach presented in this work ad-dresses the task of generic multi-document summarization. We describea supervised summarization algorithm that makes use of novel featuresderived from mapping sentences to nodes of a hierarchical topic on-tology. The ontology is built from the hierarchically structured topicsof the Open Directory Project (ODP)6 category tree, and its topicnodes are augmented with lexical knowledge acquired by harvestingmillions of topic-related words using search engine queries. The top-ics of the ontology provide a wide coverage of different domains, andare well-suited to the purpose of subtopic modeling for news articlesummarization (Chapter 3).

PLSA Going one step further, we present a content model of news storiesthat is based on the well-known Probabilistic Latent Semantic Analysis(PLSA) algorithm. The utilization of PLSA avoids the manual con-struction of the topic ontology, and instead captures the observed con-tent structure of the documents to be summarized. It is motivated bythe observation that the sparsity and variability of word usage patternsin human-written news articles impede the identification of similar textpassages when relying on a simple word vector space representation.Our model accounts for lexical variability by inferring the semantics ofwords based on co-occurrence information and word-distributional con-text, and thus does not require manually constructed lexico-semanticknowledge resources. In addition, it exploits recurrent word patternsto derive the subtopic structure of multiple texts in a domain- andlanguage-independent, unsupervised fashion (Chapter 4).

Beyond “bag-of-words” We extend the PLSA model to investigate howprobabilistic topic models of text can be merged with language mod-els in order to relax the “bag-of-words” assumption made by standardtopic models. Our novel approach to query-focused multi-documentsummarization combines term and bigram co-occurrence observationsinto a single probabilistic latent topic model. The proposed methodconditions bigram observations on the same latent topic variable asterm observations, and thus couples long-range word correlations withshort-range word associations that are due to word ordering. Evalua-tion results show that the integration of a bigram language model intoa standard topic model leads to a system that produces summaries ofa higher quality than systems which are based on term respectively bi-gram co-occurrence observations only. Furthermore, it requires a much

6http://www.dmoz.org

4

http://www.dmoz.org


smaller number of latent topics for optimal summarization performancethan a PLSA summarizer that is based solely on term co-occurrenceobservations (Chapter 5).

The final part of this thesis is dedicated to an analysis of semantic contentunits in news articles and human reference summaries. Since an automaticidentification of such meaning-oriented, fact-like text spans is highly desir-able both for automatic summarization and for summarizer evaluation, thecontributions of Part III of this dissertation include the presentation of anovel algorithm for content unit discovery:

Subsentential content units We begin our investigations with an analy-sis of the nature of fact reporting in closely related news articles. Ourobservations suggest that content units reoccur in related news arti-cles, and are often expressed with similar, but not necessarily identicalword patterns. We present a categorization of different types of con-tent units based on their lexical, semantic and structural properties.We then describe a novel, unsupervised approach to the discovery of(sub-)sentential, meaningful word patterns. Our approach addresseslexical variability on the basis of a co-occurrence model, and groupstogether observations with similar meaning. A comparative study ofthe similarity of identified word patterns and manually annotated con-tent units suggests that many of the automatically discovered patternsclosely resemble their manually created counterparts (Chapter 6).

Summary content units Evaluating machine-generated summaries on thebasis of whether they express the same information (or facts) as a set ofreference summaries is an important aspect of summarizer evaluation,since it addresses the problem of human variability in content expres-sion. The Pyramid evaluation scheme compares machine-generatedsummaries to reference summaries by counting the number of sharedcontent units. In this work, we extend our study of content units toan analysis of human summary writing and Summary Content Units(SCUs). Our intention is to find out how such an analysis can en-rich our understanding of human summaries, and to learn if humansummary authors use similar word patterns to express the same ideaswhen summarizing the same set of source documents. Our experimen-tal results show that our model can identify with high accuracy wordpatterns that are good approximations of SCUs, and reveals some ofthe structure of human-written reference summaries (Chapter 7).

5

Introduction

Application scenarios

There are many potential application scenarios for summarization in an In-formation Retrieval setting, as will be described in Chapter 1. This dis-sertation focuses on the summarization of text documents, and in particu-lar on the summarization of clusters of news articles related to a commontopic. Within this setting, the main tasks addressed are the identificationof summary-worthy, important information, the elimination of redundancies,and the diversification of summary content. Furthermore, this dissertationwill consider the task of query-focused multi-document summarization, inwhich a complex query describing a user’s information need represents acontext for constructing a personalized summary. A short description of thetwo main summarization tasks considered in this dissertation is given below:

Generic multi-document summarization Given a set of news articlesrelated to a common topic, e.g. a single event or distinct events of thesame type (“earthquakes”), the task of the summarizer is to create acoherent summary that is presented as a fluent natural language text.The challenge in generic multi-document summarization is to recognizesimilar content, to identify the topic’s main points and to create acoherent output summary. Such summaries can serve the purposes ofgiving an overview of the major developments of a single news story,or of relating the most important aspects of a specific type of event.

Query-focused multi-document summarization The query-focusedsummarization scenario extends the generic summarization scenarioby additionally considering a user who wants to satisfy a complexinformation need, i.e. a question that cannot be answered by simplystating a name, date, quantity, etc. The challenge in this setting isnot only to identify content relevant to the information need, butalso to ensure the coverage of different aspects (or subquestions)formulated in the information need. Query-focused multi-documentsummarization is the best-researched summarization scenario in thecontext of automatic text summarization. Including this scenariotherefore allows us to evaluate the presented approaches with respectto previously proposed summarizers.

Related publications

Several contributions of this dissertation have been previously published andpresented during conferences and workshops. Table 1 lists these contributions

6

Out of scope

and the chapters in which they appear.

Chapter 3 [HUW08]Chapter 4 [HS08],[Hen09]Chapter 5 [Hen09],[HA10]Chapter 6 [HSN+10]Chapter 7 [HSN+10],[HDLA10]

Table 1: Related publications to which the author contributed and theirappearance in this dissertation.

Out of scope

There are many aspects and research topics related to automatic documentsummarization that, despite their importance, cannot be investigated in thisdissertation:

Single-document summarization Many scenarios require the summa-rization of a single document instead of a set of input documents.Single-document summarization can be viewed as a simplified versionof the standard multi-document summarization task [Man01], and istherefore addressed by the summarizers developed in this dissertation.

Keyphrase summaries Some summarization scenarios focus on theconstruction of headline summaries [BMW00], key-phrase extrac-tion [Zha02], or on the generation of text snippets for search engineresult lists. As interesting as these scenarios may be, they are notwithin the scope of this work, which aims for the generation of fluent,multi-sentence texts.

Multimedia summarization Summaries can be constructed from differ-ent input media types, such as text, audio, pictures and movies. Mul-timedia summarization is a subfield of automatic summarization thatdeals with input and output consisting of media types other than text[MM99, RKEA00, Fur05, MA08]. Throughout this work, we will onlyconsider text as the sole form of input and output.

Linguistic and external knowledge This dissertation mainly studies ap-proaches that are based on probabilistic co-occurrence models. Infor-mation stemming from the linguistic processing of texts, sentences and

7

Introduction

words, such as part-of-speech information, sentence parse trees [BM05,LMFG05], or discourse relations [Mar97b] is not considered. Fur-thermore, we disregard external knowledge sources, such as Word-Net7 [Fel98] for the enrichment of our content models. The solutionspresented in this dissertation can be thought of as core building blocksfor successful summarization models, which can be augmented with ad-ditional external knowledge (see e.g. [BE97, MB99, FH04, WLZD08]).

Update summaries Recent summarization competitions have included thetask of update summarization. Update summarization requires thesummarizer to present a personalized summary under the assumptionthat the user is already familiar with previous developments of a newsstory [DO08]. Many summarization systems apply standard redun-dancy elimination algorithms to avoid introducing “known” content,and as such, the solutions to this task are similar to those for creatinga diversified generic or query-focused summary.

Other genres Research in automatic text summarization focuses mainly onthe summarization of newswire material. Reasons for this include thatlarge news articles collections were easily available from other IR eval-uations [Voo03], do not require technical domain knowledge, and are ofinterest to many potential system users [Jon07]. Summarization com-petitions such as the Document Understanding Conference (DUC)8 andthe Text Analysis Conference (TAC)9 have created document-summarycorpora mainly of newswire material. Due to the costliness of creat-ing reference summaries for the purpose of summarizer evaluation, andfor a comparative evaluation with previously presented work, the solu-tions presented in this dissertation are all evaluated on these standarddatasets.

Natural language generation This dissertation focuses on extractive ap-proaches to summarization, bypassing the challenges of natural lan-guage generation. Furthermore, the presented solutions do not employsentence simplification algorithms, as these may result in ungrammat-ical sentences [VSB06].

User Studies User studies are probably the most accurate way to evaluatethe quality of machine-generated summaries. However, in accordancewith most work on summarization systems, we will evaluate summaries

7http://wordnet.princeton.edu8http://duc.nist.gov/9http://www.nist.gov/tac

8

http://wordnet.princeton.edu

http://duc.nist.gov/

http://www.nist.gov/tac

Out of scope

by comparison with human-written reference summaries. Throughoutthis work, we will evaluate machine-generated summaries according totheir level of concept capture, i.e. the amount of content they sharewith reference summaries, using several automatic quality measurescommonly used in summarizer evaluation [LH03]. The evaluation oflinguistic quality criteria, such as coherence or grammaticality, is notwithin the scope of this work.

9

Part I

Automatic Text Summarization

11

Chapter 1

An introduction to automatictext summarization

Summaries written by humans are an integral part of our daily life, and ourinteraction with information. Some summary types, such as headlines andreviews, help us decide which news articles or books to read, which movies orTV shows to watch, or which radio shows to listen to. Other types of sum-maries act as strongly compressed substitutes of the original source, suchas book digests or professional abstracts of scientific papers. Some sum-maries are highly concentrated filters of source information, as, for example,statistics tables or diagrams, or management summaries. Common to allsummaries is that they are concise, comprehensive presentations of the mostimportant information contained in a source document or a set of sourcedocuments. Figure 1.1 illustrates the reductive power of summaries with anexample headline and abstract of a New York Times news article. Readerscan immediately understand the main point of the article by scanning just theheadline. Furthermore, the one-paragraph summary below the headline pro-vides additional details of the article’s contents. Summaries thus help peopleto cope with the ubiquitous information overload by presenting only the mainpoints of potentially overwhelming larger amounts of source information. Asa result, people can process and organize information more efficiently andmake effective decisions in less time and with less effort [HM00]. Further-more, summaries can guide people towards selecting interesting content byincorporating additional information, such as the recommendatory opinionof a book reviewer.

Information retrieval can benefit from summarization in many areas. Acommon feature of web search engines are the text snippets displayed witheach search result. In addition to providing a brief description of the searchresult’s contents, such snippets often include short text passages from the

13

Chapter 1: An introduction to automatic text summarization

Figure 1.1: The image shows the headline of an example news article, to-gether with a lead-paragraph summary of its contents. Human readers cangrasp the main gist of the news article by scanning just the headline and thelead paragraph, and on this basis decide whether reading the full article isworthwhile.

context surrounding the found query terms, and thus help users to decidewhich search results warrant further attention. An even higher benefit ariseswhen considering the document-centric approach of current IR: While searchengines allow users to filter large amounts of source information by mappinguser queries to a set of documents, they typically still return thousands ofrelevant results. Many of these result documents contain the same or similarinformation, and most of them only partially satisfy the user’s informationneed. Furthermore, it is very likely that large parts of the document con-tents are unrelated to the information need, i.e. that the information theuser is interested is found in a few passages of the original documents. Inthe standard IR paradigm, the burden of filling the gap between the out-put granularity (documents) and the searched-for information (a set of textpassages) therefore stays with the users [SAB93, BP06], which forces usersto scan and read at least a few, if not potentially very many documents.These downsides are amplified if a user’s information need is more complexthan just a simple search for some facts, and requires e.g. a comprehensiveoverview of a specific topic. Yet another example of a potential applicationarea for summarization systems are news aggregation web sites, which collecthundreds of news articles about the same event from many different sources.However, these articles often repeat similar information, or are derived frommaterial created by the same original newswire agency. Following a newsstory thus becomes a tedious endeavor, since users have to scan many ar-ticles, and will continuously encounter repeated background matter as thenews story evolves.

Automatic summarization is a technology that has the potential to ad-dress some of the aforementioned deficits of current IR solutions. As themain function of a summary is to condense larger quantities of source mate-

14

1.1. Human professional summarization

rial, summaries allow users to get an overview of the most important pointswithout having to process all of the original sources. In addition, summariescan combine information from different sources, and act as a filter for irrel-evant and redundant information. Summarization can thus be a powerfultool to reduce the amount of information users have to cope with. Froman information retrieval perspective, summarization becomes especially ap-pealing in scenarios where users are faced with a large number of documentsrelated to a specific topic or query, such as search engine result lists, newsarticle clusters, or user-generated product reviews. In addition, summariescondense information for the benefit of the reader and task [Man01]. Thismeans that summarizing the same source information can result in very dif-ferent, personalized summaries, based on the specific requirements of a useror the task she is performing.

This chapter is intended to make the reader familiar with the conceptsand challenges of automatic text summarization. It first introduces humanprofessional summarizing in Section 1.1. In Section 1.2, we will then definethe task of automatic text summarization, and discuss its main challenges. InSection 1.3, we will introduce a typology of summaries, and juxtapose the twomajor paradigms of abstractive and extractive summarization. Section 1.4of this chapter is dedicated to summary evaluation, discussing its challengesand presenting the evaluation metrics used in this thesis. Finally, Section 1.5looks at summarization from an IR perspective, and frames automatic textsummarization within related fields of research.

1.1 Human professional summarization

Everyday notions of summaries include many different things, such as movietrailers, football statistics tables or book reviews. This thesis focuses solelyon the summarization of written text documents. We therefore define a sum-mary to be a brief synopsis of the essential parts of the content of one or moresource texts, which is presented to the user as a coherent natural languagetext [Man01]. Summarization is the process of producing this condensedrepresentation of the source content for human consumption.

Several studies have investigated human professional summarization oftext documents in order to understand the processes involved and gain in-sights for improving automatic summarization methods [Cre96, EN98, JM99].Professional abstractors are active in different areas, for example, in produc-ing bibliographic databases. They typically follow a summarization processthat can be described in three approximate stages [EN98, Man01]:

Document exploration. In the document exploration phase, the abstrac-

15


tor examines the document’s title, outline, layout and overall structureto become familiar with the document. The abstractor may be expe-rienced in summarizing a particular type (or genre) of documents (e.g.experimental studies), and thus have some prior knowledge of the typesand structure of the information contained in the document. For ex-ample, she may know that scientific articles in general first introduceand motivate a research problem, then present a solution or approach,followed by an experimental study and a discussion of its results.

Relevance assessment. The relevance assessment phase deals with theidentification of source passages that may be relevant for a summary. Inthis step, the abstractor constructs a mental model of the document’stheme, and the theme’s structure in terms of different sub-elements,which she then uses to assess the (sub-)elements and text passagesassociated with them for their relevance.

Summary Production. The summary production phase consists of cut-ting and pasting text from the source document in order to create thesummary. Since professional abstractors usually are not experts in thedomain of the documents they summarize, they follow the author asclosely as possible. Abstractors often edit and revise the extracted pas-sages to conform to the intended summary structure, e.g. by lexicalrewriting, sentence combination, and the deletion of redundant, vagueor superfluous terms.

The process described by Endres-Niggemeyer illustrates that an abstractorwho aims to attain an understanding of the contents of a source documentcreates a structured representation of the document’s theme and its elements.In her study, Endres-Niggemeyer comes to the conclusion that human profes-sional summarizers often exploit prior knowledge about a document’s themeand various theme elements to determine important content. This obser-vation is also discussed in a study on newspaper summarization [Man01].Newspaper editors, who have to write a short front page summary of articlesfound elsewhere in the newspaper, often look for sentences or paragraphsthat contain information corresponding to a “specialized scheme which iden-tifies common, stereotypical situations [and characters] in a domain of inter-est” [Man01, p. 34]. These sentences or paragraphs are then re-used as is, orwith only minor revisions. A major focus of this thesis is the identification ofsuch domain-specific structures of news article collections, and Chapters 3–5of this thesis will present novel solutions for their utilization in an automatictext summarization system.

16

1.2. Automatic text summarization

1.2 Automatic text summarization

The automation of the task of summarization is very demanding, and involvesa number of complex challenges. While research in automatic text summa-rization has a long tradition, with the earliest systems appearing in the 1950sand 1960s [Luh58, Edm69], it has only been in the last two decades that themajority of automatic summarizers have been developed. This is partly dueto the advances made in related research fields (see Section 1.5), as well asbetter techniques, tools and linguistic resources that are now at the disposalof summarization researchers [HM00]. In addition, different workshops andthe introduction of the annual Document Understanding Conference series(DUC) [DUC07] provided a large stimulus and forum for text summarizationresearch.1

Two standard definitions of the task of automatic text summarization aregiven by Mani [Man01, p. 1] and Sparck Jones. According to the former, thegoal of a summarizer is to:

“. . . take an information source, extract content from it, andpresent the most important content to the user in a condensedform and in a manner sensitive to the user’s or application’sneeds.”

Sparck Jones provides a similar definition [Jon07], where summarization is

“. . . a reductive transformation of source text to summary textthrough content condensation by selection and/or generalizationon what is important in the source”

In order to create a concise and comprehensive summary, an automatic sum-marizer must therefore successfully address the following core tasks:

1. Analyze, and potentially “understand”, the source content.

2. Determine important (salient, relevant) source content, to distinguishsummary-worthy content from content that can be left out.

3. Condense source content

4. Create an output summary

Note that these steps are applicable not only to the summarization of text,but also to the summarization of other media, such as spoken language ormovies. We will now discuss each of these steps in more detail.

1DUC has since been replaced by the Summarization Track of the Text Analysis Con-ference (TAC) [TAC09] in 2008. Both conference series are organized by the AmericanNational Institute of Standards and Technology (NIST, http://www.nist.gov).

17

http://www.nist.gov


Source analysis

Determining what is important in the source requires an analysis of thesource’s content. The summarizer must process the source content, and con-struct an internal representation of it. The source representation capturesthe different (information) elements contained in the source, and their re-lation to each other. In this context, the notion of source elements mayrefer to words or concepts, as well as more complex elements such as rep-resentations of facts, or elements of the discourse structure of a document.The analysis of written text implies handling natural language and all itsinherent complexities, including morphological, syntactic, semantic and dis-course issues [Man01]. Ultimately, the goal of source analysis is to arrive atsome form of understanding of the source content, which would require thesuccessful integration of various kinds of lexical, domain and common-senseknowledge [Jon07].

Determining content importance

The next task of a summarization system is to determine which source con-tent is important, and which information does not need to be included inthe summary. From an IR perspective, this can be seen as the problem offiltering and ranking source content elements. Importance (or salience, rel-evance) can be defined as the weight attached to the information elementsof a document. This weight can be influenced by several different factors:Some scenarios may only require the identification the most important con-tent in a single document [Man01]. In other scenarios, salient content isdetermined relative to other documents, e.g., for identifying novel informa-tion [RM98, AGK01]. Importance may also depend on the user’s or ap-plication’s needs, and thus require a personalization factor in the filteringand ranking strategies [MB97, MB98]. In addition, individual source con-tent elements may be very similar. Since it is not useful to repeat similarinformation in a summary, the relevance of content elements also depends onwhich other elements have already been included in a (partially constructed)summary [CG98, RJB00]. Finally, as discussed in Section 1.1, prior domainand background knowledge may also play a role in determining content im-portance.

An additional challenge of the identification of summary-worthy contentis that importance is an elusive notion and hard to establish. It is well-knownthat humans vary strongly in what they consider important in a given sourcetext, and the same human will choose different source elements when giventhe same source on two separate occasions [RRS61, Man01, NPM07]. It

18

1.2. Automatic text summarization

is therefore not trivial to determine the particular contribution of variouscontent and context properties on content importance.

Condensation

Once an internal source representation has been constructed, and the im-portance of elements of the source representation has been established, thesummarization system needs to transform the source representation into asummary representation [Jon07]. During this transformation, the informa-tion contained in the source representation is condensed. Condensation canbe achieved by selection, where the summarizer chooses a subset of the over-all source content for the summary [Man01]. Selection is typically guided bythe importance of source elements determined in the previous step. How-ever, analogously to the problem of determining content importance, it isdifficult to decide which particular source element to select, given a set ofelements that cover the same or similar information. For example, humanabstractors may consider the same information as important, but differ inthe way they express this information, e.g. by using synonyms or para-phrases [NP04, NPM07]. One of the benefits of automatic summarization,advocated by Luhn [Luh58], is that an automatic system produces consistentsummaries in terms of content selection and expression.

Complementary to selection, a summarization system may apply general-ization and aggregation operations [HM00]. Generalization refers to replacingone or more source elements by a single, more abstract one. For example,the words “vegetables” and “fruit” may be expressed more briefly as “gro-ceries”. This is also an example of aggregation, which refers to the mergingof source elements. Generalization and aggregation can be performed at dif-ferent levels of the source representation, e.g. at the morphological, syntacticor semantic level [Man01].

Finally, the condensation process often targets a specific compression ratioor a maximum summary length, which restricts the amount of words, andhence information, the summary can contain. Depending on the length ofthe input (consider for example a short news report versus an essay-lengthdossier), this may mean that the summarizer must be flexible in choosingwhich and how much source material to include in the summary [Man01].

Synthesis

In this step, the internal summary representation is rendered back into nat-ural language text to create an output summary. For text summarization,this usually means producing a coherent, fluent text, much like an abstract

19


written by a human [Man01]. However, other summary forms include textsnippets, headlines, and lists of keywords. The challenge of generating co-herent text will be further discussed in the next section.

1.3 Summarization concepts

This section introduces summary types and basic concepts of automatic sum-marization. We describe the two main paradigms for tackling the problemof automatic summarization: extractive and abstractive summarization. Wethen introduce and discuss further categorizations of summaries that mate-rialize from the different application scenarios.

1.3.1 Abstractive and extractive summaries

Automatic text summarizers are historically categorized as abstractive orextractive. Abstractive summarization aims to reproduce human summariz-ing by attempting to infer the meaning of source content during analysis,and by creating new or reformulated text during summary generation. Ex-tractive summarization, on the other hand, focuses on the simpler strategyof selecting and concatenating source text passages to produce a summary.Figure 1.2 shows an example news article together with an abstractive and anextractive summary of its contents. Both summaries were created by humanprofessional summarizers. The abstract combines and rephrases informationfrom different source sentences into new text, whereas the extract consists ofselected sentences from the source document.

Abstraction

Abstractive summarization is motivated by the assumption that if onecan grasp the meaning of a text, one can condense it more effectively, and thuscreate a more concise summary. Abstraction must therefore process mean-ing representations (often also called symbolic representations) of source andsummary content. The construction of such representations requires deep lin-guistic analyses of source text, such as syntactic and discourse parsing, andrelies on machine-readable resources that encode context and world knowl-edge [HM00]. Important content is determined by weighting elements ofthe symbolic representation of source content. The transformation of thesource representation into a summary representation involves selection andreasoning operations (e.g. aggregation, generalization, or inference with re-spect to a user’s information need). The use of reasoning operations on

20

1.3. Summarization concepts

Hurricane Gilbert, one of the strongest storms ever, slammed into the YucatanPeninsula Wednesday and leveled thatched homes, tore off roofs, uprooted treesand cut off the Caribbean resorts of Cancun and Cozumel. Looters roamed thestreets of Cancun, stealing from stores whose windows were blown away. Hugewaves battered the beach resorts and thousands were evacuated. Despite theintensity of the onslaught and the ensuing heavy flooding, officials reported onlytwo minor injuries. The storm killed 19 people in Jamaica and five in the Domini-can Republic before moving west to Mexico. Prime Minister Edward Seaga ofJamaica said Wednesday the storm destroyed an estimated 100,000 of Jamaica’s500,000 homes when it throttled the island Monday. The Jamaican Embassyreported earlier that 500,000 of the nation’s 2.3 million people were homeless.Army officials in Mexico City said about 35,000 people were evacuated fromCancun, but Cancun Mayor Jose Sanchez Zapata said about 11,000 fled. Morethan 120,000 people on the northeast Yucatan coast were evacuated, the Yucatanstate government said. The eye of the storm passed over Cozumel and Cancunwith howling winds clocked at 160 mph at about 8 a.m. EDT. By Wednesdaynight the National Hurricane Center downgraded it to a Category 4, but centerdirector Bob Sheets said:“There’s no question it’ll strengthen again once it comesoff the Yucatan Peninsula and gets back in open water.”

(a) Example news article

OnWednesday, Hurricane Gilbert, a category 5 storm, the strongest and deadliesttype, slammed into the Yucatan Peninsula with 160mph winds causing heavydamage to the resort areas of Cancun and Cozumel. It destroyed and estimated100,000 of Jamaica’s 500,000 homes. More than 120,000 people on the northeastYucatan coast were evacuated. The already record-setting storm is expected tointensify as it leaves the Yucatan and again moves over water.

(b) Abstractive summary

Hurricane Gilbert, one of the strongest storms ever, slammed into the YucatanPeninsula Wednesday and leveled thatched homes, tore off roofs, uprooted treesand cut off the Caribbean resorts of Cancun and Cozumel. More than 120,000people on the northeast Yucatan coast were evacuated, the Yucatan state govern-ment said. Prime Minister Edward Seaga of Jamaica said Wednesday the stormdestroyed an estimated 100,000 of Jamaica’s 500,000 homes when it throttled theisland Monday.

(c) Extractive summary

Figure 1.2: The figure shows a news article about Hurricane Gilbert, togetherwith an example abstractive and extractive summary of the article’s contents.

meaning representations again requires context and world-knowledge, suchas domain-specific ontologies. Finally, the internal summary representationneeds to be rendered into written text by a Natural Language Generation

21


component [RD00].Abstractive summarization is an appealing research goal, since it poten-

tially offers high compression rates, the possibility of conceptual and struc-tural condensation, and the inclusion of background material that enrichesthe source content [HM00]. However, abstractive summarization comes at ahuge cost due to its high demands in terms of linguistic resources and tech-nologies. Most of the necessary tools are not yet reliable or generic enough,or are costly to transfer to new knowledge domains [Jon07]. Therefore, ab-stractive approaches have rarely been investigated in recent summarizationresearch.

Extraction

In contrast to abstractive summarization, extractive summarization es-sentially aims to reproduce selected source content [Edm69]. It focuses onthe identification of the most important source material, which is extracted“as is” to create a summary [Luh58, Man01]. The main assumption of ex-tractive summarization is that not all source passages are equally informa-tive [RHM02], and thus identifying the more informative source passages al-lows for the construction of a summary. An extractive summarization systemapproaches the task of summarization by splitting the source text into pas-sages, which are then weighted, filtered and ranked during the analysis phase.The basic unit of extraction is typically a sentence, since it is a prominentlinguistic unit [Man01]. In addition, extracting sentences guarantees thatthe summary consists of grammatically well-formed text. Extracting smallerpassages, such as clauses or phrases, results in fragmentary text that needsto be patched or reformulated. Choosing a larger unit of extraction, such asa paragraph, may negatively affect the compression rate and conciseness ofthe summary. Extractive summarization thus emphasizes the analysis phaseof summarization, whereas the condensation phase is typically reduced toselecting the most highly weighted passages, which are then ordered andconcatenated during the synthesis phase. Simply concatenating source pas-sages extracted from different parts of the original text may however lead toproblems of text coherence, cohesion and redundancy [HM00].

Text coherence captures the notion of the macro-level discourse structureof a text. A text is coherent if it forms an integrated whole and is not justa set of disjoint passages. Coherence is represented by relations betweentext segments such as elaboration, cause and explanation [BE97, Man01].In practice, ensuring coherence is difficult, because it requires understand-ing the content of each passage and knowledge about the structure of dis-course [RHM02]. Since extraction typically treats passages as independent,

22


an extractive summary may be incoherent, for example by containing un-resolved pronouns, or argumentative gaps [HM00]. Cohesion, on the otherhand, is a device for “gluing” together different parts of a text through theuse of semantically related terms (lexical cohesion), co-reference, ellipsis andconjunction [HH76]. Extractive summaries may lack cohesion, e.g. whencombining sentences from different source documents.

Finally, redundancy may be introduced into a summary if the summarizerselects passages that contain similar or the same information. A summariza-tion system must therefore compare the content of the selected passagesbefore adding them to a summary.

A vast majority of current approaches to automatic text summarizationpursue extractive strategies. However, many systems apply deeper linguis-tic analyses and employ knowledge-based methods, both to arrive at moremeaningful source representations and to produce more abstract-like sum-maries [HM00, RHM02, Jon07]. The distinction between extractive and non-extractive approaches is therefore not absolute, especially for different partsand components of summarization systems.

1.3.2 Indicative, informative and critical-evaluativesummaries

Summaries can be classified according to the task or function they are in-tended for. The literature distinguishes three types of summary uses – in-dicative, informative, and critical-evaluative [Edm69, Cre96, Man01].

Indicative summaries provide enough content to act as a decision (ornavigation) tool, but they do not give a comprehensive overview of the sourcecontent [Man01]. For example, the text snippets provided with search enginequery results allow users to decide which documents to read, and facilitatebrowsing the collection of relevant documents. Indicative summaries thusenable an efficient screening and scanning of sources.

Informative summaries, on the other hand, can serve as a substitute ofthe original document. They assemble relevant information from the source,at some level of detail, in a short, concise document [HM00]. Informativesummaries are what automatic summarizers have typically targeted [Jon07],and the construction of informative summaries is also the focus of this thesis.

Finally, critical-evaluative summaries incorporate a critique of thesource’s contents, in addition to providing an informative gist [Man01].A book review is a good example of this type of summary. Critical-evaluative summaries thus add expertise that is not available from the sourcealone [HM00]. Automatic summarizers currently cannot produce such sum-

23


maries due to the infeasibility of encoding this kind of expertise.

1.3.3 Generic, personalized and update summaries

A further categorization of summaries arises from varying the summary’sfocus. Traditionally, summaries were intended to be generic, i.e. designedto comprehensively reflect the main content of a source document [Jon07].However, in some settings, such as a user searching the Internet, it maybe useful to provide query-focused summaries.2 In contrast to generic sum-maries, query-focused summaries cover source content more selectively thangeneric summaries, and may not reflect all aspects of the source [Man01].Often, the query (or topic statement) defines a complex information need,which may consist of a set of sub-questions. An example topic statement isshown in Figure 1.3. Query-focused summaries should ideally aim to includeanswers for each of the “sub-questions” of the user’s information need.

Query-focused summarization raises the additional challenge of ade-quately translating a user’s information need into features which indicatethe query relevance of source elements. This requires a representation ofthe information need, and its comparison to source elements. For complextopic statements involving several (sub-)questions, a summarization systemneeds to ensure that each of the different subquestions is addressed in thesummary. Query-focused summarization has been the main focus of manyDUC competitions [DUC07].

A final type of source coverage arises in the context of newswire summa-rization. Newswire streams continuously present novel articles that update anongoing news story. A reader may consume a summary of a current event onone day, whereas from then on she is only interested in new developments ofthe story. In that case, the summarizer needs to produce update summaries,i.e. summaries which cover only novel information and do not include contentthat is assumed to be known to the user [AGK01, HRL07, DO08].

The summarizers developed in this dissertation are mainly evaluated us-ing query-focused summarization datasets, but are not restricted to this typeof summarization.

1.3.4 Single- and multi-document summarization

Text summarization is not restricted to the production of summaries of asingle source document. A much more interesting use case is the summariza-tion of multiple sources, which is known as multi-document summarization

2Query-focused summaries are often also called topic- or user-oriented summaries, de-pending on the specific focus of the summarization approach.

24


Title: Global warmingQuery: Describe theories concerning the causes and effects of globalwarming and arguments against these theories.

Figure 1.3: An example topic statement (query) describing a user’s infor-mation need as used in query-focused summarization. Note that the queryconsists of multiple subquestions, and an automatic summarization systemshould therefore aim to create a summary that includes answers for all them.

(MDS) [Man01, ODH07, Jon07]. Multi-document summarization addressesa range of application scenarios strongly connected with current document-centric approaches to information retrieval. For example, as mentioned in theintroduction, multi-document summaries can be used for condensing searchresults, or for providing overviews of news article collections. The richer inputin multi-document summarization requires more complex processes and rep-resentations than single-document summarization, but also allows for moreinteresting summaries, as summarizers can identify and highlight similaritiesand differences across documents [MB99].

Multi-document summarization typically operates on a collection of the-matically related documents [Dan06, ODH07]. The common theme of thedocument collection may be the user query that resulted in a particular setof documents being retrieved by a search engine. Alternatively, as in thecase of news article collections, the common theme is often a specific newsevent. The close relatedness of the source documents means that much ofthe information they contain will be redundant. For example, news articlesoften summarize previous reports of the same event as the news story evolvesand repeat similar background matter.

Recognizing similar information is therefore one of the main challengesof multi-document summarization [RHM02]. It is crucial for avoiding the in-troduction of redundant information in the summary, and for creating morediverse summaries [MB99, GMCK00]. In addition, the frequency with whicha particular piece of information is encountered in the source content is agood indicator of its importance [NVM06]. The identification of similar in-formation can thus help to find important content.

One recurrent observation in multi-document summarization, especiallyof news data, is that documents usually consist of several subtopics, centeredaround a main theme [GL01, BL04, Jon07]. For example, a news articlecollection about an earthquake contains information about the earthquake’sstrength and location, reports of casualties, rescue efforts, aftershocks, inter-national help, and so forth. Each document is composed of a subset of these

25


(a) Single-document summarization (b) Multi-document summarization

Figure 1.4: Illustrative schema of extractive single- and multi-document sum-marization. (a) The image shows the working principle of an extractivesingle-document summarizer. Important passages from the source documentare extracted and concatenated to create a summary. (b) In extractive multi-document summarization, documents may repeat the same or similar infor-mation, as indicated by the passages with identical coloring. A summarizermust recognize these similarities in order to avoid redundant content in thesummary.

subtopics, and many of the subtopics are repeated throughout the documentcollection.

Subtopics can be viewed as domain-specific sub-elements of the overalltext structure, i.e. news articles about an earthquake consist of a different setof subtopics than articles about a presidential election. Previous research hasargued that humans rely on prior knowledge about such formulaic domainelements as this facilitates reading comprehension and recall [BL05, Bar32].Furthermore, as discussed in Section 1.1, human summarizers often utilizethis kind of knowledge to identify important information. Modeling thesubtopics of a document collection can therefore contribute to the designof more effective summarization systems, since such a model represents high-level elements of the source content. The aggregation of subtopic informationfrom different source documents in multi-document summarization can helpto create more accurate and comprehensive representations of each subtopic,such that a user interested solely in a particular aspect of the source material

26


Parameter Types

Relation to source Extractive or abstractiveUse Indicative, informative or criticalCoverage Generic, query-focused or updateUnits Single- or multi-documentOutput Fragments or fluent textLanguage Mono-, multi- or cross-lingual

Table 1.1: Types of Summaries

can get a more complete picture of it.Finally, multi-document summarization is more challenging than single-

document summarization because it introduces additional complexities inother areas as well. For example, ensuring coherence and cohesion of an out-put summary created by concatenating passages from different source doc-uments involves resolving cross-document co-references, and handling dif-ferences in author style and source register [GMCK00, Jon07]. Documentcollections may also be broadly distributed in time, and the summarizer thusmust decide if and how to present older information. Furthermore, generat-ing a multi-document summary in general implies adopting a much highercompression rate, since the amount of source content is typically larger thanfor single-document summarization.

The summarization methods proposed in this thesis deal exclusively withmulti-document summarization, and in particular with the summarization oftopically related collections of news articles. The use of news material is, onthe one hand, motivated by the existence of suitable evaluation datasets (andin contrast, the costliness of creating novel datasets). On the other hand, themain focus of this work is to address challenges related to the identificationof similar information elements, which is facilitated by choosing a domainin which many documents naturally cover the same or similar information(i.e. news events). In particular, this thesis will present novel solutions formulti-document summarization that utilize the theme and subtopic structureof news article collections (Chapters 3–5).

1.3.5 Other aspects of summarization

There are various other, less investigated aspects and challenges of automatictext summarization. The following list is intended to serve as a brief overview,for an extensive discussion see Mani [Man01] and Sparck-Jones [Jon07].

27


Language Most studied source content is in English, but experimentshave also been done with Japanese [Nom05], Chinese [CLGT00],Dutch [MD00], German [RKEA00], Arabic [DL04, COS06], and otherlanguages. In general, source content is monolingual, with the excep-tion of the Multilingual Summarization Evaluation workshop3 whichused both English and Arabic input. The output language is usuallythe same as the input language. For cross-lingual and multi-lingualsummarization, however, output may be in a language different fromthe input [RKEA00, DM05a].

Genre Most of the input in summarization research is news material.Other corpora include scientific articles [TM02], legal texts [HG05,SRR08], medical texts [MJH98], email messages [CORGC04], shortstories [KS10], web pages [SSZ+05] and blogs [HSL08]. There has beenlittle research on explicit genre selection for output [May95], typicallythe input genre has determined the output genre.

Length The length of the source content may also play a role, for examplesummarizing a short story requires a much more effective condensationthan summarizing a news article [Man01].

Document structure Document structure, such as section headings, ta-bles, quotations, or domain-specific orderings of information, could beuseful to consider during content analysis [Jon07].

Output Format Summaries typically consist of fluent text, but for somepurposes fragmentary summaries, i.e. lists of words or phrases, may suf-fice. Text snippets are a form of summary used by search engines to pro-vide context for the displayed search results. Another specialized formof summary are document headlines [BMW00], key phrases [Zha02] ortag clouds.

Table 1.1 lists the discussed summary types, and Table 1.2 gives an overviewof the main challenges of automatic text summarization.

1.4 Summarizer evaluation

Summarization systems have to be evaluated like any other natural languageinformation processing system. This section discusses the main challenges of

3http://projects.ldc.upenn.edu/MSE, visited May 3rd, 2011

28

http://projects.ldc.upenn.edu/MSE

1.4. Summarizer evaluation

Phase Challenge Description

Analysis Content analysis Process the source text(s) to createa meaningful source representation,which requires handling the complex-ities of natural language text.

Analysis Importance Determine what is important in thesource, possibly with respect to a user’sinformation need.

Analysis Content similarity Recognize repeated and similar sourcecontent, especially in multi-documentsummarization.

Transformation Condensation Decide how to condense, e.g. by selec-tion, aggregation and generalization ofsource elements, to create a summaryrepresentation. In addition, a targetedcompression ratio or maximum sum-mary length may restrict the amountof information allowed in the summary.

Synthesis Text generation Ensure that the output summary con-sists of well-formed, grammatical text,and does not suffer from lack of coher-ence or cohesion.

Synthesis Redundancy Avoid the introduction of redundantinformation.

Table 1.2: Main challenges of automatic text summarization

summary evaluation, and introduces the different evaluation methods usedby the research community and in this dissertation.

Automatic summarization systems (and automatically constructed sum-maries) can be evaluated intrinsically and extrinsically [JG96]. Intrinsicevaluation tests the system in itself and with respect to its own declaredobjectives, and has been at the core of summarizer evaluation. Extrinsicevaluation, on the other hand, assesses a system in relation to another task.

There have been only a few extrinsic evaluations of summarization sys-tems. For example, summaries have been used for relevance filtering of fulldocuments in retrieval [BMR95, MB98, JBME98], or employed as an inter-nal system module for document indexing and retrieval [SSWW99]. Otherresearchers have investigated summaries in the context of information searchon mobile devices [BGMP01], and for report generation [MJH98, MPE+05].

29


The two main intrinsic evaluation concepts of summarization are textquality and concept capture. Text quality criteria measure the linguistic well-formedness of a summary, i.e. whether it consists of well-formed discourse,and is grammatically correct and coherent. The concept-capture criterion,on the other hand, relates to the notion that a summary should reflect asmany key concepts of the source document as possible, i.e. that it shouldcontain the most important pieces of information from the source document.However, both criteria are difficult to measure, as will be discussed below.The comparison of concepts found in a machine-generated summary withthe concepts contained in reference (or gold-standard) summaries written byhuman abstractors has become the standard method of intrinsic summarizerevaluation [Man01].

Using human-written reference summaries, or any other form of humanjudgment in summarizer evaluation, raises the problem of human varia-tion [RRS61, JBME98, PNMS05, Jon07]. Different possible extracts or ab-stracts may be equally good summaries of a source [Man01], and it is well-known that agreement in which content is important enough to be includedin the summary is typically not very high when comparing human writtensummaries of the same source text(s) [RRS61, vHT03, NPM07]. For ex-ample, Hovy and Lin report that only 40% of the words in multiple humanreference summaries overlap [LH02], and Copeck et al. observe that only 55%of the vocabulary of reference summaries occurs in source documents [CS04].

This observation holds true even when human judges are only requiredto pick out representative sentences from a source document, and also whenthe same person is requested to summarize the same source text on twoseparate occasions [RRS61, SSMB97]. However, humans do tend to agreeon the most important content to extract (e.g. the top 10% of the sen-tences) [JBME98, Mar99], whereas agreement drops for less important con-tent in longer summaries. McKeown et al. showed that, when using only asingle reference summary, the choice of reference summary has a significantimpact on the scores assigned to machine-generated summaries [MBE+01].To mitigate the effects of human variation in content selection, summaryevaluations (such as the DUC competitions) typically utilize multiple humanreference summaries [PNMS05, Dan06].

Furthermore, even if humans agree on some particular content assummary-worthy, they will not necessarily use the same words or phrasesto express this content [vHT03, NP04, NPM07]. Different summaries maytherefore contain content that is semantically, but not necessarily lexically orsyntactically similar. For example, the three phrases “the hiring of Jose Ig-nacio Lopez, an employee of GM by VW”, “Ignacio Lopez De Arriortua, lefthis job at General Motor’s Opel to become Volkswagen’s . . . director” and

30


“He left GM for VW” all express the same fact using different words andword combinations. The Pyramid method, which will be described in moredetail in the next section, aims to capture these variations using so-called“summary content units”.

1.4.1 Metrics

The difficulties of summarizer evaluation were addressed in the summariza-tion community by the introduction of the annual DUC and TAC confer-ences. These conferences enforced a shared road map of summary evalua-tion, with the goal of moving from intrinsic to extrinsic, task-oriented evalu-ation [Jon07]. They provided a forum for continuous, large-scale evaluationsof system performance on common datasets, using both human judgmentsof summary quality as well as (semi-) automated methods of intrinsic sum-mary evaluation by comparing machine-generated summaries with sets ofgold-standard summaries written by NIST assessors. Over the years, a vari-ety of automatic and manual evaluation methods have figured in DUC andTAC evaluations, most prominently the automatic Rouge metric, manuallyevaluated linguistic quality and content criteria, and the Pyramid method.This section will first introduce Rouge, which constitutes the only practicalautomatic evaluation measure to quantify the level of concept capture of asummarization system. Subsequently, we will present the Pyramid methodand its basic notion of summary content units, as an understanding of theseconcepts motivates the analyses conducted in Chapter 7 of this thesis.

Rouge

Rouge is a recall-oriented metric that measures how well a machine-generated summary overlaps with a set of human reference summaries interms of the words they contain. It therefore addresses the concept cap-ture criterion introduced above, and approximates this criterion by calculat-ing word n-gram4 co-occurrence statistics [LH03, Lin04]. Rouge is recall-oriented since it determines how many correct concepts are contained in amachine-generated summary, when compared with the set of concepts con-tained in the set of human written reference summaries.5 Higher values

4A word n-gram is a contiguous sequence of n words, for example, a unigram (1-gram)is word sequence consisting of a single word only, whereas a bigram is a sequence of twowords. ‘bigram is’ and ‘is a’ are two example bigrams of the previous sentence.

5Recall is a widely used evaluation metric in information retrieval and measures thecompleteness of correct results. For a detailed explanation of precision, recall and F-measure, see e.g. [MS01, BYRN99]

31


indicate a higher summary quality.The Rouge metric provides several different measures. The most com-

monly used are Rouge-1 (unigram overlap), Rouge-2 (bigram overlap) andRouge-SU4 (skip bigram overlap, where word pairs can contain up to 4intervening words). Rouge-N is computed as follows:

ROUGE-N =

S∈{ReferenceSummaries}

gramn∈S Countmatch(gramn)

S∈{ReferenceSummaries}

gramn∈S Count(gramn)(1.1)

where N, n denote the length of the n-gram, and Countmatch(gramn) is thenumber of n-grams gramn co-occurring in the candidate summary, i.e. thesummary to be evaluated, and the set of reference summaries. The numeratorsums over all reference summaries, which gives more weight to matching n-grams occurring in multiple reference summaries. This follows the intuitionthat information which multiple human reference summaries agree on is moreimportant, and should be weighted higher.

The major advantage of Rouge is that it does not rely on human judgesto annotate important content, but instead uses n-gram frequencies as anindicator of content importance. Rouge can be computed automaticallyand repeatedly, and has been shown to correlate well with human judg-ments [LH03]. However, Rouge measures are based on lexical overlap,and therefore cannot handle linguistic phenomena, such as word ambigu-ity (polysemy) and synonymy. They also cannot capture semantic simi-larities and differences of text. Furthermore, Rouge-1 considers unigramobservations independently, and thus makes the well-known “bag-of-words”assumption [MS01]. This assumption is relaxed for bi-gram and higher n-grams statistics, but these in turn are sensitive to changes of word choice andword order. Considering the huge costs and variability of manual summarizerevaluation, however, Rouge is de-facto the only consistently repeatable andreliable method for automatic, intrinsic summarizer evaluation, and thus themost widely utilized metric to compare system performance outside of themanual evaluations performed during the DUC and TAC conferences. Weutilize Rouge in this thesis for all automatic evaluations.

The Pyramid method

The Pyramid method is a manual evaluation scheme that aims to overcomesome of the deficits of Rouge [NP04, PNMS05, NPM07]. The method re-wards automatic summaries for conveying content that has the same meaningas content represented in a set of human reference summaries. The Pyramidmethod is thus not based on the lexical overlap of word n-grams, but on the

32


SCU ID Text

SCU 18 The US Coast Guard with help from the Air NationalGuard then began a massive search-and-rescue mission,searching waters along the presumed flight path

Contributor 1: The US Coast Guard with help from the Air NationalGuard then began a massive search-and-rescue mission,searching waters along the presumed flight path

Contributor 2: A multi-agency search and rescue mission began at 3:28a.m., with the Coast Guard and Air National Guardparticipating

Contributor 3: The first search vessel was launched at about 4:30am.An Air National Guard C-130 and many Civil Air Patrolaircraft joined the search

SCU 21 Federal officials shifted the mission to search and recov-ery

Contributor 1: Federal officials shifted the mission to search and recov-ery and communicated the Kennedy and Bessette fami-lies

Contributor 2: federal officials ended the search for survivors and begana search-and-recovery mission

Table 1.3: Summary content units (SCUs) used in the Pyramid evalua-tion [NPM07]. The table shows two example SCUs. Each SCU groups to-gether text passages (contributors) from human reference summaries whichshare the same meaning, regardless of the choice of words used to expressthis meaning.

semantic similarity of larger text spans. These text spans, which can be aslong as a sentential clause, are called content units. Similar content units areidentified on the basis of expressing the same semantic content, irrespectiveof the actual choice and ordering of words. The Pyramid method is hence notas sensitive to variation in human content expression as the Rouge metric.

The Pyramid method relies on human judgments to identify similar con-tent units in sets of human reference summaries. Similar content is identifiedmanually, which incurs another expensive annotation step in the process ofsummarizer evaluation. A group of content units with a shared meaning isreferred to as a Summary Content Unit (SCU), and the content units are de-noted as contributors of that SCU. As different contributors may vary in how

33


precisely they specify a particular piece of information (e.g. “1993” vs. “theearly 90’s”), the exact semantic precision of the SCU is left to the annotator’sjudgment, and typically clarified by the label assigned to an SCU [PNMS05].While the Pyramid approach allows for variation in the way similar contentis expressed, different authors have observed that semantically similar textspans written by different human summarizers are often conveyed with asimilar choice of words and word patterns [NP04, HNPR05].

SCUs are weighted by the number of human reference summaries theyoccur in, i.e. the number of their contributors. This approach assigns moreimportance to content that multiple human summarizers agree on. A Pyra-mid model is then created by collecting all identified summary content units,with higher-weighted SCUs at the top, and SCUs occurring only in a sin-gle reference summary at the bottom. The “pyramid” shape arises sincethere are typically only a few content units of maximum weight, and manySCUs of weight 1, corresponding roughly to a negative binomial distribu-tion [PNMS05]. Two example SCUs are given in Table 1.3. SCU 18 has aweight of 3, since three reference summaries contribute to it, whereas SCU21 has only two contributors and a weight of 2. SCU 18 aggregates con-tributors which share some key phrases such as “Air National Guard” and“search”, but otherwise exhibit a quite heterogeneous word usage. Contrib-utor 3 gives details on the aircraft type and specifies a time when the firstsea vessel was launched to search for the missing plane. Only contributor1 gives information about the location of the search. In SCU 21, the firstcontributor contains additional information about communication with theKennedy family, which is not expressed in the SCU label and therefore notpart of the meaning of the SCU. Both contributors contain key terms such as“officials”, “search” and “recovery”, but vary in word order and verb usage.

To score an automatically generated summary using the Pyramid method,one sums the weights of the content units contained in it, and normalizesthis value by the score of an ideally informative summary. An ideal summarycontains as many content units as the average human reference summary, andas many highly-weighted SCUs as possible. Formally, the modified pyramidscore is calculated as follows:

Max =n

i=j+1

i× |Ti|+ j ×X −

ni=j+1

|Ti|, where

j = maxi

nt=i

|Tt| ≥ X

. (1.2)

In this equation, |Ti| is the number of content units of weight i, X is theaverage number of SCUs in the reference summaries, n is the maximum

34


weight of any content unit, and j is equal to the index of the lowest contentunit weight an optimal summary will draw from.

A major disadvantage of the Pyramid method is its strong dependenceon human effort, both during SCU annotation and summary scoring, whichseverely hinders its wider application outside of the official DUC / TACcompetitions. While some authors have used existing Pyramid annotations toautomate the scoring of machine-generated summaries [HNPR05], there is todate no automatic approach to the identification of SCUs in human referencesummaries. In this thesis, we present a novel approach that takes a steptowards an automated discovery of semantically similar content units. Ouranalysis of reference summaries shows that the described approach identifiesrecurrent word patterns that are good approximations of manually annotatedSummary Content Units (Chapter 7). From a summarization point of view,our method allows for a fine-grained model of inter- and intra-documentcontent similarities (e.g. by representing the different fact-like text spansexpressed in a set of related news articles, see Chapter 6). From an evaluationpoint of view, we believe it can help human annotators during Pyramidcreation and with the evaluation of machine-generated summaries.

Text quality and responsiveness

Summarization evaluation has traditionally involved human judgments ofdifferent linguistic quality metrics. These metrics are used to assess thereadability and fluency of summaries and are not based on a comparisonof machine-generated summaries against reference summaries.6 Instead,machine-generated summaries are scored on a five-point scale for each ofthe following metrics [Dan06]:

Grammaticality A summary should not contain ungrammatical sentencesor spelling errors.

Non-redundancy Information provided by the summary should not berepetitive. This includes repeated sentences or facts, or the repeateduse of nouns or noun phrases (“Bill Clinton”) when a pronoun wouldsuffice.

Referential clarity It should be easy to relate pronouns to the nounphrases they are referring to, and the role of entities or their relationto the story should be clear.

6A description of the quality metrics can be found at http://www-nlpir.nist.gov/projects/duc/duc2007/quality-questions.txt, visited May 3rd, 2011

35

http://www-nlpir.nist.gov/projects/duc/duc2007/quality-questions.txt

http://www-nlpir.nist.gov/projects/duc/duc2007/quality-questions.txt


Focus The summary should have a clear focus and sentences should onlycontain information that is related to the rest of the summary.

Structure and Coherence The summary should be well-structured andwell-organized, and should not be just a heap of information.

During the DUC and TAC competitions, human judges also evaluatethe content responsiveness of a summary, which measures the amount ofinformation in the summary that actually helps to satisfy the informationneed expressed in the topic statement.7 The responsiveness score providesa coarse manual measure of information coverage [Dan06]. It is thereforesimilar to Rouge and Pyramid metrics, but can be judged by a humanwithout the use of reference summaries.

1.4.2 Datasets

All the analyses and evaluations in this work are conducted on themulti-document summarization datasets created for the DUC competi-tions [DUC07]. These datasets are the only larger corpus available on whichnew ideas and system performance can be compared against previous resultsand research. They were developed by the American National Institute ofStandards and Technology (NIST), and typically consist of source documentsets together with reference summaries written by NIST assessors. The fol-lowing sections will briefly introduce the datasets that have been used in thisthesis for analysis, evaluation and comparison of the presented solutions totext summarization. All of them can be obtained from the DUC website.8

DUC 2002 This dataset consists of 59 news article clusters, with a totalof 567 documents [OL02]. The documents in each set are related to acommon topic or event, and there are on average 9.6 documents perset. The documents themselves are drawn from different newswire andnewspaper sources, such as the Wall Street Journal, Associated Press,and Los Angeles Times, among others. NIST assessors chose documentsfor the following categories of topics:

1. A single natural disaster event with documents created within atmost a 7-day window (“The eruption of Mt. Pinatubo in thePhilippines”)

7The instructions on how to evaluate this criterion can be found at http://www-nlpir.nist.gov/projects/duc/duc2007/responsiveness.assessment.instructions, vis-ited May 3rd, 2011

8http://www-nlpir.nist.gov/projects/duc/data.html (visited May 3rd, 2011)

36

http://www-nlpir.nist.gov/projects/duc/duc2007/responsiveness.assessment.instructions

http://www-nlpir.nist.gov/projects/duc/duc2007/responsiveness.assessment.instructions

http://www-nlpir.nist.gov/projects/duc/data.html


2. A single event of any type with documents created within at mosta 7-day window (“The Clarence Thomas confirmation hearings”)

3. Multiple distinct events of the same type (no time limit) (e.g.“Heart attacks”)

4. Biographical (discuss a single person), e.g. “Margaret Thatcher”

In total, there are 15 document clusters per category. NIST assessorscreated abstracts and extracts of different lengths to evaluate systemperformance. The reference extracts for multi-document summariza-tion are 200 and 400 words long, and consist of sentences extractedwithout modification from the source documents. For each cluster,there are extracts by two different assessors. NIST also providedabstracts for both single- and multi-document summarization. Thisdataset is the only summarization corpus available which contains ref-erence extractive summaries, and therefore the only publicly availabledataset on which supervised sentence ranking and classification meth-ods can be trained and evaluated without the need for further manuallabor. We utilize this dataset in this thesis to train a supervised Sup-port Vector Machine classifier on the task of sentence extraction forgeneric multi-document summarization (see Chapter 3).

DUC 2006 For the multi-document summarization task of DUC2006 [Dan06], NIST assessors created 50 document clusters, eachconsisting of 25 news articles related to a single topic. The news arti-cles were drawn from Associated Press, New York Times, and Xinhuanews agency material. For each cluster, assessors formulated a topicstatement describing a user’s information need that could be answeredusing the selected documents. The topic statement is composed of atitle and a set of questions or a multi-sentence task description (seeFigure 1.3 for an example task description). Participants are asked togenerate summaries of at most 250 words for each cluster:

“The task [. . . ] will model real-world complex questionanswering, in which a question cannot be answered by simplystating a name, date, quantity, etc. Given a topic and a set of25 relevant documents, the task is to synthesize a fluent, well-organized 250-word summary of the documents that answersthe question(s) in the topic statement.” [Dan06]

The dataset is thus geared towards query-focused multi-document sum-marization. NIST assessors produced four model abstracts for eachdocument cluster.

37


Property / Dataset DUC 2002 DUC 2006 DUC 2007

Number of topics (docu-ment clusters)

59 50 45

Avg. number of documentsper topic

9.6 25 25

Avg. vocabulary (excludingstop words)

1,045.6 2,059.4 1,728.2

Avg. number of sentencesper topic

282.6 702.5 557.4

Number of topics withPyramid annotations

- 20 23

Number of reference sum-maries per topic

2 4 4

Table 1.4: Global statistics of summarization datasets used in this thesis.

DUC 2007 The DUC 2007 dataset for query-focused multi-document sum-marization is similar to the DUC 2006 set [ODH07]. There are 45 doc-ument clusters, each consisting of 25 thematically related documents.As in the DUC 2006 dataset, NIST assessors wrote four abstractivereference summaries per document cluster. Pyramid annotations areavailable for 23 of the 45 document clusters. We utilize the DUC 2006and DUC 2007 datasets for the evaluation of our topic model-basedapproaches to multi-document summarization (Chapters 4–5), and forour analyses of sentence-level semantic content units in Chapters 6–7.

Table 1.4 lists descriptive statistics for the multi-document summarizationdatasets considered in this thesis.

1.4.3 Baseline approaches

This section introduces several baseline methods that are commonly used insummarizer evaluation. In summarization, and especially when summarizingnews article collections such as the DUC datasets, a traditionally hard-to-beat baseline is the lead baseline. It creates a summary by simply extractingthe first n sentences of a source document [BMR95]. This follows from the

38

1.5. Summarization & IR

principle of the “inverted pyramid” in news writing, where authors put themost relevant information at the beginning of an article and provide detailsin later paragraphs, which allows editors to cut from the end of the text with-out compromising the article’s readability [RM98]. In addition, news articlestypically summarize previous and novel developments of an ongoing newsstory in the first paragraph. A comparative review of single-document sum-marization systems participating in the DUC 2001–2004 challenges showedthat none of the systems outperformed this baseline with statistical signifi-cance [Nen05].

In multi-document summarization, the lead baseline typically containsthe first n sentences of the most recent document [Dan06]. An earlier vari-ant, employed in the DUC 2002 evaluation, selected the first sentence inthe 1st, 2nd, 3rd, . . . document from a document cluster in chronological se-quence until the target summary size was reached. However, this strategyis prone to select redundant content in the case of document sets consistingof closely related documents or documents selected within a very brief timespan. Nenkova notes that for multi-document summarization, systems ingeneral do outperform the first, but not the second baseline [Nen05].

1.5 Summarization & IR

Summarization is not only a technology to condense large quantities of in-formation. Other benefits of summaries include:

Satisfying complex information needs Summaries satisfy complex in-formation needs by aggregating and merging information from differentsources [Man01].

Highlighting similarities and differences A summary can combine in-formation from different sources and may highlight their similaritiesand differences [MB99].

Removing redundant and irrelevant content Redundant or irrelevantsource content is typically excluded from a summary [GMCK00].

Identifying novel information Summaries can be tailored to provide up-dates of novel information with respect to a user’s previous knowledge,e.g. in a developing news story [GMCK00].

Personalizing summary content Depending on the task a user is per-forming, or on the requirements of the user, summarizing the samesource information can emphasize different aspects of the source(s) and

39


result in very different summaries. Summaries can also provide highlyconcentrated digests of huge amounts of material, which are focused onparticular topics or themes [Jon07].

Adapting content to user interface requirements Summarization canadapt information for display on mobile devices, in order to better ex-ploit the limited screen size [BGMP01, ORK06]. Automatic layoutgenerators for electronic newspapers can also benefit from the avail-ability of condensed versions of news articles [SH09, HS08].

Enhancing user experience Summaries can enhance user experience withnews aggregation and browsing sites [Nen06]. NewsInEssence9 andColumbia’s Newsblaster10 are two examples of such systems.

Summarization can thus be an invaluable tool for a wide range of future in-formation retrieval solutions. Figure 1.5 illustrates the use of summarizationfor providing an informative overview of current news events. The imagedisplays the interface of the Columbia Newsblaster news aggregation site,showing the details of a collection of thematically related news articles. Theparagraph text below the headline is an automatically generated summaryof the seven news articles in the collection and combines information fromdifferent source articles. The link after each sentence indicates the sourcearticle, and allows to access the article’s full text. Users thus get a broadoverview of the most important information contained in the different newsarticles. This contrasts with standard news aggregation sites like GoogleNews that simply display the first paragraph of the most recent news articlefrom the collection. In addition, Newsblaster’s user interface allows to trackthe development of a news story over time. By clicking on the event trackinglink, users can access a timeline of earlier versions of the news article collec-tion along with corresponding summaries, which reflect the developments ofthe news story.

Related research fields

There are several fields of research which are either related to automaticsummarization, or from which automatic summarization draws ideas andmethods. This section frames summarization as a Natural Language Infor-mation Processing (NLIP) task, highlights the differences between summa-rization and other research areas, and points out shared characteristics and

9http://www.newsinessence.com10http://newblaster.cs.columbia.edu, visited Nov 18th, 2010.

40

http://www.newsinessence.com

http://newblaster.cs.columbia.edu

1.5. Summarization & IR

Figure 1.5: Interface of the Columbia Newsblaster news aggregation andbrowsing site. The image shows the details and a summary of a collection ofrelated news articles. Users get an overview of the most important informa-tion contained in the different news articles and can decide which articles toread.

technologies. The analysis follows and extends the one given in [Man01, p.3–4]:

Information Extraction. Information extraction (IE) is concerned withextracting factual knowledge – such as the location of an event, par-ticipating persons, date and time of the event – from natural languagetext [AI99]. Typically, a predefined domain-specific template (a table)is filled with this information, which can then be used to generate a nat-ural language text. Condensation is not a goal of this process [Man01].In addition, an IE system would only produce a summary for contentdefined in the template, ignoring all other information.

Question Answering. Question answering systems attempt to provide ananswer to questions such as ‘At which university does Krugman teach?’(a fact) or ‘What criticisms do US senators have against the currenttax system?’ (a list of arguments) [DKL07]. The main difference to

41


summarization is in extracting specific bits of information from a doc-ument collection, and not extracting generally important informationfrom the document collection. Query-focused summarization, however,can be seen as combining answers to multiple questions. As in infor-mation extraction, the goal of question answering is not to condensedocuments.

Text Mining. Text mining aims to discover novel or anomalous informationin large text corpora, or to recognize patterns in text data [Fel06]. Inrecent summarization challenges, identifying and summarizing novelinformation has figured as a new task [DO08].

Text compression. The goal of text compression is to condense text in-put for efficient storage and transmission among machines, and not forhuman consumption. The input text is treated as a code, and the con-densation process takes advantage of the redundancy in the input, suchas re-occurring character sequences [BCW90].

Indexing. Indexing aims to provide a representation of an input documentto facilitate later retrieval [SM86], which involves the creation of aninverted index which maps terms to documents. Almost all documentterms are typically listed in this index, which thus does not serve therole of condensation. However, indexing can benefit from summariza-tion, as document summaries have been used to create more efficientindices [SJ01].

Document Retrieval. The task of document retrieval is to select a subsetof the documents of a document collection that are relevant to a user’sinformation need [SM86, BYRN99]. Retrieval is only concerned withpresenting fewer or more documents, but not about condensation ofthe content of the retrieved documents. However, summarization maybe used for the presentation of retrieved results. One very brief form ofsummary well-known to every search engine user are the text snippetsprovided with each search result.

Required research fields

Automatic text summarization utilizes methods and technologies from manyother research fields. Natural Language Processing (NLP) [MS01] meth-ods such as parsing, part-of-speech tagging or stemming play an impor-

42

1.6. Conclusion

tant role, as does the use of linguistic resources such as WordNet.11 Thefield of Information Retrieval supplies many methodologies, e.g. mod-els for document representation, as well as evaluation metrics, that areused widely in automatic summarization. In recent years, the availabil-ity of source document-summary corpora has fostered the use of Ma-chine Learning (ML) methods, e.g. for sentence classification and rank-ing [KPC95, TM97, COS06, OLL07], sentence ordering [BL04], or learn-ing of feature values [LH97, TM02, LMFG05]. More sophisticated textanalysis is the focus of methods from the fields of Textual Entail-ment [DDMR09, AM10] and Machine Reading [EBC06, HRL07], whichcombine syntactic and semantic text processing with knowledge representa-tions and logical inference to create richer representations of source texts.

Natural Language Generation (NLG) aims to create natural languagetext from an internal representation of information [RD00]. NLG has figuredonly very infrequently in recent summarization research due to the commu-nity’s focus on extractive approaches. However, in earlier research, therehave been a number of summarization approaches which employed an NLGcomponent to create summaries from internal representations of summarycontent [HR86, MRK95, RKEA00].

1.6 Conclusion

This chapter provided an overview of the motivations and challenges of au-tomatic text summarization. Summarization is a powerful technique thatenables humans to efficiently digest large amounts of information, making itan invaluable tool for a wide range of future information retrieval solutions.The main challenges that have to be addressed by an automatic summarizerare the analysis of natural language text, in order to adequately representthe content of source material, and the identification of important, summary-worthy information (Section 1.2). A major goal of summarization is to createwell-formed and coherent text, which in addition may need to be tailored tothe specific needs of a user or task. These challenges are further complicatedby involving problems related to the complexity and variability of naturallanguage, and the elusiveness of the notion of importance.

11WordNet (http://wordnet.princeton.edu) is a lexical resource that groups En-glish words (mostly nouns, verbs, adjectives and adverbs) into sets of synonyms, calledsynsets [Fel98]. The different senses of a word correspond to different synsets. Synsetsare provided with a short definition, and are linked with each other by their semanticrelations, such as antonymy, hyperonymy, and hyponymy. There exist (smaller) versionsof WordNet for other languages, see e.g. [Luc08].

43

http://wordnet.princeton.edu


Two main approaches to automatic summarization have emerged in re-search: Abstractive summarization aims to reformulate source content innovel terms based on meaning-oriented representations of source material,whereas extractive summarization focuses on the simpler strategy of selectingand concatenating relevant text passages to create a summary (Section 1.3).In the context of multi-document summarization, one of the main additionalchallenges is the identification of similar information, in order to ensure thatthe summary covers different aspects of the source material, and does not con-tain redundant content. Furthermore, it has been shown that the frequency ofsource information is an effective indicator of content importance [NVM06].

In Section 1.4, we introduced the challenges of summary evaluation, anddescribed the metrics and datasets used by the research community and inthis dissertation. The final Section 1.5 highlighted the benefits of automaticsummarization for future information retrieval solutions, and framed the taskof automatic summarization with respect to related research areas.

In the next chapter of this work, we will present an exhaustive discussionof previously presented approaches to automatic text summarization, focus-ing on work related to modeling the subtopical contents of multi-documentsummarization datasets and human-created multi-document summaries, inorder to motivate and emphasize the novel contributions of this dissertation.

44

Chapter 2

Related work

Introduction

Summarization is a research field with a long tradition. The first pub-lications appeared in the 1950’s and 1960’s [Luh58, Edm69], focusing onextractive strategies, while later work during the 1970’s and 1980’s tookup trends in the field of Artificial Intelligence (AI) and aimed for abstrac-tive summarization [Leh82, DeJ82, RH88, RJZ89]. The growing numberand quality of natural language processing tools, such as robust part-of-speech taggers and syntactic parsers, as well as the availability of suit-able text corpora renewed interest in automatic summarization during the1990’s [MR95, MRK95, BK97, BE97, TM97, HL99], with a shift back toextractive strategies. These years also saw the first applications of methodsfrom Machine Learning (ML) [KPC95, AOGL99, MB98], and new researchdirections like multi-document summarization [MB99, GMCK00, SNM02]and multimedia summarization [MM99, Fut99, Zec01] were being investi-gated.

Today, automatic summarization has become a vibrant field of research,with recent years seeing a rapid growth in publications. This growth hasbeen fueled by the competitions conducted during the annual Document Un-derstanding Conference (DUC) [HM01, HH02, ODH07] and its successor,the Text Analysis Conference series (TAC) [TAC09], and the availabilityof summarization corpora that were created in the course of these com-petitions. Until recently, the attention of the research community focusedon the tasks of generic and query-oriented multi-document summarization,typically of news material. However, this picture is changing rapidly, andmany researchers are starting to investigate the summarization of non-newsmaterial (e.g. blogs or product reviews) [HSL08, GZH10], or address other

45

Chapter 2: Related work

types of summarization such as update [HRL07, SJ08] and opinion sum-marization [KLWC05, NHMK10, GZH10]. On the other hand, approacheswhich aim for abstractive summarization are still scarce and most systemsopt for extractive strategies. Nevertheless, in recent research one can observea tendency of using more complex linguistic processing during analysis andsynthesis in order to move from simple passage extraction towards symbolicrepresentations of source and summary content and reformulation for outputgeneration [Jon07].

This chapter surveys existing work in automatic text summarization. Itstarts with an overview of classical work in Section 2.1, which serves as abasis for our subsequent discussion of different summarization approaches.Section 2.2 outlines general strategies for extractive summarization, and in-troduces the application of machine learning techniques to the task of au-tomatic summarization. Section 2.3 then presents previous approaches togeneric and query-focused multi-document summarization. Section 2.4 in-troduces a class of unsupervised learning algorithms that are known as la-tent factor models, which constitute the algorithmic basis for many of thecontent modeling approaches presented in Section 2.5, and for our own workpresented in Chapters 4 and 5. Subsequently, Section 2.6 presents existingpublications related to subsentential content units and human variation incontent expression that relate to our work in chapters 6 and 7. The finalSection 2.7 summarizes the main findings and challenges of current summa-rization research, and motivates the contributions of this thesis.

2.1 Classical approaches

One of the first approaches to automatic text summarization was presentedby Luhn in 1958 [Luh58]. Luhn’s study pioneered the idea that an automaticextract of sentences can serve in place of an abstract for summary purposes.He described an approach in which sentences are scored for their componentword values, ranked by this score, and selected from the top of the rankedsentence list until some predefined score threshold is reached. Word valuesare determined by their frequency in the source document, and stop words1

are not considered in the computation of sentence scores. Luhn’s approachis a premier example of basically statistical approaches to automatic textsummarization [Jon07].

1Stop words (or function words) are common words like pronouns, determiners andprepositions – e.g. “the”, “a”, “it”, and “that” –, that are generally assumed to not carrysignificant content information. They are typically words that occur very often, and thusare of little discriminative value [Zip35].

46

2.1. Classical approaches

The summarization strategy Luhn describes is motivated by a few keyassumptions: It presumes that the relative importance of words is definedby lexical frequency, and that the relative importance of a sentence canbe derived as a function of the values of the words in that sentence. Thefact that word frequency has something to do with the importance of corre-sponding concepts is well-established [NV05, Jon07], and variants of (lexical)frequency-based features serve as a major indicator of importance in manysummarization systems.

Another highly influential early work on text summarization was pub-lished by Edmundson [Edm69]. He presented a study of the abstractingbehavior of humans, which allowed him to identify additional features, be-sides component words, that signal sentence importance. Edmundson pro-posed features based on cue phrases, title words and sentence location. Cuephrases are phrases which signal important or unimportant information ina text. These can for example be words indicating in-text summaries, suchas “in conclusion”, comparatives and superlatives, and on the other handbelittling expressions which hint at unimportant information. The empha-sis of words appearing in the document’s title is motivated by the idea thatin titles and headings an author herself summarizes the main notions of adocument. The location feature captures the intuition that information oc-curring at specific positions, such as the beginning or the end of documentsand paragraphs, is more likely to carry salient information, an observationfirst put forward by Baxendale [Bax58].

In Edmundson’s approach, the discussed features were computed once foreach sentence. The title feature, for example, was calculated by counting thenumber of words from the document’s title that occurred in a given sentence.To determine an overall sentence score W (s), Edmundson then computed aweighted linear combination of a sentence’s feature values:

W (s) = w1C(s) + w2K(s) + w3L(s) + w4T (s) (2.1)

where C(s), K(s), L(s), T (s) correspond to the cue phrase, key terms, loca-tion and title words scores of sentence s, and the feature weights wi deter-mined the relative influence of each feature. Edmundson found that the newfeatures dominated Luhn’s word frequency feature. In his study, locationwas the single best feature, and using cue phrases, location and title wordsgave the best performance. Variations of Edmundson’s original features playan important role in much of summarization research. Furthermore, the fea-tures evaluated in his study are very similar to the cues used by professionalabstractors (see Section 1.1).

Sentence significance can also be derived from a sentence’s relationsto other parts of the source document. In his influential study, Sko-

47


D1S1

D1S2

D1S3

D1S4

D2S1

D2S2

sim(D1S1,D1S2)

sim(D1S2,D1S3)sim(D2S1,D2S2)

sim(D1S1,D2S2)

sim(D

1S1,D

1S4)

sim(D2S1,D1S4)

sim(D2S1,D1S1)

sim(D1

S2,D2

S1)

sim(D1S4,D

2S2)

Figure 2.1: Example sentence graph. The figure shows a sentence graphconsisting of six sentences originating from two different documents D1 andD2. Sentences correspond to nodes in the graph, and edges between nodesare inserted if the similarity score of two sentences exceeds a predefinedthreshold.

rokhod’ko [Sko72] proposed to model relationships between sentences asa graph, where sentences are nodes and edges correspond to edges betweensentences. Figure 2.1 shows an example sentence graph consisting of six sen-tence nodes and several edges. In Skorokhod’ko’s approach, sentences werelinked with each other if the number of words they shared was larger thansome predefined threshold.

Sentences were then assumed to be more important if they had manysignificant links to other sentences, or if the deletion of a sentence would havecaused a larger change in the graph structure. Skorokhod’ko’s approach thuscontrasted with earlier work by not considering sentences in isolation, butinstead utilizing their relations to each other. Furthermore, such a sentencerelationship graph can capture some aspects of the document’s discoursestructure [Jon07]. Approaches that are based on graph representations ofdocument content have since been a major avenue of research.

Following these early extractive approaches, summarization research tooka sharp turn towards abstractive summarization during the late 1970’s and1980’s. This paradigm shift was driven both by general trends in ArtificialIntelligence (AI) during those years, as well as by the introduction of formalknowledge representation and inference models, such as semantic networks,frames, and scripts [Sch73, SR81]. These models provided a methodologyfor working with conceptual, domain knowledge representations of source

48

2.1. Classical approaches

content, and promised some form of text understanding, which could subse-quently lead to meaning representations and transformations for summarygeneration. However, most of the proposed approaches were extremely re-stricted (e.g. to very limited domains), and typically did not generalize orscale well [Jon07].

One of the first examples of an abstractive summarizer was presentedby DeJong. His summarizer, called FRUMP [DeJ82], instantiated prede-fined sketchy scripts – templates which defined the important events thatwere expected to occur in a specific situation – by scanning the source textfor expressions matching the specified event types. FRUMP then produced asummary from the instantiated script by using a natural language generationmodule. In DeJong’s approach, content importance was predetermined andrepresented in terms of world knowledge about what is expected to be salientin a particular situation. Source analysis was restricted to extracting infor-mation predefined in the script, and all other information was assumed to beunimportant. The main weakness of the approach lay in its brittleness, as itdid not extend easily to new situations or summary purposes. Furthermore,scripts as well as recognition criteria had to be predefined by hand. A rangeof summarization systems similar in spirit and differing only in the parsingtechnologies and knowledge representations used were proposed subsequentlyby various authors [RJZ89, PJ93, MR95, SL02].

In another line of work, the TOPIC system presented by Hahn andReimer [HR86, RH88, HR99] utilized a concept hierarchy to represent do-main knowledge in the area of computers and technical products. The hier-archy encoded “is-a”-relations and “has-part” properties, such that for ex-ample the concept “Computer” had the sub-concepts “Workstation”,“PC”and “Laptop”. Summarization was then viewed as finding those parts of theconcept hierarchy that were talked about in a given text. The system em-ployed syntactic parsing to identify noun phrases which referenced instancesof these concepts, and counted how frequently concepts were referred to inthe text. The system’s main appeal was its ability to generalize concepts onthe basis of the ontology’s hierarchical relations, but determining the ‘right’level of generalization was found to be a problem in itself. Furthermore, asall knowledge-rich approaches, this summarizer suffered from the fact thatthe domain-specific concept hierarchy it used would have had to be recreatedfor each new domain.

As there has been little research in (purely) abstractive summarizationin recent years, we will conclude our discussion of classical work with a briefoverview of noteworthy abstractive summarizers, before shifting our reviewto the multitude of extractive systems seen in the past two decades.

McKeown and Radev [MR95, RM98] presented a symbolic summa-

49


rization system which combined information extraction template representa-tions of multiple news stories to create a summary. Template combinationwas based on the identification of relationships such as contradiction, changeof perspective, or information addition. Saggion and Lapalme [SL02]also employed a template-based approach for summarizing technical articles,where templates represented indicative or informative content types such asthe Topic of a section or Experiment. The synthesis phase of summarizationwas the focus of work by McKeown et al. [MRK95], who described tech-niques for a rule-based revision of sentences in order to incrementally packinformation into linguistic constituents. Maybury [May95] also focused onsummary generation, and constructed summaries from military event logs.His approach utilized statistical measures to determine event significance,but also employed aggregation and generalization operations to create morecompact event expressions. Reithinger et al. [RKEA00] described a systemwhich can summarize spoken dialogues about negotiations in the domain oftravel planning.

2.2 Extractive summarization

The basic extractive strategies outlined by Luhn and Edmundson arenaturally extensible in several ways. The proposed statistical measurescan be computed for whatever is taken as a source passage, be itphrases [BK97, BME99], sentences [BMR95, KPC95, MB97, GKMC99] orparagraphs [MSB97, SSMB97, SSWW99]. Similarly, the measures cannotonly be applied to lexical elements (words), but also to more sophisticated ele-ment types which consider the linguistic relations between words, such as con-cepts [MB98, MB99, SNM02], word n-grams [BL04, GF09], grammatical con-stituents like noun phrases [BK97, BE97], or logical forms [BME99, VBM04,HL05, TJ05, LMFG05]. Weighting schemes for different elements and ele-ment types can be used to differentiate their relative contribution towardspassage importance, for example by taking into account corpus characteris-tics [BMR95, LH00], grammatical notions of salience [BK97], cohesion rela-tions [BE97], or statistical association with predefined topics [LH00, Har04].

Various authors have also moved beyond the basic groups of fea-tures first described by Edmundson and investigated more elaborate andnovel features that may contribute to passage significance [BMR95, LH97,MB97, Mar97a, BE97, HL99], or that characterize passages with respect toquery [MB98, GKMC99, FR06, DM06] and subtopic representations [GL01,HL02, HL05, WWLL09]. The question of how to efficiently combine andweight different features led to the application of methods from machine

50

2.2. Extractive summarization

learning [KPC95, MB98, AOGL99]. ML methods are also employed in de-termining optimal feature values [LH97, TM02, LMFG05, YGVS07], or fordetermining the content structure of source texts [GL01, BL04, NVM06].In addition, as outlined by Skorokhod’ko’s and Earl’s work [Ear70, Sko72],words and sentences do not exist in isolation. The former are part of (of-ten complex) sentence structures, the latter part of the document’s discoursestructure. Identifying and utilizing these structures has been a major fo-cus of research in automatic summarization [BE97, BK97, SSMB97, Mar99,BME99, FH04, VBM04, LMFG05, WY08].

The general strategies presented in Luhn’s and Edmundson’s approachesare typically still employed in many current extractive summarizers. It hasbecome common practice to preprocess the source text in order to detect sen-tence and word boundaries. Subsequently, various statistical and linguisticfeatures are computed for each sentence. The feature values of each sen-tence are then weighted and combined to derive a final importance score foreach sentence, and sentences are ranked in order of the scores. The con-densation phase involves selecting the highest-ranked sentences, which areconcatenated until some predetermined summary length (often specified as anumber of words) is reached. Redundancy is accounted for by applying somemeasure of content overlap, such as Maximum Marginal Relevance (MMR)(see Section 2.3.2). The extracted sentences are typically re-arranged to ap-pear in the same order as in the original document, which is a simple strategyfor ensuring a minimal amount of coherence.

Before we proceed with our survey of summarization research, we willbriefly introduce the Vector Space Model, which has been adopted by manysummarization researchers as a model to represent sentences and documents,and for feature computation.

Vector Space Model The Vector Space Model (VSM) is a basic method-ology proposed by IR researchers which reduces each document to a vector ofsuitably weighted words, w = (w1, w2, . . . , wn), where wi is the value of wordi [SM86, BYRN99]. In this word space representation, each word correspondsto a different dimension, and the total number of dimensions is fixed by thesize of the vocabulary. A common weighting scheme for word values is tf-idf,where each word is weighted by combining its frequency count in a docu-ment with its inverse document frequency, i.e. its frequency of occurrence ina larger corpus of documents [SM86].

Many ways of measuring the similarity of two text documents are basedon comparing their vector representations. Formally, a similarity function is

51


defined as follows:

d = f(x, y), (2.2)

where f(x, y) is a function that measures the similarity of x and y. A popularmeasure for computing f(x, y) is the cosine similarity. The cosine measure oftwo n-dimensional vectors x and y in a real-valued space is calculated as theinner product of the vectors, normalized by the product of their Euclideanlength:

cos(x,y) =xTy

∥x∥ ∥y∥ (2.3)

This equation assigns values ranging from 1.0 to vectors pointing in the samedirection, over 0.0 to vectors orthogonal to each other, to −1.0 for vectorspointing in opposite directions. Other common similarity measures includethe Dice coefficient, the Jaccard coefficient, and measures of distributionalsimilarity such as the Kullback-Leibler divergence [MS01].

Despite being successfully used in many IR and NLP applications, thestandard word-based VSM is limited in several ways. Since each word cor-responds to a different dimension of the vector space, the similarity of twodocuments will be determined on the basis of their lexical overlap. Thismakes it difficult to measure the conceptual similarity of documents, as plainword matching has severe drawbacks due to the ambiguity of words and todifferences in word usage and personal style across authors. To alleviate thisproblem, the representation of documents in word space is often replacedby a representation in concept space, where each concept (or term) aggre-gates several words based on considering their morphological and semanticrelations. Using stemming algorithms, for example, one can reduce words totheir stems by stripping inflectional and derivational affixes and then grouptogether words sharing the same stem [Por80, MS01]. Alternatively, one cangroup together synonymous or otherwise semantically related words to cre-ate vectors of high-level concepts. However, the identification of semanticrelations usually requires that appropriate lexico-semantic dictionaries, suchas WordNet, are made available to the application. Such resources are oftenexpensive to create and to maintain, and may not be available for specificdomains or languages.

A second drawback of the VSM arises from the fact that word vectorrepresentations are often very sparse. There are typically only very fewnon-zero entries in each document’s vector, given the fixed dimensionalityof the vector space which is determined by the much larger vocabulary of adocument collection. In text summarization, which usually needs to representsentences and short queries instead of documents, this problem is even morepressing than in traditional document retrieval [DM06]. Word aggregation

52


strategies, such as stemming or concept identification, may help to reducesparsity-related problems. Another solution are dimensionality reductiontechniques, which will be introduced in Section 2.4.

A third problem of the word vector representation is that it considerswords as independent of each other, and does not capture relations amongdifferent words in a document. Each word or concept corresponds to a sep-arate dimension of the word space. The representation thus disregards theway words co-occur, or how they are combined to form a clause or a sentence.Co-occurrence models, which are discussed in Section 2.4, are one approachto address this problem.

2.2.1 Lexical elements and features

Many early extractive summarization approaches have chosen VSM repre-sentations for sentences and documents and have investigated the effects ofdifferent term aggregation, selection and weighting strategies. We will de-scribe the most important strategies next.

Word frequency in a document, as utilized in Luhn’s and Edmundson’sapproaches, captures how important a word is within a document. However,it does not capture how discriminative, or semantically focused, a word isin a collection of documents [MS01]. Words that are spread homogeneouslyover many documents are generally assumed to be uninformative, and maynot be useful for characterizing sentence importance. Stop words are primeexamples of such uninformative words. Brandow et al. [BMR95] illustratethe use of corpus statistics to account for this observation. The authorscalculate tf-idf scores for each word and select a subset of words significantfor the document to weight sentences.

Aone et al. [AOGL99] investigate the effects of term aggregation andfeature selection on summarizer performance. The authors apply stemmingto link similar word forms, and identify named entities (e.g. person, orga-nization and location names) and name aliases (e.g. ‘IBM’ for ‘InternationalBusiness Machine’) in order to count name references rather than name men-tions. They also collect collocations from a large document corpus, whichagain influences term aggregation. Their experiments show that the differentways of identifying basic element types and thus aggregating different termscan impact summarization performance.

Term aggregation using lexico-semantic resources instead of morpho-logical processing is employed in the summarizer by Mani and Bloe-dorn [MB98, MB99]. The authors utilize WordNet’s synonymy and hyper-nymy relations, and apply co-reference resolution to aggregate semanticallyrelated words. WordNet has been used as a lexical resource for term aggrega-

53


tion in many summarization systems, and various authors have explored theutility of the different semantic relations it encodes [BE97, HL02, SNM02,VBM04, LMFG05, Nas08, WLZD08].

The use of a larger background corpus to generate a more descriptiveset of words, as introduced by Brandow et al., is also pursued in Hovy andLin’s SUMMARIST system. The authors use a corpus of news articles pre-classified into different topics to collect signature words for each topic, i.e.words that are closely associated with a predefined topic. The topics underconsideration in this approach are broadly-defined news article categories,such as ‘Banking’ or ‘Agriculture’. Together, the set of signature wordsfor a given topic constitute a so-called topic signature. The approach ismotivated from the perspective of feature selection [RY02], aiming to narrowdown a large and potentially imprecise vocabulary to a set of words highlycharacteristic for a given topic. In later work, the authors refined theirapproach by selecting topic signature terms based on the likelihood ratioof their occurrence in a collection of topic-relevant documents as comparedto a corpus of non-relevant documents [LH00].

Topic signatures have turned out to be a good indicator of impor-tance, and are used in many summarization systems [CSGO04, Har04, HL05,CSO06, HL10]. However, topic signatures are based on a document-level rel-evancy decision. Documents from the topic-relevant corpus which are onlyloosely associated with a topic (e.g. the document has the topic as a sub-sidiary theme only) introduce ‘noise’ terms in the topic signatures. Thetopics in general are coarse-grained, and an identification of subtopics is onlypossible if an appropriately fine-grained corpus of preclassified documentsexists. Terms that refer to subtopics may however be included implicitly inthe topic signature, but the subtopical relation is not made explicit.

2.2.2 Syntactic and discourse structures

Most of the elements and features discussed in the previous section con-sider elements and passages as independent. The influence of passage- ordocument-level structure is expressed solely through the statistical saliencemodel for elements, e.g. by applying term aggregation and calculating fre-quencies within a document to estimate importance. However, passages arenot simply random sequences of words, but rather structured informationordered by syntactic, semantic and discourse rules, all of which can providevaluable clues to an automatic summarizer.

The summarization system by Barzilay and Elhadad [BE97] constructslexical chains that capture the semantic relations between words, and usesthe chains to model topic progression through a text. Lexical chains are

54


sequences of noun phrases which are linked together based on lexical cohesionrelations. Repetition and WordNet’s synonymy and hyponymy relations areused to identify semantically related words, and WordNet path length isused to estimate the link weight of different noun groups in a chain. Chainsare scored by summing link weights, and sentences associated with high-scoring chains are selected to create a summary. However, the high degree ofpolysemy encoded in WordNet leads to a combinatorial growth of candidatechains and an exponential complexity of the algorithm. Barzilay and Elhadadaddress this issue by pruning low-scoring chains. In later work, Silber andMcCoy [SM02] present an efficient algorithm that computes lexical chainsin linear time.

Ono et al. [OSM94] and Marcu [Mar97a, Mar99] propose approacheswhich utilize macro-level discourse structures for summarization. In bothapproaches, the discourse structure of source texts is modeled on the basis ofRhetorical Structure Theory (RST) [MT88]. In his study, Marcu shows thatthere is a strong correlation between the nuclei of RST trees and what readersperceive as the most important information in a text. The summarizer usesthe generated discourse tree to assign scores to tree elements based on elementdepth and selects sentences that span the major rhetorical nodes of the tree toconstruct a summary. However, the utilized RST parser was constructed byhand, and RST parsers in general depend on domain-specific lexico-syntacticpatterns, making them hard to adapt to new domains and not robust withrespect to the dynamics of language [Mar97b]. More recent approaches thatconsider discourse structure for guiding an automatic summarizer includeThione et al. [TVdBPC04] and Bosma [Bos08].

Harabagiu and Lacatusu [Har04] extend the idea of topic signaturesby hypothesizing that topics are not only characterized by terms, but alsoby relations between terms. These relations are determined by a syntacticanalysis of sentences, which identifies verb phrase-noun phrase (VP-NP) con-structions. Similar to the acquisition of topic signature terms, characteristicVP-NP relations are learned from a corpus preclassified into topics. Duringsummarization, sentences are then scored not only by the number of topicsignature words, but also by the number of relations they contain. An evalu-ation on the DUC 2002 multi-document summarization data set shows thatthis approach outperforms a model using only topic signatures.

As syntactic and semantic parsing technologies have become more robust,different researchers studied representations based on logical form analysesof input sentences. Tucker and Sparck Jones identify predicate-argumentstructures, and Vanderwende et al. [VBM04] as well as Leskovec etal. [LMFG05] utilize dependency parsing to extract logical forms from sen-tences. In the summarization approach described by Vanderwende et al.,

55


each sentence is analyzed to construct its logical form, i.e. its dependencytree. From the tree, logical triples (nodes and the relation between them)are extracted and linked to other triples to create a graph. In this graph,predicates (verbs) and arguments (nouns) correspond to nodes, and edges areconstructed from the semantic relations between nodes. Words from differentsentences denoting the same concept are mapped to the same node on thebasis of lexical similarity and grammatical information. The authors thendetermine the weight of each node by applying the well-known PageRankalgorithm [BP98]. To create a summary, the system extracts sentences con-taining highly-weighted predicate nodes, which are assumed to correspondto events. Sentences containing highly-weighted noun nodes, on the otherhand, are included in the summary to provide reference information on thereferred entity.

Wang et al.’s [WLZD08] approach illustrates the combination of lexico-semantic resources like WordNet and semantic parsing technologies to calcu-late more sophisticated measures of content similarity. The authors proposeto label words with semantic roles, such as “Actor”, “Location” or “relation”,using the PropBank semantic annotation scheme [PGK05]. Pairwise sentencesimilarity is then calculated based on words occurring in the same semanticrole and having a direct semantic relation, such as synonymy or hypernymy,in WordNet. These values are used to construct a pair-wise similarity matrix,which is decomposed using matrix factorization methods [GVL96] in orderto find clusters of similar sentences (see Section 2.4.1). The most informativesentences from each cluster are selected to create the summary.

2.2.3 Machine learning

Given the range of potentially interesting features that can be used to char-acterize source passages, the question arises of how they can be weightedand combined, and how one can determine which features are the most use-ful for summarization. In addition, the contribution of features can vary fordifferent text genres and domains, and it would be tedious to manually tunefeature weights for each new setting. Machine learning methods offer a rangeof solutions to these questions [Mit97, Bis07].

Supervised approaches The application of ML techniques to the taskof automatic summarization was pioneered by Kupiec et al. [KPC95]. Intheir study, the authors trained a Naıve Bayes classifier on the task of deter-mining summary-worthy sentences, given a corpus of research articles withcorresponding summaries. Kupiec et al. represented each sentence as a vectorof features, and labeled each sentence as summary-worthy or not depending

56


on whether the sentence occurred in the document’s reference summary. Thesentence features they considered were similar to the ones described by Ed-mundson. An experimental evaluation of the proposed approach showedthat the learned feature combination lead to a significant boost of sum-mary quality. Encouraged by these results, a number of other researchersexplored ML techniques for learning feature weights and optimal feature val-ues [TM97, MB98, HL99, HIMM02, AG02, YGVS07]. Other researchers haveused ML methods to learn sentence orderings for summary generation [BL05],or to reduce redundancy [LZX+09].

The crucial problem for supervised approaches to sentence selection, how-ever, is the difficulty of obtaining training data [AG02]. Often, it is availableonly in the form of human-written abstracts. Sentences from the abstracttherefore do not exactly match source sentences, making it difficult to ade-quately label the latter. Kupiec et al. solve this problem by allowing partialmatches, while other researchers have manually labeled gold-standard sen-tences [TM97, RJB00, LMFG05]. Later work focused on exploiting the refer-ence summaries provided in the DUC challenges (see Section 1.4.1), for exam-ple by estimating source sentence “oracle” scores. This approach is illustratede.g. by Ouyang et al. [OLL07], who calculate sentence oracle scores on thebasis of word n-gram likelihoods in reference summaries, and by Schilder etal. [SK08], who assign each source sentence a score of summary-worthinessbased on its cosine similarity to the reference summaries. However, sentencelabeling cannot only be determined by lexical overlap, but must also takeinto account semantic similarities, as discussed in Section 1.4. Furthermore,a single abstract may not be sufficient given human variability in contentselection, and thus the preparation of training data must handle multiple,possibly differing labels per sentences.

Summary annotations can also be used to label source sentences assummary-worthy. For example, Fuentes et al. [FAR07] exploit the Pyramidannotations to label as positive instances all source sentences containing atleast one Summary Content Unit. Another solution to the problem of ob-taining training data is the use of semi-supervised approaches, where smalleramounts of labeled training data are combined with large amounts of unla-beled data. Amini and Gallinari [AG02] illustrate such a semi-supervisedclassification scheme, and show that with only 10% labeled training data,their approach performs comparable to Kupiec et al.’s fully supervised ap-proach, and outperforms it when using more training data.

Unsupervised approaches The difficulties involved in obtaining andworking with labeled training data in the context of automatic summariza-

57


tion have stimulated the use of unsupervised ML algorithms, which do notrequire such data. Among these, clustering, graph-based ranking and latentfactor methods have raised the greatest interest.

Clustering algorithms partition data sets into groups of similar data, suchthat the resulting groups contain data that is similar to each other, but dis-similar from the data in other groups [Bis07]. They iteratively group elementsof a set S on the basis of a distance function f(x, y), where x, y ∈ S andx = y. A popular approach is to represent sentences as word vectors in theVSM, and to use the cosine measure to determine f(x, y). Typically, theword vectors x and y are assigned to the same cluster if their distance issmaller than some threshold ϵ, which has to be determined experimentally.Popular clustering algorithms include k-means [Mac67] and spectral cluster-ing [Wei99, NJW01]. As a result, one obtains a set of clusters, where eachcluster contains a subset of the sentences in S. Clustering based on wordvector representations identifies similar information at the level of lexicalsimilarity, by matching dimensions in the word space.

Sentence clustering for text summarization is illustrated by Nomoto andMatsumoto [NM01], who employ k-means clustering to find topical groupsof sentences. Lacatusu et al. [LHR+06] employ k-Nearest Neighbor cluster-ing using the cosine similarity measure to group similar sentences, and extractonly the top-ranked sentence from each cluster to reduce the likelihood thatredundant information is included in a summary. Wang et al. [WWLL09]incorporate sentences clusters as nodes in a hypergraph, which allows themto determine sentence relevance with respect to subtopics using standardgraph-based ranking algorithms. Other researchers employ clustering as apreprocessing step, for instance, McKeown et al. [MBE+02] initially clus-ter sentences in order to facilitate an identification their syntactic overlap.

Graph-based approaches to text summarization represent relations be-tween text passages, typically sentences, in the form of a graph, and utilizefeatures derived from the resulting graph to rank sentences. In such a graph,nodes correspond to sentences (or sentence-derived representations), andedges are based on lexical or logical relations between nodes (see Figure 2.1).Classical work by Skorokhod’ko [Sko72] illustrates sentence graphs con-structed on the basis of lexical relatedness between sentences.

An influential work in this area was presented by Erkan andRadev’s [ER04], whose LexRank approach is based on the assumption thatin a sentence graph, sentences “vote for each other” on the basis of the wordsthey contain, and sentences that receive many votes are considered impor-tant. This idea is implemented in the PageRank algorithm [BP98]. In theproposed approach, generic summaries are constructed by selecting the sen-tences with the highest weight after convergence of the PageRank algorithm.

58

2.3. Multi-document summarization

The authors evaluate their summarizer on data sets of the DUC 2003 and2004 conferences, and report results as good or slightly better than the top-ranked participants of the respective DUC competitions. Similar results arealso reported independently by Mihalcea and Tarau [MT04, Mih05] forsingle-document summarization. A major benefit of both approaches is thatthey are fully unsupervised and domain- and language-independent. How-ever, a single global model of centrality, as computed by PageRank, is notwell-suited to adequately represent the different subtopics of a documentcollection. Furthermore, in Erkan and Radev’s approach, the edges in thesentence graph are undirected, as opposed to the original PageRank formu-lation which assumes directed edges. The use of undirected edges causesweights to flow back and forth between nodes, which has been shown to leadto weak performance of the PageRank algorithm in other domains [Wet09].

Otterbacher et al. [OER05, OER09] extend the LexRank approachfor query-focused multi-document summarization, using a combination of asentence’s similarity to a query and to other sentences to weight nodes. Theauthors report a competitive Rouge-2 recall score on the DUC 2006 dataset.Similar work by Wan et al. [WYX06] represents the query-similarity ofsentences as a personalization vector for the PageRank algorithm. Wanget al. [WWLL09] generalize the above models to a hypergraph, where nodesare sentences and hyperedges connect nodes or arbitrary node sets. Standardedges connect sentences on the basis of lexical overlap, and hyperedges areused to model sentence-subtopic similarities, where subtopics correspond tonode sets consisting of lexically related sentences. The authors report a highRouge-2 recall score for the DUC 2006 dataset, a result that is similar tothe results reported in this thesis (Chapter 4). Subtopic integration into asentence graph is also investigated by Wan and Yang [WY08].

Latent factor models, such as Latent Semantic Analysis [DDF+90], Prob-abilistic Latent Semantic Analysis [Hof99b] and Latent Dirichlet Alloca-tion [BNJ03], also work in an unsupervised fashion. Similar to clusteringapproaches, these models can be used to partition the set of source passagesinto collections of thematically related passages [GL01, SPKJ07, AR08b].We will discuss latent factor models extensively in Sections 2.4 and 2.5.

2.3 Multi-document summarization

In multi-document summarization, where the input is typically a collection ofthematically related documents, slightly richer representations are requiredthan in single-document summarization. The summarization system mustdeal with the redundancy inherent in the document collection, as many doc-

59


uments repeat the same or similar content. Many systems opt to extractsentences iteratively, and apply a measure of content overlap between candi-date sentences and the current summary. Candidate sentences which repeatcontent already contained in the summary are penalized, and thus receive alower overall score. However, summarization systems also can benefit fromthe repetition of information, as frequently repeated content is often assumedto be important.

The summarization system must also account both for the overall contentof the document collection, as well as for the different subtopics appearingin it, to ensure the coverage of different aspects of the main theme. A char-acterization of the collection and its subtopics is required, and sentences arescored against both. Subtopics in turn may be weighted to distinguish theirrelative importance.

Finally, recent summarization competitions have addressed the task ofquery-focused multi-document summarization [DUC07]. For query-focusedsummarization, systems must adequately translate the query or topic state-ment into features which indicate the topic relevance of source elements, andtake into account that the specified information need may be more complexthan a simple question.

2.3.1 Early work

Mani and Bloedorn [MB97, MB99] present one of the first multi-documentsummarization systems. The authors approach the problem of highlightingsimilarities and differences between pairs of documents from a graph per-spective. Each node of the graph corresponds to a concept, and edges areconstructed on the basis of cohesion relations between concepts. Conceptsare extracted from text using a variety of linguistic tools, such as a phraseidentification component and a named entity recognizer, and linking relieson lexical overlap, WordNet’s synonymy and hypernymy relations and co-reference resolution. The authors use a spreading activation algorithm tofind nodes related to a user query, and employ graph matching to identifysimilar and differing concept nodes.

The study of Barzilay et al. [BME99] aims to identify similar infor-mation from a syntactic perspective, with the goal of fusing informationfrom different source sentences. The authors move from sentence extractionto phrase extraction by identifying and synthesizing phrasal intersections ofsimilar sentences. They manually construct a set of syntactic and seman-tic paraphrasing rules, which are applied to dependency tree pairs to createintersections of similar predicate-argument structures. In later work, Barzi-lay and McKeown [BM05] rely on a bottom-up alignment of parse tree

60


elements to identify phrases to be fused.Schiffman et al. [SNM02] present a summarizer for loosely similar doc-

uments, which uses a combination of concept counting (exploiting WordNet’ssynonymy, hypernymy and hyponymy relations), verb specificity and globalinformation about words that are likely to appear in lead sentences of newsarticle to determine the summary-worthiness of sentences. The summarizer ispart of the fully operational Newsblaster system [MBE+02], and illustrates arange of representations and processes – including statistical and symbolicalones – that may be used in summarization.

TheMead system presented byRadev et al. [RJB00, RJST04] considersthe relative importance of a sentence with respect to all articles of a documentcollection as a novel feature. Extending earlier work on the use of tf-idf -based relevance features by Aone et al. [AOGL99], the authors calculatefor each sentence how well it represents the collection’s main theme. Tocharacterize the main theme, the approach first computes a centroid vector,i.e. a representation of the collection’s statistically most important words. Asentence’s “collection” feature score is then calculated as the cosine similarityof the sentence and centroid vectors. The Mead system performed well inthe DUC 2001 and 2002 competitions [HM01, HH02].

2.3.2 Redundancy and frequency

To address the problem of redundancy in MDS, Carbonell and Gold-stein [CG98] introduced an approach called Maximum Marginal Relevance(MMR), which combined relevance and novelty criteria in the ranking of sen-tences. The novelty criterion measures the degree of dissimilarity betweena ranked list of candidate sentences and sentences already selected for thesummary:

MMRdef= Arg maxSi∈R\D [λ(Sim1(Si, q))

−(1− λ)(maxSj∈D Sim2(Si, Sj)), (2.4)

where R is a ranked list of sentences, q is a query (or some other relevancecriterion), D is the subset of sentences already selected for the summary,R\D is the set of candidate sentences not yet selected for the summary, andSim1 and Sim2 are similarity measures. The novelty measure Sim2 can bethe same as Sim1. Typically, sentences and the summary are represented asword (or concept) vectors and compared with the cosine similarity measure.λ is a tuning parameters that balances the relative influence of the similar-ity penalty and the relevance factor. This formulation of MMR penalizes acandidate sentence by comparison with the most similar summary sentence.

61


Other variants of MMR measure the similarity to centroid representationsof the summary’s sentences. Carbonell and Goldstein report that the appli-cation of MMR is extremely useful for multi-document summarization, butless so for single-document summarization, since a single document typicallydoes not contain as much redundant information [GMCK00].

MMR is a greedy optimization scheme, since it adds candidate sentencesiteratively and recomputes the redundancy criterion for all remaining candi-date sentences in each iteration. An alternative greedy scheme for penalizingredundant content is based on discounting the weights of words already in-cluded in the summary [GL01, FH04, NV05]. However, given that in currentsummarization tasks a summary is typically constrained to a fixed size (inwords), finding a selection of sentences that maximizes relevance and min-imizes redundancy is an example of a global inference problem, related tothe knapsack problem [KPP04]. Different authors have therefore proposedalternative solutions, including dynamic programming [McD07], stack decod-ing [YGVS07], and integer linear programming formulations [McD07, GF09].MMR however remains the de-facto standard algorithm due to its simplicityand efficient computability. Alternatively, summarization systems can selectrepresentative sentences from clusters of related sentences, or apply matrixdecomposition methods to select sentences from different dimensions of thelatent space, as discussed in more detail in Section 2.5.

The redundancy of information observed in MDS document collectionscan also be exploited as a useful feature to identify important content.Nenkova and Vanderwende [NV05, NVM06] present an influential studythat isolates the contribution of frequency information in MDS from that ofother features. The authors find that the frequency of words in input docu-ments strongly correlates with their appearance in human-written referencesummaries. Words that are very frequent in the source documents are alsovery likely to be included in reference summaries. This observation supportsthe (heuristically motivated) frequency-based features that are at the core ofmany summarization systems. At the same time, the authors note that fre-quency alone does not completely explain human choices in content selection,as there are also many low-frequency input words appearing in summaries.

On the basis of their observations, Nenkova and Vanderwende’s proposea summarization approach that assigns words a weight equal to their proba-bility in the input document collection, and calculates sentence importanceas the average probability of the words occurring in a sentence. The systemhandles redundancy in an innovative fashion: Instead of applying MMR, itdiscounts the probabilities of words that occur in sentences already includedin the summary. Thus, candidate sentences that contain these words willreceive a lower score, as compared to candidate sentences containing words

62


not yet “used”. The performance of this single-feature, unsupervised systemon the task of generic multi-document summarization is comparable to thebest systems in DUC 2004.

Nenkova’s findings on frequency as a major indicator of importance areconfirmed by various authors [CSO06, VSBN07, YGVS07, GF09, HV09]. Arecent study performed by Gillick and Favre [GF09] utilizes word bigramfrequencies instead of unigram frequencies, and implements an Integer LinearProgramming algorithm to extract the set of sentences that maximizes thesum of bigram frequencies. In their experimental evaluation, the authorsshow that this simple heuristic gives results comparable to state-of-the-artsystems on current summarization data sets.

2.3.3 Query-focused summarization

For query-focused summarization, systems must adequately translate thequery or topic statement into features which indicate the query relevanceof source elements. Various strategies have been proposed, most of whicheither compute features based on a sentence’s similarity to the query, orbased on the presence of (weighted) query words in the sentence. To al-leviate the problems of sparsity and term mismatches when computingsuch features, various researchers have proposed to adopt query expan-sion techniques from IR [BYRN99]. Query expansion increases the set ofword associated with the query, e.g. by adding semantically related terms,in order to match additional sentences. Several researchers have incorpo-rated lexical resources (e.g. WordNet) for expanding a query with seman-tically related words [HLH06, VSB06, VSBN07]. Nastase [Nas08] uti-lizes information derived from Wikipedia, a large-scale encyclopedic knowl-edge source. Other approaches have employed relevance feedback methods,e.g. [GKMC99, DM06, AU07]. However, as Vanderwende et al. discuss,query expansion with lexical resources like WordNet, or synonyms acquiredfrom a large web corpus, does not always help in query-focused summariza-tion [VSB06].

Summarization systems must also take into account that a user’s infor-mation need may be more complex than a simple question, an aspect ofinformation search which is reflected in recent DUC competitions (see Sec-tion 1.4.2). Some researchers have therefore investigated strategies for decom-posing the question into a set of simpler ones. For example, Harabagiu etal. [HLH06, HRL07] utilize a syntactic parser to separate conjoined phrasesand to recognize embedded questions. The authors explain that questiondecomposition leads to more relevant and complete answers, as systems canselect appropriate passages for each of the subquestions.

63


A popular scheme to determine the query relevance of sentences is toscore them based on the number of query words they contain, which is avariant of the title words feature proposed by Edmundson. Conroy etal. [CSO06] adopt this approach, and combine a query relevance featurewith topic signature-based features to compute sentence scores. The authorsreport that using these two simple features, together with a redundancyremoval strategy based on a matrix decomposition [CO01], results in state-of-the-art summarization performance on DUC 2005 and 2006 data sets.Goldstein et al. [GMCK00] instead compute the passage’s similarity withthe query using the cosine measure and weighted word vector representa-tions. Copeck et al. [CIK+07] calculate a query-focused similarity scoreby linearly combining overlap measures of the passage with the title andthe passage with the longer topic statement of the document collection, us-ing unigram and bigram overlap and assigning different weights to differentparts-of-speech.

The approach of Daume and Marcu [DM06] considers query-focusedsummarization from a language modeling perspective [PC98]. In their ap-proach, the weights of query words are smoothed with a word-unigram lan-guage model constructed from a set of query-relevant documents. Sentencesare then ranked based on the Kullback-Leibler divergence between the query’sword distribution and a sentence’s word distribution. In order to incorporatequery constraints into the frequency-based summarizer proposed by Nenkovaand Vanderwende [NVM06], Vanderwende et al. [VSBN07] compute theprobability of words contained in the query. Final word weights are thencalculated as a linear combination of a word’s original likelihood in the doc-ument collection and its query probability.

We will now introduce three different query-focused summarization sys-tems that have achieved state-of-the-art results, whether measured by man-ual inspection or through the use of automatically computed metrics, in theDUC 2006 and DUC 2007 competitions. These systems illustrate a rangeof very different summarization strategies, ranging from basically statisticalmodels to in-depth linguistic processing.

GISTexter

The GISTexter system has evolved over the years into a complex systemwhich combines a wide range of linguistic and statistical methods for genericand query-focused multi-document summarization. In its early implementa-tions, sentence scores were mainly determined by calculating topic signatureand topic relation-based features, which were discussed previously in thissurvey [HL02]. In later work, Harabagiu and Lacatusu [Har04, HL05]

64


describe the use of statistical models along with predicate-argument struc-tures to represent sentences. The authors utilize clustering for discoveringsubtopics of related sentences, and combine subtopics in a graph represen-tation based on linguistic and content relations. In a comparative study ofdifferent source representations – moving from basic statistical ones to com-plex combinations of statistical and symbolic structure – the authors showthat complex subtopic representations they propose improve the quality ofsummaries [HL10].

Current versions of GISTexter [LHR+06, HHL07, HL10] additionallyincorporate question decomposition, question answering and textual entail-ment technologies, and utilize a variety of knowledge sources for query-focused multi-document summarization. The system’s performance is ex-cellent, ranking first for many of the manually evaluated linguistic qualitycriteria and overall summary responsiveness, and among the top systems forRouge and Pyramid scores in DUC 2006 and 2007. The approach imple-mented in the GISTexter system thus illustrates the value of combiningimplicit statistical and explicit symbolic structure [Jon07].

PYTHY

The PYTHY summarization system [VSB06, TBG+07] scores each sentenceby linearly combining a wide range of mostly lexico-statistical features (e.g.bigram frequency). The system approaches the learning of sentence featureweights as a pair-wise ranking problem. Given a goodness metric of sentenceswhich asserts a set of preferences ij : si > sj, the learner seeks to assign higherscores to the “better” sentence of a pair. The authors evaluate a range ofmetrics derived from reference summaries of previous DUC competitions,and find that a metric based on unigram frequency in reference summariesoutperforms metrics using Pyramid content units and Rouge scores. Re-dundancy is accounted for using a discounting strategy similar to Nenkovaet al.’s approach [NVM06]. In addition, the system employs a beam searchand a dynamic programming approach instead of standard greedy summaryoptimization schemes. The performance of this summarizer is excellent, infact, its reported Rouge-2 recall score is to the best of our knowledge onlysurpassed by the system we present in Chapters 4–5 of this thesis. On DUC2007 data, the system also achieves a high Rouge-2 recall score, placing 2ndout of 30 participating systems.

An additional feature of the PYTHY system is the use of a syntacti-cal sentence simplification component, which supplies abridged versions ofinput sentences to the summarizer. This strategy allows the summarizerto choose from a larger set of alternative sentences. However, the authors

65


report that syntactic simplification can produce ungrammatical sentences,which contributes to the low grammaticality and referential clarity scores oftheir system in the DUC 2006 competition.2

Yih et al. [YGVS07] extend the original PYTHY system by incorporat-ing features based on word position. The weight of a word is determined byits probability of occurring in reference summaries, and is learned with a su-pervised logistic regression approach using a range of position and frequencyfeatures. A stack decoding algorithm then finds the set of sentences thatmaximizes the sum of word weights, similar to the approach of Vanderwendeet al. [VSBN07]. The authors report state-of-the-art Rouge-2 recall scoreson DUC 2004 data, and in particular, significant improvements over a purefrequency-based summarizer.

IIIT Hyderabad

The summarization system presented by Jagarlamudi et al. [JPV06] deter-mines sentence importance based on a set of statistically computed features.The first feature exploits the distributional hypothesis that co-occurringwords are semantically related. Word weights are estimated as joint proba-bilities from a term-term matrix T. Each entry Tij is calculated as:

p(w′|w) =Kk=0

P (k)P (w′|w, k), (2.5)

where P (w′|w, k) is the relative frequency of word w′ co-occurring with wordw in a sliding window of size k, and P (k) is inversely proportional to k. Therelative co-occurrence frequency is summed over all windows of size k up to apredefined maximum size K. This weighting scheme gives higher probabilityto words co-occurring close to each other, and lower weight to words co-occurring at a larger distance. Each sentence is assigned a feature valueequal to its probability under the query, following the probability rankingprinciple [BYRN99]:

P (S|Q) ≈wi∈S

P (wi)qj∈Q

P (qj|wi), (2.6)

2Syntactic sentence compaction has been a pre- or post-processing component in a va-riety of summarization systems [SNM04, ZDL+05, CSO06, TBG+07, ZDLS07, YGVS07].Although appealing because compaction allows to remove redundant or non-relevantsubsentential content, its impact on summarizer performance has not been fully deter-mined [SNM04, DM05b, VSBN07], especially with respect to the grammaticality andcoherence of the produced summary. Nevertheless, by using sentence compaction summa-rizers may create more space to capture important content [TBG+07].

66

2.4. Latent factor models

where qj are the words contained in the query. Note that this model reducesto the standard Vector Space Model model if the size of the sliding windowis K = 1.

In addition to the query probability feature, the system uses two query-independent features to score sentences. The first feature is calculated as thelikelihood of the sentence under a unigram language model, which is esti-mated on the source document collection. The second feature is given by theentropy of the sentence under the same language model. The authors evalu-ated the validity of their approach during the DUC 2006 competition. Theirsystem outperformed all other summarizers in the automatic evaluations andranked among the top systems for manual evaluations.

In later work, Pingali et al. [PKV07] introduce a feature that is basedon a contrastive analysis of word probabilities in the document collection tobe summarized (D), compared to the probabilities of words in a randomlychosen document set D:

scoreD,D(S) =P (D)

wi∈S P (wi|D)

P (D)

wi∈S P (wi|D) + P (D)

wi∈S P (wi|D), (2.7)

where S is a sentence and wi is the ith word in sentence S. This approach is

similar to the topic signature approach described previously, since it assigns ahigher weight to words characteristic for the collection of relevant documents.

2.4 Latent factor models

This section introduces latent factor models, a class of unsupervised machinelearning algorithms that can be used to model large collections of discretedata, such as the document collections used in information retrieval [BNJ03].Latent factor models aim to represent such datasets in terms of a set ofhidden, or latent factors, which are assumed to be responsible for generatingthe observed data. Each latent factor can be seen as corresponding to aparticular concept expressed in the original data. When applied to textcorpora, the factors are often interpreted as “topics”, and hence these modelsare often called topic models. The goal of latent factor algorithms is to recoverthe set of underlying factors from the observed data.

In natural language processing and information retrieval, the applicationof latent factor models is motivated by the observation that documents aretypically composed of several different main concepts, or topics. For example,a scientific paper in computational linguistics may be tagged as containing30% computer science-related content, and 70% content that deals with lin-guistic issues. Latent factor models assume that when writing a document,

67


Topic 247 Topic 5 Topic 43 Topic 56

word prob. word prob. word prob. word prob.

drugs .069 red .202 mind .081 doctor .074drug .060 blue .099 thought .066 dr. .063medicine .027 green .096 remember .064 patient .061effects .026 yellow .073 memory .037 hospital .049body .023 white .048 thinking .030 care .046medicines .019 color .048 professor .028 medical .042pain .016 bright .030 felt .025 nurse .031person .016 colors .029 remembered .022 patients .029marijuana .014 orange .027 thoughts .020 doctors .028label .012 brown .027 forgotten .020 health .025alcohol .012 pink .017 moment .020 medicine .017dangerous .011 look .017 think .019 nursing .017abuse .009 black .016 thing .016 dental .015effect .009 purple .015 wonder .014 nurses .013known .008 cross .011 forget .012 physician .012

Table 2.1: An illustration of four (out of 300) latent topics extracted fromthe TASA corpus (taken from [SG07]).

the author has in mind several ideas (or concepts, topics) she intends to ex-press, and subsequently selects words from each topic to relate these ideas.Each latent topic is thus characterized by its own, specific vocabulary.

Table 2.1 shows a set of example topics, taken from [SG07], which arederived from the TASA document corpus. The TASA corpus is a collection ofapproximately 37,000 texts from educational materials [LFL98]. The figureshows the 15 most highly weighted words for each topic, which relate to druguse, colors, memory and the mind, and doctor visits.

There are several strong benefits associated with the use of latent factormodels to represent unstructured text data. From a linguistic point of view,latent factor models represent documents by a small set of underlying factors,and thus abstract from the observed usage of words. This approach allowssuch models to address several limitations of the standard vector space modelrelated to lexical variability, data sparsity and vector space dimensionality(see Section 2.2). The projection of the original, high-dimensional word spaceonto a denser, low-dimensional latent space typically leads to documentshaving a high similarity in the latent space even if they do not share anywords in the original word vector space [MS01]. In addition, latent factormodels capture patterns of word usage, thereby uncovering semantic relations

68


between words and documents, which has proven to result in more robustword processing in many IR and NLP applications. A general claim made forsuch models is that the various factors explicitly distinguish between differentmeanings and different types of word usage (polysemy), and group words withthe same or similar meanings (synonymy) [LMK07]. This effect is achievedby exploiting word co-occurrence: Words which co-occur in the same contextsare projected onto the same latent topic, and words that occur in differentcontexts are projected onto different latent topics. For example, the words“physician” and “doctor”, even if never co-occurring in a single document,will tend to be quite similar in the latent space because they occur in thesame contexts (e.g. with words like “patient”, “hospital”, “sick”, “surgery”,“nurse”, etc.). This distributional viewpoint of word semantics is driven bythe hypothesis that a word’s meaning is determined by its context [Fir57].Latent factor models do not only discover the hidden structuring in documentcollections, but also establish inter- and intra-document links, which offersnew ways to explore and understand the input data. They are particularlyuseful with text data, since the observed data (words) are explicitly intendedto communicate a latent structure (their meaning) [GS04].

From a computational point of view, the reduced description of docu-ments in a latent space often mitigates noise-related issues, and lowers com-putational efforts, e.g. in document retrieval applications [BNJ03]. Further-more, latent factor models are unsupervised machine learning algorithms,and therefore do not require the use of labeled training data. This factmakes them applicable to large collections of unstructured text data, as wellas domain- and language-independent. Latent factor models assume no priorknowledge of the topics in a given set of documents, and can thus model thecontent structure of texts independent of any external knowledge resources.Probabilistic variants of latent factor models are in addition embedded ina rich Bayesian framework that enables the application of well-known sta-tistical methodologies, and facilitates the integration of new informationsources [Bis07, LMK07]. However, similar to the standard vector spacemodel, latent factor models make the “bag-of-words” assumption — i.e. theassumption that the order of words in a document can be neglected, and thatword occurrence observations can be considered independent of each other forcomputational purposes. Since word-order information contains importantcues to determine content meaning, various authors have proposed extensionswhich aim to incorporate such kinds of information [Wal06, GSBT05, HA10].

69


2.4.1 Latent semantic analysis

Latent Semantic Analysis (LSA) is a technique for latent factor modelingthat was first proposed by Deerwester et al. in the context of automaticdocument indexing and retrieval [DDF+90]. LSA is based on a singular valuedecomposition (SVD) of a matrix representation of a document collection.SVD identifies a linear subspace that captures most of the variance of theoriginal word vector space, and its derived features are linear combinationsof the original word features [BNJ03].

Formally, to apply LSA, a corpus of documents is represented as an m×nmatrix A of m terms and n documents, where each entry Aij correspondsto the (suitably weighted, e.g. with tf-idf ) value of term i in document j.Then, the SVD decomposition of A is defined as:

A = UΣVT , (2.8)

where the orthonormal columns of U are called the left singular vectors, Σis a diagonal matrix of singular values sorted in descending order, and theorthonormal columns of V are called right singular vectors.

To derive a low-dimensional latent factor model using SVD, one typicallychooses a k ≪ m,n to approximate the matrix A with a rank-reduced matrixA(k):

A(k) = U(k)Σ(k)VT(k). (2.9)

This rank-k approximation uses only the first k singular vectors and singularvalues of the decomposition. It minimizes the difference of the Frobeniusnorms3 of A and A(k) [DDF

+90].The magnitude of the singular values Σii signifies the degree of impor-

tance of the ith latent dimension. Each dimension of the subspace is assumedto correspond to a single latent factor, and the entries of U and VT corre-spond to the feature values of the term and document vectors, respectively,along the new, projected dimensions. Thus, the vector Ui1 represents thefirst, and most important, latent factor, and is characterized by terms i witha high value in the column Ui1. It has to be noted that the actual choice ofthe number of latent factors k is critical for the performance of the model,and is usually determined empirically [LMK07].

LSA makes three main claims: semantic information can be derived froma word-document co-occurrence matrix, dimensionality reduction is an es-sential part of this derivation, and words and documents can be representedas vectors in an Euclidean space [SG07]. It has been applied successfully in a

3The Frobenius, or L2-norm of a matrix is a function that assigns a real-valued, strictlypositive length or size to a matrix, similar to the length function of vectors.

70


number of domains, including indexing for IR [DDF+90], document cluster-ing and classification [Dum04], and in cognitive theories of meaning [LMK07].However, LSA’s latent factors are often difficult to interpret, as many entriesof U and VT will be negative due to the orthonormal basis of the linearsubspace [Hof99b, Zha02].

2.4.2 Probabilistic topic models

Probabilistic topic models address the unsatisfactory statistical foundationsof LSA, and express the semantic properties of words and documents interms of probabilistic topics instead of as points in Euclidean space [Hof99b,SG07]. They model each document as a mixture of latent topics, i.e. asa probability distribution over a fixed set of topics. Each latent topic, inturn is represented as a multinomial distribution over words. Each word in adocument is generated from a single topic, and different words in a documentcan be generated from different topics.

The main idea that latent topic models are based upon is that the docu-ments of a collection may be created by a “generative” process, which spec-ifies a simple probabilistic procedure for generating a new document: Givena set of latent topics and the words associated with them, one creates a newdocument by first choosing a distribution over topics. Subsequently, one picksa topic at random from this distribution for each word in the document, andrandomly draws a word from this topic. The algorithms used to create latenttopic models invert this process, and work backward to explain the observeddata using statistical inference methods. Given the observed words in a setof documents, they find the model that is most likely to have generated thedata, i.e. the probability distribution over words associated with each topic,and the distribution over topics for each document [SG07].

Similar to the latent dimensions of LSA, each latent topic clusters se-mantically related words based on co-occurrence observations. The maindifference to LSA is how the model parameters are estimated: Whereas LSAminimizes the distance of the L2 norms of A and A(k), latent topic modelsfit the model to the data with probabilistic maximum likelihood or Bayesianmethods.

Probabilistic topic models are appealing alternatives to LSA becausethey are probabilistic generative models of text, and thus embedded in alarger and useful framework of statistical methodology. In particular, thisframework offers methods for determining an optimal number of latent fac-tors, and for avoiding overfitting [BGJT04, Bis07]. Furthermore, the topicsare typically as interpretable as the ones shown in Table 2.1, which con-trasts them with the arbitrary axes of LSA’s spatial representation [SG07].

71


Probabilistic topic models have been successfully applied to many differ-ent tasks, such as collaborative filtering based recommendation [Hof99a],document retrieval [Hof99b], image annotation [MGP03], news personaliza-tion [DDGR07], document modeling [BNJ03], modeling of scientific docu-ment collections [GS04, HJM08], and trend detection [WM06, HJM08]. Theprobabilistic framework facilitates the integration of additional knowledgesources, and a host of refinements and extensions have been proposed in re-cent years: Incorporating word order [GSBT05, Wal06], including citationinformation in modeling scientific document collections [DBS07], develop-ing hierarchical topic structures [BGJT04], time-dependent dynamic mod-els [BL06, HJM08], or multilingual topic models [MWN+09].

Probabilistic Latent Semantic Analysis (PLSA) Probabilistic LatentSemantic Analysis (PLSA), as introduced by Hofmann [Hof99b], is a firstexample of a probabilistic topic model. PLSA posits that a word w and adocument d are conditionally independent given a latent topic z:

P (d, w) = P (d)P (w|d), where (2.10)

P (w|d) =z∈Z

P (w|z)P (z|d). (2.11)

In this equation, P (d, w) is the joint probability distribution over observa-tions of word w occurring in document d, P (d) is the prior probability ofdocument d, P (z|d) is the distributions over topics Z in a particular docu-ment d, and P (w|z) is the probability distribution over words w given topicz.

Figure 2.2 shows a graphical model representation of PLSA. In this platenotation, shaded and unshaded variables correspond to observed and un-observed (latent) variables, respectively. The arrows indicate conditionaldependencies between variables. Plates (boxes) refer to repetitions of thesampling steps of the generative procedure, with the number of samples in-dicated by the variable in the lower right hand corner. For PLSA, the outerplate represents a corpus of M documents, while the inner plate representsthe repeated choice of topics and words within a document.

Following the maximum likelihood principle, the topics and thedocument-specific topic distributions are determined by the maximizationof the log likelihood function:

L =d∈D

w∈W

n(d, w) logP (d, w), (2.12)

where n(d, w) denotes the term frequency, i.e. the number of times word woccurs in document d.

72


MN

wzd

Figure 2.2: Graphical model representation of the document-word PLSAmodel for N words and a corpus of M documents. In this plate notation,shaded and unshaded variables correspond to observed and unobserved (la-tent) variables, respectively. The arrows indicate conditional dependenciesbetween variables. Plates (boxes) refer to repetitions of the sampling steps ofthe generative procedure, with the number of samples indicated by the vari-able in the lower right hand corner. For PLSA, the outer plate representsdocuments, while the inner plate represents the repeated choice of topics andwords within a document.

The standard procedure for maximizing the likelihood function in thepresence of latent variables is the Expectation Maximization (EM) algo-rithm [DLR77]. EM is an iterative algorithm where each iteration consistsof two steps, an expectation step where the posterior probabilities for thelatent topics z are computed:

P (z|d, w) = P (z)P (d|z)P (w|z)z′ P (z′)P (d|z′)P (w|z′) , (2.13)

and a maximization step where the conditional probabilities of the parame-ters given the posterior probabilities of the latent topics are updated:

P (w|z) =

d n(d, w)P (z|d, w)

d,w′ n(d, w′)P (z|d, w′), (2.14)

P (d|z) =

w n(d, w)P (z|d, w)

d′,w n(d′, w)P (z|d′, w) , (2.15)

P (z) =

d,w n(d, w)P (z|d, w)

d,w n(d, w). (2.16)

Starting with a random initialization of the parameters, one alternatesthe expectation and maximization steps until arriving at a converging pointwhich describes a local maximum of the log likelihood. The output of thealgorithm are the topics, as well as the distribution over topics for eachtraining document, i.e. the conditional probabilities P (w|z) and P (z|d). New

73


documents, e.g. queries in document retrieval applications, are “folded” intothe trained model by performing EM iterations, where the factors P (w|z)are kept fixed, and only the mixing proportions P (z|d) are adapted in eachmaximization step [Hof99b]. It is then possible to calculate document andword similarities by comparing the latent topic distributions P (z|w) andP (z|d).

Similar to LSA, one of the major challenges of applying latent topic mod-els is the estimation of the number of latent topics. Choosing too few topicsmay not fully reflect the underlying latent factors of a domain and may re-sult in broad topics or arbitrary combinations of different factors. A solutionwith too many topics will result in uninterpretable topics that pick out id-iosyncratic word combinations [SG07].

Another limitation of the EM algorithm is its convergence on localmaxima. In Hofmann’s original approach, the author suggests temperedExpectation-Maximization to avoid unfavorable local extrema [Hof99b], andto avoid overfitting on the training data. Another optimization approach,which we follow for its significantly better performance in our own work, isdescribed by Brants et al. [BCT02]: Instead of using only a single model, theauthors propose to compute several different, randomly initialized models,and to average the features computed from these models.

Latent Dirichlet Allocation (LDA) A deficit of PLSA is its lack ofgenerative modeling at the document level, which means that the modelis biased towards the topic distributions of the documents in the trainingcorpus. It is also not clear how to assign a probability distribution overtopics to documents outside the training set [BNJ03]. Furthermore, PLSAdoes not incorporate smoothing to reduce noise and to allow for unseen data.

These issues have been addressed in a latent topic model known as LatentDirichlet Allocation (LDA) [BNJ03]. LDA differs from PLSA by introducingBayesian generative modeling at the level of documents and word distri-butions. Figure 2.3 shows the graphical model representation of LDA. Incontrast to PLSA, LDA does not directly estimate P (z|d), denoted as θ inthe figure, but conditions this distribution on a conjugate prior α. Each wordof a document d is generated by drawing a topic zk ∈ Z from the document’stopic distribution θd, and then drawing a word w from the topic’s distribu-tion over words ϕzk . ϕ in turn is also parametrized by a conjugate Dirichletprior β. The parameters α and β can be interpreted as prior observationcounts for the number of times topic zk is sampled in a document (and re-spectively word wi for topic zk), before having observed any actual wordsfrom that document. α and β thus act as smoothing parameters on the topic

74

2.5. Content models

MN

wzθ

β

α

φ

T

Figure 2.3: Graphical model representation of the document-word LDAmodel for N words, T topics, and a corpus of M documents. In contrastto PLSA, LDA first generates a document’s distribution over topics θ, con-ditioned on a Dirichlet prior α. Each word of a document is generated bydrawing a topic z from θ, and then drawing a word w from the word distri-bution ϕ associated with z.

and word distributions [Bis07]. The estimation of the model’s parameters isagain viewed as a problem of maximizing the likelihood of the data:

p(D|α, β) =Md=1

p(θd|α)

Ndi=1

zk

p(zk|θd)p(wi|ϕzk)p(ϕzk |β)dθd, (2.17)

where p(D|α, β) is the likelihood of the data, M is the number of documentsin the corpus, Nd is the number of words in document d, and zk is the topicassigned to word i in document d.

The main problem in applying LDA is computing the posterior distribu-tion of the hidden variables given a document collection, which is intractablefor exact inference in general [BNJ03]. However, a variety of approximateinference algorithms can be considered for LDA, including Laplace approxi-mation [AWST09], variational expectation maximization [Bis07], and Markovchain Monte Carlo techniques such as Gibbs sampling [SG07].

2.5 Content models

The analysis of text structure, and the development of computational modelsof text, are of central concern in many areas of natural language processing.An important aspect of text analysis in automatic summarization is an ex-ploration of the content structure of texts, i.e. a characterization of source

75


documents in terms of their themes and topics. In this section, we will in-troduce approaches to text summarization that analyze this kind of textstructure and create so-called content models of text.

AS TURKEY SEARCHES FOR QUAKE SURVIVORS, FINGERS AREPOINTED AT SHODDY HOUSINGISTANBUL, Turkey – More than 1,000 relief workers from 19 countriesjoined the frantic search Wednesday for victims of Tuesday’s devastatingearthquake as grieving survivors raised an outcry over shoddy construc-tion practices and lax government regulations. Although rescue workerscontinued to find people alive under the rubble of collapsed buildings, thedeath toll climbed steadily: By Wednesday night, according to an officialcount, 3,879 bodies had been recovered. More than 16,000 people werelisted as seriously injured.Foreign help is especially vital because early rescue efforts by the Turk-ish authorities have been plagued by inexperience, poor organization andlack of supplies. In some stricken towns, newly arrived foreigners found noorganized rescue effort under way and took charge themselves, directingbattalions of eager volunteers. News commentators pilloried the govern-ment for what they said were inexcusable lapses, both in preparing for anearthquake that scientists said was sure to come and in dealing with it afterit struck. “The rescue effort is a fiasco,” one Istanbul newspaper asserted.Several experts blamed unscrupulous contractors and ineffective inspectorsfor having contributed to the scope of the catastrophe by allowing theconstruction of flimsy buildings that could not withstand a quake.“Theinevitable happened, despite years and years of repeated warnings,” saidAhmet Ercan, a professor of geophysics at Istanbul Technical University.“Officials refused to face facts. They never insisted that contractors surveythe risks and build earthquake-resistant structures. Maybe after this bitterexperience, we will update our regulations along the lines of Japan, theUnited States and Mexico.”The epicenter of the quake was near Izmit, about 55 miles east of Istanbul.There, an astonishing number of the destroyed buildings were new, builtwithin the past five years. The U.S. Geological Survey said the quake hada preliminary magnitude of 7.8, but downgraded that to 7.4, one of thestrongest of the century anywhere in the world. Turkish seismologists putits strength at 7.4.Teams of rescue workers arrived steadily at the Istanbul airport on Wednes-day. Officials carrying clipboards wandered through the arrival loungeasking passengers, “Are you rescue people?” Nearly every European coun-try sent money, relief workers or equipment, including Greece, which isTurkey’s principal rival in the eastern Mediterranean. The EuropeanUnion contributed $2.1 million in emergency aid. Relief planes also arrivedfrom non-European countries including Egypt, Jordan, Algeria, Pakistanand Japan. The largest foreign contributor so far has been Israel, whichdispatched two firefighting planes, teams of dogs trained to track buriedhumans, and 350 relief workers. Israel has also pledged to send a fullyequipped field hospital within a few days.

Rescue efforts & Col-lapsed buildings &Casualties

Rescue efforts

Collapsed buildings

Location and strength

Foreign Help

Figure 2.4: The figure shows an example document from a collection of newsarticles about the 1998 earthquake in Turkey. The article contains severalsubtopics related to the main theme, with each subtopic spanning an amountof text anywhere from a single sentence up to a few paragraphs.

Content models are representations of the domain-specific content struc-

76

2.5. Content models

ture of documents, which distinguishes them from models that characterizetext structure using domain-independent rhetorical elements or cohesion re-lations [MT88, BE97, Mar99]. They also differ from models that utilizeknowledge about the rhetorical status of sentences, as discussed for scien-tific papers and legal documents in [TM02, HG05], as these models assume agenre-specific structuring of content. In particular, content models focus onrepresentations of the (sub-)topical structure of texts, which contrasts themwith simpler models of content that are based on word frequency, graphs orlinguistically motivated representations, as discussed earlier. However, thenotion of what exactly constitutes a subtopic is hard to pin down [Hea97]. Itis generally assumed that each subtopic relates a particular type of informa-tion, and can span an amount of text up to a few paragraphs. Subtopics areoften characterized by a specific word distribution, and when text shifts fromone subtopic to the next, a large amount of the vocabulary changes [Hea97].To illustrate the notion of subtopics, Figure 2.4 shows an example news ar-ticle from a collection of news articles about the 1998 earthquake in Turkey.The article discusses several types of information related to the main theme,for example the earthquakes location and strength, information related tothe number of casualties, rescue efforts, and foreign help.

Subtopic identification constitutes one way of finding similar and differinginformation in a document or document collections, and subtopic modelinghas been explored by various authors [GMCK00, GL01, HL02, BL04]. Manyearly approaches to automatic document summarization incorporated theTextTiling algorithm [Hea97] to segment longer source texts. TextTiling re-lies on detecting the change of vocabulary between adjacent text segmentsto identify boundaries between subtopics. The algorithm represents textsegments as word vectors in a vector space model, and calculates the co-sine distance of word vectors to segment documents into multi-paragraphunits [BMR95, BE97, BK97, SSMB99, HL02].

In the next sections, we discuss a range of approaches for content andsubtopic modeling, starting with Hidden Markov Model representations, andthen moving on to the application of latent factor models.

2.5.1 Hidden Markov Models

Barzilay and Lee [BL04] posit that content structures of texts can be iden-tified by scanning for recurrent word patterns [Har82], instead of relying onnotions of vocabulary change or lexical overlap in the vector space model.Their idea is motivated by previous research which shows that such struc-tures are often expressed with the same or similar, “formulaic” word pat-terns [Wra02]. For example, in the news article shown in Figure 2.4, there

77


are many words which indicate the general theme (e.g. “quake”, “tremor”,“magnitude”). However, each subtopic is furthermore characterized by itsown specific vocabulary, e.g. the “Foreign Help” subtopic frequently containswords such as “money”, “relief”, “help”, or “worker”. The main assumptionis that these word patterns reappear in similar articles, e.g. articles aboutthe same event, or in documents about a similar event (e.g. a different earth-quake).

In their approach, Barzilay and Lee propose to represent subtopics asword bigram language models, and to capture possible information orderingsin a domain by utilizing a Hidden Markov Model (HMM) [Rab90]. Theyassume that texts from the same domain, for example a set of news articlesabout earthquakes, are characterized by a definable set of subtopics and by aspecific ordering of these subtopics. The HMM’s states and state transitionsrepresent the subtopics and their ordering, respectively. Both subtopic rep-resentations and ordering relations between them are learned from unanno-tated documents. The authors first cluster sentences by measuring sentencesimilarity with the cosine metric and using word bigrams as features. Thecorresponding HMM models has as many states as clusters, and each state’s(smoothed) bigram emission probabilities are estimated from the correspond-ing cluster’s bigram distribution. The original ordering of the sentences as-signed to each cluster is used to calculate initial state transition probabilities.Finally, sentences are reclustered according to which state is most likely tohave generated them, and the HMM parameters are re-estimated based onthe new clustering. The re-estimation procedure is repeated until the clus-terings stabilize.

Barzilay and Lee present an experimental evaluation of their approach onthe task of generic single-document summarization. After learning a HMMmodel of the input article, the proposed algorithm uses document-summarypairs to learn the summary-likelihood of HMM states. Sentences from thesource article which are assigned to the most likely subtopics are then in-cluded in the summary. Their experiments on a small corpus of 60 Associ-ated Press newswire articles show that their approach outperforms a systemsimilar to the one described by Kupiec et al. [KPC95].

2.5.2 Latent semantic analysis

Gong and Liu [GL01] illustrate the use of LSA in generic single-documentsummarization by applying SVD on a term-sentence matrix of the sourcedocument. They assume that the latent dimensions discovered by SVD cor-respond to the document’s subtopics. To construct a summary, their methodselects for each i = 1..K the sentence with the largest value in the i-th right

78

2.5. Content models

singular vector, i.e. a single sentence for each subtopic. The authors arguethat this minimizes redundancy, as the dimensions of the latent space areorthonormal, and guarantees that the most important subtopics are selectedfirst, since they correspond to the latent dimensions with the highest singularvalues.

In their evaluations, the authors show that the LSA approach outperformsa baseline method which assigns sentence scores based on features derivedfrom a tf-idf -weighted vector space representation. Their experiments alsoshow that simple binary or word frequency weighting of the term-documentmatrix works best, whereas more complex weighting schemes (logarithmic, tf-idf ) deteriorate performance. As an alternative to choosing a single sentenceper topic, Steinberger et al. [SPKJ07] propose to re-weight VT by thesingular values Σ, such that sentences with the greatest combined weight inB = Σ2VT are included in the summary. An experimental evaluation showsthat this selection strategy outperforms the strategy employed by Gong andLiu.

The general usefulness of subtopic modeling on the basis of LSA is alsoconfirmed in an earlier study of Steinberger et al. [SKPSG05]. The au-thors propose a modification of the term-document matrix A to incorporateco-reference information. They argue that this approach increases the co-occurrence information available to LSA, and mitigates the effects of authorsusing name variants and pronouns to avoid lexical repetition. Co-referenceinformation can either be incorporated by substituting all distinct co-referentwords with a single ‘main’ term, or by viewing the co-reference chain iden-tifiers as additional, novel terms. For this latter representation, the m × nterm-document matrix A is extended by adding rows for each co-referencechain, where Aij, i > m is a binary variable that indicates if co-referencechain i occurred in sentence j. The authors conduct experiments on a cor-pus of Reuters newswire and popular science texts from the British NationalCorpus, comparing against manually constructed sentence extracts. Theirevaluations show an improvement of summarizer performance across a rangeof performance measures when using the addition method, whereas the sub-stitution method resulted in a loss of performance.

The approach presented by Zha [Zha02] combines clustering and LSAapproaches. The authors first cluster sentences into topical groups – corre-sponding to the different subtopics of a document – using a modified k-meansclustering algorithm. The modification incorporates sentence priors repre-senting locational proximity into the clustering similarity measure. For eachtopical cluster the algorithm then determines sentence weights by an SVDdecomposition of the cluster’s term-sentence adjacency matrix. The highest-ranked sentences of the first right-singular vector qualify as candidates for

79


the summary.Murray et al. [MRC05] and Ozsoy et al. [OCA10] illustrate modi-

fied variants of the standard LSA approach that employ novel sentence se-lection strategies based on the factor matrices Σ and VT . Hachey andGrover [HMR05] learn the semantic space from a large background corpusinstead of the document collection to be summarized, and represent sentencesin this more general semantic space. Yeh et al. [YKYM05] demonstrate acombination of LSA with a graph representation of sentence connectedness.The authors replace the sentence’s word vector representation by its newvector in the latent semantic space before computing sentence similarities.Their evaluations show that this improves summarization performance ascompared to graphs constructed from word vectors.


Probabilistic latent semantic analysis was first adopted by Bhandari etal. [BSIM08] for the task of generic single-document summarization. Similarto the approach described by Gong and Liu, the authors represent the inputdocument by a term-sentence matrix. After estimating the parameters of thePLSA model, the authors compare two different sentence selection strategies.The first strategy selects sentences from the dominant subtopic of the doc-ument, the second strategy chooses sentences which best represent all thesubtopics in a document. This second strategy is similar to the re-weightingapproach employed by Steinberger et al. [SPKJ07].

Bhandari et al. evaluate their approach on the DUC 2002 data setwith mixed results: When using the first strategy, PLSA outperforms LSAand a graph-based ranking scheme, but does not reach the performance ofLexRank. Their second strategy, however, results in a considerable improve-ment of Rouge scores over the other approaches. This can be attributedto the fact that this strategy favors sentences that are relevant for differentsubtopics, whereas graph-based rankings emphasize the influence of the dom-inant subtopic. Therefore, PageRank seems to be more promising when fo-cusing on selecting sentences from the dominant subtopic of a document, butPLSA is more promising for summaries that aim to cover different subtopics.The large improvement in Rouge scores suggests that the human summa-rizers intuitively also adopt this latter strategy of creating a summary withbroader coverage of the source’s contents.

Arora and Ravindran [AR08b, AR08a] present a similar approach forgeneric multi-document summarization. They present an alternative selec-tion strategy that views summary creation as a generative process: Aftertraining the latent factor model on a document collection, they compute the

80

2.5. Content models

probability of each latent factor. Their algorithm then iteratively samples afactor from this distribution, and selects the most likely sentence given thisfactor. The final summary thus consists of sentences from different factors,and the number of sentences from each factor is proportional to the factor’slikelihood.

Experiments conducted on DUC 2002 data show that the proposed ap-proach outperforms the best submissions of that competition in Rouge-1recall. Unfortunately, however, the authors do not compare their approachto current multi-document summarization systems, or on more recent datasets, which would have given more substantial evidence for the validity oftheir approach. This criticism equally holds for the approach presented byBhandari et al.

Query-oriented multi-document summarization is the focus of a recentstudy conducted by Tang et al. [TYC09]. The authors describe an ap-proach which combines the query and the source document collection in ashared generative model. The approach highlights one of the benefits of prob-abilistic models, which is the simple and principled integration of differentkinds of source information into a single model. In Tang et al.’s approach,the word distribution of latent factors is simultaneously guided by the queryand by the documents. In addition, the query is modeled as a distributionover factors, to address the fact that complex questions may relate to dif-ferent subtopics. An evaluation of the approach on DUC 2005 and DUC2006 data sets shows small, non-significant Rouge score improvements overgeneric summarization systems. However, the system performs on par withthe highest-ranking participants in DUC 2005 and DUC 2006.

A similar approach for incorporating query information into a probabilis-tic model of text is presented by Daume and Marcu [DM06]. Daume andMarcu view query-focused summarization from a language modeling for IRperspective, where sentences are ranked based on their distributional similar-ity to a query. In their approach, each sentence is represented as a mixtureof three latent topics whose word distributions are assumed to correspond toa background English language model, a document-specific language model,and a query-specific language model, respectively. Given a typical IR corpusof documents, queries, and document relevance judgments, the model thenlearns, for each sentence, its distribution over latent topics (e.g. that the sen-tence contains 90% general English and 10% document-specific words), andthe word distributions of the latent topics. The effect of the co-occurrenceobservations exploited by latent topic models in this case is that words co-occurring very often in the query and in sentences serve as a kind of relevancefeedback to the query, and ‘expand’ the query’s distribution over words. Theword distribution of the query-related latent topic thus assigns high probabil-

81


ities not only to words occurring in the query, but also to words co-occurringwith query words in various sentences.

The authors evaluate their approach on the tasks of query-focused andgeneric summarization. For generic MDS, the relevance of sentences is de-termined with respect to the centroid of the document cluster. In the DUC2005 competition, the summarizer ranked first in the manual responsivenessevaluation and among the top-ranked systems in all automatic evaluations.In contrast to the other approaches discussed in this section, the proposedmethod does not explicitly model the subtopic structure of documents ordocument collections and is conceptually simpler. Yet, the experimentalevaluation shows that latent topic models capture meaningful structures inthe data. These structures, in turn, can be used to determine useful im-portance features for extractive summarization. The authors also note thattheir evaluations show that latent topic models are relatively robust to noisydata (i.e. non-relevant documents).

Haghighi and Vanderwende [HV09] take up the approach of Daumeand Marcu, and model sentences as a distribution over a general backgroundvocabulary ϕb, document collection-specific content ϕc, and document-specific content ϕd. The authors then extend the model to capture collection-wide subtopics by introducing additional latent topics ϕck . Specifically, theirapproach models word generation as a hierarchical process: When generatinga collection-specific word, the model first decides whether to emit a generalcollection or subtopic-specific word, and in the latter case then decides fromwhich specific subtopic. Each subtopic distribution ϕck models subtopicswhich are used in several documents and tend to appear in contiguous setsof sentences. Subtopics in a document are ordered sequentially, similar tothe approach proposed by Barzilay and Lee [BL04].

After learning the LDA-style model, the algorithm creates a summarys by selecting a set of sentences that minimizes the Kullback-Leibler (KL)divergence between the summary’s word distribution PS and the collection’sword distribution PD:

S∗ = minS:words(S)≤LKL(PD||PS), (2.18)

where L is the length of the summary in words, and the KL divergencebetween two distributions P and Q is defined as:

KL(P ||Q) =w

P (w)log

P (w)

Q(w)

. (2.19)

The authors represent PD and PS by the learned collection-specific distribu-tion ϕc. Query information is not considered, although the authors present

82

2.6. Subsentential content units

formal experiments on the DUC 2007 query-focused summarization task.The evaluation shows that using the distribution ϕc for sentence selectionresults in Rouge-2 scores comparable to those of state-of-the-art systems —albeit only when representing each latent topic as a distribution over wordbigrams instead of word unigrams. This observation confirms that inputbigram frequencies are good predictors of reference summary content. Theauthors do not further consider the learned subtopic distributions ϕck , butnote that the incorporation of latent factors for subtopics results in a moreprecise word distribution for ϕc, such that the model using latent factors forsubtopics outperforms a model not using subtopic representations.

2.6 Subsentential content units

The identification of subtopics in a document collection results in a represen-tation that captures similarities and differences of source content at the levelof global text structure. At the other end of the spectrum, many summa-rization systems employ comparisons of source content at the level of lexicalelements. For example, summarizers identify similar words to improve fre-quency estimates. Somewhere in the middle we find passage-level similarities,which are often simply viewed as a function of the passages’s lexical elementsand other features. For instance, sentences can be similar or dissimilar onthe basis of their lexical overlap. However, such a passage-level granularityof comparison is unsatisfactory because generally sentences contain differ-ent pieces of information, some of which may be important to include in asummary, whereas others may be left out [HLZ05].

A better granularity for measuring content similarity are semantic con-tent units [NPM07]. Content units are defined as subsentential units of text,not bigger than a sentential clause, that relate a specific piece of information.They can be considered as corresponding to atomic units of meaning [NP04],e.g., “an airplane crash happened off the coast of Nova Scotia.” A con-tent unit may be as small as a modifier of a noun phrase or as large as aclause [NPM07].

As a consequence of the above definition, a sentence can contain multi-ple content units, and sentences from different source documents can repeatand combine content units in different ways. The following example, takenfrom [RJB00], illustrates how various content units, expressed with separatesentences in one news article (1–3), are combined into a single sentence (4)in another news article:

1. “Security forces found the mass grave on Wednesday at Chbika, nearDjelfa, 275 kilometers (170 miles) south of the capital.”

83


2. “The victims included women, children and old men.”

3. “Most of them had been decapitated and their heads thrown on a road,reported the Es Sahafa.”

4. “Police found the decapitated bodies of women, children and old men,with their heads thrown on a road near the town of Jelfa, 275 kilometers(170 miles) south of the capital Algiers.”

Content units are intended to be similar to each other on the basis ofexpressing the same semantic content [NP04], regardless of the actual choiceof words. This implies that different content units with the same mean-ing may use different words, different forms of the same words (inflectionalor derivational variants), synonyms, different word order, and different syn-tactic structure to express this meaning [HNPR05]. However, a number ofauthors have observed that content units often share many words and evenphrases [NP04, NV05, HNPR05, PNMS05].

The importance of content units is underlined by the fact that they are thebasic unit of content comparison in many summarization evaluation schemes(see Section 1.4.1). Furthermore, content units that occur very frequently ininput documents are also very likely to be found in human reference sum-maries [NV05]. This observation emphasizes the need to identify similarcontent units in order to improve frequency counts.

Various authors have therefore investigated methods for the identifica-tion of similar content at the level of such units. Most research thus far hasfocused on the definition of content unit similarity, and content units haveusually been manually annotated in text due to the difficulties involved inidentifying and matching them [LH02, vHT03, Voo03, SM03, TH04]. Manualannotation, however, again introduces issues related to human variability andinter-annotator agreement, and incurs a significant effort, e.g. when annotat-ing multiple human reference summaries as well as a set of machine-generatedsummaries.

A syntactically motivated approach that could lead to the automatic iden-tification of content units is presented by Hovy et al. [HLZ05]. The authors’intention is to find a summary evaluation metric which measures content over-lap with a granularity “somewhere between unigrams and sentences”. Theunit size they propose is called a Basic Element (BE), which correspondseither to the head of a major syntactic constituent or a triple expressing arelation between a head-BE and a single dependent. BEs are identified byconstructing sentence parse trees and then applying a rule-based “cutting”of tree elements. They are matched to each other on the basis of lexical

84


similarity, using lemmatization, WordNet synonymy information and distri-butional similarity information. Typically, parsing a single sentence resultsin many BEs, e.g. ’Two Libyans were indicted for the Lockerbie bombingin 1991’ gives ‘Libyans—NIL—NIL’ (a head), ‘Libyans—two—NIL’ (a headand modifier), ‘indicted—Libyans—ARG1’, and so forth. BEs are thus muchsmaller than the Pyramid’s summary content units, and different BEs needto be combined in order to create content units [HLZ05].

The phrasal intersections constructed by Barzilay et al. [BME99, BM05]are also syntactically motivated. The authors propose to find paraphrasesof the same information in a set of closely related sentences. Paraphraseidentification is based on manually crafted syntactic rules, and on a bottom-up alignment of sentence parse trees using lexical matching. The result ofthis syntactic matching is a subsentential intersection of the source sentences.The approach focuses on identifying a maximal intersection of parse treesand relies on lexical identity of sentence constituents to match tree nodes.It requires as input a set of closely related sentences [MBE+01], and doesnot consider content unit similarity or combination across “different” sets ofsentences.

Harnly et al. [HNPR05] are the first to address an automation of thePyramid evaluation. Due to the difficulty of creating a Pyramid, they fo-cus on automatically scoring new summaries against an existing Pyramid,i.e. on identifying text spans in an unannotated summary that correspondto predefined content units. The authors match and score contiguous wordsequences from source sentences to SCUs using different measures of lexicaloverlap. Candidate text spans are created by enumerating all possible con-tiguous word sequences in a sentence, i.e. a sentence with words (w1, w2, w3)produces the sequence set {w1, w2, w3, w1w2, w2w3, w1w2w3}. In a next step,the algorithm selects the subset of non-overlapping spans that maximizesthe overall Pyramid score of the new summary. In their experiments, the au-thors compare these automatically determined Pyramid scores with manualPyramid scores for a set of 18 summaries. They find a very high Pearsoncorrelation when using a simple unigram overlap similarity for matching textspans to SCU contributors. Unfortunately, the authors do not discuss thequality of the identified text spans, or present a comparison of these withthe matched summary content units. The correlation of scores just tells usthat the automated scores of different summaries exhibited the same scoringtendencies as the manual scores of the same summaries.

The authors also report that they first implemented a method for produc-ing candidate text spans that selected subtrees from a sentence’s dependencyparse tree. This was motivated by the observation that the overwhelmingmajority of SCU contributors chosen by humans are in a single subtree of a

85


dependency tree. However, the authors found that this local and syntacticmethod did not yield contributors that are very similar to those chosen byhuman annotators, as it did not capture semantic similarities.

2.7 Conclusion

In this chapter, we presented a comprehensive overview and discussion ofprevious approaches to automatic text summarization. As abstractive sum-marization remains an elusive goal, the majority of summarizers opt for thesimpler and more robust strategy of passage extraction. This allows systemsto focus on solving problems related to the analysis of natural-language sourcetexts, and on finding ways to characterize source content importance. Sys-tems vary widely in their level of linguistic analysis, which ranges from plainword and sentence tokenization to the application of sophisticated semanticand discourse parsing technologies. In contrast to earlier word vector-basedrepresentations, recent years have seen a trend towards more sophisticatedmodels of source content, such as graph-based representations, distributionalcontent models, and representations derived from a syntactic and semanticanalysis of sentences. To determine content importance, many summariza-tion approaches rely on word or concept vectors and tf-idf -type scoring, orderive similarity-based features by comparing vector representations of sen-tence, document and query content. Frequency has turned out to be oneof the main indicators of source content importance, even though it doesnot explain all the choice of human summarizers. Machine learning methodshave also been widely adopted for a range of subtasks involved in summarycreation, with an emphasis on the application of unsupervised algorithms.

In multi-document summarization, the identification of similar sourcecontent has been recognized as an essential step of source content analy-sis. At the discourse level, current MDS datasets are characterized by a setof recurrent subtopics that structures the main theme of the document collec-tion. Summarization approaches can be improved by incorporating subtopicrepresentations, such that importance and coverage can be determined withrespect to the overall theme and each subtopic. The vast majority of pre-viously proposed solutions view the identification of subtopics as a sentenceclustering problem. Clusterings are created on the basis of word vector repre-sentations and use standard IR similarity metrics. However, these strategiesdo not adequately address problems related to the ambiguity of natural lan-guage, lexical variation among authors, and neglect the sparsity of sentencevocabularies. The next part of this thesis therefore proposes three novel ap-proaches for subtopic identification and representation that help to overcome

86

2.7. Conclusion

these issues. Chapter 3 introduces a generic summarizer that incorporatesfeatures computed from a wide-coverage topic ontology, and uses lexical co-occurrence knowledge drawn from a large corpus of web documents. Chap-ter 4 then looks at subtopics from a distributional perspective, and proposesa probabilistic content modeling approach for query-focused multi-documentsummarization. Finally, Chapter 5 presents a hybrid probabilistic modelthat relaxes the bag-of-words assumption made by standard topic models,and merges unigram and bigram observations into a unified model. We willshow that the incorporation of bigrams helps to reduce sparsity, resulting inmore descriptive subtopic representations. All the proposed solutions per-form favorably compared to the existing state-of-the-art.

In the final part of this thesis, we turn our attention to the problemof identifying semantic content units. Content units are an integral part ofmany summarization evaluation schemes, but can currently only be identifiedby human judgment. An automatic identification of content units is there-fore highly desirable, both for summary evaluation as well as for summaryconstruction. Chapter 6 presents an initial study on a corpus of news articlesto evaluate the performance of a probabilistically motivated approach on thetask of content unit identification. Encouraged by the results of this study,we extend our analysis in Chapter 7 to a large corpus of human referencesummaries. This allows for a direct comparison of our approach with theSummary Content Units annotated for the Pyramid evaluation method, andleads to a better understanding of the structure of human reference sum-maries.

87

Part II

Content Modeling forMulti-DocumentSummarization

89

Chapter 3

Modeling subtopics withhierarchical ontologies

In this chapter, we present an approach to generic multi-document summa-rization which identifies subtopics in text using a wide-coverage topic ontol-ogy. The ontology we use is built from the hierarchically structured topics ofthe Open Directory Project (ODP)1 category tree. The ODP category treeencodes knowledge about the relations between broad topics such as “Arts”and “Sports”, and their associated subtopics, like for example “Movies” and“Television”. It is thus an instance of a taxonomy — a hierarchically struc-tured classification scheme organized by supertype-subtype relationships. Wemap sentences to topics of this ontology, and extend a baseline summarizerwhich uses a standard set of sentence features with novel features derivedfrom the topic mapping.

In our approach, each topic node of the ontology tree is augmented withlexical co-occurrence knowledge drawn from a very large corpus of web docu-ments. We will show that this helps to overcome shortcomings of the vectorspace model related to lexical variability and the sparsity of sentence vec-tor representations. Moreover, we find that the summarizer incorporatingontology-derived sentence features outperforms various baseline summariz-ers, and produces higher-quality summaries as measured by Rouge.

Ontology-based summarization is an active area of research. The TOPICsystem presented in [HR99] utilized a hierarchical concept tree to score textpassages. Verma et al. [VCL07] apply knowledge drawn from UMLS, a med-ical ontology,2 for query expansion and term selection in a query-focusedsummarizer. In recent work more closely related to our scenario, Wu and

1http://www.dmoz.org2Unified Medical Language System, available at www.nlm.nih.gov/research/umls

91

http://www.dmoz.org

www.nlm.nih.gov/research/umls

Chapter 3: Modeling subtopics with hierarchical ontologies

Liu [WL03] study a subtopic identification problem in the domain of busi-ness news. The approach presented in this chapter, in contrast, exploits hier-archical coarse-to-fine information contained in a wide-coverage topic ontol-ogy, such that each sentence is associated with a structured set of topics andsubtopics. The mapping is achieved using a levelwise classification scheme,which matches the bag-of-words representations of nodes with the word vec-tor representations of sentences. After sentences are represented by subtreesin the ontology space, we can calculate different features and similarity mea-sures in this space, and in addition compute relations between sentencesbased on graph properties of the subtrees. Moreover, while the algorithmsdescribed in [HR99, WL03] require manual intervention, our scheme is fullyautomatic.

The remainder of this chapter is structured as follows: The first sectiondescribes the ontology, the ontology’s augmentation with lexical knowledge,and the classifier used to map sentences to the ontology (Section 3.1). Sec-tion 3.2 presents our approach to extractive summarization, which combinescommon sentence features with novel features derived from the ontology-based representation of sentences. Features are combined by training a Sup-port Vector Machine (SVM) classifier on the task of identifying summary-worthy sentences. We evaluate the performance of our summarization systemon the task of generic multi-document summarization (Section 3.3).

3.1 An ontology of topics

Available summarization datasets typically consist of news articles that covera wide range of different domains and topics (see Section 1.4.2). As newsarticles in general are written for a broader, non-expert audience, they do notrequire technical or very specific domain knowledge. We therefore choose tomodel topics and subtopics using an ontology that addresses many differentknowledge domains, and at the same time captures topical structures on avery general, non-technical level. The database of knowledge collected bythe Open Directory Project is a suitable candidate for the topic structureswe are interested in. Figure 3.1 shows an illustrative fragment of the ODPtopic tree.

The ontology we consider is strictly hierarchical, and can be formallydescribed as a tuple O := (T,E), where T = {t1, t2, . . . , tm} is a finite setof nodes or topics, and E = {e1, e2, . . . , en} is a finite set of edges betweenthe nodes. Each topic t ∈ T has a unique parent corresponding to a moregeneral topic. We define C(t) to be the set of children of node t, and p(t)the parent node of t. The ontology can therefore be more strictly defined as

92

3.1. An ontology of topics

Figure 3.1: Illustration of a hierarchical topic tree. Edges between nodesindicate topic-subtopic relations, with broader topics positioned at the topof the hierarchy, and leaves corresponding to more specific subtopics. Theexample ontology is derived from the category tree of the Open DirectoryProject.

a taxonomy.Following [WAB+07], we associate each node t with a representative

bag-of-words B(t) = {w1, w2, . . . , wBt}, where wi is the ith word. Givena vocabulary V , which is the union of the words wi from all bag-of-words B(t), each node t can be represented as a weighted word vectorv(t) = (f(w1), f(w2), . . . , f(wV )), where f(wi) is the tf-idf weight of wordwi for node t. We calculate the tf-idf weight of word wi in topic t as

ft(wi) = (1 + log n(wi, t)) ∗ log(|T |/| {t : wi ∈ t} |), (3.1)

where t : wi ∈ t is the number of topics (nodes) containing word wi, |T | isthe total number of nodes and n(wi, t) is the frequency of word wi in topic t.The vector v(t) of a particular topic t thus captures lexical knowledge aboutwords strongly associated with this topic.

Subsequently, in order to strengthen the hierarchical structure of the tree,we optimize all bag-of-words B(t) of the ontology. To this end, we propagatefeature distributions from leaf nodes to parent nodes by recursively aggre-gating children feature weights to the parent [WAB+07]:

f′

t (wi) = ft(wi) +

c∈C(t)

fc(wi), (3.2)

where fc(wi) is the weight of word wi in child category c. Starting with thenodes above the leafs, this is repeated in a bottom-up manner until the rootnode is reached.

In the next section, we will describe our approach for associating nodesof the ODP category tree with lexical knowledge. Subsequently, we will

93


introduce an efficient algorithm for mapping sentences to the ontology space.

3.1.1 Populating the taxonomy

In order to associate each topic with a descriptive bag-of-words, we propose toexploit lexical knowledge derived from a very large corpus of web documents.To this end, we query the Yahoo! Search web service3 once for each topicnode to retrieve a set of topic-relevant training documents. We use the topicnames of the ontology to construct the query string by concatenating thelabel assigned to each node and the label of the parent node. For instance,for the node “Artificial Intelligence” in the “Computers” branch of the ODPcategory tree, the query string would be “Computers Artificial Intelligence”.

We limit the search results to English web pages only by applying the“format”and “language” filters provided by the API. The web service returnsa ranked list of the most relevant web pages and their URLs. The first Nwebsites of the result set are downloaded and processed by removing allHTML tags in order to extract all contained terms. We then remove stopwords, and stem terms with the Porter algorithm [Por80].

The initial feature vector for each node is constructed by calculating tf-idfword weights as described above, using all the terms of the retrieved trainingdocuments. We normalize each node’s feature vector to the Euclidean normof 1.0. Subsequently, feature weights are propagated from leaf nodes toparent nodes following the procedure outlined in the previous section, whichensures that the structural information given by the ontology is representedin the feature vector space. We also normalize the resulting parent featurevectors to unit length after each propagation step, as otherwise nodes withmany children would be overestimated.

Ontology population is performed as an offline preparation step, and theonly human supervision needed for our algorithm is the specification of thetaxonomy.

3.1.2 Mapping sentences to the ontology

In the next step, we utilize the ontology model to map sentences to subtreesof the ontology space, such that each sentence is associated with a structuredset of subtopics. The mapping algorithm decides which topics to assign to agiven sentence. This is achieved using a simple hierarchical algorithm, whichonly requires a similarity measure to match the bag-of-words representation

3http://developer.yahoo.com/search/

94

http://developer.yahoo.com/search/

3.1. An ontology of topics

Input: sentence s ∈ S to categorize, ontology O, similarity measure sim,branching parameter α

Output: corresponding subtree Os ⊂ O categorizing sentence s

/* Find the subtree of the ontology O that best categorizes a given sentences iteratively from top to bottom */

1. set the root node as current node

2. compute the similarity of the given sentence s to all child nodes of thecurrent node

3. compute the mean µ and the standard deviation σ of the resultingsimilarities

4. consider all nodes {tk} in the current set of children for whichsim(s, tk) > µ+ ασ, where α is a fixed parameter

5. if {tk} = ∅, choose the most similar child as {tk}6. if max(sim(s, tk)) < sim(s, current node), add the current node and

its ancestors as valid tags and stop

7. for each node tk in the set {tk}: if tk is a leaf node, add tk and allancestors to the list of valid topics, else set tk as current node andcontinue with step 2

Figure 3.2: Algorithm for mapping a sentence represented as a word featurevector to a subtree of the topic ontology (adapted from [WAB+07]). Allnodes of the resulting subtree are considered valid sentence subtopics.

of nodes with the word vector representations of sentences. In our approach,we employ the cosine similarity measure described in Section 2.2.

Before mapping sentences to nodes of the ontology, we preprocess eachsentence in the same way we did for the training documents, extracting terms,removing stop words, and performing stemming. The levelwise classificationfollows the method presented in [WAB+07] and is described in Figure 3.2.

Starting at the root node, the algorithm computes the similarity of a sen-tence to all child nodes, then determines the mean µ and standard deviationσ of the resulting similarities, and selects all nodes for further explorationwhose similarity to the sentence s, sim(s, tk) > µ + ασ. The parameter αdetermines the branching behavior. Setting it to a very high value makes thealgorithm choose only a single path. If the maximum similarity of a childis lower than the current node’s similarity to the sentence, or if a leaf nodehas been reached, the algorithm stops. A sentence is therefore not necessar-

95


ily classified to a leaf node, but may be assigned to an internal node. Thealgorithm in Figure 3.2 is robust because of its top-to-bottom iterative na-ture and usage of the ontology’s hierarchical structure. It solves a series ofclassification problems at each level of the tree with increasing difficulty butin a sense of decreasing importance [WAB+07].

The output of the algorithm is a subtree Os ⊂ O for each sentence s ∈ S,which can be serialized as a weighted topic vector ts = (t1, . . . , tk). Theweight of each topic ti is its similarity to the input sentence s. The algorithmassigns all nodes of the subtree Os as valid topics, thus allowing for thematching of nodes sharing common subtrees.

Figure 3.3 shows a set of example sentences of a newspaper article mappedto topics from the Open Directory Project category tree, along with the confi-dence weights assigned by the mapping algorithm. Each sentence is assigneda subtree of the topic ontology. The sentence-specific subtrees are repre-sented as weighted topic vectors, with the weights indicating the similarityof a sentence and the corresponding topic. The enclosing document is mod-eled as a topic vector that aggregates the topic vectors of its sentences bysumming over all weighted vectors:

td =s∈Sd

ts, (3.3)

The document-level topic weights thus give an indication of the dominanttopics in a document.

3.2 Summarizing with ontology features

Our approach for producing a summary consists of three steps: First, we asso-ciate sentences with a representation in the ontology space. We then calculatea vector of features for each sentence, combining well-known state-of-the-artfeatures with novel features derived from the ontology-based representation ofsentences. Subsequently, we approach the problem of sentence extraction asa supervised classification task, similar in spirit to the approaches describede.g. by [KPC95, LZX+09]. We train a Support Vector Machine (SVM) clas-sifier on the task of identifying summary-worthy sentences, given a corpusof source documents and corresponding summaries. The classification valuescomputed by the SVM are used to rank summary-worthy sentences. Fromthis ranked list, the summarizer selects sentences until a predefined summarylength is reached. We consider the ranking as our sole source of informationfor summary construction, and do not penalize sentences for redundancy.

96

3.2. Summarizing with ontology features

Figure 3.3: Illustration of a news article, where each sentence is assigned asubtree of a topic ontology derived from the Open Directory Project categorytree. The sentence-specific subtrees are represented as weighted topic vectors,with the weights indicating the similarity of a sentence and the correspondingtopic. The enclosing document is modeled as a topic vector that aggregatesthe topic vectors of its sentences.

97


In the following subsections, we will first describe the set of featuresutilized to represent each sentence, and introduce in particular the novelfeatures derived from the ontology representation of sentences. Subsequently,we will present the supervised learning scheme.

3.2.1 Sentence features

We preprocess documents to identify sentence boundaries, and represent eachsentence as a vector of word values w = (w1, . . . , wm). We remove stop wordsand apply stemming, as outlined above, using the NLTK toolkit.4 On thebasis of this representation, we calculate several common sentence features.

Following [MB98], we compute the average tf and average tf-idf scoresof the weighted word vector of each sentence. The tf -based weight of a wordwi in sentence s is given by

ftf (wi) = 1 + log n(wi, s), (3.4)

where n(wi, s) is the frequency of occurrence of word wi in sentence s. Thetf-idf -based weight is computed analogously to equation 3.1, substitutingthe inverse sentence frequency instead of the inverse topic frequency in thesecond part of the equation:

ftf -idf (wi) = (1 + log n(wi, s)) ∗ log(|S|/| {s : wi ∈ s} |), (3.5)

We normalize both bag-of-words to unit length before computing the aver-age weights. To represent the parent document of a sentence, we constructa weighted word vector by aggregating the weighted word vectors of its sen-tences, and normalize the resulting vector to unit length. We then computethe cosine similarity of the tf-idf -weighted document and each sentence vec-tor.

Our other features are based on the structure of a document: Each sen-tence is assigned binary features that indicate whether the sentence occurs inthe first, second or last third of the document, as well as a real-valued positionscore linearly scaled to [0, 1], where fpos(s) = (|S|−pos(s))/(|S|−1). Finally,to prefer sentences of medium length, we assign binary features indicatingwhether a sentence is shorter respectively longer than some fixed thresholdslmin and lmax. We organize a sentence’s features in the form of a vector, suchthat each sentence is associated with a vector x = (x1, x2, . . . , xk) of featurevalues. Table 3.1 summarizes the set of baseline sentence features.

4http://www.nltk.org

98

http://www.nltk.org

3.2. Summarizing with ontology features

Feature Description

avg. tf Average tf score of the bag-of-words ws of sentence savg. tf-idf Average tf-idf score of the bag-of-words ws of sentence

scos(ws,wd) Cosine similarity cos(ws,wd) of the tf-idf -weighted

word vectors of sentence s and its parent document d.

sent-in-first

1 if sentence occurs in first part of document

0 else

sent-in-second

1 if sentence occurs in second part of document

0 else

sent-in-third

1 if sentence occurs in third part of document

0 else

sent-pos Real-valued position score

sent-min

1 if length(s) > lmin

0 else

sent-max

1 if length(s) < lmax

0 else

Table 3.1: Overview of sentence features used in the baseline generic multi-document summarizer.

Ontology-derived sentence features For our ontology-based summa-rizer, we add a range of novel features derived from the representation ofsentences in the topic ontology. After mapping sentences to the ontologyusing the algorithm outlined in Section 3.1.2, we arrive at a representationof sentences in the ontology space, as given by the weighted topic vectorts = (t1, . . . , tk) for each sentence s ∈ S. If a sentence is mapped to multipletopic paths in the ontology, we include all nodes from every path. For exam-ple, in Figure 3.3 the first sentence is assigned the paths News/Weather andRegional/Carribean.

We utilize the topic vector representation to compute several differentthematic and graph-based sentence features. Our first feature estimates howwell the sentence represents the overall content of the document, i.e. it isa measure of the centrality of the sentence. Analogously to the previouslyintroduced word vector cosine similarity feature, we calculate the cosine simi-larity cos(ts, td) of the topic vectors of sentence s and its enclosing documentd. Again, all vectors are normalized to unit length before computing sim-

99


Feature Description

cos(ts, td) Cosine similarity of weighted sentence and documenttopic vectors

depth(s) Depth of the most specific node assigned by the hierar-chical mapping algorithm

pathcount(s) Number of topic paths assigned by the hierarchical map-ping algorithm

avg ts Average topic score of the bag-of-topics ts of sentence s

Table 3.2: Overview of ontology-derived sentence features used for genericmulti-document summarization.

ilarities. The entries of td with the highest weights can be interpreted asthe main topics of a document, as shown in Figure 3.3. We also computean average topic score for each sentence, similar to the average tf scoresmentioned above, from the weighted topic vector ts.

In addition, we exploit the structural information provided by the ontol-ogy to calculate sentence features. We compute the maximum depth of anytree path assigned to a sentence. This can be viewed as a measure of thespecificity of a sentence: If a sentence is assigned to a leaf node of a certaindepth, it is assumed to contain more specific information than a sentencethat is classified to a higher-ranked internal node. Furthermore, we computethe number of tree paths assigned to a sentence, which can be interpreted asa measure of the quantity of a sentence’s information content. Table 3.2 liststhe features derived from the ontology.

3.2.2 Learning summary-worthy sentences

We consider the problem of sentence extraction as a classification task, givena corpus of source documents and corresponding summaries. The goal ofsentence classification is to predict for each source sentence if it is summary-worthy, i.e. it should be included in a summary, or not. Training a classifieron this task provides a principled method for selecting useful sentence fea-tures, and for choosing an optimal weight combination [KPC95].

Classification is a supervised learning method (see Section 2.2.3). Wefollow [HIMM02] in using a Support Vector Machine (SVM) classifier [Vap95].Support Vector Machines are robust even when the number of features isvery large, and have shown good performance for a range of classificationand regression problems [Bis07]. An important property of SVMs is that the

100

3.3. Experiments: Generic multi-document summarization

determination of the model parameters corresponds to a convex optimizationproblem, and so any local solution is also a global optimum. For more detailson SVMs, see for example [Vap95, Bis07]. For our experiments, we use theSVM-light implementation provided by [Joa99].

In order to train the SVM classifier, we construct a training data set Xfrom the set of sentence feature vectors {x}. We assign a positive label ifthe sentence is deemed summary-worthy according to a human summarizer,and a negative label if the sentence has not been selected by any humansummarizer. The training data X can thus be formalized as follows:

X = {(xi, yi) | xi ∈ Rp, yi ∈ {−1, 1}}ni=1 , (3.6)

where xi is the feature vector of the ith training sentence, and yi is the classlabel assigned to vector xi.

During training, the SVM classifier determines a set of hyperplanes whichseparates positive and negative examples. During testing, the feature vectorsof new sentences are classified as either positive (yi ≥ 0) or negative (yi < 0),i.e. as summary-worthy or not. The SVM implementation we employ assignsa real-valued classification score to each test feature vector, which can beused to order sentences.5

3.2.3 Summary construction

To create a summary, we collect all sentences classified as summary-worthy.Sentences are then ranked according to the classification score assigned tothem by the SVM classifier, and extracted until the number of words inthe summary reaches a predefined threshold. We then reorder sentencesaccording to their position in the source documents, presenting sentencesfrom earlier documents first.

3.3 Experiments: Generic multi-document

summarization

For a quantitative performance evaluation of our summarization approach,we conduct a series of experiments. To this end, we construct a topic ontology

5Although SVM classification scores are not really interpretable in terms of an absolutevalue, the set of scores produced by the SVM for a test set of feature vectors can be usedto rank the feature vectors. A higher positive score corresponds to a larger margin to thenearest hyperplane, and may be seen as an indication that the classifier is more confidentin assigning the object represented by this feature vector to the class of positive examples,whereas a score closer to 0 indicates that a particular feature vector is less well assignableto the class of positive examples.

101


from the Open Directory Project category tree. We prune the full categorytree to a set of 1036 nodes, encompassing all of the top-level categories andtheir immediate children. We exclude the first level “World” branch, as itconsists mostly of non-English content. In order to describe each topic nodewith a representative word vector, as outlined in Section 3.1.1, we harvestseveral million topic related keywords from the topN = 20 HTML documentsyielded by a YAHOO! search engine query. For the mapping algorithm, weset the α-parameter controlling the number of topics assigned to a sentenceto α = 1.5 [WAB+07].

We evaluate our summarization approach on the DUC 2002 multi-document summarization dataset, as this dataset contains extractive refer-ence summaries. It is therefore possible to label source sentences as summary-worthy or not, as required for supervised learning methods. The dataset con-tains 59 clusters of topically related newspaper articles. We set the minimumlength threshold of a sentence to lmin = 4 words, and the maximum lengththreshold lmin = 15, counting only content words.

We then train an SVM classifier with a radial basis function kernel [Joa99]to classify sentences for extractive summarization, labeling all sentences fromthe human created extracts as positive examples, and all other sentences asnegative examples. The sentence feature vectors are scaled such that all fea-ture values are in [0, 1]. Since there are much more negative examples thanpositive examples in the training data, we set the SVM parameter J , whichcontrols the weighting of training errors on positive examples, to 6, aimingat a higher recall of human-extracted sentences. We set the parameter C,controlling the trade off between fit to the training data and model general-ization, to 8. Both parameters were minimally tuned on a small subsampleof our dataset. We then train two different classifiers:

Standard This classifier uses the standard feature set listed in Table 3.1.

Standard+Ontology For this classifier, we extend the standard featurevector of each sentence with the ontology-derived features listed in Ta-ble 3.2.

We perform leave-one-out cross validation, with sentences from one of the 59document clusters in turn constituting the test set, and sentences from theremaining 58 document clusters being used as training data. The summaryis constructed from sentences of the test cluster.

Classifier evaluation measures Given a set of sentences S ′ labeled assummary-worthy by the classifier, we measure recall as the fraction of trulysummary-worthy sentences contained in this set compared to the full set

102


of sentences S+ that are actually summary-worthy according to the humanreference summaries:

R =|S ′ ∩ S+|

|S+| (3.7)

Precision, on the other hand, measures the quality of S ′, i.e. the fraction ofactually summary-worthy sentences compared to the set of sentences classi-fied as positive instances S ′:

P =|S ′ ∩ S+|

|S ′| (3.8)

For micro-averaged precision and recall values, we compute the mean preci-sion and recall values across individual sentence classification results (i.e. S+

corresponds to the union of positive training examples across all 59 documentclusters). Since we perform leave-one-out cross validation per topical clus-ter of documents, we can also average precision and recall values across the59 individual results, which is known as macro-averaging. Macro-averagingequally weights the different document clusters tested during cross-validation,whereas micro-averaging equally weights the sentences, thus favoring com-mon sentence feature vectors.

We also compute the F1-measure, which combines precision and recallinto a single value:

F1 =2 ∗ P ∗RP +R

(3.9)

The subscript of 1 indicates a balanced weighting of recall and precision, andis also known as the harmonic mean of precision and recall.

3.3.1 Results and Discussion

SVM classifier performance

We first evaluate the effects of our novel ontology-based sentence features onthe performance of the SVM classifier. Table 3.3 compares the micro- andmacro-averaged precision, recall and F1 results of a classifier trained on thestandard feature set with a classifier trained on an extended feature set usingthe ontology-derived sentence features.

We observe that adding features from our topic ontology representation ofsentences improves both micro- and macro-averaged F1 scores. In particular,we find a statistically significant increase in recall, both micro- and macro-averaged, when including ontology-based features.6 Although precision is

6The differences in recall are statistically significant as measured by a Wilcoxon ranksum test with α = 0.01.

103


Features avg. by Precision Recall F1

Standard macro 0.294 0.304 0.299micro 0.266 0.287 0.276

Standard+Ontology macro 0.274 0.380 0.319micro 0.250 0.362 0.296

Table 3.3: Micro- and macro-averaged precision, recall and F1 values of aSupport Vector Machine classifier trained on the task of sentence extraction.The table compares a classifier trained on a standard set of sentence featureswith a classifier trained on sentence features derived from a hierarchical on-tology of topics.

slightly lower in both cases, the increase in recall results in higher F1 scores.

Summarization performance

We evaluate the summaries constructed with our approach using the well-known Rouge measure (see Section 1.4.1). Rouge metrics are recall-oriented and based on n-gram overlap. Higher values indicate a higher overlapwith the content of human-created reference summaries. We report the per-formance of our summarizers for the widely used Rouge-1 (word overlap)and Rouge-2 (bigram overlap) measures. We use Rouge version 1.5.5, withthe same parameter settings as in the official DUC evaluations, and imple-ment jackknifing as in the official evaluation.7 All obtained results are alsocompared to a lead sentence summarizer (see Section 1.4.3).

Table 3.4 compares the Rouge scores of the evaluated summarizers forsummaries which are 200 words long. We observe that both Rouge-1 andRouge-2 scores increase significantly when adding the ontology-based fea-tures, as compared to a summarization system using only the standard set offeatures. Both systems outperform the lead sentence baseline, but whereasthe Rouge-2 scores of the standard summarizer are not significantly betterthan those of the lead system, the system using ontology features exhibits aconsiderable increase in Rouge-2 scores.

The table also shows the Rouge scores of the best participating system

7The exact parameter settings are ‘-n 4 -w 1.2 -m -2 4 -u -c 95 -r 1000 -f A -p 0.5-t 0’. See also the Rouge package documentation at http://berouge.com (visited May3rd, 2011) and http://www-nlpir.nist.gov/projects/duc/duc2007/tasks.html (vis-ited May 3rd, 2011) for a description of the parameters and a explanation of why to usejackknifing.

104

http://berouge.com

http://www-nlpir.nist.gov/projects/duc/duc2007/tasks.html


Summarizer Rouge-1 F1 Rouge-2 F1

Lead 0.3986 0.1604Standard 0.4386 0.1657Standard+Ontology 0.4636 0.2040Best DUC 2002 0.50583 0.26765

Table 3.4: Comparison of Rouge scores for 200-word summaries of differ-ent summarizers. The novel ontology-based features lead to a significantimprovement of Rouge scores.

Summarizer Rouge-1 F1 Rouge-2 F1

Lead 0.5103 0.2488Standard 0.5325 0.2632Standard+Ontology 0.5716 0.3143Best DUC 2002 0.59064 0.34972

Table 3.5: Comparison of Rouge scores for 400-word summaries of differentsummarizers.

(Gistexter in DUC 2002 [HL02]. Although our Rouge results are lowerthan those of [HL02], we note that our hierarchical classifier is not trainedon previous DUC data, and requires much less human effort for preparingknowledge resources than the IE-style top-performing system (see discussionof the Gistexter system in Section 2.3.3). Our summarizer thus tradesflexibility and generality (i.e. a transfer to other topic domains) for accuracy.Also, since it is trained offline, the actual mapping of sentences to nodes ofthe ontology is very efficient, which is useful for an online summarizationsystem.

Table 3.5 shows the Rouge scores of the same summarizers for 400-wordsummaries. The results confirm the observations made for 200-word sum-maries. In particular, the improvement over the summarizer using the stan-dard feature set is much more pronounced when considering Rouge-2, andthe difference to the top-performing system is lower than for 200-word sum-maries. This suggests that the ontology-based features indeed support therecall of summary-worthy sentences, and as more sentences can be includedin a longer summary, the Rouge scores of the ontology-based summarizerincrease relative to the baseline summarizer.

105


3.4 Conclusion

Recognizing similar information is an important step in constructing multi-document summaries. One recurrent observation in automatic text sum-marization, especially of newswire material, is the repeated occurrence ofdomain-specific subtopics throughout the collection of documents to be sum-marized. An identification of these subtopics helps to determine similar in-formation, and can be used to guide an extractive summarization system.

This chapter presented an approach for subtopic identification that mapssentences to nodes of a hierarchical topic ontology. The topic ontology we ex-ploit is built from the hierarchically structured topics of the Open DirectoryProject category tree, and thus provides a wide coverage of different domainssuited to the purpose of news article summarization. In order to augmentthe topical nodes of the ontology with lexical knowledge, we described anautomatic approach for associating each topic node with a descriptive bag-of-words. Our approach uses search engine queries to harvest millions oftopic-related words, and represents structural information of the hierarchi-cal ontology by propagating feature distributions from leaf nodes to parentnodes. We then introduced an efficient algorithm for mapping sentences toa subtree of the ontology, thus associating sentences with a set of subtopics.

Subsequently, we considered the task of generic multi-document summa-rization. We treat sentence extraction as a classification task of identifyingsummary-worthy sentences. To this end, we represent sentences as featurevectors using a standard set of features common in summarization research.We augment these feature vectors with novel features, derived from the sen-tences’ representation in the ontology space, that capture topical as well asstructural properties of the ontology.

Our experimental evaluations show that the features derived from the on-tology representation significantly improve the classification accuracy, whencompared to a classifier trained only on the standard set of features. In par-ticular, we observe that the recall of summary-worthy sentences increases.This improvement in recall is mirrored in the evaluation of the constructedsummaries using the Rouge measure. Our summarizer evaluations are con-ducted on the DUC 2002 multi-document summarization dataset, for sum-maries that are 200 and respectively 400 words long. We observe that for bothsummary sizes, Rouge-1 and Rouge-2 scores of the summarizer exploitingontology-derived features improve significantly, and vastly outperform a leadbaseline and the summarizer using a set of standard, well-known sentencefeatures.

The approach for subtopic identification presented in this chapter requiresthe manual definition of a hierarchical topic ontology. Furthermore, the lex-

106

3.4. Conclusion

ical representation of ontology nodes relies on the bag-of-words assumption,and therefore cannot handle linguistic aspects such as synonymy and poly-semy. In the next chapter, we will present an unsupervised, fully-automaticapproach for subtopic identification that further reduces the amount of hu-man effort required. In addition, the presented method exploits lexical co-occurrence information in order to better capture distributional aspects ofword meaning with strong benefits for subtopic identification and summa-rization.

107

Chapter 4

A probabilistic approach tocontent modeling

In the previous chapter, we approached the problem of subtopic identifica-tion by relying on an externally defined, hierarchical topic ontology. Theconstruction of this ontology requires human effort, and an extension of thetopic hierarchy to more specific, fine-grained subtopics may become pro-hibitively expensive. A transfer of the ontology to other languages or knowl-edge domains will also incur additional effort. Yet another limitation of theontology-based approach is that the construction of the topic ontology doesnot take into account the observed content structure of the documents to besummarized. Instead, prior knowledge is encoded in a fixed, static structure,and during summarization, source passages can only be mapped to existingtopics.

In this chapter, we present an unsupervised, fully-automatic approachfor modeling the content of multi-document summarization datasets. Ourapproach aims to identify subtopics from the content of source documentsalone by means of an analysis of recurrent word distribution patterns. Asdiscussed in Section 2.5, such word patterns often characterize particulartypes of discourse, and have been used by different researchers to determinedomain-specific content structures of texts. Of course, the applicability ofour approach relies on the existence of identifiable word patterns. As men-tioned in Chapter 1, multi-document summarization datasets typically con-sist of news articles which reiterate various subtopics centered around a maintheme. As a news story evolves, more recent articles will summarize previousreports, and repeat similar background matter. Consequently, news articleauthors will tend to re-use and repeat text passages, phrases and specificword patterns. In addition, news writing is highly similar even across differ-ent newspaper organizations, as the use of domain-specific topic structures

109

Chapter 4: A probabilistic approach to content modeling

facilitates readers’ comprehension and recall [Bar32, BL04].In our approach, we investigate the utility of probabilistic topic models,

such as probabilistic latent semantic analysis (PLSA, see Section 2.4) for cap-turing subtopics and content structures of MDS document collections. Ourapproach infers the semantics of words based on lexical co-occurrence infor-mation and distributional context, and does not require manually constructedlexico-semantic knowledge resources. It exploits recurrent word patterns toderive the subtopic structure of multiple texts in a domain- and language-independent, unsupervised fashion. The identification of subtopic structureshelps to overcome common problems related to word ambiguity, synonymy,and the sparseness of the original word vector space when estimating the sim-ilarity of text passages. Similar benefits are accrued by mapping user queries,as employed in query-focused summarization, to the latent topic space con-structed by our model. The originally small set of query words is expandedwith semantically related words that are derived from the latent factors. Thisform of relevance feedback can considerably increase the query’s similaritywith relevant source sentences [DM06].

We apply our probabilistic topic modeling approach to the problem ofquery-focused multi-document summarization. Unlike previous work ex-ploring the use of latent topic modeling for extractive text summariza-tion [GL01, DM06, SPKJ07, BSIM08, HV09], our model utilizes the identifiedlatent topics to represent sentences, queries, and documents as probabilitydistributions. On the basis of this representation, we compute various the-matic and query-focused sentence features, as well as a redundancy measuresimilar to Maximal Marginal Relevance, in order to estimate the summary-worthiness of sentences.

Our system differs from previous approaches in three ways: First, weinvestigate PLSA in the context of query-focused multi-document summa-rization, modeling topic distributions across documents and taking into ac-count information redundancy. Second, we do not only pick sentences fromtopics with the highest likelihood in the training data as in [GL01, SPKJ07,BSIM08], but compute a sentence’s score based on a linear function of query-focused and thematic features. Third, we examine how a PLSA model can beused to represent documents, sentences and queries in the context of multi-document summarization, and investigate which measures are most usefulfor computing similarities in the latent topic space. We evaluate our ap-proach on the data sets of the DUC 2006 and DUC 2007 text summarizationchallenges, and show that the resulting summaries receive very competitiveRouge scores, when compared with those of existing state-of-the-art sum-marization systems.

This chapter is organized as follows: We first describe our approach for

110

4.1. Probabilistic Latent Semantic Analysis

utilizing a probabilistic latent topic model to discover semantic subtopicstructures in a collection of related documents (Section 4.1). Then, we givedetails of our summarization system and the sentence features we computeto estimate sentence summary-worthiness in Section 4.2. In Section 4.3, wepresent experimental results showing that our approach leads to consider-able improvements over different baseline systems, and that overall scorescompare very favorably with those of existing systems on Rouge metrics.

4.1 Probabilistic Latent Semantic Analysis

Probabilistic latent semantic analysis is a latent variable model for co-occurrence data which associates an unobserved latent factor variable zk ∈Z = {z1, . . . , zT} with each observation (d, w), where word wi ∈ W ={w1, . . . , wN} occurs in document dm ∈ D = {d1, . . . , dM}. Each word ina document is considered as a sample from a mixture model, where themixture components are multinomial random variables that can be viewedas representations of topics. A document is represented as a list of mixingproportions for the mixing components, i.e. it is reduced to a probabilitydistribution over a fixed set of latent factors (see Section 2.4).

Topic models are generally applied to large collections of documents, andexploit the co-occurrence of documents and words. However, given the smallsize of typical multi-document summarization document collections, a latentfactor analysis at the level of documents will provide the algorithms withinsufficient co-occurrence information to work with, resulting in latent topicswhich pick out idiosyncratic word combinations. Instead, it is more useful toconsider co-occurrence information at the level of text passages, typically atthe same granularity as that which is used during summary creation [BL04,SPKJ07]. This approach allows the model to identify similar and differenttext passages, and to group text passages with similar content.

In our approach, we follow the dominant paradigm of extractive summa-rization, and consider sentences as the passages to be extracted. Our PLSAmodel therefore associates the latent factor variable z with each observa-tion (s, w) of word w in sentence s, and utilizes sentence-word co-occurrenceinformation to identify latent topics. For the purposes of our model, eachsentence is viewed as a (very short) document. Similarly, we consider a userquery as a single document q. From hereon, we will use P (z|s) and P (z|q) todenote topic distributions over sentences and queries respectively, which canbe considered identical to the notation P (z|d) of the original PLSA model.Any other (pseudo-) documents utilized in our approach, such as the title ofdocuments, or the centroid vector representation of a document collection,

111


are treated analogously, and will be introduced where needed.

4.2 Content modeling for summarization

Our approach to creating a query-focused summary of a collection of themat-ically related documents consists of three steps: First, we associate sentencesand queries with a representation in the latent topic space by training aPLSA model on the term-sentence matrix of the input document collection.From the model, we derive a representation of sentences in the latent topicspace, and estimate the mixing proportions of the user query, documents, andthe collection’s centroid vector. We then compute several features for eachsentence on the basis of this novel representation, in order to estimate thesummary-worthiness of sentences. The feature set includes features whichcapture how well a sentence represents the overall theme of the documentcollection, as well as features which model a sentence’s relevance with respectto the query. We combine individual feature scores linearly into an overallsentence score in order to rank sentences. Subsequently, we iteratively selectthe top-ranking sentences to create a summary, and penalize candidate sen-tences based on their similarity to the partially constructed summary. Eachof these steps is described in detail below.

4.2.1 Sentence and document representation in the la-tent topic space

Given a collection of thematically related documents, we perform sentencesplitting for each document using the NLTK toolkit.1 Each sentence is repre-sented as a weighted bag-of-words w = (w1, . . . , wN). During preprocessing,we remove stop words, and apply stemming using Porter’s stemmer [Por80].We discard all sentences which contain less than lmin = 5 or more thanlmax = 20 content words, as these sentences are unlikely to be useful for asummary [TM97].

From the bag-of-words representation, we create a term-sentence2 matrixATS over the union of sentences in the corpus. Each entry Ats is givenby the frequency of term t in sentence s. We train the PLSA model onthe term-sentence matrix ATS, using an implementation of the Expectation-Maximization (EM) algorithm described in [Hof99b] (see Section 2.4.2).

1http://www.nltk.org2A term here corresponds to a stemmed content word. We utilize the notion of ‘term’

in its conventional sense to distinguish it from the concept of ‘words’, which include stopwords and are not stemmed.

112

http://www.nltk.org

4.2. Content modeling for summarization

After the model has been trained, it provides a representation of thesentences as probability distributions P (z|s) over the latent topics z. Thisrepresentation can be interpreted as follows: Since the source documentscover multiple topics related to a central theme, each sentence can be viewedas representing one (or more) of these topics. By applying PLSA, we arrive ata representation of sentences as vectors in the “topic space” of the documentcollection:

z(s) = P (z|s) = (p(z1|s), p(z2|s), . . . , p(zT |s)), (4.1)

where p(zk|s) is the conditional probability of topic zk given the sentence s.The probability distribution P (z|s) hence tells us how many and which topicsthis sentence covers, and how likely the different topics are for this sentence.Typically, there will be one or at most two dominant topics for each sentence,and the remaining topics will have a negligible probability p(z|s) < ϵ, whichcan be safely ignored. Similar to Latent Semantic Analysis (LSA), one ofthe major challenges of applying latent topic models is the estimation of thenumber of latent topics T (see Section 2.4). We will therefore evaluate theperformance of our approaches for different values of the parameter T .

Having arrived at a representation of sentences in the latent space, thereare a number of ways of exploiting this representation. The authors of[BSIM08] suggest picking the most central topics, i.e. the topics with the high-est posterior probabilities P (z), and selecting sentences with the highest like-lihood P (s|z), given these topics. In contrast, we interpret the distributionover topics as a lower-dimensional representation of the high-dimensional,noisy word vector space spanned by the term-sentence matrix ATS of thedocument collection. By using the topic space, we gain a reduced repre-sentation where similarities between sentences, or between sentences and aquery, can be more reliably estimated. Furthermore, the latent space reducesthe impact of ‘noise’ terms. This interpretation is similar to that of the LSAmodel. The representation of sentences and queries in the latent topic spacethus allows us to apply similarity measures in this space. Furthermore, thetopic space is much smaller and denser than the original term vector space.In the next section, we describe the sentence-level features that we computeusing the topic-based representation of sentences and queries.

4.2.2 Computing query-focused and thematic sentencefeatures

Since we are interested in creating a summary that covers the main top-ics of a document collection, and in addition satisfies a user’s information

113


need, expressed by a query, we compute a set of sentence-level features thatare intended to capture these different aspects. We utilize different typesof information available in the DUC 2006 and DUC 2007 multi-documentsummarization data sets (see Section 1.4.2):

• tc: Title of a collection of related documents

• q: Query or topic statement

• tds: Title of document d containing sentence s

• ds: Document term vector of document d containing sentence s

• c: Document collection centroid vector

Each of these elements is first represented as a vector of words w =(w1, . . . , wm), applying the same preprocessing steps as described above forsentences. We do not perform sentence splitting if the query consists of amulti-sentence set of questions, but instead treat it as a single, long sentence.We also do not discard queries or titles based on the length of the resultingword vector, as done for sentences. Document and document collection termvectors are computed by summing the term frequencies of the correspondingsentence term vectors.

The word vectors are then transformed into probability distributions overthe latent topic space by folding them into the trained model. The folding isperformed by EM iterations, where the distributions P (w|z) are kept fixed,and only the mixing proportions P (z|q) are adapted in each M-step [Hof99b].

Given the resulting representation of elements as probability distribu-tions, we calculate the following set of sentence features:

• ftc(s) : sim(p(z|s), p(z|tc)) - the similarity of the sentence and thedocument collection’s title

• fq(s) : sim(p(z|s), p(z|q)) - the similarity of the sentence and the query

• ftd(s) : sim(p(z|s), p(z|tds)) - the similarity of the sentence and the titleof the document it belongs to

• fds(s) : sim(p(z|s), p(z|ds)) - the similarity of the sentence and thedocument it belongs to

• fc(s) : sim(p(z|s), p(z|c)) - the similarity of the sentence and the col-lection centroid vector

114

4.2. Content modeling for summarization

A variety of similarity measures can be used to compute sentence features.In our approach, we compare three different metrics, and evaluate their effecton summary quality. A traditional and well-known similarity measure thatuses the vectorial representations of the topic probability distributions (seeEquation 4.1) is the cosine similarity, described in Section 2.2:

simCOS(x,y) =xTy

|x||y| (4.2)

In addition, we evaluate two standard distribution divergence measures,the Kullback-Leibler (KL) divergence and the Jensen-Shannon (JS) diver-gence, since the representations we compare are probability distributions.The symmetric KL divergence is defined as follows:

KL(S,Q) = DKL(S||Q) +DKL(Q||S)

=I

S(i) logS(i)

Q(i)+I

Q(i) logQ(i)

S(i). (4.3)

To use the KL divergence as a similarity measure, we scale divergence valuesto [0, 1] and invert by subtracting from 1, hence

simKL = 1−KL(S,Q)scaled. (4.4)

The Jensen-Shannon divergence is a symmetrized and smoothed versionof the KL divergence, calculated as the KL divergence of S,Q with respect tothe average of the two input distributions. The JS divergence based similaritysimJS is then defined as:

simJS(S,Q) = 1− [DJS(S||Q)] (4.5)

= 1−1

2DKL(S||M) +

1

2DKL(Q||M)

,

where M = 1/2(S +Q).Since the training of a PLSA model using the EM algorithm with random

initialization converges on a local maximum of the likelihood of the observeddata, different initializations will result in different locally optimal models(see Section 2.4.2). As the authors of [BCT02] have shown, the effect ofrandom initializations can be reduced by generating several PLSA models,and then computing and averaging the features derived from the differentmodels. We have implemented this model averaging in our approach usingfive iterations of training the PLSA model. The number of iterations wastuned experimentally to give a sensible trade-off between model performanceand training time. We compute sentence features in each iteration, and thenaverage the feature values before computing the final sentence score.

115


4.2.3 Sentence scoring

The system described so far assigns a vector of similarity feature values toeach sentence s ∈ S,xs = (ftc(s), fq(s), ftd(s), fds(s), fc(s)). The overall scoreof a sentence s is calculated as a linearly weighted combination score(s) =wTxs, where w is a weight vector. For our system, we determined featureweights experimentally by initializing all weights to a default value of 1 andthen tuning one feature weight at a time while keeping the others fixed.3 Weperformed the weight tuning on the DUC 2006 data set.

To create a summary, we rank sentences by their score, and select thehighest-scoring sentences for inclusion in the summary until the predefinedsummary length is reached. In order to deal with redundancy, we applya scheme similar to Maximum Marginal Relevance. To this end, we selectsentences iteratively, and calculate a redundancy penalty for each remainingcandidate sentence in each iteration:

scoremmr(s) = λ(score(s))− (1− λ)sim(p(z|s), p(z|sum)), (4.6)

where score(s) is scaled to [0, 1] and sim(p(z|s), p(z|sum)) is the similarityof the candidate sentence to the current, partial summary. Again, differentsimilarity measures can be applied. In our experiments, we evaluated thesame three similarity measures for the redundancy penalty as were usedfor computing sentence features. Since the differences in performance werenegligible, we only report the results of the variant using the cosine similaritymeasure in this work. The weighting parameter λ is set experimentally to0.5, weighting relevance and redundancy scores equally.

4.3 Experiments: Query-focused multi-

document summarization

We evaluate our approach on two data sets from recent summarization tasks,the DUC 2006 and DUC 2007 multi-document summarization datasets (seeSection 1.4.2). The quality of the summaries created by our system is eval-uated with the Rouge measure, using the official DUC parameter settings

3The procedure we used for tuning feature weights, following [LYRL04], correspondsto a greedy search in the feature weight space, and may converge on a local optimum.Alternative schemes for finding (globally) optimal weights are discussed in Section 2.2.3.For example, Vanderwende et al. [VSBN07] suggest a preference learning scheme for sen-tences, using Rouge oracle scores to optimize the ranking and the feature weights. Hicklet al. [HRL07] discuss a hill-climbing algorithm. “Oracle” scores calculated from the prob-ability distribution of n-grams in human summaries are also used by Conroy et al. [CSO06]and Ouyang et al. [OLL07] as a learning criterion.

116

4.3. Experiments: Query-focused multi-document summarization

and implementing jackknifing (see Section 3.3 for a list of the used parame-ters and their settings). Summaries are truncated to a length of 250 wordsby using the Rouge parameter ‘-l 250’. We report the performance of oursummarizer for the Rouge-1 (word overlap), Rouge-2 (bigram overlap) andRouge-SU4 (skip bigram overlap) metrics.

We implemented three variants of our PLSA-based summarizer, each oneusing a different similarity measure to compute sentence features: PLSA-JS(Jenson-Shannon divergence), PLSA-KL (symmetric KL divergence), PLSA-COS (cosine similarity). In addition, we implemented two baseline systems,Lead and a system based on Latent Semantic Analysis. The Lead systemselects the lead sentences from the most recent news article in the documentcluster as the summary.

The LSA baseline computes a rank-T singular value decomposition ofthe term-sentence matrix. The resulting right-singular vectors, scaled bythe singular values, represent the sentences in the latent semantic space (seeSection 2.4). We adopt the same approach as for the PLSA-based system tocompute sentence-level features and score sentences. Sentence features arecalculated using the cosine similarity measure, since the novel dimensionsderived by SVD are linear combinations of the original word vector space,to which distributional divergence metrics are not applicable. We apply ourgreedy ranking and redundancy removal strategy to create a summary.

4.3.1 DUC 2006

In the multi-document summarization task in DUC-2006, participants aregiven 50 document clusters, where each cluster contains 25 news articlesrelated to the same topic. Participants are asked to generate summaries ofat most 250 words for each cluster. For each cluster, a title and a querydescribing a user’s information need are provided. The query is usuallycomposed of a set of questions or a multi-sentence task description.

We present the results of our system in Table 4.1. We compare ourresults to three state-of-the-art systems (described in Section 2.3.3): thebest peer participating in DUC 2006 (IIIT Hyderabad [JPV06]), the bestreported results on this data set by the PYTHY system [TBG+07], and theGistexter system [LHR+06]. In addition, we also give the results for theLSA baseline using an SVD decomposition with T = 128 latent topics, andthe Lead baseline.

In the table, system PLSA-JS uses the Jensen-Shannon divergence as thesimilarity measure, PLSA-KL the symmetric KL divergence and PLSA-COSthe cosine similarity. The results are given for the empirically best valueof the parameter T (number of latent topics) for each system variant. The

117


System T Rouge-1 Rouge-2 Rouge-SU4

PLSA-JS 192 0.43283 0.09698 0.15568PYTHY - n.a. 0.096 0.147PLSA-COS 256 0.42444 0.09588 0.15409IIIT Hyderabad - 0.40980 0.09505 0.15464PLSA-KL 256 0.42956 0.09465 0.15474LSA 128 0.42155 0.08880 0.14938GisTexter - 0.38751 0.08082 0.13582Lead - 0.30217 0.04947 0.09788

Table 4.1: Rouge recall scores for best number of latent topics T on the DUC2006 dataset. The table compares three different variants of our PLSA-basedsystem, which use the Jensen-Shannon divergence (PLSA-JS), the symmet-ric KL divergence (PLSA-KL) and the cosine similarity (PLSA-COS) respec-tively to calculate similarity features in the latent topic space. The best LSAmodel is based on a rank-T approximation with T = 128. The PLSA-JSapproach produces the best summarization quality, outperforming the beststate-of-the-art systems, such as PYTHY or IIIT Hyderabad.

system using the JS divergence outperforms the best state-of-the-art systemsat T = 192 with a Rouge-2 score of 0.9698. However, the improvementsfor Rouge-2 and Rouge-SU4 are not significant at p < 0.05. Rouge-1scores are significantly better than the results reported by IIIT Hyderabad.A comparison to the PYTHY system on Rouge-1 is not possible as the au-thors do not report this score for their system. All variants of our systemoutperform the LSA baseline on Rouge-2, suggesting that probabilistic la-tent topic models are a more adequate choice for modeling the content ofmulti-document summarization data sets than the algebraic model underly-ing LSA.

The most dominant features in our experiments are the sentence-querysimilarity fq(s) and the sentence-document similarity fds(s), which confirmsprevious research. On the other hand, the sentence-title similarity ftd(s) didnot have a significant influence on quality of the resulting summaries. Ourexperiments with other weighting schemes for the input matrix ATS, suchas tf-idf, resulted in significantly lower performance, confirming the resultsreported by Gong and Liu [GL01].

118


System T Rouge-1 Rouge-2 Rouge-SU4

IIIT Hyderabad - 0.44508 0.12448 0.17711PYTHY - 0.43245 0.12028 0.17074PLSA-JS 128 0.45843 0.11675 0.17680PLSA-KL 224 0.45208 0.11662 0.17306PLSA-COS 64 0.44329 0.11222 0.16679LSA 256 0.44891 0.11022 0.16864GisTexter - 0.42419 0.10810 0.16280Lead - 0.31250 0.06039 0.10507

Table 4.2: Rouge recall scores for best number of latent topics T on theDUC 2007 dataset. The table compares three different variants of our PLSA-based system, which use the Jensen-Shannon divergence (PLSA-JS), the sym-metric KL divergence (PLSA-KL) and the cosine similarity (PLSA-COS)respectively to calculate similarity features in the latent topic space. ThePLSA-JS variant outperforms the best participating system on Rouge-1,and ranks among the top systems for Rouge-2.

4.3.2 DUC 2007

The multi-document summarization task in DUC 2007 is the same as inDUC 2006, with participants asked to produce 250 word multi-documentsummaries for a total of 45 document clusters. The results of our system arepresented in Table 4.2, together with the results of the 2007 versions of thestate-of-the-art systems introduced above [PKV07, TBG+07, HRL07].

Rouge-2 and Rouge-SU4 scores of our system are slightly lower thanthose of the best system, but still very competitive, with the PLSA-JS vari-ant ranking 5th for Rouge-2 and 2nd for Rouge-SU4 when compared tosystems participating in DUC 2007. Again we see that all three system vari-ants outperform the LSA baseline. We also observe that both the PLSA-JSand the PLSA-COS variant require a much smaller number of latent topicsthan the LSA model for comparable Rouge-2 results. The implied reductionin description length, i.e. the reduced memory footprint due to the smallersize of the resulting topic vectors, can prove to be beneficial for large-scalesummarization applications.

We can also see that the PLSA-JS variant outperforms the best system onRouge-1, and achieves almost the same score as the top-performing systemfor Rouge-SU4, with the differences in both cases not being significant. Thissuggests that the PLSA model can adequately capture the importance of in-

119


dividual words for Rouge-1 recall, and word co-occurrences for Rouge-SU4skip-bigram recall. The Rouge-2 score, on the other hand, is significantlylower than that of the best system. This indicates that the PLSA model,which is trained on the co-occurrence counts of individual words, could ben-efit from the inclusion of bigram co-occurrence counts.

We find that the variants of our summarizer using distribution divergencemeasures (PLSA-JS, PLSA-KL) outperform the approach implementing thecosine similarity measure (PLSA-COS). When compared to the LSA baseline,there is no clear advantage for the PLSA-COS system, as the higherRouge-2score is offset by the slightly lower Rouge-1 and Rouge-SU4 scores.

4.3.3 Effect of system variations

In this section, we look at the effect of varying the number of latent topics.For all systems we find that using less than T = 32 latent topics, the modelcannot cope with the complexity of the data. As shown in Figure 4.1,Rouge-2 scores of the PLSA-JS and -COS variants on DUC 2006 data improvedrastically when increasing the number of latent topics to T = 32. For T >32, both variants show smaller improvements and seem relatively robust tochanges of the number of latent topics. The observed performance variationscan be explained by the way latent topics models assign words to topics:Given a fixed number of latent topics to “fill”, the algorithm may split upa single topic into two distinct ones, or vice versa merge different topics,which in turn affects similarity scores and the resulting sentence ranking.The scores of the KL divergence based variant are significantly lower thanthose of the PLSA-JS variant when using less than T = 156 latent topics,but are almost as good as those of PLSA-JS when using more topics.

Similar observations hold for the DUC 2007 data set, as can be seen fromthe performance curves presented in Figure 4.2. One notable difference isthat the PLSA-COS variant exhibits over-fitting for the DUC 2007 data setmuch earlier than the other systems, with performance dropping for T > 64.We assume this to be due to the fact that the PLSA model assigns near-zeroprobabilities to most of the latent topics of a sentence, which in turn affectsthe cosine similarity measure more strongly than the distribution divergencemeasures, which smooth near-zero probabilities.

For both data sets the PLSA-JS variant achieves very good performancewith only very few latent topics, signifying that a drastic reduction of theoriginal vector space is possible before computing similarity features. Evenwith T = 192 latent topics, the dimensionality of this space is much lowerthan for the original term vector space.

Both figures also show that the PLSA-JS approach consistently outper-

120


Figure 4.1: Rouge-2 recall on DUC 2006 data as a function of the numberof latent topics. The figure shows the performance of system variants usingthe Jensen-Shannon divergence (PLSA-JS), the symmetric KL divergence(PLSA-KL) and the cosine similarity (PLSA-COS), and of a LSA baseline.PLSA-based summarizers outperform the LSA baseline, although the PLSA-KL variant requires a large number of latent topics for a high summarizationquality. The PLSA-JS variant of our system outperforms existing approachesfor T = 192.

forms the LSA approach, as do the PLSA-COS and PLSA-KL variants forthe best setting of the latent number of topics T . Although performance im-provements are not significant at p < 0.05, they do indicate that the PLSAmodel can better capture the sparse information contained in sentences thanthe LSA model. We find that the LSA baseline system’s Rouge-2 scoresimprove significantly when increasing the number of latent topics from 32 to96. For larger T , gains in performance are less pronounced.

Figure 4.3 shows the Rouge-1 and Rouge-SU4 scores of the same setof summarizers on DUC 2007 data. The plots in general mirror the curvesthat can be observed for Rouge-2. One difference is that for Rouge-1, theLSA summarizer performs as well as the PLSA-KL summarizer. In addition,all summarizers except for the over-fitting PLSA-COS system show similar

121


Figure 4.2: Rouge-2 recall on DUC 2007 data as a function of the number oflatent topics T . The results mirror the performance on DUC 2006 data, withPLSA summarizers outperforming the LSA summarizer. The PLSA-COSvariant tends to over-fit already for a small number of latent topics.

performance for T > 128.

4.3.4 Do latent topics capture subtopics?

We conclude our discussion with a study of the latent topics generated byPLSA for our summarization datasets. Table 4.3 shows the most prominentterms of 10 latent topics generated by PLSA (T = 64) for the DUC 2007 doc-ument set D0743J, which is about the 1999 earthquakes in Western Turkey.We find that many topics aggregate semantically related terms rather well,and many latent topics seem to capture subtopic-like information. For ex-ample, topic 33 aggregates terms related to search and rescue efforts, topic 7terms related to a fire in a refinery, topic 48 concerns economic consequencesof the disaster, and topic 15 contains terms dealing with scientific studies ofearthquakes. On the other hand, topic 10 mixes terms related to the firstreports on the massive earthquake with terms pertaining to geology. Topic56 combines terms associated with the industrial importance of the affected

122


(a) ROUGE-1

(b) ROUGE-SU4

Figure 4.3: Rouge-1 (a) and Rouge-SU4 (b) recall on DUC 2007 data asa function of the number of latent topics for different summarizers.

123


Topic Most likely terms

10 quak kill peopl 17 injur region trigger turkey stress inhibit33 rescu worker team search rubbl survivor build wednesday collaps foreign56 earthquak industri turkey region heartland center damag countri part crisi1 aftershock scale richter measur occur degre gmt regist report local7 refineri izmit blaze fire petrol control strong compani alakoc ismail35 izmit citi mile epicent damag istanbul quak largest center 6548 turkish caus expert foreign earthquak dollar turkey billion econom countri8 buri bodi diseas rubbl water relief thousand men believ huge15 scientist quak fault struck monitor earthquak record largest west magnitud20 thousand homeless survivor izmit worker tent work busi volunt human

Table 4.3: Stemmed terms for 10 latent topics generated by PLSA with T =64 for DUC 2007 topic D0743J (“Earthquakes in Turkey in August 1999”).Most topics aggregate co-occurring terms, and seem to capture subtopic-like information, e.g. “rescue efforts” (topic 33), “aftershocks” (topic 1), and“refinery fire” (topic 7). Other topics mix subtopics, e.g. “industry” and“crisis management” (topic 56).

region with terms referring to crisis management. Such mixed contents in-dicate an insufficient number of latent topics T . A similar picture emergeswhen looking at the most likely sentences in each latent topic. Table 4.4shows sentences for 5 of these 10 topics. Whereas topic 1 clusters sentencescontaining reports on various aftershocks of the earthquake, the sentences intopic 56 seem rather unrelated to each other. Such overlapping and mixed-content topics will generally lower the performance of our summarizer, as inthis case it cannot infer correct (dis-)similarities when computing importanceand redundancy features. One potential approach to mitigate this effect isto increase the number of latent topics T , in order to separate such mixedtopics. However, this may result in a model over-fitted to the training data.A more viable approach is to average multiple different models, as describedin Section 4.2, since independently trained models are unlikely to share thesame term and sentences mixtures. We find that such averaged models tendto produce higher quality summaries.

Table 4.3 also shows that the PLSA model may map the same word todifferent latent topics. For example, the term “turkey” occurs in topics 10,56 and 48, the term “earthquak” in topics 56, 48, and 15, and “izmit” intopics 7, 35 and 20. PLSA thus associates the occurrences of these wordswith the different contexts they appear in, which provides some evidence forthe claim that latent topic models can disambiguate different meanings of a

124

4.4. Conclusion

Topic Most likely sentences

33 We don’t need any more rescue teams on site.Two hours later, the rescue workers finally walked away, leaving behind . . .The number of casualties may still increase as the rescue workers are . . .Foreign search and rescue teams managed to find at least three people . . .More than 1,000 relief workers from 19 countries joined the frantic search. . .

15 But the only similar networks of monitoring stations in the world, one in. . .On this basis, the three scientists proposed in their article that over the. . .Scientists said Tuesday’s earthquake was along the Anatolia Fault, a . . .A team of MIT scientists has been studying the region for a decade, but. . .Only in the last 20 years has technology allowed scientists to detect multiple..

56 Turks nationwide are struggling to recover from the massive, 7.4-magnitude..The Prime Ministry Crisis Administration Center stated Tuesday that. . .I think we also have to consider establishing some new residential areas . . .Foreign engineers and architects are expected to arrive in Turkey over the..This is a temporary measure, until the government organizes a program for..

1 A 4.2-magnitude aftershock followed.The most powerful one of the aftershocks, with a magnitude of five degrees..The aftershock occurred at 17:30 local time (14:30 GMT), but no serious..Some 27 aftershocks have jolted western Turkey in the past several hours. . .Five aftershocks, ranging between 4.3 and 4.6 on the Richter scale, rocked. . .

7 He said the blaze in the refinery was brought under control . . .Blaze in a refinery in western Turkey has been under control after the strong..The experts said that the fire in the Izmit Refinery of the Turkish Petrol. . .A few miles away from blazing refinery, rescue workers and volunteers . . .For four days, large swaths of Izmit were forbidden to rescue workers . . .

Table 4.4: Sentences for 5 latent topics generated by PLSA with T = 64 forDUC 2007 topic D0743J (“Earthquakes in Turkey in August 1999”). Mosttopics cluster sentences related to a single subtopic, e.g. “aftershocks” (topic1). Other topics mix unrelated sentences, e.g. topic 56.

word on the basis of word context (Section 2.4).

4.4 Conclusion

In this chapter, we showed how probabilistic topic models can be utilized todiscover subtopic structures of topically-related news article collections. Wepresented a novel approach to extractive summarization in order to producequery-focused summaries. The proposed method represents text passages ina latent topic space, which helps to overcome common problems related to

125


word ambiguity, synonymy, and the sparseness of the original word vectorspace when estimating the similarity of text passages. Our model infers thesemantics of latent topics based on lexical co-occurrence information anddistributional context, and exploits recurrent word patterns to derive thesubtopic structure of multiple texts in a domain- and language-independent,unsupervised fashion. Moreover, our approach enables us to map queries,document titles, and collection centroid vectors into the same latent topicspace. This consistent modeling can considerably increase the summary rele-vance of source sentences even when they share none or very few words withthese elements in the original word vector space.

We evaluated the applicability of our approach to the task of query-focused multi-document summarization on two recent summarizationdatasets. We performed extensive evaluations of our approach over a rangeof different parameter settings. The results showed that the proposed ap-proach produces higher-quality summaries than different baseline summa-rization methods. Furthermore, we find that our results are among the bestreported for the DUC-2006 and DUC-2007 multi-document summarizationtasks, as measured by Rouge-1, Rouge-2 and Rouge-SU4 metrics. Wehave achieved these very competitive results using a domain- and language-independent, unsupervised method.

One recurrent observation in all our experiments is a general quality gainwhen using a probabilistic topic model compared to a standard system basedon singular value decomposition. These quality improvements become espe-cially apparent when we compute text passage similarities with the Jensen-Shannon divergence. We also find that the PLSA-JS and the PLSA-COSvariants of our system achieve very good performance when using a muchsmaller number of latent topics than a comparable LSA model. The im-plied reduction in memory consumption may prove beneficial for large-scalesummarization tasks.

The probabilistic content model presented in this chapter capturedsubtopical text structures in collections of thematically related documents.Our method successfully identified descriptive latent topics that allowed forbetter estimates of the summary-worthiness of sentences. However, the stan-dard probabilistic topic model we utilize assumes independence between wordobservations. Each latent topic is represented as a “bag-of-words”, and anyinformation about word order is ignored. In the next chapter, we will try torelax the bag-of-words assumption by incorporating word order informationinto the probabilistic topic model. The proposed approach fuses word uni-gram and bigram observations into a unified model, leading to an improveddescriptiveness of the latent topics.

126

Chapter 5

Content modeling beyondbag-of-words

Standard probabilistic topic models infer latent topic variables using the“bag-of-words” assumption, and ignore word order. Each latent topic isrepresented as a multinomial distribution over words, similar to a unigramlanguage model. The bag-of-words assumption is often motivated from theperspective of computational efficiency. However, in many language modelingapplications, such as speech recognition and text compression, the order inwhich words appear is extremely important. Considering word order canassist in topic inference, and will likely lead to more meaningful contentmodels of text [Wal06]. For example, the phrases “the department chaircouches offers” and “the chair department offers couches” are about quitedifferent topics, but have the same unigram statistics. The integration ofword order information would allow the topic model to distinguish betweenthe two topics, since the model can assign a much higher likelihood to theword “chair” being generated by “university” topic when observing it directlypreceded by the word “department”.

In contrast to topic models, language models capture term dependenciesby predicting the occurrence of a word on the basis of the words directlypreceding it [MS01]. These predictions are made based on conditional andmarginal word n-gram statistics derived from a text corpus. A languagemodel thus assigns a higher likelihood of occurrence to a common phraselike “the green apple” than to an uncommon phrase such as “the greenstrawberry”. In addition, it may predict the occurrence of the word “ap-ple” following the word sequence “the green”, but assign a zero probabilityto a subsequent occurrence of the word “moon”. While language models mayuse word contexts of any length, in practice most often bigram and trigrammodels are used for reasons of computational effort and data sparsity [MS01].

127

Chapter 5: Content modeling beyond bag-of-words

In this chapter, our goal is to improve the content model introduced inthe previous chapter by incorporating word order information. Each topic isnow represented by two multinomial distributions, one over words and oneover bigrams (Section 5.1). We present an extension of the Expectation-Maximization algorithm in order to merge word and bigram observationswhen estimating the parameters of the unified model. Both types of observa-tions are coupled by the latent topic variable, which enables the model to con-sider bigram evidence when making predictions about words and vice versa.During inference, our model combines information about (long-range) wordco-occurrence correlations with information about word associations that aredue to word ordering and word proximity. Furthermore, the inferred latenttopics are also characterized by a distribution over bigrams, making it easierto distinguish between topics sharing similar word (unigram) distributions.

We evaluate the applicability of our content modeling approach on thetask of query-focused multi-document summarization. Experimental resultsshow an improved performance of our summarizer when compared to a sum-marizer which considers word information only when constructing a latenttopic model. In addition, our method compares favorably with current state-of-the art summarizers, while being built on a considerably simpler modelthat is learned in an unsupervised, domain- and language-independent fash-ion. We also provide evidence that our approach improves the descriptivequality of the latent topics, and substantially reduces the number of topicsrequired to create high-quality content models for multi-document summa-rization.

In the remainder of this chapter, we first present our novel approachfor merging language model information into a standard probabilistic topicmodel (Section 5.1). Subsequently, we give details of our summarizationsystem and the sentence features we compute to estimate the summary-worthiness of sentences in Section 5.2. We evaluate the accuracy of ourapproach on the task of query-focused multi-document summarization inSection 5.3.

5.1 Combining topic and language models

We will motivate the proposed approach of incorporating bigram informationwith a simple example. Consider the two phrases given in the introduction:“the department chair couches offers” (phrase A) and “the chair departmentoffers couches” (phrase B). PLSA assumes that documents and words areconditionally independent given the latent topic variable z. In addition, ob-servations pairs (d, w) are assumed to be generated independently. Applying

128

5.1. Combining topic and language models

PLSA to the two example phrases will reduce the word sequences to a setof independent observations (A, the), (A, department), . . . , (B, the), . . . ,etc. In practice, it is very likely that the topic inference process, followingthe maximum likelihood principle, will induce that both phrases are asso-ciated with the same latent topic z, which has a word distribution heavilydominated by the words the, chair, couches, department, offers.

To address the word independence assumption made by standard latentfactor models, we propose to incorporate bigram-document co-occurrence ob-servations into a topic model. Even though one can consider a bigram simplyto be a co-occurrence of two terms, and as such captured well enough by astandard topic model, our assumption is that the observed bigram patternswill reinforce the observed word patterns, leading to more descriptive latenttopics. We limit our investigations in this chapter to word bigrams, butthe described approach can easily be extended to incorporate higher-ordern-grams.1

For our novel approach, we associate each observation (d, b) of a bigramb = (ww′), where bigram b ∈ B = {b1, . . . , bl} occurs in document d, with anunobserved class variable z, similar to the decomposition of the co-occurrenceobservations (d, w) of words and documents in PLSA. To illustrate this ideausing our previous example, our goal here is to reduce the phrase to a set of(independent) observations (A, the department), (A, department chair), . . . ,(B, the chair), (B, chair department), . . . , etc. Furthermore, we assumethat the same hidden topics that are responsible for the observed term-document co-occurrences (d, w) are also the origin of the bigram-documentco-occurrence observations (d, b). Hence, there is only one set of latent topicsZ, but each zk ∈ Z is associated with a probability distribution over wordsP (w|zk) as well as with a probability distribution over bigrams P (b|zk). Theplate notation of the combined model is shown in Figure 5.1.

For PLSA, the probability that a word and a document co-occur can becalculated by summing over all latent topics Z:

P (dj, wi) = P (wi|dj)P (dj),where (5.1)

P (wi|dj) =k

P (wi|zk)P (zk|dj) (5.2)

Similarly, we can compute the probability that a bigram and a document

1The small size of multi-document summarization news article collections makes theusefulness of higher-order n-gram observations questionable, since the resulting modelswill suffer from sparsity-related problems [MS01]. Furthermore, latent factor models relyon co-occurrence observations (e.g. of the same word in different documents) to reducethe dimensionality of the original vector space, and within a small dataset higher-ordern-grams will rarely occur more than once.

129


M

N

wzd

b

N − 1

Figure 5.1: Graphical model representation of the extended PLSA modelincorporating a bigram language model for N words and a corpus of M doc-uments. In contrast to the standard PLSA model, the extended model con-ditions both word and bigram observations on the latent topic variable Z.

co-occur:

P (bl|dj) =k

P (bl|zk)P (zk|dj). (5.3)

Notice that the decompositions in Equation 5.2 and Equation 5.3 sharethe same document-specific mixing proportions P (zk|dj). This couples theconditional probabilities for terms and bigrams: each latent topic has someprobability P (bl|zk) of generating bigram bl as well as some probabilityP (wi|zk) of generating an occurrence of term wi. The advantage of thisjoint modeling approach is that it integrates term and bigram informationin a principled manner. This coupling allows the model to take evidenceabout bigram co-occurrences into account when making predictions aboutterms and vice versa. Furthermore, since bigrams are two-word sequencesww′, bigram observations (d, ww′) overlap as illustrated in the above exam-ple, which relaxes the bag-of-words assumption of independent observations(d, w) made by the standard PLSA model.

Following the procedure outlined in Cohn and Hofmann [CH00], we cannow combine both models based on the common factor P (z|d) by maximizingthe log-likelihood function

130

5.1. Combining topic and language models

L =j

αi

n(dj, wi) logP (wi, dj)

+(1− α)l

n(dj, bl) logP (bl, dj)

(5.4)

The α parameter (0 ≤ α ≤ 1) is a predefined weight for the influence of eachmodel. The terms n(dj, wi) and n(dj, bl) denote the co-occurrence countsof a document with a word and a bigram, respectively. For the purposesof our approach, each document corresponds to a sentence, as discussed inthe previous chapter. The unified model hence associates the latent factorvariable z with observations (s, w) and (s, b) of word w and bigram b insentence s.

Using the Expectation-Maximization (EM) algorithm we then performmaximum likelihood parameter estimation for the latent factor model. Dur-ing the expectation (E) step we first calculate the posterior probabilities:

P (zk|wi, dj) =P (wi|zk)P (zk|dj)

P (wi|dj)(5.5)

P (zk|bl, dj) =P (bl|zk)P (zk|dj)

P (bl|dj), (5.6)

and then re-estimate parameters in the maximization (M) step as follows:

P (wi|zk) =j

n(wi, dj)i′ n(wi′ , dj)

P (zk|wi, dj) (5.7)

P (bl|zk) =j

n(bl, dj)l′ n(bl′ , dj)

P (zk|bl, dj) (5.8)

The class-conditional distributions are recomputed in the M-step as

P (zk|dj) ∝ α

in(wi,dj)i′ n(w

i′ ,dj)

P (zk|wi, dj)

+(1− α)

ln(bl,dj)l′ n(b

l′ ,dj)

P (zk|bl, dj) (5.9)

Based on the iterative computation of the above E and M steps, the EMalgorithm monotonically increases the likelihood of the extended model onthe observed data. Using the α parameter, our new model can be easilyreduced to a term-document based model by setting α = 1.0.

131


5.2 Summarizing with a hybrid content

model

Our method for creating a query-focused summary of a collection of themat-ically related documents is similar to the approach described in the previouschapter, and consists of three steps: First, we associate sentences and querieswith a representation in the latent topic space by training a PLSA model.In contrast to the approach for content modeling presented in Chapter 4,however, we utilize not only the term-sentence matrix, but also the bigram-sentence matrix of the input document collection. We then estimate themixing proportions of the given user query, documents, and the collection’scentroid vector. On the basis of the sentence and query representations in thelatent topic space, we compute several features for each sentence to estimatetheir summary-worthiness. We combine individual feature scores linearly intoan overall sentence score, which we utilize to rank sentences. Subsequently,we iteratively select top-ranking sentences to create a summary, and penal-ize candidate sentences based on their similarity to the partially constructedsummary to avoid the introduction of redundant content.

5.2.1 Sentence representation in the latent topic space

Given a collection of related documents, we perform sentence splitting oneach document using the NLTK toolkit. Each sentence is represented as abag-of-words w = (w1, . . . , wm). We remove stop words and apply stemmingusing Porter’s stemmer [Por80]. From the bag-of-words representation, wecreate a term-sentence matrix ATS over the union of sentences in the corpus.Each entryAts is given by the frequency of term t in sentence s. Similarly, wecreate a bigram-sentence matrix BBS, where each entry Bbs is given by thefrequency of bigram b in sentence s. We then train the extended PLSA modelon the matrices A and B, using the modified Expectation-Maximizationalgorithm described above.

After the model has been trained, it provides a representation of thesentences as probability distributions P (z|s) over the latent topics Z, and wearrive at a representation of sentences as a vector in the “topic space” of thedocument collection:

zs = P (z|s) = (p(z1|s), p(z2|s), . . . , p(zT |s)), (5.10)

where p(zk|s) is the conditional probability of topic zk given the sentence s.As before, one of the major challenges of applying latent topic models is theestimation of the number of latent topics T . We will therefore evaluate theperformance of our approaches for different values of the parameter T .

132

5.2. Summarizing with a hybrid content model

5.2.2 Computing query- and topic-focused sentencefeatures

We follow the approach described in Chapter 4 to create a summary, andestimate the summary-worthiness of sentences using a range of features thatmeasure a sentence’s relevance to a user query and its relevance with respectto the overall content of the collection of related documents. The informationwe consider encompasses the following the set of elements, each of which canbe represented as a vector of words w = (w1, . . . , wm).

• tc: Title of a collection of related documents

• q: Query or topic statement

• tds: Title of document d containing sentence s

• ds: Document term vector of document d containing sentence s

• c: Document collection centroid vector

We apply the same preprocessing steps as described above for sentences.We do not perform sentence splitting if the query consists of a multi-sentenceset of questions, but instead treat it a single, long sentence. We also do notdiscard queries or titles based on the length of the resulting word vector,as done for sentences. Document and document collection term vectors arecomputed by summing the term frequencies of the corresponding sentenceterm vectors.

The word vectors are then transformed into probability distributions overthe latent topic space by folding them into the trained model. The foldingis performed by EM iterations, where the distributions P (w|z) and P (b|z)are kept fixed, and only the mixing proportions P (z|q) are adapted in eachM-step [Hof99b].

Given the resulting representation of elements as probability distribu-tions, we calculate the following set of sentence features:

• ftc(s) : sim(p(z|s), p(z|tc)) - the similarity of the sentence and thedocument collection’s title

• fq(s) : sim(p(z|s), p(z|q)) - the similarity of the sentence and the query

• ftd(s) : sim(p(z|s), p(z|tds)) - the similarity of the sentence and the titleof the document it belongs to

133


• fds(s) : sim(p(z|s), p(z|ds)) - the similarity of the sentence and thedocument it belongs to

• fc(s) : sim(p(z|s), p(z|c)) - the similarity of the sentence and the col-lection centroid vector

We utilize the Jensen-Shannon (JS) divergence to calculate similarityscores, since our previous investigations showed that the JS divergence is auseful similarity metric in the context of topic-based multi-document sum-marization (see Chapter 4). It is defined as:

simJS(S,Q) = 1− [DJS(S||Q)] (5.11)

= 1−1

2DKL(S||M) +

1

2DKL(Q||M)

,

where M = 1/2(S + Q). We follow our previous approach and employ five-fold model averaging [BCT02] to overcome the effects of finding local maximaof the trained PLSA model.

5.2.3 Sentence scoring

The system described so far assigns a vector of similarity feature values toeach sentence s ∈ S,xs = (ftc(s), fq(s), ftd(s), fds(s), fc(s)). The overall scoreof a sentence s is calculated as a linearly weighted combination score(s) =wTxs, where w is a weight vector.

To create a summary, we rank sentences by their score score(s), andselect the highest-scoring sentences for inclusion in the summary until thepredefined summary length is reached. In order to deal with redundancy,we apply a scheme similar to Maximum Marginal Relevance. To this end,we select sentences iteratively, and calculate a redundancy penalty for eachremaining candidate sentence in each iteration:

scoremmr(s) = λ(score(s))− (1− λ)sim(p(z|s), p(z|sum)), (5.12)

where the score(s) is scaled to [0, 1] and sim(p(z|s), p(z|sum)) is the sim-ilarity of the candidate sentence to the current, partial summary. Again,different similarity measures can be applied. In our experiments, we usedthe cosine similarity, which is defined as:

simCOS(p(z|s),p(z|sum)) =p(z|s)Tp(z|sum)

|p(z|s)||p(z|sum)| (5.13)

134

5.3. Experiments

5.3 Experiments

We conduct our analysis and evaluate our approach on the DUC 2007 multi-document summarization dataset (see Section 1.4.2). We follow commonpractice and evaluate the quality of our summaries by measuring their overlapwith human-created reference summaries according to the Rouge measure.Summaries are truncated to a length of 250 words before calculating Rougescores. We report the performance of our summarizer on the commonlyused Rouge-1, Rouge-2 (bigram overlap) and Rouge-SU4 (skip bigram)metrics, using the official parameter settings as described in Section 3.3.All obtained results are compared to the best participating systems of DUC2007, as well as to a Lead sentence baseline system, which selects the firstn sentences from the most recent news article to create a summary (seeSection 1.4.3).

5.3.1 Results

Optimal parameter settings

We utilize the DUC 2006 multi-document summarization dataset to findoptimal settings for the free parameter λ, and to tune the feature weightvector w. We follow the greedy procedure outlined in the previous chapterto learn feature weights. The optimal value of λ is determined by varying it in[0.0; 1.0], using a step size of 0.1. Our experiments on the DUC 2007 datasetare conducted with the best settings of λ = 0.4 and w = (6, 16, 2, 40, 1).

Figures 5.2 and 5.3 compare the Rouge-2, Rouge-1 and Rouge-SU4scores achieved by variants of our extended summarizer for different val-ues of α and T . We observe that the models combining term and bigramco-occurrence information outperform the models based only on term co-occurrence (α = 1.0) respectively bigram co-occurrence (α = 0.0) for smallT . As we increase T , the extended models tend to over-fit, leading to lowersummarization performance. This suggests that the information obtainedfrom the combining term and bigram observations allows for more descrip-tive latent topics, but utilizing a higher number of latent topics T tendsto dilute these observations, leading to topics which pick out idiosyncraticword combinations [SG07]. The performance of the term-based model in-creases consistently for T <= 256, reaching a maximum Rouge-2 recall of0.11776, before also over-fitting. The over-fitting is more clearly visible forall summarizers in Figure 5.3 (a).

The most interesting observation shown in Figure 5.2 is that addingbigram-sentence co-occurrence observations to a standard PLSA model can

135


Figure 5.2: Rouge-2 recall of the summarizer using a hybrid content modelon DUC 2007 data as a function of the number of latent topics T , for dif-ferent settings of the parameter α. For T < 64, the summarizers combiningterm and bigram co-occurrence information outperform the models based onterm co-occurrence (α = 1.0) respectively bigram co-occurrence (α = 0.0)only. All combined summarizers tend to over-fit faster than the term-basedsummarizer.

substantially improve Rouge-2 scores, and significantly reduces the numberof latent topics T required for a good model. All combined models outper-form the term and bigram baseline models on Rouge-2 for T <= 32 latenttopics. The effect is less pronounced for Rouge-SU4 scores, but still recog-nizable (see Figure 5.3 (b)). The experimentally optimal value of α = 0.6weights term and bigram co-occurrences almost equally. For lower valuesof α, i.e. models where bigram observations contribute more prominentlyduring parameter estimation, the summarization performance of the modeldecreases substantially. This is mirrored in the Rouge-1 and Rouge-SU4scores of systems with α <= 0.4, which are consistently lower than for themodels which emphasize term co-occurrence observations (Figure 5.3). Apossible explanation for this behavior is that these metrics benefit less fromthe incorporation of bigrams than Rouge-2, since they measure unigramoverlap and long-range word correlations respectively. In contrast, Rouge-2

136

5.3. Experiments

(a) ROUGE-1

(b) ROUGE-SU4

Figure 5.3: Rouge-1 (a) and Rouge-SU4 (b) recall of the summarizer usinga hybrid content model on DUC 2007 data as a function of the number oflatent topics, and for different settings of the parameter α.

137


explicitly measures bigram overlap, which is reflected in the higher Rouge-2scores of the combined models.

The plots also indicate that term-sentence co-occurrence observations aremore important for a good model than bigram-sentence co-occurrence ob-servations. Further evidence for this assumption is given by the fact thatthe term-based model (α = 1.0) consistently outperforms the bigram-basedmodel (α = 0.0), indicating that bigram co-occurrence information alonecaptures fewer of the topical relations that exist within a document collec-tion. The most likely reason for this is that most bigrams occur only once,and there is therefore less overlap between different sentences than for termco-occurrence observations.

Comparison to state-of-the-art methods

Table 5.1 compares Rouge recall scores of our system to different state-of-the-art systems and two baseline systems. The state-of-the-art systemsIIIT Hyderabad [PKV07], PYTHY [TBG+07] and GisTexter [HRL07] aredescribed in Section 2.3.3. Our first baseline system PLSA-JS uses termco-occurrence observations only α = 1.0, and thus implements the approachpresented in the previous chapter. The second baseline is a Lead sentencebaseline.

In the table, our novel system PLSA-F combines term and bigram co-occurrences into a single model, based on the best setting of the parame-ter α = 0.6. We see that the PLSA-F system outperforms the standardPLSA-JS approach on Rouge-1, Rouge-2 and Rouge-SU4 scores. How-ever, the improvements are not significant at p < 0.05. On the other hand,the PLSA-F method achieves its best score using only T = 32 latent classes,compared to T = 256 for the PLSA-JS system. This suggests that the in-formation supplied by the bigram co-occurrence observations reinforces theterm co-occurrence observations, such that the model can better representthe different latent topics contained in the document cluster.

Our combined approach outperforms both state-of-the-art systems onRouge-1 recall, and is not significantly worse on Rouge-SU4 recall. ForRouge-2, our system’s performance is only slightly lower than the 95%-confidence interval of the top system’s performance (0.11961–0.12925). Theresults of our system are also comparable to the topic modeling approach ofHaghighi and Vanderwende [HV09], who report a Rouge-2 score of 0.118for a model based on bigram distributions, but are significantly better thanthe 0.097 they report for an unigram-based model. We find that all systemsachieve considerably better results than the Lead sentence baseline system.

138

5.4. Conclusion

System Rouge-1 Rouge-2 Rouge-SU4

IIIT Hyderabad 0.44508 0.12448 0.17711PYTHY 0.43245 0.12028 0.17593PLSA-F (T=32) 0.45400 0.11951 0.17573PLSA-JS (T=256) 0.44885 0.11774 0.17552GisTexter 0.42419 0.10810 0.16280Lead 0.31250 0.06039 0.10507

Table 5.1: DUC-07: Rouge recall scores for best number of latent topics T .The PLSA-JS system uses term co-occurrences only (α = 1.0), the PLSA-Fsystem combines term and bigram co-occurrence information, with α = 0.6.Our novel PLSA-F approach outperforms the best participating system (IIITHyderabad) on Rouge-1.

5.4 Conclusion

In this chapter, we investigated how probabilistic topic models can be com-bined with language models in order to relax the “bag-of-words” assump-tion of standard topic models. We introduced a novel approach for query-focused multi-document summarization that combines term and bigram co-occurrence observations into a single probabilistic latent topic model. Theproposed method conditions bigram observations on the same latent topicvariable as term observations, and thus couples “long-range” word correla-tions with short-range word associations that are due to word ordering.

We evaluated the applicability of our approach to the task of query-focused multi-document summarization on a recent summarization dataset,and performed extensive evaluations over a range of different parameter set-tings. Our results show that the integration of a bigram language modelinto a standard topic model leads to a system that produces higher-qualitysummaries than systems which are based on term respectively bigram co-occurrence observations only. Furthermore, we find that our summarizercompetes favorably with existing state-of-the-art systems. Our results areamong the best reported on the DUC-2007 multi-document summarizationtasks for Rouge-1, Rouge-2 and Rouge-SU4 scores. We have achievedthese excellent results with a system that utilizes a considerably simplermodel than previous topic modeling approaches to multi-document summa-rization.

One recurrent observation in our experiments is that the combined systemrequires a much smaller number of latent topics for optimal summarization

139


performance than a PLSA summarizer based on term co-occurrence obser-vations only. This is especially apparent in the high scores our approachachieves for the Rouge-2 metric, which directly reflects the usefulness ofincorporating bigram observations. However, good models still rely to alarge part on term observations, which are much more frequent than bigramobservations, and thus do not suffer from sparsity-related problems.

140

Part III

Subsentential Content Units

141

Chapter 6

Subsentential content units innews articles

The identification of similar content is one of the main challenges of auto-matic multi-document summarization. In the previous chapters of this thesis,we have developed several approaches for discovering and modeling subtopicsin a collection of thematically related documents. Subtopics capture major“themes” that are addressed in documents of a given domain, and thus corre-spond to a very coarse-grained segmentation of documents — typically, eachsubtopic can span an amount of text up to a few paragraphs.

News articles reporting on the same event, however, are similar not onlyin terms of the subtopics they address, but often also relate the same facts.Figure 6.1 shows two example paragraphs from different news articles whichreport the disappearance of a plane piloted by John F. Kennedy Jr. Bothparagraphs relate the same, or very similar, information: Two U.S. agenciesare searching for the missing plane, the plane was carrying John F. KennedyJr., who is the son of the 35th U.S. president, and the search is being con-ducted off the coast of Long Island. In multi-document summarization, suchfactual pieces of information are often denoted content units [NP04]. Con-tent units, which we introduced in Section 2.6, are defined as text spans thatexpress a particular piece of information, such as a fact, and are primarilycharacterized by their meaning. A consequence of this definition is that con-tent units which relate the same information may differ in their choice ofwords.

Summarization systems can benefit from an identification of content unitsin many ways. The comparison of units of text larger than words or phrases,but smaller than sentences provides a much more useful granularity for de-termining content similarity than the comparison of words, word senses, orcomplete sentences. In addition, the identification of information units on

143

Chapter 6: Subsentential content units in news articles

Doc Paragraph

1 The U.S. Coast Guard and the Air National Guard are conductinga massive search off the coast of Long Island, N.Y. for a small planecarrying John F. Kennedy Jr., son of the 35th U.S. President, U.S.media reported Saturday. The search began Saturday morning inan area covering some 1,000 square miles, presumably the flightpath of Kennedy’s plane, searchers said.

2 A small plane carrying John F. Kennedy Jr., son of the formerU.S. president, was reported missing early Saturday, and a searchwas under way off the coast of New York’s Long Island, officialsources said. The U.S. Coast Guard confirmed it was searching forthe plane with help from the Air National Guard. The search wasbeing conducted in water off eastern tip of Long Island, along thepresumed flight path of Kennedy’s plane.

Figure 6.1: The figure shows two paragraphs from different news articleswhich report the same facts concerning the disappearance of a plane pilotedby John F. Kennedy Jr. Facts are combined in different ways into the sen-tences, and are expressed with very similar wording.

the basis of their meaning offers a way to match similar content regardlessof the actual choice of words used to express this meaning. This latter bene-fit has been recognized early in summarizer evaluation, where content unitsand similar notions have played a major role in many different evaluationschemes [vHT03, NP04, PNMS05, HLZ05]. The goal of these evaluation ap-proaches is to overcome the problem of human variability in content selectionand expression, and to reward machine-generated summaries if they relatethe same facts as summaries written by humans, even if they utilize differentwords and phrases to do so. Furthermore, modeling each sentence as a setof content units can help to overcome problems arising in models that utilizeword vector representations, such as handling ambiguous and synonymouswords. As discussed in Section 2.6, content units can currently be identi-fied only by human annotators, and an automatic discovery of content unitsremains a major open challenge in the context of automatic summarizationand summarizer evaluation.

In this chapter, we present an unsupervised approach to the identifica-tion of subsentential word patterns that are similar to content units. Wesimplify the problem of automatically identifying content units – which is

144

6.1. Dataset

beyond the scope of this work – to the problem of finding distinctive wordpatterns that are repeated across sentences. We desire to learn, for exam-ple, that the content unit “JFK Jr. was the son of the 35th U.S. President”is described by a word distribution favoring words such as “Kennedy, U.S.,Jr., son, president”. We conjecture that the learnt word distributions aredistinctive enough to group subsentential text spans from different sentenceswith the same or similar meaning, and to distinguish between text spanswith different meanings. Consequently, we can represent each sentence asa distribution over a specific set of text spans. This problem formulationnaturally lends itself to an application of latent topic modeling algorithms.The question we thus intend to answer in this chapter is whether latent top-ics discovered in related sentences are similar to manually annotated contentunits.

In order to verify the validity of our idea, we conduct an analysis of aset of closely related pairs of news articles, chosen from the MDS task of theDUC 2007 data set. In this data set we identify four different types of contentunits and annotate a gold-standard set of content units in each pair of doc-uments. We then develop a probabilistic topic modeling approach that aimsto automatically determine this set of content units. Our model infers thesemantics of words based on their distributional context, and derives map-pings between different sentences on the basis of recurrent word patterns. Inour evaluation, we provide evidence that the learnt word distributions arevery similar to the word distributions of gold-standard content units. Fur-thermore, our model infers sentence associations that correspond closely withthose that we have manually identified, which enables us to group sentencesexpressing the same or similar facts.

The remainder of this chapter is structured as follows: After a brief in-troduction of the utilized dataset (Section 6.1) we introduce the notion ofcontent units in more detail, and describe the different types of content unitsthat we want our model to learn (Section 6.2). We then present our automaticapproach to content unit discovery in Section 6.3, and evaluate its accuracyin learning content unit-like word distributions and sentence associations inSection 6.4.

6.1 Dataset

We perform our analysis on pairs of news articles which report the samenews event, are very similar in terms of word choice, and which were writtenaround the same date. This allows us to detect useful patterns that are notlost in the noise of too varied information found in a larger set of news articles,

145


DUC ID Document A Document B Annotator ID

D0706 APW19980911.0093 APW19980911.0350 1,2D0710 APW19981108.0225 APW19981108.0643 1,2,3D0714 NYT20000503.0009 NYT20000503.0432 1D0718 APW19990630.0028 APW19990701.0333 1,2,3D0721 APW19991026.0010 APW19991026.0133 1,3D0724 APW19991026.0184 APW19991027.0174 1,2,3D0727 NYT19991003.1076 NYT19991006.0325 1,3D0730 APW19980625.0941 XIE19980626.0339 1,3D0734 NYT19980921.0071 XIE19980922.0236 1,2,3D0742 XIE19990718.0011 XIE19990718.0013 1,2,3D0743 XIE19990821.0129 XIE19990823.0008 1,2,3

Table 6.1: Document pairs used for content unit analysis, and identifiers ofthe annotators who defined content units.

and reduces the number of content units to be discovered. In addition, suchclosely related news articles will be more likely to report the same facts andevents. We therefore conduct our analysis on documents drawn from theDUC 2007 multi-document summarization dataset. The dataset consists of45 clusters, and each cluster contains 25 thematically related news articles.To find suitable documents, we first represent each document in a givencluster as a tf-idf -weighted word vector, and compute the cosine similaritysim(di, dj) for all document pairs di, dj ∈ D in the cluster.1 We then choosethe most similar document pair di, dj of each cluster if 0.7 ≤ sim(di, dj) ≤0.85. The upper bound on the similarity helps to avoid choosing documentswhich are simply copies or minor revisions of each other, while the lowerbound ensures that the two documents are sufficiently similar, and are likelyto share the same facts. Using this procedure, we found suitable documentpairs for 11 of the 45 DUC 2007 document clusters. On average, each of thesedocument pairs contained 34.3 sentences and had a vocabulary of 169 words,which occurred a total of 393 times. Table 6.1 lists the DUC topic anddocument identifiers used in our analysis, and provides information aboutwhich annotators defined content units for the respective document pairs.

1We removed stop words and performed stemming as described in previous chapters ofthis thesis.

146


6.2 Subsentential content units

The goal of our approach is to discover recurrent word patterns which ap-proximate content units. In this section we will introduce the types of contentunits that we annotated in our dataset.

6.2.1 Types of content units

We adopt the notion of content units introduced by Nenkova et al. [NP04,NPM07] (see Section 2.6) and assume that a content unit is identified bynoting information that is repeated across source documents. Following theprocedure outlined by Nenkova and Passonneau [NP04], we determine con-tent units by first identifying similar sentences, and then proceeding with afiner grained inspection of these sentences to identify more tightly relatedsubparts. However, we limit the identification of related subparts to thelevel of sentential clauses and do not consider more fine-grained similarities,such as repeated words or noun phrases, as separate content units. Thus,the content units we describe next roughly correspond to “relations” betweensentences, and can be viewed as expressing the degree of similarity of sentencepairs.

By the above definition, each sentence can contain one or more contentunits. Some sentences may express the same information as another sentence,and some sentences may combine information from two or more distinctsentences. The words in many sentences will reflect the main theme of thedocument, and hence there will be some words that are very common acrosssentences. Consider the example given in Figure 6.2: Words like “Kennedy”and “U.S.” reappear throughout the sentences, whereas words like “son”or “flight” appear only once, and are associated with a particular fact. Inthis example, we can identify four different content units. The differentcontent units are bracketed and numbered, and similar content units expressapproximately the same information, but differ in word choice and ordering.

The analysis of our dataset resulted in the definition of the following typesof content units:

Copy

Sentences that are verbatim copies of another sentence constitute the firsttype of content unit. Sentence copies are typically easy to match to eachother, but are not necessarily trivial to distinguish from other content units,as their word distribution may overlap with that of other sentences.

147


ID Text

D1S1 [The U.S. Coast Guard and the Air National Guard are conduct-ing a massive search off the coast of Long Island, N.Y.]1 [for asmall plane carrying John F. Kennedy Jr., son of the 35th U.S.President,]2 [U.S. media reported Saturday.]3

D1S2 [The search began Saturday morning in an area covering some1,000 square miles, presumably the flight path of Kennedy’splane,]4 searchers said.

D2S1 [A small plane carrying John F. Kennedy Jr., son of the formerU.S. president,]2 [was reported missing early Saturday,]3 and [asearch was under way off the coast of New York’s Long Island,]1official sources said.

D2S2 [The U.S. Coast Guard confirmed it was searching for the planewith help from the Air National Guard.]1

D2S3 [The search was being conducted in water off eastern tip of LongIsland,]1 [along the presumed flight path of Kennedy’s plane.]4

Figure 6.2: The figure shows manually annotated content units in two ex-ample paragraphs extracted from different articles. Content units bracketedand numbered, and can be interpreted as roughly corresponding to facts.

Similar

Sentences that express the same content, but with different word usage orword ordering, are considered as the second type of content unit. The fol-lowing pair of sentences gives an example of utilizing synonyms and wordre-ordering (the underlined parts denote word sequences which are identicalin the two sentences):

(a) The Supreme Court struck down as unconstitutional a law giv-ing the president a line-item veto which lets him cancel specificitems in tax and spending measures.(b) The U.S. Supreme Court Thursday struck down asunconstitutional the line-item veto law that lets the U.S. pres-ident strike out specific items in tax and spending measures.

Clause

The Clause type of content units corresponds to clauses that are repeated asparts of another sentence, or as a full sentence, with similar or identical word

148


usage. Clause units are therefore similar to Copy or Similar content units,with the difference that there is no one-to-one correspondence between twosentences. Instead, at least one of the two sentences must also contain anadditional, different content unit.

(a) Germany, Azerbaijan, Greece, France, the Netherlands, Kaza-khstan, Ukraine and Russia have been participating in the fightagainst the blaze that threatened to engulf the entire field of 30storage tanks containing 1 million tons of crude oil.(b) However, he said the strong fire had destroyed seven storagetanks and damaged two other ones in the refinery which held 30storage tanks containing 1 million tons of crude oil.

In this example, the underlined text constitutes a Clause content unit, andthe remainder of the two sentences make up two further content units. Inthe example given in Figure 6.2, all annotated content units correspond tothe Clause type.

Unique

Finally, sentences that express unique information that is not repeated inany other sentence form the last type of content unit. In contrast to theannotation procedure described by Nenkova et al. [NPM07], we do not splitinto clauses sentences which relate information that appears in only onedocument. The reason for this is that, as explained above, we limit ourapproach to the task of finding subsentential word patterns that are repeatedacross sentences, and the word patterns of unique sentences are by definitionnot found anywhere else.2

6.2.2 Annotating gold-standard content units

Three different annotators marked up content units in the set of 11 documentpairs. The annotators were given the definitions of the content unit types in-vestigated in this study, and instructed to follow the procedure for identifyingsimilar content units outlined above. Six document pairs were processed byall human annotators, and one document pair by a single annotator only (seeTable 6.1). The task of each annotator was to scan each document pair forthe content unit types defined above, and to write down for each identified

2Note that for the same reason we do not split copied or similar sentences into distinctclauses. We only split sentences if a specific clause is repeated as part of a different sentencetogether with a clause not contained the original sentence.

149


Figure 6.3: Distribution of annotated gold-standard content units per typefor all document pairs, averaged over annotators.

content unit the ids of the sentences it occurs in. In addition, the annotatorspecified the type of content unit she selected. If a sentence contained multi-ple content units, the annotator created several distinct lists of sentence ids.Figure 6.3 shows the distribution of content units per type for all documentpairs, averaged over annotators. We see that for some document pairs, suchas for D0718 and D0721, the majority of content units are of type Copy orSimilar. On the other hand, some document pairs do not share any or onlyvery few similar or copied sentences, such as for example D0727 and D0742.On average, annotators identified 24.2 distinct content units per documentpair, which corresponds to approximately 0.7 content units per sentence.3

In order to be able to evaluate our approach on these gold-standard anno-tations, we transform the output of the annotation process into a matrix Θof content unit-sentence assignments for each annotator and document pair,where each entry Θij = 1 if content unit i occurs in sentence j. In addition,

3This number is lower than 1 because we counted content units of type Copy, Similarand Clause only once for each set of sentences they were expressed in.

150

6.3. Sentence-level topic models

we estimate the word distribution p(w|zi) for each content unit zi = θi. Af-ter removing stop words from the sentences associated with the content unitand performing stemming, we calculate the maximum likelihood estimate ofp(w|zi):

p(wk|zi) = j n(wk, sj)Θij

j

k n(wk, sj)Θij

, (6.1)

In the above equation, n(wk, sj) corresponds to the frequency of word wk

in sentence sj. p(w|zi) is therefore equal to the frequency of a word in thesentences associated with a particular content unit, normalized by the totalnumber of words in those sentences. We denote the resulting matrix of worddistributions per content unit as Φ, such that Φki specifies the conditionalprobability of word k given content unit i. Note that this approach will notlead to completely adequate word distributions for topics of type Clause,since all words – and not only the words of the clause – will be counted forthe purpose of estimating the content unit’s word distribution. We leave forfuture work the manual annotation of clause and phrase boundaries, suchthat only the proper subset of words is used in this estimation step.

There are two problematic issues when comparing the content unit anno-tations produced by different annotators. First, annotators do not necessar-ily agree on the number of content units. In addition, they may disagree onwhich text spans constitute a content unit, leading to partial matches. Due tothese difficulties, many standard inter-annotator agreement metrics cannotbe applied [PNMS05]. We therefore measure inter-annotator agreement asthe fraction of fully matching content units, i.e. where both annotators agreeon the type of the content unit and which sentences it occurs in. Averagedover annotator pairs, the inter-annotator agreement for this metric is 0.69.This means that annotators fully agreed on approximately two thirds of thecontent units. The mean pairwise Pearson correlation of annotators on thenumber of topics for each document pair is 0.97, suggesting that annotatorsalso agreed very strongly on the “amount” of information contained in thedocument pairs.

6.3 Sentence-level topic models

In this section, we introduce our approach to automatically finding(sub-)sentential word patterns which are similar to content units. Our ideais to use a latent topic model to solve this task. In this model, each sentenceis represented as a mixture of latent topics, and each topic corresponds toa distribution over words. The goal of our approach is to find out whether

151


these word distributions correspond to the word distributions of manuallyannotated content units, and if the latent topics occur in the same sentencesas similar content units. We thus evaluate the correspondence of latent topicsand content units by looking at word distributions and sentence associations.We base our approach on a number of assumptions:

• Closely related news articles report the same facts. We gave evidencesupporting this assumption in the previous section of this chapter.

• Content units are expressed with a similar, but not necessarily identi-cal choice of words. Various authors have given evidence that humanauthors, when expressing the same information, may vary word choice(e.g. use of synonyms or paraphrases) and word ordering, but in theend there is only a fixed set of alternatives that can be used withoutchanging the information content [Luh58, BL04, HNPR05].

• Content units may be repeated in different sentences, and may be com-bined in different ways into sentences. This assumption arises as a con-sequence of the definition that content units are at most as large as asentential clause.

6.3.1 Inference for sentence-level topic models

A topic model is a generative latent variable model that associates each la-tent topic with a distribution over words. Each document is representedas a mixture of topics. In our approach we utilize the Latent Dirichlet Al-location model introduced by [BNJ03] (see Section 2.4.2). In this model,each document is generated by first choosing a distribution over topics θd,parametrized by a conjugate Dirichlet prior α. Subsequently, each word ofthis document is generated by drawing a topic zk from θd, and then drawinga word wi from topic zk’s distribution over words ϕk. ϕ is parametrized by aconjugate Dirichlet prior β.

To estimate the posterior distribution of the hidden variables Φ and Θgiven a collection of documents and a number of topics T , a variety of ap-proximate inference algorithms can be applied [Bis07]. We choose to employGibbs sampling, a Markov chain Monte Carlo technique, using the imple-mentation of [GS04]. After the model has been trained, it specifies in thematrix Φ the probability p(w|z) of words given topics, and in the matrix Θthe probability p(z|d) of topics given documents. p(w|z) thus indicates whichwords are important in a topic, and p(z|d) tells us which topics are dominantin a document.

152

6.4. Experiments

6.3.2 Matching content units and latent topics

We can estimate the similarity of latent topics and manually annotated con-tent units by comparing their word distributions and their sentence associ-ations. The comparison of word distribution similarity is straight-forward,and achieved by calculating the pairwise Jensen-Shannon (JS) divergence of

the distributions P (w|z) = Φ and P (w|z) = Φ for each latent topic zi andeach content unit zj. The Jensen-Shannon divergence of two distributions isdefined as:

DJS(P ||Q) =1

2DKL(P ||M) +

1

2DKL(Q||M), (6.2)

where M = 1/2(P +Q), and DKL(P ||Q) is the Kullback-Leibler divergenceof two distributions P and Q. As a result, we can identify matching pairs(zi, zj) that have a low JS divergence. Each latent topic zi is matched to asingle content unit zj and vice versa, using a greedy approach. The approachfirst finds the pair (zi, zj) with the lowest overall JS divergence, and theniteratively selects the pair (zk,k ={i}, zl,l ={j}) from the remaining latent topicsand content units with the next-lowest JS divergence. This procedure isrepeated until all content units are matched. Using this mapping, we can nowcompare the sentence associations of our model with the sentence associationsof content units determined by the human annotators. The comparison isperformed on the basis of the matrices Θ and Θ. For each match (zi, zj), weselect and compare the corresponding vectors θi and θj.

6.4 Experiments

Evaluation metrics

We evaluate the accuracy of the sentence associations learnt by our model bycalculating the precision and recall of the learnt topic-sentence distributionmatrix Θ with respect to the gold-standard content unit annotations Θ, foreach document pair and annotator. Since Θ is a binary matrix, we binarizeΘ to give Θ

′:

Θ′

ij =

1, Θij ≥ ϵ0, Θij < ϵ

(6.3)

Recall is defined as the fraction of correct sentences identified by the latenttopic with respect to the overall number of correct sentences associated withthe content unit:

Rθ′i=

|θ′

i ∩ θj||θj|

(6.4)

153


Precision, on the other hand, measures the quality of θ′

i, i.e. the fractionof correctly associated sentences divided by the number of all associatedsentences:

Pθ′i=

|θ′

i ∩ θj||θ′

i|(6.5)

We also compute the F1-measure, which combines precision and recall intoa single value:

F1 =2 ∗ P ∗RP +R

(6.6)

The subscript of 1 indicates a balanced weighting of recall and precision, andis also known as the harmonic mean of precision and recall. We can nowcompute precision, recall and F1-scores for each latent topic. Averaged overall assignments (i.e. topics, document pairs and annotators), these measuresgive us an indication of how well the latent topic model captured the contentunits we are interested in.

Parameter settings

We conduct our experiments on the dataset described in Section 6.1. Theinput for the latent topic model inference algorithm is a matrix A of co-occurrence observations, which in our case correspond to observations of wordin sentences. We create this matrix for each document pair by preprocessingsentences as described above, and by setting each entry Aij to the frequencyof word i in sentence j. In our dataset, the majority of these frequencies is1.

Since we want to learn a topic model with a structure that reflects thetype of topics defined in Section 6.2, the topic distribution for each sentenceshould be peaked toward a single or only very few topics. This is ensured bysetting the priors α and β to low values, which enforces a bias toward sparsityand leads to more peaked distributions [SG07]. A low value of β also favorsmore fine-grained topics [GS04]. In our experiments, we set α = 0.01 andβ = 0.01, which were determined experimentally on the D0742 documentpair. The parameter T , the number of latent topics to learn, is set to thenumber of manually annotated topics. We then run the Gibbs samplingalgorithm on A for 2000 iterations, and collect a single sample from theresulting posterior distribution over topic assignments for words.4 From this

4After the so-called burn-in period, the topic assignments of individual words stabilize.Taking multiple samples may be beneficial, e.g. to compute averaged statistics. However,topics are not uniquely identifiable across samples, as the model is unaffected by permuta-tions of the indices of the topics [GS04]. We therefore avoid this complication and utilizeonly a single sample, as suggested by other authors [GS04, SG07].

154

6.4. Experiments

sample, we compute the conditional distributions P (w|z) and P (z|d). Werepeat this procedure for each document pair and annotator.

We set the evaluation parameter ϵ, the threshold for binarizing Θ, to0.1. Since the topic modeling algorithm learns very peaked distributions,i.e. distributions where one or only a few topics are assigned most of theprobability mass, the actual value of this threshold does not have a largeimpact on the resulting matrix and subsequent evaluation results.5

Similarity of word distributions

Figure 6.4 shows the pairwise similarities of the word distributions of contentunits and latent topics for four of the 11 document pairs. Each cell in theplot displays the JS divergence of the word distribution of a latent topic(column) compared to the word distribution of a content unit (row). Lowervalues (darker cells) correspond to a lower JS divergence, and thus a highersimilarity of the respective word distributions. On the diagonal, the best-matching pairs (zi, zj) are ordered by increasing JS divergence, as determinedby the greedy matching algorithm described above.

The plots show a clear correspondence between many latent topics andthe matched content units, indicated by the strongly pronounced main di-agonal. On this diagonal, the JS divergences are much lower than thosemeasured for off-diagonal topic-unit pairs. The striking correspondence be-tween latent topics and content units observed in these plots leads to twomain conclusions: Our model discovers many useful subsentential word pat-terns which are clearly separable, and the same properties hold for the wordsused to express content units. In addition, many latent topics have a singlecontent unit counterpart (and vice versa). The mappings are identified re-gardless of the type of content unit, and can deal with lexical variability anddifferences in word order. Our findings thus suggest that word distributionscharacteristic for content units can be identified by an automatic approachthat is based on a co-occurrence analysis of words and sentences.

There are a few other observations that can be made from Figure 6.4.First, we can see that sometimes there are multiple cells with low JS di-vergence in a single row or column (e.g. in row 4 of document pair D0706).These entries indicate that in some cases, our model has created multiple la-tent topics with similar word distributions, which are all quite similar to theword distribution of a particular content unit. This is an effect of the randominitializations of the topic modeling algorithm. Furthermore, we find that aswe move along the main diagonal towards the lower right, the similarity of

5Except of course if setting ϵ to a very high value such as ϵ > 0.5.

155


(a) D0706 (b) D0710

(c) D0724 (d) D0734

Figure 6.4: Pairwise Jensen-Shannon divergence of word distributions ofmanually annotated content units and latent topics. Matching topics areordered by increasing divergence along the diagonal, using a simple greedyalgorithm. The examples show a clear correspondence of latent topics to thegold-standard content units.

the pairs (zi, zj) decreases, and the matches become arbitrary. Thus, for doc-ument pair D0706, we find that only the first 10-15 matching pairs are trulyuseful for our further evaluation. Our model thus cannot identify as manylatent topics as there are content units. Further evidence for this observationcomes from the fact that during our experiments, we noticed that the GibbsSampler did not always use all the latent topics available. Instead, sometopics had a uniform distribution over words, i.e. no words were assigned tothese topics during the sampling process.

Table 6.2 shows the most likely terms for some example topic-content unitmatches. The first topic captures the fact that different countries helped to

156

6.4. Experiments

Topic 5 Content Unit 2 Topic 8 Content Unit 4

Top

Term

s blaze 30 oil storagengulf azerbaijan crude tankentir blaze tank 1field engulf ton 30fight entir storag crude

Table 6.2: Example matches of latent topics and content units

Figure 6.5: F1 scores of sentence associations discovered by a latent topicmodel, compared to gold-standard content units. Scores are shown per doc-ument pair, and averaged over topic-content unit matches and annotators.The mean F1 score across all document pairs is 0.86.

fight the blaze that threatened to engulf the entire field of 30 storage tanks,the second lists words related to the fact that the storage tanks containedone million tons of crude oil.

157


(a) Precision (b) Recall

Figure 6.6: Precision (a) and Recall (b) of sentence associations discoveredby a latent topic model, compared to gold-standard content units.

Similarity of sentence associations

Figure 6.5 shows the F1 scores of correctly identified sentence-topic associa-tions for each of the 11 document pairs. The reported values are averagedover matched topic-content unit pairs, and over annotators of this documentpair. We see that the sentence associations discovered by the latent topicmodel in most cases correspond quite well to those of manually annotatedcontent units. The mean F1 score is 0.86. Precision is consistently higherthan recall for all document pairs, with the average precision being 0.89, andaverage recall 0.83 (Figure 6.6). Topic models for document pairs that con-tain many Clause or Unique content units seem to be more difficult to learn.This is indicated by the relatively low F1 scores for document pairs D0714and D0727.

These results suggest that our probabilistic topic model is quite accuratein detecting the correct sentence associations. In combination with the excel-lent word distribution similarity results, our results provide convincing evi-dence that latent topic models can successfully discover sentence- and clause-level topics that are similar to manually annotated content units. Our modelallows us to determine which sentences a given latent topic – and thereforea corresponding content unit – occurs in, and which words are characteris-tic for this particular latent topic, regardless of word-level variability. Note,however, that our approach does not correspond to a truly semantic identi-fication of content units: Our model relies on the bag-of-words assumption,and thus is not aware of the meaning of a clause. This assumption also hin-

158

6.5. Conclusion

ders correctly matching content units, for instance when considering a clausethat negates a previous statement. In addition, the probabilistic nature ofour approach makes it difficult to identify exact segmentations of content unitboundaries, a problem that may more easily be solved by applying syntacticand semantic parsing algorithms.

An evaluation of the performance our model with respect to the differenttypes of content units confirms our intuition that content units of type Clauseare the most difficult to identify. Figure 6.7 shows precision and recall valuesfor different types of content units, and different settings of the parameterγ. The error bars indicate the standard deviation of the results, which areaveraged across document pairs and annotators. γ is a threshold on the JSdivergence of pairs (zi, zj) which enables us to evaluate the performance ofour approach for matching pairs of different quality. If γ = 0.1, we onlyconsider high-quality matches with a very low JS divergence, when γ = 1,we consider all matches. Thus, with higher γ, matches of lower quality leadto lower average precision and recall scores.

We can see that the precision and recall scores of topics correspondingto copied or similar content units are very high. Latent topics for contentunits of type Unique are also identified with very high precision, but withlower recall. The recall curve of content unit type Clause is similar to that ofUnique, but precision is lower than for all other content unit types. There aretwo main reasons which can explain this observation: First, our model mustdeal with extraneous ‘noise’ words in the enclosing sentences, i.e. it must dis-tinguish co-occurrence observations for all those words that do not actuallybelong to the content unit. Second, the word distributions we estimated forthese content units are not adequate, as explained in Section 6.2. The greedymatching process intuitively prefers matching clearly defined topic-word dis-tributions, and the noise introduced by the extra words may well dilute thedistributions of this kind of content unit too much in order for them to bematched correctly. A better modeling of the gold-standard word distribu-tions is therefore necessary to show the real performance of this content unittype, e.g. by a segmentation of sentences into clauses.

6.5 Conclusion

This chapter investigated the nature of fact reporting in closely related newsarticles. Our studies of content units – meaning-oriented text spans thatare similar to facts – verified that content units often re-occur in differentnews articles. In addition, they are often expressed with similar, but notnecessarily identical word patterns. Content units occur in different sentence

159


(a) Precision (b) Recall

Figure 6.7: Precision and recall for different types of content units, and fordifferent settings of parameter γ. Only topic-content unit matches with aJS divergence ≤ γ are considered when computing precision and recall. Theerror bars show the standard deviation of the scores which are averaged overmatches, annotators and document pairs.

contexts, and may be combined differently into full sentences. Representingsentences as content unit distributions rather than as bags of words is highlydesirable, as it diminishes the effects of lexical variability and allows formeaning-oriented comparisons of content.

We presented a novel, unsupervised approach that maps (sub-)sentential,recurrent word patterns to meaningful latent topics. Our approach addresseslexical variability on the basis of a co-occurrence model, and groups togetherobservations with similar meaning. We evaluated our method on a dataset of11 news article pairs, for which we manually annotated a set of gold-standardcontent units. Our evaluations showed that many of the automatically discov-ered latent topics closely resemble gold-standard content units. In particular,we observed a striking correspondence between the word distributions of la-tent topics and content units. The results suggest that our model discoversmany useful word patterns which are clearly separable and in addition havea valid counterpart in the word distributions of manually annotated contentunits. These observations are confirmed when analyzing the sentence distri-bution of latent topics, which our model predicts with very high accuracy fordifferent types of content units. Our studies thus suggest that topic mod-els, with their shallow statistical approach to semantics, can successfully beutilized to identify sentence-level latent topics which are similar to contentunits.

Our approach has many interesting applications. For example, it can be

160

6.5. Conclusion

seen as a step toward the automated acquisition of Summary Content Unitsused in the Pyramid summarization evaluation method, a task that we studyin the next chapter.

161

Chapter 7

Content units in human-writtenreference summaries

In the field of multi-document summarization (MDS), the Pyramid methodhas become an important approach for evaluating machine-generated sum-maries [NP04, PNMS05, NPM07]. The method rewards automatic sum-maries for conveying content that has the same meaning as content repre-sented in a set of human model summaries. This approach contrasts thePyramid method with other evaluation methods such as Rouge that mea-sure word n-gram overlap. While a meaning-oriented evaluation of contentaddresses the problem of human variability in content expression, it suffersfrom the fact that it currently cannot be automated. Instead, human annota-tors identify content with the same meaning by inspecting similar sentencesin model and machine-generated summaries, which adds yet another levelof human effort (on top of creating model summaries) to the task of sum-mary evaluation. The automatic identification of semantically similar contenttherefore remains a major challenge both in summary evaluation [HLZ05] andsummary generation.

The basis of the Pyramid methods’ scoring scheme is an aggregation ofclause-length text spans with the same meaning into Summary Content Units(SCU, see Section 1.4.1). The identification of similar content by humanannotators allows for variation in how this content is worded. Differencesin content expression can manifest themselves for example in the choice ofwords, different word order, or in the use of paraphrases. In fact, similarcontent is determined solely by the judgment of the annotator, and thusindependent of which words are used, or how many [PNMS05]. However,various authors have observed that semantically similar text spans writtenby different human summarizers are often expressed with a similar choiceof words [NP04, HNPR05]. A related observation was made by Luhn, who

163

Chapter 7: Content units in human-written reference summaries

argued that although an author “can vary word choices to express the samenotion (e.g. by the use of synonyms) as she advances her argument, in theend there is only a fixed set of legitimate alternatives at her disposal, if shedoes not want her writing to become imprecise.” [Luh58]

In this chapter, we apply the approach introduced in the previous chapterto the task of automatically identifying semantically similar text spans inhuman model summaries. In particular, our intention is to find out how suchan analysis can enrich our understanding of human summaries. We desireto learn if human summary authors use similar word patterns to express thesame ideas when summarizing the contents of a set of thematically relatedsource documents. Previous studies have shown that the agreement of humansummary summarizers on individual words is high if these words are veryfrequent in source documents [NV05]. In the work presented in this chapter,we analyze if human summarizers not only agree in their choice of words,but also in their choice of word patterns. Our observations suggest that thisis an important step for distinguishing text spans with different meaning,since high-frequency words closely related to the main theme of a documentcluster often occur in many different Summary Content Units (Section 7.1).

• We train a probabilistic topic model on the term-sentence matrix ofhuman model summaries used in the DUC 2007 Pyramid evaluation.We analyze the resulting model to evaluate whether a topic modelcaptures useful structures of these summaries.

• Given the model, we compare the automatically identified topics withSummary Content Units (SCU) on the basis of their word distributions.We discover a clear correspondence between topics and SCUs, whichsuggests that many automatically identified topics are good approxi-mations of manually annotated SCUs (Section 7.2).

• We analyze the distribution of topics over summary sentences in Sec-tion 7.3, and compare the topic-sentence associations computed by ourmodel with the SCU-sentence associations given by the Pyramid anno-tation. Our results suggest that the topic model finds many SCU-liketopics, and associates a given topic with the same summary sentencesin which a human annotator identifies the corresponding SCU.

The automatic identification of latent topics that approximate SCUs hasclear practical applications: The latent topics can be used as a candidateset of SCUs for human annotators to facilitate the process of SCU creation.Given a set of latent topics that correspond to SCUs, the learnt topics canalso be identified in machine-generated summaries using standard statistical

164

7.1. Summary content units

inference techniques [AWST09], which would speed up the process of sum-mary scoring.

The remainder of this chapter is organized as follows: After introducingSummary Content Units in more detail and giving some illustrative examplesof SCUs in Section 7.1, we present our approach to Summary Content Unitdiscovery in sets of human model summaries (Section 7.2). We then evaluatethe accuracy of our approach in learning SCU-like word distributions andsentence associations in Section 7.3.

7.1 Summary content units

We start our investigations with a brief introduction of the Summary Contentunit creation process, and an analysis of how content is expressed in typicalSummary Content Units. For a more in-depth description of the Pyramidsummary evaluation method, we refer the reader to Chapter 1.

A Pyramid is a model predicting the distribution of information content insummaries, as reflected in the summaries humans write [PNMS05, NPM07].Similar information content is identified by inspection of similar sentences,and parts of these, in different human model summaries. Typically, thetext spans which express the same semantic content are not longer than aclause. An SCU consists of a collection of text spans with the same meaning(contributors) and a defining label specified by the annotator.

Each SCU is weighted by the number of human model summaries it occursin (i.e. the number of contributors). The Pyramid metric assumes that anSCU with a high number of contributors is more informative than an SCUwith few contributors. An optimal summary, in terms of content selection, isobtained by maximizing the sum of SCU weights, given a maximum numberof SCUs that can be included for a predefined summary length [NP04].

Figure 7.1 shows a set of example Summary Content Units from thePyramid of DUC topic D0742. SCU 18 has a weight of 3, i.e. three modelsummaries contribute to it, SCU 21 has a weight of 2, and SCU 29 has aweight of 4. We see that the contributors of all SCUs vary in their choiceof words and word order. Nevertheless there are always at least some words(or phrases) that are found in every contributor of a given SCU. SCU 29,for example, combines words related to the burial with information aboutthe type and location of the ceremony, but contributors vary in the level ofdetail, the number of words, and phrases used (e.g. “given up to the waters”vs. “buried at sea”). SCU 18 aggregates contributors which share some keyterms and phrases such as “Air National Guard” and “search”, but otherwiseexhibit a quite heterogeneous word usage. Contributor 3 gives details on the

165


SCU / TextContributors

SCU 18 The US Coast Guard with help from the Air National Guardthen began a massive search-and-rescue mission, searchingwaters along the presumed flight path

Contributor 1: The US Coast Guard with help from the Air National Guardthen began a massive search-and-rescue mission, searchingwaters along the presumed flight path

Contributor 2: A multi-agency search and rescue mission began at 3:28a.m., with the Coast Guard and Air National Guard par-ticipating

Contributor 3: The first search vessel was launched at about 4:30am. AnAir National Guard C-130 and many Civil Air Patrol aircraftjoined the search

SCU 21 Federal officials shifted the mission to search and recovery

Contributor 1: Federal officials shifted the mission to search and recoveryand communicated the Kennedy and Bessette families

Contributor 2: federal officials ended the search for survivors and began asearch-and-recovery mission

SCU 29 Kennedy family members buried the ashes of the three atsea in a Navy ceremony

Contributor 1: An at sea burial of all three was conducted Friday aboardthe destroyer USS Brisco in view of the Jacquelin KennedyOnassis shore-front estate

Contributor 2: Kennedy family members decided to bury the ashes of thethree at sea in a Navy ceremony

Contributor 3: The ashes of all three victims were buried at sea in a closelyguarded ceremony aboard a Navy destroyer

Contributor 4: In a private, closely guarded ceremony aboard a US Navydestroyer, the remains of Kennedy, his wife and sister-in-lawwere given up to the waters of the Atlantic Ocean

Table 7.1: Example SCUs from topic D0742 of DUC 2007

aircraft type, and specifies a time when the first sea vessel was launched tosearch for the missing plane. Only contributor 1 gives information aboutthe location of the search. In SCU 21, the first contributor contains addi-tional information about communication with the Kennedy family, which isnot expressed in the SCU label and therefore not part of the meaning of the

166

7.2. Topic modeling in human reference summaries

SCU. Both contributors contain key terms such as “officials”, “search” and“recovery”, but vary in word order and verb usage. Our examples suggestthat contributors written by different human summarizers are often expressedwith a similar choice of words or even phrases. However, contributors canvary in using different forms of the same words (inflectional or derivationalvariants), different word order, different syntactic structures, and even para-phrases [HNPR05, NPM07].

We also find that high-frequency words closely related to the main themeof a document cluster can occur in many different Summary Content Units.For instance, words like “Kennedy”, “search”, “mission”, “water” or “family”re-appear throughout the example SCUs given in Figure 7.1, and in otherSCUs of this summary set. In some cases, there are subtle differences inword meaning, for example, the term “Kennedy” is used to denote John F.Kennedy Jr., Senator Edward Kennedy or the Kennedy family in differentsummary sentences. The precise meaning of these words in a given SCUarises from taking into account words and phrases the ambiguous terms co-occur with.

7.2 Topic modeling in human reference sum-

maries

Can a topic model reveal some of the structure of human model summariesand learn topics that are approximations of manually annotated SCUs? Toanswer these questions, we train a topic model on sets of human modelsummaries, and compare the automatically learnt latent topics with manuallyannotated Summary Content Units.


Our approach for discovering semantically similar text spans makes use of astatistical method known as topic modeling. As described in the previouschapter, we use the Latent Dirichlet Allocation (LDA) model introducedby Blei et al. [BNJ03] for our analysis. In this model, each document isgenerated by first choosing a distribution over topics θd, parametrized by aconjugate Dirichlet prior α. Subsequently, each word of this document isgenerated by drawing a topic zk from θd, and then drawing a word wi fromtopic zk’s distribution over words ϕk. We follow Griffiths et al. [GS04] andplace a conjugate Dirichlet prior β over ϕk as well.

For T topics, the matrix Φ specifies the probability distribution P (w|z)of words given topics, and Θ specifies the probability distribution P (z|d) of

167


topics given documents. P (w|z) indicates which words are important in atopic, and P (z|d) tells us which topics are dominant in a document. Weemploy Gibbs sampling [GS04] to estimate the posterior distribution over z(the assignment of word tokens to topics), given the observed words w of thedocument set. From this estimate we can approximate the distributions inΦ and Θ.

7.2.2 Inference of model parameters

Since we are interested in modeling topics for sentences, we treat eachsentence as a document. We construct a matrix A of term-sentenceco-occurrence observations for a set of human model summaries M ={m1, . . . ,ml}. Each entry Aij corresponds to the frequency of word i insentence j, and j ranges over the union of the sentences contained in M . Asbefore, we preprocess terms using stemming and removing a standard list ofstop words with the NLTK toolkit.

We run the Gibbs sampling algorithm on A, setting the parameter T , thenumber of latent topics to learn, equal to the number of SCUs contained inthe Pyramid of S. We use this particular value for T since we want to learna topic model with a structure that reflects the SCUs and the distribution ofSCUs of the corresponding Pyramid.1

The topic distribution for each sentence should be peaked toward a singleor only very few topics. To ensure that the topic-specific word distributionsP (w|z) as well as the sentence-specific topic distributions P (z|d) behave asintended, we set the Dirichlet priors α = 0.01 and β = 0.01. This enforcesa bias toward sparsity and favors more fine-grained topics [GS04]. We runthe Gibbs sampler for 2000 iterations, and collect a single sample from theresulting posterior distribution over topic assignments for words. From thissample we compute the conditional distributions p(w|z) and p(z|d).

During our experiments, we observed that the Gibbs Sampler did notalways use all the topics available. Instead, some topics had a uniform dis-tribution over words, i.e. no words were assigned to these topics during thesampling process. We assume that this effect is also due to the relatively lowprior α = 0.01 we use in our experiments. We explore the consequences ofvarying the LDA priors and T in Section 7.3. This observation indicates thatthe topic model cannot learn as many distinct topics from a given set of sum-maries as there are SCUs in the Pyramid of these summaries. On average,24.4% (σ = 17.4) of the sampled topics had a uniform word distribution, but

1For an unannotated set of summaries, determining an optimal value for T is a Bayesianmodel selection problem [KR95].

168

7.2. Topic modeling in human reference summaries

the fraction of such topics varied. For some summary sets, it was very low(D0701, D0706 with 0%), whereas for others it was very high (D0704, D0728with 52%). Both of the latter summary sets contain many SCUs with verysimilar labels and often only a single contributor, e.g. in DUC topic D0704about “Amnesty International criticism”:

• SCU 120: AI criticism frequently involves genocide

• SCU 114: AI criticism frequently involves intimidation

• SCU 115: AI criticism frequently involves police violence

• SCU 112: AI criticism frequently involves political prisoners

The different SCUs are derived from summary sentences that containenumerations: “AI criticism frequently involves political prisoners, torture,intimidation, police violence, the death penalty, no alternative service for con-scientious objectors, and interference with the judiciary.” A co-occurrence-based model like LDA cannot distinguish between the enumerated phrases,since the model’s granularity is determined largely by the granularity of thetext spans used (sentences in our case). Thus, our model treats these phrasesas semantically similar since they co-occur in the same sentence. Splittingsentences into their clause components, and creating multiple clauses withnearly identical wording will not solve this problem, as the large vocabularyoverlap will likely cause the model to group such clauses into a single latenttopic.

7.2.3 Word and sentence distributions of SCUs

In order to evaluate the quality of the learnt latent topics, we compare theirword distributions to the word distributions of SCUs. This allows us toanalyze if the topics capture similar word patterns as SCUs. We approximatethe distribution over words p(w|sl) for each SCU sl as the relative frequencyof word wi in the bag-of-words constructed from the texts of sl’s label andcontributors. We denote the resulting matrix of word distributions for a setof SCUs as Φ (see also Chapter 6).

In addition, we can compare the topic-sentence associations computedby the model to the SCU-sentence associations given by the Pyramid anno-tation. If the probability of a given topic is high in those sentences whichcontribute to a particular SCU, this would suggest that the topic model canautomatically learn topics which not only have a word distribution similar toa specific SCU, but also a similar distribution over contributing sentences.

169


SCU contributors are annotated as a set of contiguous sequences of wordswithin a single sentence. In the DUC 2007 data, there are only a few caseswhere a contributor spans more than one sentence. The DUCView annota-tion tool2 stores the start and end character positions of the phrases markedas contributors of an SCU. We can utilize this information to determine whichsentences an SCU is associated with. We store the associations in a matrixΘ, where Θij = 1 if SCU i is associated with sentence j. Sentences maycontain multiple SCUs, and SCUs are associated with as many sentences astheir number of contributors.

7.3 Experiments

We conduct our experiments on the 23 document clusters of the DUC 2007dataset that were used in Pyramid evaluation.3 There are 4 human modelsummaries available for each of these document clusters. On average, thesummary sets contain 52.4 sentences, with a vocabulary of 260.5 terms, whichoccur a total of 549.7 times. The Pyramids of these summary sets consistof 68.8 SCUs on average. The number of SCUs per SCU weight follows aZipfian distribution, i.e. there are typically very few SCUs of weight 4, andvery many SCUs of weight 1 (see also Passonneau et al. [PNMS05]).

7.3.1 Similarity of word distributions

Figure 7.1 shows the pairwise similarities of the word distributions of Sum-mary Content Units and latent topics for several different DUC 2007 sum-mary sets. Each cell in the plot displays the Jensen-Shannon (JS) divergenceof the word distributions of a latent topic (row) compared to an SCU (col-umn). Lower JS divergences correspond to darker cells, and to a higher sim-ilarity of the compared word distributions. We apply the procedure outlinedin Section 6.3 to match latent topics to SCUs, and order the best-matchingpairs on the diagonal by increasing JS divergence.

The plots show a clear correspondence between many latent topics andmatched Summary Content Units, confirming on a new dataset the resultspresented in the previous chapter. As before, the diagonal is clearly distin-guishable from the surrounding cells, and shows a much higher similarityfor the matched topic-SCU pairs than for random topic-SCU pairings. Wefind that a large percentage of automatically learnt latent topics have similar

2http://www1.cs.columbia.edu/~becky/DUC2006/2006-pyramid-guidelines.

html, visited May 3rd, 20113http://www-nlpir.nist.gov/projects/duc/data.html, visited May 3rd, 2011

170

http://www1.cs.columbia.edu/~becky/DUC2006/2006-pyramid-guidelines.html

http://www1.cs.columbia.edu/~becky/DUC2006/2006-pyramid-guidelines.html

http://www-nlpir.nist.gov/projects/duc/data.html

7.3. Experiments

(a) DUC Topic D0705 (b) DUC Topic D0706 (c) DUC Topic D0707

(d) DUC Topic D0721 (e) DUC Topic D0729 (f) DUC Topic D0743

Figure 7.1: Pairwise Jensen-Shannon divergence of word distributions ofLDA topics and Summary Content Units (SCUs), for 3 DUC 2007 Pyramids.Topic-SCU matches are ordered by increasing divergence along the diagonal,using a simple greedy algorithm. The examples suggest that many of theautomatically identified LDA topics correspond to manually annotated SCUs.

distributions over words as the corresponding Summary Content Units. Thehigh quality (i.e. low JS divergence) of the matched topic-SCU pairs sug-gests that human summarizers not only agree on individual words, but alsoagree on word patterns when summarizing a set of source documents. Theseword patterns are in many cases distinctive enough to be separated by ourmodel, and to be matched to their SCU counterparts. Furthermore, goodlatent topic-SCU matches arise regardless of the SCU’s weight, as shownin Table 7.2. The table shows the most likely terms for the best-matchingtopic-SCU pairs for summary set D0742. For each of these matches, the topterms are almost identical.

Among the top matches of a model constructed from the summary setof DUC topic D0742, we find several SCUs that occur in multiple differenthuman summaries. The contributors of these example SCUs are often fullsentences, but in some cases also subsentential clauses. Many of the contrib-utors differ significantly in their choice and number of words (e.g. SCU 29,SCU 33, see also Table 7.1), and can yet be grouped together by the latenttopic model. From the table we also see that the model can to some ex-

171


Topic / SCU Top terms

Topic 17 pilot kennedi condit conduc dark disorient earth fli hazeSCU 31 (w=1) pilot condit conduc dark disorient earth fli haze kennedi

Topic 5 analysi control corkscrew descent fall feet graveyard indic lostSCU 32 (w=1) analysi control corkscrew descent fall feet graveyard indic lost

Topic 9 (w=1) bodi diver entomb floor found fuselag kennedi navi oceanSCU 25 bodi diver entomb floor found fuselag kennedi navi ocean

Topic 8 kennedi edward recoveri son wit bodi jr navi patrick shipSCU 33 (w=3) kennedi edward recoveri son wit jr patrick navi sen ship

Topic 24 bodi wednesday wreckag aircraft found locatSCU 19 (w=3) wednesday wreckag found bodi aircraft locat

Topic 35 ceremoni navi aboard ash buri close destroy guard sea kennediSCU 29 (w=4) ceremoni kennedi navi sea aboard ash buri destroy close famili

Table 7.2: Top terms of best matching LDA topics and SCUs for summaryset D0742. The first column specifies the ids of the matched topic-SCU pair,and the weight of the SCU.

tent distinguish between different senses of a word (e.g. the name ‘Kennedy’denoting both John F. Kennedy Jr. and Senator Edward Kennedy).

When comparing the dataset used in our current analysis with the oneanalyzed in the previous chapter, we find that the average number of con-tent units is almost three times as high (68.8 versus 24.2 in the dataset ofChapter 5). Consequently, the average number of content units per sentenceis much higher (1.3 compared to 0.7), which suggests that there is a con-siderable number of subsentential content units in the summaries dataset.From the plots in Figure 7.1 we see that our model overcomes this increaseddifficulty, and still acquires many useful latent topics. However, the diago-nals tend to “fade out” earlier than those displayed in the previous chapter,indicating that the fraction of learnable topics is lower for the summariesdataset.

7.3.2 Similarity of sentence associations

We determine the quality of the sentence associations of latent topics learntby our model using different standard evaluation measures, such as MeanAverage Precision (MAP), Precision and Recall. In particular, we compare

the topic-sentence distributions Θ with the SCU-sentence associations Θ.

172

7.3. Experiments

To calculate precision and recall, we follow the procedure outlined in theprevious chapter: We first binarize Θ to give Θ

′by setting all entries Θ

′ij = 1

if Θij > ϵ, and 0 otherwise. Θ′ij is therefore equal to 1 if a topic i has a high

probability sentence j. We set ϵ = 0.1 in our experiments. Since the LDAalgorithm learns very peaked distributions, the actual value of this thresholddoes not have a large impact on the resulting binary matrix and subsequentevaluation results. We evaluated a range of settings for ϵ in [0.001 − 0.5],all with similar performance. We can now evaluate if a given topic occurs inthe same sentences as the corresponding SCU (recall), and if it occurs in noother sentences (precision).

Mean Average Precision, on the other hand, is a rank-based measure,which avoids the need for introducing a threshold to binarize Θ [BYRN99].In our case, it is determined as the mean of average precision scores obtainedfor the topic associations a single sentence. The average precision of topics ofsentence s is calculated as the mean of the precision values for each relevantlatent topic zk ∈ ZR, when ranking the list of topics for a given sentence sby their probability p(zk|s):

AP (s) =1

|ZR|

ZRk=1

Prec(zk). (7.1)

MAP is then defined as:

MAP =1

|S|S

j=1

AP (s), (7.2)

where S is the set of sentences (i.e. the union of sentences over all summarysets). For each sentence, we create a ranked list of topics according to thematrix Θ. This gives high ranks to topics which have a high probabilityin sentence s. A higher MAP score indicates that the topics with a highprobability in a given sentence correspond to the SCUs associated with thissentence.

For our evaluation, we compute precision and recall for each matchedtopic-SCU pair with JS(Φj, Φk) ≤ γ. Averaged over topic-SCU pairs, thesemeasures give us an indication of how well the topic model approximatesthe set of SCUs. The parameter γ allows us to tune the performance ofthe model with respect to the quality and number of topic-SCU matches.Setting γ to a low value will consider only topic-SCU matches with a low JSdivergence, which generally results in higher precision and recall. Increasingγ will include topic-SCU matches with a larger JS divergence, which will ingeneral introduce some noise and result in lower precision and recall scores.

173


Figure 7.2: Precision, Recall and the fraction of LDA topics matched toSCUs for different settings of parameter γ, averaged over all summary setswith Pyramid annotations from DUC 2007. Error bars show the standarddeviation. Only topic-SCU matches with JS(Φj, Φk) ≤ γ are consideredwhen computing precision and recall. Both are very high, suggesting thatthe model identifies topics that are very similar to SCUs.

Figure 7.2 shows the precision and recall curves for different values ofthe parameter γ, averaged over topic-SCU pairs and summary sets. Theplots show that both the precision and recall of the discovered topic-sentenceassociations are quite high, suggesting that the model automatically identifiestopics which are very similar to manually annotated SCUs. With higher γ,precision and recall scores decrease, as more, but less well matching, topic-SCU pairs are evaluated. The word distributions of these additional topic-SCU pairs are increasingly dissimilar, and hence the sentences associatedwith a latent topic overlap less with the sentences associated with the pairedSCU. The figure also shows the fraction of topic matches that are consideredin the evaluation of precision and recall. There is a clear trade off betweenperformance and the number of matches retrieved. However, many of thetopic-SCU matches (≈ 50%) have a JS divergence ≤ 0.4, suggesting thatthe word distributions of many LDA topics are very similar to SCU worddistributions.

Since we observed that the Gibbs sampling does not always utilize the full

174

7.3. Experiments

Figure 7.3: F1 measure and Mean Average Precision (MAP) for differentsettings of the number of latent topics T as a fraction of the number of SCUsin the corresponding Pyramid (γ = 0.5).

set of topics, we repeat our experiments to evaluate how the performance ofthe model changes when varying the LDA priors and the parameter T . Fig-ure 7.3 shows F1 and MAP scores for different values of the parameter δ,where T = δ ∗ |SCU |. For example, a value of 0.6 means that for eachsummary set, T was set to 60% of the number of SCUs in the correspond-ing Pyramid. We see that the MAP score increases quickly, and reaches aplateau for δ ≥ 0.3. The F1 score increases more slowly, and levels out forδ ≥ 0.6. The model’s performance is relatively robust with respect to δ.This observation can be helpful when training models for new summary setswithout an existing Pyramid, and which therefore consider T as a parameterto be optimized.

Figure 7.4 shows performance curves for various settings of the LDA pri-ors α and β. We find that for 0.01 ≤ α ≤ 0.05, F1 and MAP scores areconsistently high, whereas for other settings, performance decreases signifi-cantly. Similarly, β ≥ 0.05 results in lower F1 and MAP scores. The fractionof uniform topics decreases with higher α, e.g. for α = 0.1 it is close to zero.In contrast, higher settings of β increase the fraction of uniform topics.

Finally, Figure 7.5 shows the precision and recall curves for SCUs of dif-ferent weights, and for different settings of parameter γ. Results are again

175


(a) α vs. performance (b) α vs. fraction of uniform topics

(c) β vs. performance (d) β vs. fraction of uniform topics

Figure 7.4: Performance of model (a,c) and smoothing effects (b,d) of a latenttopic model for different settings of α and β, averaged over all summary setsand topic-SCU pairs. Error bars show the standard deviation.

averaged over all summary sets. In Figure 7.5(a), we see that the recall oftopic-sentence associations is very similar for all SCUs, with SCUs of higherweight exhibiting a slightly better recall. However, as Figure 7.5(b) shows,the average precision of SCUs with lower weight is much higher. Intuitively,this may be expected as SCUs of higher weight tend to have a larger vo-cabulary due to the higher number of contributors. This results in a largerword overlap with non-relevant sentences. Figure 7.5(c) shows the fractionof topic-SCU matches identified. The fraction is computed with respect tothe total number of SCUs of that weight in a given Pyramid, so a value of 0.3for w = 1 means that on average 30% of the SCUs of weight 1 were matched.The figure shows that there is almost no difference in retrieval for SCUs ofdifferent weights. A slightly larger fraction of SCUs with weight 1 seems

176

7.4. Conclusion

(a) Recall by SCU weight (b) Precision by SCU weight

(c) Fraction of topics by SCU weight

Figure 7.5: Recall (a) and precision (b) of sentence associations, and frac-tion of topic-SCU matches (c) for SCUs of different weights, and settings ofparameter γ, averaged over all summary sets. Recall is similar for SCUs of allweights, whereas SCUs with a lower weight have a higher average precision.Error bars show the standard deviation.

to be retrieved when considering only matches with very low JS divergence,which suggests that the word distributions of the corresponding topics aremore clear-cut.

7.4 Conclusion

In this chapter, we investigated the nature of human-written summaries.Our studies revealed that different human authors often utilize similar wordpatterns to express content with the same meaning when summarizing a setof thematically related source documents. We find that this agreement on

177


word patterns allows our approach to group text spans that have the same(or a similar) meaning, and helps to distinguish between text spans thathave a different meaning, but share high-frequency terms closely related tothe main theme of the document cluster.

We applied our approach to content unit discovery to a dataset consistingof human reference summaries. To evaluate the quality of latent topics, wecompared topics with Summary Content Units (SCUs) that were manuallyannotated in the reference summaries for the DUC 2007 Pyramid evalua-tion. Our experimental results suggest that a latent topic model can identifywith high accuracy word patterns that correspond to these gold-standardSummary Content Units. This correspondence is expressed both in similarword distributions, which enables us to create pair-wise matches of latenttopics and SCUs, and similar sentence associations. The identified patternsare clearly separable into distinct topics, and allow for lexical and syntacticvariation in how semantically similar content is worded. Precision and re-call of the learnt sentence-topic associations are very high when comparedto the SCU-sentence associations, indicating that many of the automaticallyacquired topics are good approximations of SCUs. Our results suggest thata topic model can be used to learn a candidate set of SCUs to facilitate theprocess of Pyramid creation.

Our model finds good latent topics for SCUs of any weight, which sug-gests that the model captures with high accuracy word patterns that re-occurin different human-written summaries. Furthermore, our experiments pro-vide some evidence that latent topics models, with their shallow approachto semantics, can successfully discover recurrent and varied (sub-)sententialword patterns, and reveal some of the structure of human model summaries.However, while our model does discover a significant fraction of manuallyannotated Summary Content Units, it clearly leaves room for future work,for example by incorporating linguistic knowledge into the process of contentunit identification.

178

Conclusions

This dissertation aimed to develop novel summarization solutions for con-densing collections of topically-related news stories. The methods presentedin this thesis were motivated by two major observations:

1. Topically related news stories consist of subtopics centered around thecollection’s main theme.

2. Different news stories, as well as human reference summaries of newsstory collections, often express the same facts.

The first part of this thesis presented novel multi-document summariza-tion solutions that explore subtopical content models incorporating shallownotions of semantics based on co-occurrence information. Extensive evalu-ations of the presented summarization algorithms emphasized the positiveeffect of content model-based sentence representations on the descriptivenessand robustness of the developed models, since the latent topic rather thanbags of words representations diminished the effect of lexical variability. Theobtained results also supported the intuition that summarization systems cre-ate more accurate and diversified summaries when respecting the subtopicalstructure of news article collections. Experiments on several datasets andfor different summarization tasks, such as generic and query-focused multi-document summarization, showed that the developed methods produce sum-maries comparable to or better than the state-of-the-art.

In the second part of this dissertation, we presented an analysis of humanfact writing in news articles and reference summaries, and developed novelmodeling approaches to the discovery of subsentential content units. Ourfindings suggest that human (summary) authors not only agree on importantwords, but also often agree on word patterns when expressing the same infor-mation. Experiments on several datasets showed that the developed modelsidentify word patterns that closely resemble manually annotated, semanticcontent units.

179

Conclusions

Summary of contributions

This dissertation has advanced the state-of-the-art in multi-document sum-marization and content modeling of news articles and human reference sum-maries through a number of contributions.

First, we introduced the reader to summarization and presented the mainchallenges of automatic text summarization in Chapter 1. The chapterfurther described the basic notions of summarization, and introduced thechallenges involved in summary evaluation. It also highlighted the potentialsof automatic summarization for future information retrieval solutions.

Chapter 2 then presented an exhaustive discussion of the state-of-the-art in automatic text summarization. The chapter also presented the firstcomprehensive academic analysis of content modeling approaches within thecontext of automatic summarization.

In Chapter 3, we then proposed a novel approach to generic multi-document summarization that aims to represent sentences in an ontologyspace by mapping them to nodes of a hierarchical topic ontology. The on-tology was built from the hierarchically structured topics of the Open Direc-tory Project category tree, and its topic nodes were augmented with lexicalknowledge acquired by harvesting millions of topic-related words using searchengine queries. Experiments showed that the novel sentence features de-rived from this topic ontology significantly improved Rouge-1 and Rouge-2scores of the summarizer.

Chapter 4 considered the task of query-focused multi-document summa-rization, and introduced a novel probabilistic content model for representingsentences of related news articles in a common latent topic space. The latenttopic representation of our model helped to overcome common problems re-lated to word ambiguity, synonymy, and the sparseness of the original wordvector space when estimating the similarity of text passages and queries. Inaddition, the model was trained in an unsupervised fashion, and is thereforedomain- and language-independent. Evaluations on two recent summariza-tion datasets showed that this summarizer produced better summaries thandifferent baseline methods and the state-of-the-art.

Chapter 5 then proposed an extension of the standard PLSA model toinvestigate how probabilistic topic models of text can be merged with lan-guage models in order to relax the “bag-of-words” assumption. Our novelapproach to query-focused multi-document summarization combined termand bigram co-occurrence observations into a single probabilistic latent topicmodel. Experimental results showed that the integration of a bigram lan-guage model into a standard topic model leads to a system that produceshigher-quality summaries than systems which are based on term respectively

180

Future research

bigram co-occurrence observations only. The results underline the usefulnessof probabilistic content models for multi-document summarization.

Chapter 6 investigated word patterns that express similar facts in closelyrelated news articles. The chapter introduced a classification scheme of sub-sentential content units in order to categorize and distinguish different typesof content units. Based on this analysis, we then proposed a probabilistic, un-supervised model that aimed to discover content unit-like subsentential wordpatterns and addressed the variability of human writing. A comparativestudy of the similarity of identified word patterns and manually annotatedcontent units showed that many of the automatically discovered patternsclosely resemble their manually created counterparts.

Encouraged by these results, Chapter 7 took a step forward and an-alyzed the nature of human summary writing. Our analyses revealed thathuman summary authors often agree on word patterns when expressing thesame source information. The chapter then presented an evaluation of ourprobabilistic model on the task of Summary Content Unit discovery. The ex-perimental results confirmed the analyses conducted in the previous chapter,and suggest that latent topic models can identify with high accuracy wordpatterns that correspond to manually annotated Summary Content Units.This correspondence was expressed both in similar word distributions, whichenabled us to create pair-wise matches of latent topics and SCUs, and simi-lar sentence associations. The identified patterns were clearly separable intodistinct topics, and allowed for lexical and syntactic variation in how seman-tically similar content is worded. Our studies thus suggest that topic models,with their shallow statistical approach to semantics, can successfully be uti-lized to identify sentence-level latent topics which are similar to summarycontent units.

Future research

There are many potential directions for future research in the area of auto-matic document summarization.

Extending probabilistic content models

Creating content models from bag-of-words representations of documentsand sentences is a strong simplification. Instead, sentences and documentsprovide information that goes well beyond the scope of this model. Thecore models proposed in this dissertation could be extended to include thefollowing types of information:

181

Conclusions

Entity recognition. News articles revolve around events, and the people,organizations and locations involved in these events. Entity recognitionmethods could help to map words to concepts, which could in turn belinked to external knowledge resources such as Wikipedia4 or DBPedia5.These resources provide meta-data about entities which likely can beexploited to increase the quality of content models.

Lexico-semantic resources. Many authors have shown that the incorpo-ration of external lexico-semantic resources, such as WordNet, may helpto map words to semantic concepts and to reduce problems related toword ambiguity and synonymy.

Linguistic features. Finally, the bag-of-words assumption made in thiswork could be relaxed by incorporating other linguistic sources of in-formation. For example, part-of-speech tags, parse tree information,or even discourse information (e.g. co-reference resolution) can all helpin the design of content models, and in turn allow for higher-qualitysummaries.

Temporal distribution. The news story clusters used in the DUC andTAC datasets are significantly distributed over time. Subtopics thathave just occurred and are treated in-depth in early articles mightmerit only a background summary paragraph in later news stories.The articles of some of the news story clusters range over a periodof years, with more recent articles barely touching upon aspects ofthe event that were deemed highly important early on. Similarly, thevocabulary of subtopics will change over time as new facts becomeknown, and early assumptions and hypotheses are verified or discarded.The quality of the content models could therefore benefit from thisadditional information.

In addition, several aspects of the probabilistic model itself could be ex-amined in further research. On the one hand, it is necessary to better inves-tigate the problem of choosing the right dimensionality of the latent topicspace for a particular news story cluster, a problem which is known as modelselection. The simplification made in this work of choosing a single, fixednumber of latent topics for different types of news clusters regardless of theirevent domain impedes the model’s ability to form coherent latent topics, assome subtopics may need to be split up into different latent topics or merged

4http://www.wikipedia.org5http://www.dbpedia.org

182

http://www.wikipedia.org

http://www.dbpedia.org

Future research

into a single latent topic. On the other hand, the effects of the model’s pa-rameter settings, and its convergence on local optima – which holds true evenin more sophisticated latent topic models such as Latent Dirichlet Allocation– also require further empirical investigations.

Semantic content unit discovery

Content units are primarily defined as units of meaning. The restrictionof being at most clause-length, although stated as such in the proceduraldescription for annotators, is rarely respected in existing Pyramids. Somecontent units extend across sentence boundaries, while other annotators splite.g. sentences containing enumerated phrases into a multitude of distinctcontent units. In order to capture such phenomena, it may be worthwhile toinvestigate different linguistic methods. Sentence segmentation and parsingmay help to pre-process sentences into clauses, which could then be analyzedwith our model. Semantic parsing technologies offer a means to identifypredicate-argument structures in sentences and clauses. Both strategies maysignificantly reduce the amount of word “noise” introduced into the model,and allow for a better distinction of different content units. However, for atrue identification of shared (clause) meaning, many different linguistic andstatistical methods will need to be combined.

Another avenue of research would be the comparison of Pyramid evalu-ation results from the DUC datasets with results using our automated SCUlearning approach. To this end, content units can be identified in machine-generated summaries using standard probabilistic inference techniques giventhe trained model. Subsequently, we could compare Pyramid rankings andrankings obtained with the automatic model, and analyze correlations be-tween the two, which, if successful, could lead to strong benefits for automaticsummarization evaluation tasks.

Summarization & IR

Another area of future research is to examine the usefulness of summaries invarious information retrieval settings. For instance, some authors have sug-gested that summarization can enhance a user’s experience with news aggre-gation sites such as Google News [Nen06]. There are different (scientificallyoriented) websites which already offer this functionality, e.g. NewsInEssence6

and Columbia’s Newsblaster7. Such services can be used to conduct evalua-tions of perceived summary usefulness and quality, and to further understand

6http://www.newsinessence.com7http://newsblaster.cs.columbia.edu

183

http://www.newsinessence.com

http://newsblaster.cs.columbia.edu

Conclusions

the relationships between user information needs and information overload.Moreover, the exact nature of a user’s information needs depends on a

number of context factors, such as cognitive load, time constraints, task re-quirements, and prior knowledge, and may change over time. For example,users may want to receive a quick update on news events during their workinghours, but may be interested in a rather thorough synopsis of an importantnews story in the evening. Context-aware summarization may well help toadapt summaries better to user information needs, and could be a key chal-lenge for future personalized summarization solutions.

184

Example summaries

Example summaries created by the summarizers developed in this thesis.

Summary for DUC 2002 topic d079 (Summarizer from Ch. 3)

Hurricane Gilbert swept toward Jamaica yesterday with 100-mile-an-hourwinds, and officials issued warnings to residents on the southern coasts ofthe Dominican Republic, Haiti and Cuba. Gilbert, an “extremely dangeroushurricane” and one of the strongest storms in history, roared towardMexico’s Yucatan Peninsula Tuesday with 175 mph winds after batteringthe Dominican Republic, Jamaica and the tiny Cayman Islands. HurricaneGilbert weakened to a tropical storm as it blustered northwestward todaybut it threw tornadoes and sheets of rain at thousands of shuttered evacueesalong the Texas-Mexico border.

Hurricane Gilbert swept toward the Dominican Republic Sunday, and theCivil Defense alerted its heavily populated south coast to prepare for highwinds, heavy rains and high seas. The most intense hurricane on recordsurged toward Texas today after battering the Yucatan Peninsula with 160mph winds, leveling slums, pummeling posh resorts and forcing tens of thou-sands to flee. Mark Zimmer, a meterologist at the National Hurricane Center,reported an Air Force reconnaissance plane measured the barometric pressureat Gilbert’s center at 26.13 inches at 5:52 p.m. EDT on Tuesday. Forecasterssaid the hurricane was gaining strength as it passed over the ocean and woulddump heavy rain on the Dominican Republic and Haiti as it moved south ofHispaniola, the Caribbean island they share, and headed west.

Reference summary for DUC 2002 topic d079

Gilbert, an “extremely dangerous hurricane” and one of the strongest stormsin history, roared toward Mexico’s Yucatan Peninsula Tuesday with 175mph winds after battering the Dominican Republic, Jamaica and the tinyCayman Islands. With the winds of Hurricane Gilbert clocked at 175 milesper hour, U.S. weather officials called Gilbert the most intense hurricaneever recorded in the Western Hemisphere. Gilbert is one of only threeCategory 5 storms in the hemisphere since weather officials began keepingdetailed records. Hurricane Gilbert battered the resorts of the YucatanPeninsula today with 160 mph winds and torrential rains.

The most intense hurricane on record surged toward Texas today after

185

Appendix A: Example summaries

battering the Yucatan Peninsula with 160 mph winds, leveling slums, pum-meling posh resorts and forcing tens of thousands to flee. The storm, spawnedSaturday southeast of Puerto Rico, appeared to have hit Jamaica the hardest.Hurricane Gilbert weakened to a tropical storm as it blustered northwestwardtoday but it threw tornadoes and sheets of rain at thousands of shutteredevacuees along the Texas-Mexico border. Rains from Hurricane Gilbert senta river raging over its banks in Monterrey, sweeping at least 10 policemen totheir deaths and overturning buses loaded with evacuees from Matamoros,police said today.

Summary for DUC 2006 topic D0647 (Summarizer from Ch. 4)

The Supreme Court on Wednesday refused to issue an order to keep ElianGonzalez in the United States, clearing the way for the 6-year-old boy toreturn home to Cuba. As both sides in the Elian Gonzales custody caseawait a federal appeals court ruling, the mayor of Miami flies to Washingtonto meet with Attorney General Janet Reno. Reno has backed the INSdetermination that the boy be returned to his father in Cuba, and hassaid that Florida state courts have no say in the federal matter. The U.S.Immigration and Naturalization Service (INS) ruled that Elian belongs tohis father in Cuba, but Elian’s Miami relatives insist the boy stay with themin the United States.

Demonstrators sang and prayed outside the home where Elian Gonzalesis staying as all sides in the custody battle waited for a federal appeals courtruling that could lead to the boy’s reunion with his father. About 11 p.m.,officials from the powerful Cuban American National Foundation announcedthat the Miami relatives would travel to Washington Wednesday with Elianfor a meeting with the boy’s father. As Gonzalez was speaking, his relatives inMiami appeared in a county Family Court, with Elian’s great-uncle, LazaroGonzales, asking for temporary custody of the boy. Elian has remained inthe Washington area since he was taken from Miami as the case made itsway through the courts.

Reference summary for DUC 2006 topic D0647

On November 25, 1999, six-year-old Elian Gonzalez was rescued from theAtlantic after the boat in which he and his mother and others were fleeingCuba for the US capsized and his mother drowned. His great-uncle inMiami was granted temporary custody of Elian and sought asylum forhim. Fidel Castro and Elian’s father Juan Miguel Gonzalez demanded

186

Elian’s return to Cuba. The Clinton administration weighed honoringthe dead mother’s wishes in bringing Elian to freedom and placating theCuban exile community, or bending to Cuban pressure and strengtheningCuban-American relations. Vietnam demanded Elian be returned to Cuba.Based on US and international law, the INS ruled that Elian belonged withhis father and should be returned to Cuba.

The Miami relatives insisted Elian didn’t want to return and appealedto Attorney General Janet Reno. Reno upheld the father’s right to custodyand ignored Florida court interference, stating immigration was a federalmatter. The relatives challenged the INS ruling, then appealed their suit’sdismissal. Reno met with the relatives and ordered them to surrender Elian.They refused, and on April 22 federal agents seized Elian in a pre-dawn raidand reunited him with his father who had come to Washington, DC to bringElian home. Riots and strikes broke out in Miami. The 11th US CircuitCourt of Appeals unanimously ruled for the government and Elian’s father,later reaffirming its decision. The Supreme Court rejected an appeal by theMiami relatives. Elian returned to Cuba on June 28, 2000.


UNITED NATIONS (AP) Pakistan’s prime minister said Wednesday hiscountry would unilaterally adhere to the nuclear test ban treaty, but calledon international pressure to force rival India to do the same. In reactionto Pakistan’s nuclear tests, Egypt Thursday underscored the need to makethe Middle East a nuclear-free region and urged Israel to join the NuclearNon-Proliferation Treaty (NPT). A statement issued by Turkish ForeignMinistry Friday said that the nuclear tests conducted by both India andPakistan were of concern for regional and global stability and security. Theresolution said that the tests not only violated the Nuclear Non-ProliferationTreaty (NPT) but also threatened peace and stability in South Asia. It isthe first press conference given by the prime minister since India conducteda series of nuclear tests on May 11 and 13.

Annan called on both Indian and Pakistani governments to freeze theirnuclear weapons development programs and sign the Comprehensive TestBan Treaty (CTBT) and a no-first-use pledge with each other. Iran hascalled on Pakistan and India to cease nuclear tests and rivalry and join thecomprehensive test ban treaty. Both India and Pakistan carried out nucleartests last May, drawing international sanctions and calls to sign the test bantreaty. India has long spurned both the Comprehensive Test Ban Treaty andthe Nuclear Non-Proliferation Treaty, saying they are discriminatory because

187


they do not force Britain, China, France, Russia and the United States togive up their nuclear arsenals.


Pakistan and India have failed to join the 1970 Nuclear Non-ProliferationTreaty (NNPT), blaming each other and the five nuclear powers Britian,China, France, Russia and the United States for unfair and discriminatorybehavior. In accordance with the NNPT, only those five nations maymaintain nuclear arsenals. In May 1998, Pakistan conducted six under-ground nuclear explosions in response to five tests by India. The tests werecondemned worldwide and sanctions were imposed on both countries bythe United States, Britain, Japan and other industrialized nations. Russiadid not join in the imposition of sanctions and signed a deal with India forconstruction of a nuclear power station in Kudankulam. Following India’snuclear tests, Pakistani President Nawaz Sharif said that the balance ofpower in the region had been “violently tilted.”

India and Pakistan have gone to war three times since 1947, twice overthe disputed territory of Kashmir. India claimed that its nuclear tests werenecessary because of threats by China over disputed territory. Egypt calledfor the Middle East to be a nuclear-free region and urged Israel to join theNNPT. Turkey and Iran called on Pakistan and India to cease nuclear testingand join the Comprehensive Test Ban Treaty (CTBT). Saudi Arabia calledon India and Pakistan to show restraint; Crown Prince Abdullah recentlyvisited a uranium enrichment plant and missile factory in Pakistan. Japanhas urged India and Pakistan to join the CTBT and has also offered bothcountries to conduct bilateral talks in Tokyo.


JERUSALEM, April 28 (Xinhua) – Israeli Prime Minister Benjamin Ne-tanyahu has warned Palestinian leader Yasser Arafat not to declare anindependent Palestinian State on May 4 this year. The Oslo Accords,signed by his predecessor, late Yitzhak Rabin, and Palestinian leader YasserArafat prepared the ground for the Israeli-Palestinian peace process on theprinciple of “land-for-peace.” Besides, Netanyahu said he did not intendto travel to the birth place of the Oslo Accords since he had no part inthe historic peace agreements. According to the Oslo accords, a permanentpeace agreement should be signed between the Palestinians and Israel bythe date. Under the Oslo accords, Israel should transfer all remaining West

188

Bank territories to Palestinian control during the third redeployment. Peres,who is also a Knesset (Parliament) member, contributed a great deal in thepeace process, particularly in formulating the Oslo accords.

Both Palestinian National Authority Chairman Yasser Arafat and formerIsraeli Prime Minister Shimon Peres are planning to attend and scheduledto speak at the meeting to be held in Oslo next Monday. Following monthsof secret talks held in Oslo, the Israeli and Palestinian negotiators reachedagreement on August 20 of 1993 on principles of Palestinian self-rule, widely-known later as the Oslo Accords which started the Israeli-Palestinian peaceprocess. Israel is still at loggerheads with the Palestinians over its withdrawalfrom the West Bank, although the two sides should have moved forward tofinal-status talks two years ago according to the Oslo Accords. Egypt is thechief Arab mediator between Israel and the Palestinians when the two sidesheld negotiations on the Oslo accords.


Following months of secret talks held in Oslo, Norway, Israeli and Pales-tinian negotiators reached agreement on the Oslo Accords, which startedthe Israeli-Palestinian peace process. The accords called for Palestinianself-rule in most of Gaza and the West Bank and for the withdrawal ofIsraeli forces from these territories. Palestinian rule would last for a fiveyear period during which a permanent arrangement would be negotiated.Tough issues, as Jerusalem, were left for final status talks. The peaceprocess went smoothly in the first two years, but both sides had doubts.Palestinians doubted the accords would do anything to improve their plightand Israelis doubted that giving land for peace would provide security.Under the accords, Israel should have completed its pullout from the WestBank town of Hebron in March 1996, but a series of suicide bombing attacksin February delayed the pullout.

The government of Prime Minister Benjamin Netanyahu took office in1996. Netanyahu repeatedly attacked the Oslo Accords. He replaced the“land for peace” principle with “peace with security”. Israel began buildingmore settlements in the occupied territories, delayed troop redeploymentsin the West Bank, and tightened control over Jerusalem, These policies,together with Hamas terrorist attacks, derailed the peace process and stale-mated peace talks. President Yasser Arafat pledged that at the end of thepeace process he would declare Palestinian independence in 1999 even with-out a final peace agreement with Israel. He warned that violence in the regionwould result if there was no peace agreement.

189



In a 6-3 decision, the high court ruled the Line Item Veto Act violatedthe constitution’s separation of powers between Congress, which approveslegislation, and the president, who either signs it into law or vetoes it. Pres-ident Bill Clinton said he was “disappointed” by the ruling, but expressedconfidence the line-item law eventually will be upheld by the SupremeCourt. The court ruled that such a specialized veto can be authorized onlythrough a constitutional amendment. The law was challenged last year bya group of mostly Democratic senators, but the Supreme Court dismissedthe challenge, saying President Bill Clinton had not yet used the selectiveveto and therefore the group had no standing to bring suit. Justice JohnPaul Stevens wrote the majority opinion for a court divided 6-to-3 in theline item veto case (Clinton vs. City of New York).

That process requires the president to approve or reject legislation in full,whereas the Line Item Veto Act impermissibly allowed him to strike legisla-tion piecemeal, the justices added. The case the high court decided, Clintonvs. New York City, marked the second time the justices considered the con-stitutionality of the Line Item Veto Act, which went into effect in January1997. Although Congress presumably anticipated that the President mightcancel some of the items in the Balanced Budget Act and in the TaxpayerRelief Act, Congress cannot alter the procedures set out in Article I, 7, with-out amending the Constitution. What the Line Item Veto Act does insteadauthorizing the President to “cancel” an item of spending is technically dif-ferent. The President’s action it authorizes in fact is not a line item veto andthus does not offend Art. I.


In the words of President Clinton the line item veto “is very important inhelping to preserve the integrity of federal spending”. The line item vetohas been sought by presidents since Grant and was popularized by Reagan.It was part of Republican “Contract with America” led by Speaker NewtGingrich that enacted it. The line item veto allows the president to vetoparticular items in spending bills and certain limited tax provisions passedby Congress. Previously the president could only veto entire bills. BillClinton is the only president to have had line item veto authority. He hassaid that it should be used sparingly. He used it 163 times, mostly to deleteitems from the military construction bill.

The line item veto was challenged by a group of most Democratic senatorsbut was dismissed by the Supreme Court. However, another challenge led

190

by New York Mayor Giuliani and Idaho farmers resulted in a federal judgedeclaring the line item veto unconstitutional. The Justice Department ap-pealed that decision to the Supreme Court. The Supreme Court rejected theline item veto as a departure from the basic constitutional requirement thatpresidents accept or reject bills in their entirety. The Court found that theline item veto violates the “presentment clause” of Article I that establishesthe process by which a bill becomes law. The Court vote was 6-3 with JusticeStevens writing for the majority.

191

Notation

Throughout this work, we make use of some mathematical notation. Follow-ing conventional practice, bold capital letters denote matrices (A), and boldlower case letters vectors (x). A superscript T denotes the transpose of amatrix or vector (AT ). Sets and set elements are represented by non-boldletters, e.g. T and t. Capital letter subscripts of matrices denote the setscorresponding to each mode, such that ATS represents a |T |-by-|S| matrixwith rows and columns built from the sets T and S respectively. Lower caseletter subscripts represent only one entry of the corresponding set, e.g. Ats

is the entry of ATS in row t and column s, and xi is the i-th element ofvector x. We use subscripts to indicate the context of a vector, such that θd

may also denote the vector θ of document d. Statistical notation follows thestandard conventions.The following symbols and functions are used in this PhD thesis:

Nd Number of words in document dM Number of document in a collection or corpusD Corpus of documentsT The number of latent topicsL Length of summary in wordsn(d, w) The frequency of word w in document dATS A |T |-by-|S| matrix with rows and columns built from

the sets T (terms) and S (sentences)Ats The entry of ATS in row t and column sxTy Dot product of vectors x and ycos(x,y) The cosine of the angle between x and y∥x∥ Vector normα, β Dirichlet priors of the LDA modelΘ Topic-document probability distributions of the LDA

modelΦ Word-topic probability distributions of the LDA modelθd Topic distribution of document dϕzk Word distribution of latent topic zkP (A|B) The probability of A conditional on Blog a The logarithm of aL Log likelihood of a datasetKL(P ||Q) Kullback-Leibler (KL) divergenceJS(P ||Q) Jensen-Shannon (JS) divergence

193

Bibliography

[AG02] Massih-Reza Amini and Patrick Gallinari. The use of unla-beled data to improve supervised learning for text summariza-tion. In Proc. of the 25th annual international ACM SIGIRconference on Research and development in information re-trieval, SIGIR ’02, pages 105–112, New York, NY, USA, 2002.ACM.

[AGK01] James Allan, Rahul Gupta, and Vikas Khandelwal. Temporalsummaries of new topics. In Proc. of the 24th annual interna-tional ACM SIGIR conference on Research and developmentin information retrieval, SIGIR ’01, pages 10–18, New York,NY, USA, 2001. ACM.

[AI99] Douglas Appelt and David Israel. Introduction to informationextraction technology. In A Tutorial Prepared for IJCAI-99,SRI International, 1999.

[AM10] Ion Androutsopoulos and Prodromos Malakasiotis. A surveyof paraphrasing and textual entailment methods. J. Artif. Int.Res., 38:135–187, May 2010.

[AOGL99] Chinatsu Aone, Mary Ellen Okurowski, James Gorlinsky, andBjornar Larsen. A trainable summarizer with knowledge ac-quired from robust nlp techniques. In I. Mani and M. T.Maybury, editors, Advances in Automatic Text Summariza-tion, pages 71–80. MIT Press, Cambridge, MA, 1999.

[AR08a] Rachit Arora and Balaraman Ravindran. Latent dirichlet allo-cation and singular value decomposition based multi-documentsummarization. In Proc. of the 2008 Eighth IEEE Interna-tional Conference on Data Mining, pages 713–718. IEEE Com-puter Society, 2008.

195

BIBLIOGRAPHY

[AR08b] Rachit Arora and Balaraman Ravindran. Latent dirichlet al-location based multi-document summarization. In AND ’08:Proc. of the second workshop on Analytics for noisy unstruc-tured text data, pages 91–97, New York, NY, USA, 2008. ACM.

[AU07] M. R Amini and N. Usunier. A contextual query expansionapproach by term clustering for robust text summarization.In Proc. of the Document Understanding Conference (DUC),volume 7, 2007.

[AWST09] Arthur Asuncion, Max Welling, Padhraic Smyth, andYee Whye Teh. On smoothing and inference for topic mod-els. In UAI ’09: Proc. of the Twenty-Fifth Conference onUncertainty in Artificial Intelligence, pages 27–34, Arlington,Virginia, United States, 2009. AUAI Press.

[Bar32] F. C. Bartlett. Remembering: a study in experimental andsocial psychology. Cambridge University Press, 1932.

[Bax58] P. B. Baxendale. Man-made index for technical literature:an experiment. IBM Journal of Research & Development,2(4):354–361, 1958.

[BCT02] Thorsten Brants, Francine Chen, and Ioannis Tsochantaridis.Topic-based document segmentation with probabilistic latentsemantic analysis. In Proc. of the eleventh international confer-ence on Information and knowledge management, CIKM ’02,pages 211–218, New York, NY, USA, 2002. ACM.

[BCW90] Timothy C. Bell, John G. Cleary, and Ian H. Witten. Text com-pression. Prentice-Hall, Inc., Upper Saddle River, NJ, USA,1990.

[BE97] Regina Barzilay and Michael Elhadad. Using lexical chainsfor text summarization. In Proc. of the ACL Workshop onIntelligent Scalable Text Summarization, pages 10–17. ACL,1997.

[BGJT04] David M. Blei, Thomas L. Griffiths, Michael L. Jordan, andJoshua B. Tenenbaum. Hierarchical topic models and thenested chinese restaurant process. In Advances in Neural Infor-mation Processing Systems 16: Proc. of the 2003 Conference,page 17, 2004.

196

BIBLIOGRAPHY

[BGMP01] Orkut Buyukkokten, Hector Garcia-Molina, and AndreasPaepcke. Seeing the whole in parts: text summarization forweb browsing on handheld devices. In WWW ’01: Proc. ofthe 10th international conference on World Wide Web, pages652–662, New York, NY, USA, 2001. ACM.

[Bis07] C. M. Bishop. Pattern Recognition and Machine Learning.Springer, 2007.

[BK97] B. Boguraev and C. Kennedy. Salience-based content charac-terisation of text documents. In Proc. of the ACL’97/EACL’97Workshop on Intelligent Scalable Text Summarization, pages2–9, 1997.

[BL04] Regina Barzilay and Lillian Lee. Catching the drift: Prob-abilistic content models, with applications to generation andsummarization. In Proc. of the Annual Conference of the NorthAmerican Chapter of the Association for Computational Lin-guistics (HLT-NAACL’04), pages 113–120, 2004.

[BL05] Regina Barzilay and Mirella Lapata. Modeling local coherence:an entity-based approach. In ACL ’05: Proc. of the 43rd An-nual Meeting on Association for Computational Linguistics,pages 141–148, Morristown, NJ, USA, 2005. Association forComputational Linguistics.

[BL06] David M. Blei and John D. Lafferty. Dynamic topic models. InProc. of the 23rd international conference on Machine learn-ing, ICML ’06, pages 113–120, New York, NY, USA, 2006.ACM.

[BM05] Regina Barzilay and Kathleen R. McKeown. Sentence fusionfor multidocument news summarization. Comput. Linguist.,31:297–328, 2005.

[BME99] Regina Barzilay, Kathleen R. McKeown, and Michael Elhadad.Information fusion in the context of multi-document summa-rization. In Proc. of the 37th annual meeting of the Associationfor Computational Linguistics on Computational Linguistics,pages 550–557, Morristown, NJ, USA, 1999. Association forComputational Linguistics.

197

BIBLIOGRAPHY

[BMR95] Ronald Brandow, Karl Mitze, and Lisa F. Rau. Automaticcondensation of electronic publications by sentence selection.Inf. Process. Manage., 31(5):675–685, 1995.

[BMW00] Michele Banko, Vibhu O. Mittal, and Michael J. Witbrock.Headline generation based on statistical translation. In ACL’00: Proc. of the 38th Annual Meeting on Association for Com-putational Linguistics, pages 318–325, Morristown, NJ, USA,2000. Association for Computational Linguistics.

[BNJ03] David M. Blei, Andrew Y. Ng, and Michael I. Jordan. Latentdirichlet allocation. J. of Machine Learning Research, 3:993–1022, 2003.

[Bos08] W. E. Bosma. Discourse Oriented Summarization. PhD thesis,University of Twente, Enschede, the Netherlands, 2008.

[BP98] Sergey Brin and Lawrence Page. The anatomy of a large-scale hypertextual web search engine. In WWW7: Proc. of theseventh international conference on World Wide Web 7, pages107–117, Amsterdam, The Netherlands, 1998. Elsevier SciencePublishers B. V.

[BP06] R. Bunescu and M. Pasca. Using encyclopedic knowledge fornamed entity disambiguation. In Proc. of the 11th Conferenceof the European Chapter of the Association for ComputationalLinguistics (EACL), volume 6, pages 9–16, 2006.

[BSIM08] Harendra Bhandari, Masashi Shimbo, Takahiko Ito, and YujiMatsumoto. Generic text summarization using probabilisticlatent semantic indexing. In Proc. of the Third Int. J. Conf.on Natural Language Processing (IJCNLP 2008), 2008.

[BYRN99] Ricardo A. Baeza-Yates and Berthier Ribeiro-Neto. ModernInformation Retrieval. Addison-Wesley Longman PublishingCo., Inc., Boston, MA, USA, 1999.

[CG98] Jaime Carbonell and Jade Goldstein. The use of MMR,diversity-based reranking for reordering documents and pro-ducing summaries. In SIGIR ’98: Proc. of the 21st annualint. ACM SIGIR conf. on Research and development in infor-mation retrieval, pages 335–336, New York, NY, USA, 1998.ACM.

198

BIBLIOGRAPHY

[CH00] David Cohn and Thomas Hofmann. The missing link - a proba-bilistic model of document content and hypertext connectivity.In Neural Information Processing Systems 13, pages 430–436,2000.

[CIK+07] Terry Copeck, Diana Inkpen, Anna Kazantseva, AlistairKennedy, Darren Kipp, and Stan Szpakowicz. Catch whatyou can. In Proc. of the Document Understanding Conference2007, 2007.

[CLGT00] W. K. Chan, T. B. Y. Lai, W. J. Gao, and B. K. T’sou. Miningdiscourse markers for chinese textual summarization. In Proc.of the Workshop on Automatic Summarization, pages 11–20,2000.

[CO01] John M. Conroy and Dianne P. O’Leary. Text summarizationvia hidden markov models and pivoted QR matrix decomposi-tion. Technical Report CS-TR-4221, University of Maryland,May 2001.

[CORGC04] Simon Corston-Oliver, Eric Ringger, Michael Gamon, andRichard Campbell. Task-focused summarization of email. InMarie-Francine Moens and Stan Szpakowicz, editors, TextSummarization Branches Out: Proc. of the ACL-04 Workshop,pages 43–50, Barcelona, Spain, 2004. Association for Compu-tational Linguistics.

[COS06] J. M Conroy, D. P O’Leary, and J. D Schlesinger. CLASSYarabic and english Multi-Document summarization. In Multi-Lingual Summarization Evaluation 2006, 2006.

[Cre96] E.T. Cremmins. The Art of Abstracting. Information Re-sources Press, Arlington, Virginia, 1996.

[CS04] Terry Copeck and Stan Szpakowicz. Vocabulary usage innewswire summaries. In Marie-Francine Moens and Stan Sz-pakowicz, editors, Text Summarization Branches Out: Proc. ofthe ACL-04 Workshop, pages 19–26, Barcelona, Spain, 2004.Association for Computational Linguistics.

[CSGO04] John M. Conroy, Judith T. Schlesinger, Jade Goldstein, andDianne P. O’Leary. Left-Brain/Right-Brain Multi-Documentsummarization. In Proc. of the Document Understanding Con-ference 2004, 2004.

199

BIBLIOGRAPHY

[CSO06] John M. Conroy, Judith D. Schlesinger, and Dianne P.O’Leary. Topic-focused multi-document summarization usingan approximate oracle score. In Proc. of the COLING/ACL onMain conference poster sessions, pages 152–159, Morristown,NJ, USA, 2006. Association for Computational Linguistics.

[Dan06] H. T. Dang. Overview of DUC 2006. In Proc. of the DocumentUnderstanding Conference (DUC 2006), 2006.

[DBS07] Laura Dietz, Steffen Bickel, and Tobias Scheffer. Unsuper-vised prediction of citation influences. In Proc. of the 24th in-ternational conference on Machine learning, ICML ’07, pages233–240, New York, NY, USA, 2007. ACM.

[DDF+90] Scott Deerwester, Susan T. Dumais, George W. Furnas,Thomas K. Landauer, and Richard Harshman. Indexing bylatent semantic analysis. J. of the American Society for Infor-mation Science, 41:391–407, 1990.

[DDGR07] Abhinandan S. Das, Mayur Datar, Ashutosh Garg, and ShyamRajaram. Google news personalization: scalable online collab-orative filtering. In Proc. of the 16th international conferenceon World Wide Web, WWW ’07, pages 271–280, New York,NY, USA, 2007. ACM.

[DDMR09] Ido Dagan, Bill Dolan, Bernardo Magnini, and Dan Roth.Recognizing textual entailment: Rational, evaluation and ap-proaches. Natural Language Engineering, 15(Special Issue04):i–xvii, 2009.

[DeJ82] G. F. DeJong. An overview of the FRUMP system. InW. Lehnert and M. Ringle, editors, Strategies for Natural Lan-guage Processing. Lawrence Erlbaum and Associates, Hillsdale,NJ, 1982.

[DKL07] Hoa Trang Dang, Diane Kelly, and Jimmy J. Lin. Overview ofthe TREC 2007 Question Answering Track. In TREC, 2007.

[DL04] F. S. Douzidia and G. Lapalme. Lakhas, an arabic summaris-ing system. In Proc. of the Document Understanding Confer-ence, DUC 2004, pages 128–135, 2004.

200

BIBLIOGRAPHY

[DLR77] A. Dempster, N. Laird, and D. Rubin. Maximum likelihoodfrom incomplete data via the em algorithm. J. Royal Statist.Soc., B 39:1–38, 1977.

[DM05a] Hal Daume III and Daniel Marcu. Bayesian multi-documentsummarization at MSE. In Proc. of the Workshop on Multilin-gual Summarization Evaluation (MSE), Ann Arbor, MI, June29 2005.

[DM05b] Hal Daume III and Daniel Marcu. Bayesian summarization atDUC and a suggestion for extrinsic evaluation. In Proc. of theDocument Understanding Conference (DUC) 2005, 2005.

[DM06] Hal Daume III and Daniel Marcu. Bayesian query-focusedsummarization. In Proc. Int. Conf. on Computational Lin-guistics (ACL), pages 305–312, Morristown, NJ, USA, 2006.Association for Computational Linguistics.

[DO08] Hoa Trang Dang and Karolina Owczarzak. Overview of the tac2008 update summarization task. In Proc. of the Text AnalysisConference (TAC) 2008, 2008.

[DUC07] DUC. Proc. of the document understanding conferences 2001-2007. http://duc.nist.gov, 2007.

[Dum04] S. T. Dumais. Latent semantic analysis. Annual Review of In-formation Science and Technology (ARIST), 38:189–230, 2004.

[Ear70] Lois L. Earl. Experiments in automatic extracting and index-ing. Information Storage and Retrieval, 6(4):313–330, October1970.

[EBC06] Oren Etzioni, Michele Banko, and Michael J. Cafarella. Ma-chine reading. In AAAI’06: Proc. of the 21st national confer-ence on Artificial intelligence, pages 1517–1519. AAAI Press,2006.

[Edm69] H.P. Edmundson. New methods in automatic abstracting. J.of the Association for Computing Machinery, 1969.

[EN98] B. Endres-Niggemeyer. Summarising information. Springer,Berlin, 1998.

201

BIBLIOGRAPHY

[ER04] G. Erkan and D. Radev. Lexrank: graph-based centrality assalience in text summarisation. J. of Artificial IntelligenceResearch, 22:457–479, 2004.

[FAR07] Maria Fuentes, Enrique Alfonseca, and Horacio Rodrıguez.Support vector machines for query-focused summarizationtrained and evaluated on pyramid data. In Proc. of the 45thAnnual Meeting of the ACL on Interactive Poster and Demon-stration Sessions, ACL ’07, pages 57–60, Morristown, NJ,USA, 2007. Association for Computational Linguistics.

[Fel98] Christiane Fellbaum. WordNet: An Electronic LexicalDatabase. MIT Press, 1998.

[Fel06] Ronen Feldman. Text Mining Handbook: Advanced Approachesin Analyzing Unstructured Data. Cambridge University Press,New York, NY, USA, 2006.

[FH04] Elena Filatova and Vasileios Hatzivassiloglou. Event-based ex-tractive summarization. In Marie-Francine Moens and Stan Sz-pakowicz, editors, Text Summarization Branches Out: Proc. ofthe ACL-04 Workshop, pages 104–111, Barcelona, Spain, 2004.Association for Computational Linguistics.

[Fir57] John Rupert Firth. Papers in Linguistics 1934-1951. Long-mans, London, 1957.

[FR06] Seeger Fisher and Brian Roark. Query-focused summarizationby supervised sentence ranking and skewed word distributions.In Proc. of the Document Understanding Conference (DUC2006), 2006.

[Fur05] S. Furui. Spontaneous speech recognition and summarization.In Proc. of the Second Baltic Conference on Human LanguageTechnologies, pages 39–50, 2005.

[Fut99] R. P. Futrelle. Summarization of diagrams in documents. InI. Mani and M. T. Maybury, editors, Advances in AutomaticText Summarization, pages 403–421. MIT Press, Cambridge,MA, 1999.

[GF09] Dan Gillick and Benoit Favre. A scalable global model forsummarization. In ILP ’09: Proc. of the Workshop on IntegerLinear Programming for Natural Language Processing, pages

202

BIBLIOGRAPHY

10–18, Morristown, NJ, USA, 2009. Association for Computa-tional Linguistics.

[GKMC99] Jade Goldstein, Mark Kantrowitz, Vibhu Mittal, and JaimeCarbonell. Summarizing text documents: sentence selectionand evaluation metrics. In SIGIR ’99: Proc. of the 22nd an-nual international ACM SIGIR conference on Research and de-velopment in information retrieval, pages 121–128, New York,NY, USA, 1999. ACM.

[GL01] Yihong Gong and Xin Liu. Generic text summarization usingrelevance measure and latent semantic analysis. In SIGIR ’01:Proc. of the 24th annual int. ACM SIGIR conf. on Researchand development in information retrieval, pages 19–25, NewYork, NY, USA, 2001. ACM.

[GMCK00] Jade Goldstein, Vibhu Mittal, Jaime Carbonell, and MarkKantrowitz. Multi-document summarization by sentence ex-traction. In NAACL-ANLP 2000 Workshop on Automaticsummarization, pages 40–48, Morristown, NJ, USA, 2000. As-sociation for Computational Linguistics.

[GS04] T. L. Griffiths and M. Steyvers. Finding scientific topics. Proc.of the National Academy of Sciences, 101(Suppl. 1):5228–5235,2004.

[GSBT05] T. L Griffiths, M. Steyvers, D. M Blei, and J. B Tenenbaum.Integrating topics and syntax. In Advances in neural informa-tion processing systems, volume 17, pages 537–544, 2005.

[GVL96] Gene H. Golub and Charles F. Van Loan. Matrix Computa-tions. The Johns Hopkins University Press, October 1996.

[GZH10] Kavita Ganesan, ChengXiang Zhai, and Jiawei Han. Opinosis:A graph based approach to abstractive summarization ofhighly redundant opinions. In Proceedings of the 23rd Interna-tional Conference on Computational Linguistics (Coling 2010),pages 340–348, Beijing, China, August 2010.

[HA10] Leonhard Hennig and Sahin Albayrak. Personalized multi-document summarization using n-gram topic model fusion. InNicoletta Calzolari (Conference Chair), Khalid Choukri, BenteMaegaard, Joseph Mariani, Jan Odijk, Stelios Piperidis, Mike

203

BIBLIOGRAPHY

Rosner, and Daniel Tapias, editors, Proc. of LREC ’10, 1stWorkshop on Semantic Personalized Information Management(SPIM 2010), pages 28–34, Valletta, Malta, 2010. EuropeanLanguage Resources Association (ELRA).

[Har82] Z. Harris. Discourse and sublanguage. In R. Kittredgeand J. Lehrberger, editors, Sublanguage: Studies of Languagein Restricted Semantic Domains, pages 231–236. Walter deGruyter, Berlin; New York, 1982.

[Har04] Sanda Harabagiu. Incremental topic representations. In COL-ING ’04: Proc. of the 20th international conference on Com-putational Linguistics, pages 583–589, Morristown, NJ, USA,2004. Association for Computational Linguistics.

[HDLA10] Leonhard Hennig, Ernesto William De Luca, and Sahin Al-bayrak. Learning summary content units with topic model-ing. In Proc. of the 23rd International Conference on Compu-tational Linguistics (COLING 2010), pages 391–399, Beijing,China, August 2010. Coling 2010 Organizing Committee.

[Hea97] Marti A. Hearst. Texttiling: segmenting text into multi-paragraph subtopic passages. Comput. Linguist., 23(1):33–64,1997.

[Hen09] Leonhard Hennig. Topic-based multi-document summariza-tion with probabilistic latent semantic analysis. In Proc. ofthe International Conference RANLP-2009, pages 144–149,Borovets, Bulgaria, September 2009. Association for Compu-tational Linguistics.

[HG05] Ben Hachey and Claire Grover. Sentence extraction for legaltext summarisation. In IJCAI’05: Proc. of the 19th interna-tional joint conference on Artificial intelligence, pages 1686–1687, San Francisco, CA, USA, 2005. Morgan Kaufmann Pub-lishers Inc.

[HH76] M. Halliday and R. Hasan. Cohesion in English. Longman,London, 1976.

[HH02] Udo Hahn and Donna Harman, editors. Proc. of the DocumentUnderstanding Conference (DUC-02), Philadelphia, 2002.

204

BIBLIOGRAPHY

[HHL07] Sanda Harabagiu, Andrew Hickl, and Finley Lacatusu. Satis-fying information needs with multi-document summaries. Inf.Process. Manage., 43:1619–1642, November 2007.

[HIMM02] Tsutomu Hirao, Hideki Isozaki, Eisaku Maeda, and Yuji Mat-sumoto. Extracting important sentences with support vectormachines. In Proc. of the 19th international conference onComputational linguistics - Volume 1, pages 1–7, Morristown,NJ, USA, 2002. Association for Computational Linguistics.

[HJM08] David Hall, Daniel Jurafsky, and Christopher D. Manning.Studying the history of ideas using topic models. In EMNLP’08: Proc. of the Conference on Empirical Methods in NaturalLanguage Processing, pages 363–371, Morristown, NJ, USA,2008. Association for Computational Linguistics.

[HL99] E. Hovy and C-Y. Lin. Automated text summarization insummarist. In I. Mani and M. Maybury, editors, Advancesin Automatic Text Summarization, pages 81–94. MIT Press,Cambridge, MA, 1999.

[HL02] S. Harabagiu and F. Lacatusu. Generating single and multi-document summaries with gistexter. In Proc. of the DocumentUnderstanding Conference (DUC 2002), pages 30–38, 2002.

[HL05] Sanda Harabagiu and Finley Lacatusu. Topic themes for multi-document summarization. In SIGIR ’05: Proc. of the 28thannual int. ACM SIGIR conf. on Research and developmentin information retrieval, pages 202–209, New York, NY, USA,2005. ACM Press.

[HL10] Sanda Harabagiu and Finley Lacatusu. Using topic themesfor multi-document summarization. ACM Trans. Inf. Syst.,28:13:1–13:47, July 2010.

[HLH06] Sanda Harabagiu, Finley Lacatusu, and Andrew Hickl. An-swering complex questions with random walk models. In SI-GIR ’06: Proc. of the 29th annual international ACM SIGIRconference on Research and development in information re-trieval, pages 220–227, New York, NY, USA, 2006. ACM.

[HLZ05] Eduard Hovy, Chin-Yew Lin, and Liang Zhou. EvaluatingDUC 2005 using basic elements. In Proc. of the Fifth Doc-

205

BIBLIOGRAPHY

ument Understanding Conference (DUC), Vancouver, BritishColumbia, Canada, 2005.

[HM00] Udo Hahn and Inderjeet Mani. The challenges of automaticsummarization. Computer, 33(11):29–36, 2000.

[HM01] Donna Harman and Daniel Marcu, editors. Proc. of the Doc-ument Understanding Conference (DUC-01), New Orleans,2001.

[HMR05] B. Hachey, G. Murray, and D. Reitter. The EMBRA systemat DUC 2005: Query-oriented multi-document summarizationwith a very large latent semantic space. In Proc. of the Docu-ment Understanding Conference (DUC-2005), 2005.

[HNPR05] A. Harnly, A. Nenkova, R. Passonneau, and O. Rambow. Au-tomation of summary evaluation by the Pyramid method. InProc. of the Conference on Recent Advances in Natural Lan-guage Processing 2005, 2005.

[Hof99a] Thomas Hofmann. Probabilistic latent semantic analysis. InProc. of Uncertainty in Artificial Intelligence (UAI’99), 1999.

[Hof99b] Thomas Hofmann. Probabilistic latent semantic indexing. InSIGIR ’99: Proc. of the 22nd annual international ACM SI-GIR conference on Research and development in informationretrieval, pages 50–57, New York, NY, USA, 1999. ACM.

[HR86] Udo Hahn and Ulrich Reimer. Semantic parsing and summa-rizing of technical texts in the topic system. In R. Kuhlen,editor, Informationslinguistik. Theoretische, experimentelle,curriculare und prognostische Aspekte einer informationswis-senschaftlichen Teildisziplin, pages 153–193. M. Niemeyer,Tubingen, 1986.

[HR99] Udo Hahn and Ulrich Reimer. Knowledge-based text summa-rization: Salience and generalization operators for knowledgebase abstraction. In Advances in Automatic Text Summariza-tion, pages 215—232. MIT Press, Cambridge, MA, 1999.

[HRL07] Andrew Hickl, Kirk Roberts, and Finley Lacatusu. LCC’s GIS-Texter at DUC 2007: Machine reading for update summariza-tion. In Proc. of the 2007 Document Understanding Conference(DUC 2007), 2007.

206

BIBLIOGRAPHY

[HS08] Leonhard Hennig and Thomas Strecker. Tailoring text for au-tomatic layouting of newspaper pages. In 19th InternationalConference on Pattern Recognition (ICPR 2008), pages 1–4,2008.

[HSL08] Meishan Hu, Aixin Sun, and Ee-Peng Lim. Comments-oriented document summarization: understanding documentswith readers’ feedback. In Proc. of the 31st annual interna-tional ACM SIGIR conference on Research and developmentin information retrieval, pages 291–298, Singapore, Singapore,2008. ACM.

[HSN+10] Leonhard Hennig, Thomas Strecker, Sascha Narr, ErnestoWilliam De Luca, and Sahin Albayrak. Identifying sentence-level semantic content units with topic models. Databaseand Expert Systems Applications, International Workshop on,0:59–63, 2010.

[HUW08] Leonhard Hennig, Winfried Umbrath, and Robert Wetzker.An ontology-based approach to text summarization. InIEEE/WIC/ACM International Conference on Web Intelli-gence and Intelligent Agent Technology, 2008 (WI-IAT ’08)0,volume 3, pages 291–294, 2008.

[HV09] Aria Haghighi and Lucy Vanderwende. Exploring contentmodels for multi-document summarization. In NAACL ’09:Proc. of Human Language Technologies: The 2009 AnnualConference of the North American Chapter of the Associa-tion for Computational Linguistics, pages 362–370, Morris-town, NJ, USA, 2009. Association for Computational Linguis-tics.

[JBME98] Hongyan Jing, Regina Barzilay, Kathleen McKeown, andMichael Elhadad. Summarization evaluation methods: Exper-iments and analysis. Intelligent Text Summarization. Papersfrom the 1998 AAAI Spring Symposium. Technical Report SS-98-06, pages 60–68, 1998.

[JG96] K. Sparck Jones and J.R. Galliers. Evaluating natural languageprocessing systems: An analysis and review. Springer, Berlin,1996.

207

BIBLIOGRAPHY

[JM99] Hongyan Jing and Kathleen R. McKeown. The decompositionof human-written summary sentences. In SIGIR ’99: Proc.of the 22nd annual international ACM SIGIR conference onResearch and development in information retrieval, pages 129–136, New York, NY, USA, 1999. ACM.

[Joa99] Thorsten Joachims. Making large-scale support vector machinelearning practical. In B. Scholkopf, C. Burges, and A. Smola,editors, Advances in kernel methods: support vector learning,pages 169–184. MIT Press, Cambridge, MA, USA, 1999.

[Jon07] Karen Sparck Jones. Automatic summarising: The state ofthe art. Inf. Process. Manage., 43(6):1449–1481, 2007.

[JPV06] J. Jagarlamudi, P. Pingali, and V. Varma. Query indepen-dent sentence scoring approach to duc 2006. In Proc. of theDocument Understanding Conference (DUC 2006), 2006.

[KLWC05] Lun-Wei Ku, Li-Ying Lee, Tung-Ho Wu, and Hsin-Hsi Chen.Major topic detection and its application to opinion summa-rization. In Proceedings of the 28th annual international ACMSIGIR conference on Research and development in informationretrieval, pages 627–628, Salvador, Brazil, 2005. ACM.

[KPC95] Julian Kupiec, Jan Pedersen, and Francine Chen. A trainabledocument summarizer. In SIGIR ’95: Proc. of the 18th annualinternational ACM SIGIR conference on Research and devel-opment in information retrieval, pages 68–73, New York, NY,USA, 1995. ACM.

[KPP04] Hans Kellerer, Ulrich Pferschy, and David Pisinger. KnapsackProblems. Springer, 2004.

[KR95] R. E. Kass and A. E. Raftery. Bayes factors. Journal of theAmerican Statistical Association, 90:773–795, 1995.

[KS10] A. Kazantseva and S. Szpakowicz. Summarizing short stories.Computational Linguistics, 36(1):71–109, 2010.

[Leh82] W. Lehnert. Plot units: A narrative summarization strategy.In W. Lehnert and M. Ringle, editors, Strategies for Natu-ral Language Processing. Lawrence Erlbaum and Associates,Hillsdale, NJ, 1982.

208

BIBLIOGRAPHY

[LFL98] T. K. Landauer, P. W. Foltz, and D. Laham. Introductionto latent semantic analysis. Discourse Processes, 25:259–284,1998.

[LH97] Chin-Yew Lin and Eduard Hovy. Identifying topics by position.In Proc. of the fifth conference on Applied natural language pro-cessing, pages 283–290, San Francisco, CA, USA, 1997. MorganKaufmann Publishers Inc.

[LH00] Chin-Yew Lin and Eduard Hovy. The automated acquisitionof topic signatures for text summarization. In Proc. of ACL,pages 495–501, Morristown, NJ, USA, 2000. Association forComputational Linguistics.

[LH02] Chin-Yew Lin and Eduard Hovy. Manual and automatic evalu-ation of summaries. In Proc. of the ACL-02 Workshop on Au-tomatic Summarization, pages 45–51, Morristown, NJ, USA,2002. Association for Computational Linguistics.

[LH03] Chin-Yew Lin and Eduard Hovy. Automatic evaluation of sum-maries using N-gram co-occurrence statistics. In NAACL ’03:Proc. of the 2003 Conference of the North American Chapter ofthe Association for Computational Linguistics on Human Lan-guage Technology, pages 71–78, Morristown, NJ, USA, 2003.Association for Computational Linguistics.

[LHR+06] Finley Lacatusu, Andrew Hickl, Kirk Roberts, Ying Shi,Jeremy Bensley, Bryan Rink, Patrick Wang, and Lara Tay-lor. LCC’s GISTexter at DUC 2006: Multi-Strategy Multi-Document summarization. In Proc. of the 2006 documentunderstanding conference (DUC 2006) at HLT/NAACL 2006,2006.

[Lin04] Chin-Yew Lin. Rouge: A package for automatic evaluationof summaries. In Marie-Francine Moens and Stan Szpakow-icz, editors, Text Summarization Branches Out: Proc. of theACL-04 Workshop, pages 74–81, Barcelona, Spain, 2004. As-sociation for Computational Linguistics.

[LMFG05] Jure Leskovec, Natasa Milic-Frayling, and Marko Grobelnik.Impact of linguistic analysis on the semantic graph coverageand learning of document extracts. In AAAI’05: Proc. of the

209

BIBLIOGRAPHY

20th national conference on Artificial intelligence, pages 1069–1074. AAAI Press, 2005.

[LMK07] T. Landauer, S. Dennis McNamara, and W. Kintsch, editors.Latent Semantic Analysis: A Road to Meaning. Laurence Erl-baum, 2007.

[Luc08] Ernesto William De Luca. Semantic Support in MultilingualText Retrieval. PhD thesis, Otto-von-Guericke-UniversitatMagdeburg, 2008.

[Luh58] H. P. Luhn. The automatic creation of literature abstracts.IBM J. Res. Dev., 2(2):159–165, 1958.

[LYRL04] David D. Lewis, Yiming Yang, Tony G. Rose, and Fan Li.Rcv1: A new benchmark collection for text categorization re-search. J. Mach. Learn. Res., 5:361–397, 2004.

[LZX+09] Liangda Li, Ke Zhou, Gui-Rong Xue, Hongyuan Zha, and YongYu. Enhancing diversity, coverage and balance for summariza-tion through structure learning. In Proc. of the 18th interna-tional conference on World wide web, WWW ’09, pages 71–80,New York, NY, USA, 2009. ACM.

[MA08] Arthur G. Money and Harry Agius. Video summarisation: Aconceptual framework and survey of the state of the art. J.Vis. Comun. Image Represent., 19(2):121–143, 2008.

[Mac67] J. B. MacQueen. Some methods for classification and analysisof multivariate observations. In L. M. Le Cam and J. Ney-man, editors, Proc. of the Fifth Berkeley Symposium on Math-ematical Statistics and Probability, volume 1, pages 281–297.University of California Press, 1967.

[Man01] Inderjeet Mani. Automatic summarization. John BenjaminsPublishing Company, 2001.

[Mar97a] D. Marcu. From discourse structures to text summaries. InProc. of the ACL Workshop on Intelligent Scalable Text Sum-marization, pages 82–88, 1997.

[Mar97b] D. Marcu. The rhetorical parsing of natural language texts.In Proc. of the 35th Annual Meeting of the Association forComputational Linguistics, pages 96–103, 1997.

210

BIBLIOGRAPHY

[Mar99] Daniel Marcu. Discourse trees are good indicators of impor-tance in text. In Advances in Automatic Text Summarization,pages 123–136. The MIT Press, 1999.

[May95] Mark T. Maybury. Generating summaries from event data.Inf. Process. Manage., 31(5):735–751, 1995.

[MB97] I. Mani and E. Bloedorn. Multi-document summarization bygraph search and matching. In Proc. of AAAI 1997, pages622–628, Menlo Park, California, 1997. AAAI Press.

[MB98] Inderjeet Mani and Eric Bloedorn. Machine learning of genericand user-focused summarization. In AAAI ’98/IAAI ’98:Proc. of the fifteenth national/tenth conference on Artificialintelligence/Innovative applications of artificial intelligence,pages 820–826, Menlo Park, CA, USA, 1998. American As-sociation for Artificial Intelligence.

[MB99] Inderjeet Mani and Eric Bloedorn. Summarizing similaritiesand differences among related documents. Inf. Retr., 1(1-2):35–67, 1999.

[MBE+01] K.R. McKeown, R. Barzilay, D. Evans, V. Hatzivassiloglou,M. Yen Kan, B. Schiffman, and S. Teufel. Columbia multi-document summarisation: Approach and evaluation. In Proc.of the Document Understanding Conference (DUC 2001),2001.

[MBE+02] Kathleen R. McKeown, Regina Barzilay, David Evans,Vasileios Hatzivassiloglou, Judith L. Klavans, Ani Nenkova,Carl Sable, Barry Schiffman, and Sergey Sigelman. Track-ing and summarizing news on a daily basis with columbia’snewsblaster. In Proc. of the second international conferenceon Human Language Technology Research, pages 280–285, SanFrancisco, CA, USA, 2002. Morgan Kaufmann Publishers Inc.

[McD07] Ryan McDonald. A study of global inference algorithms inmulti-document summarization. In ECIR’07: Proc. of the 29thEuropean conference on IR research, pages 557–564, Berlin,Heidelberg, 2007. Springer-Verlag.

[MD00] M. F. Moens and J. Dumortier. Use of a text grammar forgenerating highlight abstracts of magazine articles. Journal ofDocumentation, pages 520–539, 2000.

211

BIBLIOGRAPHY

[MGP03] Florent Monay and Daniel Gatica-Perez. On image auto-annotation with latent space models. In Proc. of the eleventhACM international conference on Multimedia, MULTIMEDIA’03, pages 275–278, New York, NY, USA, 2003. ACM.

[Mih05] Rada Mihalcea. Language independent extractive summariza-tion. In ACL ’05: Proc. of the ACL 2005 on Interactiveposter and demonstration sessions, pages 49–52, Morristown,NJ, USA, 2005. Association for Computational Linguistics.

[Mit97] Tom Mitchell. Machine Learning. Mcgraw-Hill, London, 1997.

[MJH98] K. McKeown, D. Jordan, and V. Hatzivassiloglou. Generat-ing patient-specific summaries of online literature. In Proc. ofAAAI-98, pages 34–43, 1998.

[MM99] A. Merlino and M. Maybury. An empirical study of the optimalpresentation of multimedia summaries of broadcast news. InI. Mani and M. T. Maybury, editors, Advances in AutomaticText Summarization, pages 391–402. MIT Press, Cambridge,MA, 1999.

[MPE+05] Kathleen McKeown, Rebecca J. Passonneau, David K. Elson,Ani Nenkova, and Julia Hirschberg. Do summaries help? InSIGIR ’05: Proc. of the 28th annual international ACM SI-GIR conference on Research and development in informationretrieval, pages 210–217, New York, NY, USA, 2005. ACM.

[MR95] Kathleen McKeown and Dragomir R. Radev. Generating sum-maries of multiple news articles. In SIGIR ’95: Proc. of the18th annual international ACM SIGIR conference on Researchand development in information retrieval, pages 74–82, NewYork, NY, USA, 1995. ACM.

[MRC05] G. Murray, S. Renals, and J. Carletta. Extractive summariza-tion of meeting recordings. In Ninth European Conference onSpeech Communication and Technology, 2005.

[MRK95] Kathleen McKeown, Jacques Robin, and Karen Kukich. Gen-erating concise natural language summaries. Inf. Process.Manage., 31(5):703–733, 1995.

[MS01] Christopher D. Manning and Hinrich Schutze. Foundations ofStatistical Natural Language Processing. MIT Press, 2001.

212

BIBLIOGRAPHY

[MSB97] Mandar Mitra, Amit Singhal, and Chris Buckley. Automatictext summarization by paragraph extraction. In Proc. of theACL’97/EACL’97 Workshop on Intelligent Scalable Text Sum-marization, pages 39–46, 1997.

[MT88] W. C. Mann and S. A. Thompson. Rhetorical structure the-ory: Towards a functional theory of text organization. Text,8(3):243–281, 1988.

[MT04] R. Mihalcea and P. Tarau. TextRank - bringing order intotexts. In Proc. of the Conference on Empirical Methods in Nat-ural Language Processing (EMNLP 2004), volume 4, page 6,Barcelona, Spain, 2004.

[MWN+09] David Mimno, Hanna M. Wallach, Jason Naradowsky,David A. Smith, and Andrew McCallum. Polylingual topicmodels. In Proc. of the 2009 Conference on Empirical Meth-ods in Natural Language Processing: Volume 2 - Volume 2,EMNLP ’09, pages 880–889, Morristown, NJ, USA, 2009. As-sociation for Computational Linguistics.

[Nas08] Vivi Nastase. Topic-driven multi-document summarizationwith encyclopedic knowledge and spreading activation. InProc. of the Conference on Empirical Methods in Natural Lan-guage Processing, pages 763–772, Honolulu, Hawaii, 2008. As-sociation for Computational Linguistics.

[Nen05] Ani Nenkova. Automatic text summarization of newswire:lessons learned from the document understanding conference.In Proc. of the 20th national conference on Artificial intelli-gence - Volume 3, pages 1436–1441. AAAI Press, 2005.

[Nen06] Ani Nenkova. Understanding the process of multi-documentsummarization: content selection, rewriting and evaluation.PhD thesis, Columbia University, New York, NY, USA, 2006.

[NHMK10] Hitoshi Nishikawa, Takaaki Hasegawa, Yoshihiro Matsuo, andGenichiro Kikui. Opinion summarization with integer linearprogramming formulation for sentence extraction and ordering.In Coling 2010: Posters, pages 910–918, Beijing, China, 2010.

[NJW01] A.Y. Ng, M.I. Jordan, and Y. Weiss. On spectral clustering:analysis and an algorithm. In T.G. Dietterich, S. Becker, and

213

BIBLIOGRAPHY

Z. Ghahramani, editors, Advances in Neural Information Pro-cessing Systems, volume 14, Cambridge, MA, 2001. MIT Press.

[NM01] Tadashi Nomoto and Yuji Matsumoto. A new approach tounsupervised text summarization. In Proc. of the 24th annualinternational ACM SIGIR conference on Research and devel-opment in information retrieval, SIGIR ’01, pages 26–34, NewYork, NY, USA, 2001. ACM.

[Nom05] Tadashi Nomoto. Bayesian learning in text summarization. InHLT ’05: Proc. of the conference on Human Language Tech-nology and Empirical Methods in Natural Language Process-ing, pages 249–256, Morristown, NJ, USA, 2005. Associationfor Computational Linguistics.

[NP04] Ani Nenkova and Rebecca Passonneau. Evaluating Con-tent Selection in Summarization: The Pyramid Method. InDaniel Marcu Susan Dumais and Salim Roukos, editors, HLT-NAACL 2004: Main Proc., pages 145–152, Boston, Mas-sachusetts, USA, 2004. Association for Computational Linguis-tics.

[NPM07] Ani Nenkova, Rebecca Passonneau, and Kathleen McKeown.The Pyramid Method: Incorporating human content selectionvariation in summarization evaluation. ACM Trans. SpeechLang. Process., 4(2):4, 2007.

[NV05] Ani Nenkova and Lucy Vanderwende. The impact of frequencyon summarization. Technical Report MSR-TR-2005-101, Mi-crosoft Research, 2005.

[NVM06] Ani Nenkova, Lucy Vanderwende, and Kathleen McKeown. Acompositional context sensitive multi-document summarizer:exploring the factors that influence summarization. In SIGIR’06: Proc. of the 29th annual int. ACM SIGIR conf. on Re-search and development in information retrieval, pages 573–580, New York, NY, USA, 2006. ACM.

[OCA10] Makbule Ozsoy, Ilyas Cicekli, and Ferda Alpaslan. Textsummarization of turkish texts using latent semantic analy-sis. In Proc. of the 23rd International Conference on Com-putational Linguistics (Coling 2010), pages 869–876, Beijing,China, 2010.

214

BIBLIOGRAPHY

[ODH07] Paul Over, Hoa Dang, and Donna Harman. DUC in context.Inf. Process. Manage., 43(6):1506–1520, 2007.

[OER05] Jahna Otterbacher, Gunes Erkan, and Dragomir R. Radev. Us-ing random walks for question-focused sentence retrieval. InHLT ’05: Proc. of the conference on Human Language Tech-nology and Empirical Methods in Natural Language Process-ing, pages 915–922, Morristown, NJ, USA, 2005. Associationfor Computational Linguistics.

[OER09] Jahna Otterbacher, Gunes Erkan, and Dragomir R. Radev.Biased lexrank: Passage retrieval using random walks withquestion-based priors. Inf. Process. Manage., 45:42–54, Jan-uary 2009.

[OL02] P. Over and W. Liggett. Introduction to DUC: An intrin-sic evaluation of generic news text summarization systems.In Proc. of the Document Understanding Conference (DUC2002), 2002.

[OLL07] You Ouyang, Sujian Li, and Wenjie Li. Developing learn-ing strategies for topic-based summarization. In Proc. of thesixteenth ACM conference on Conference on information andknowledge management, CIKM ’07, pages 79–86, New York,NY, USA, 2007. ACM.

[ORK06] Jahna Otterbacher, Dragomir Radev, and Omer Kareem.News to go: hierarchical text summarization for mobile de-vices. In SIGIR ’06: Proc. of the 29th annual internationalACM SIGIR conference on Research and development in in-formation retrieval, pages 589–596, New York, NY, USA, 2006.ACM.

[OSM94] Kenji Ono, Kazuo Sumita, and Seiji Miike. Abstract genera-tion based on rhetorical structure extraction. In Proc. of the15th conference on Computational linguistics, pages 344–348,Morristown, NJ, USA, 1994. Association for ComputationalLinguistics.

[PC98] Jay M. Ponte and W. Bruce Croft. A language modeling ap-proach to information retrieval. In SIGIR ’98: Proc. of the21st annual international ACM SIGIR conference on Research

215

BIBLIOGRAPHY

and development in information retrieval, pages 275–281, NewYork, NY, USA, 1998. ACM.

[PGK05] Martha Palmer, Daniel Gildea, and Paul Kingsbury. Theproposition bank: An annotated corpus of semantic roles.Comput. Linguist., 31:71–106, March 2005.

[PJ93] Chris D. Paice and Paul A. Jones. The identification of impor-tant concepts in highly structured technical papers. In SIGIR’93: Proc. of the 16th annual international ACM SIGIR con-ference on Research and development in information retrieval,pages 69–78, New York, NY, USA, 1993. ACM.

[PKV07] Prasad Pingali, Rahul K, and Vasudeva Varma. IIIT Hyder-abad at DUC 2007. In Proc. of the Document UnderstandingConference (DUC 2007), 2007.

[PNMS05] R. J Passonneau, A. Nenkova, K. McKeown, and S. Sigel-man. Applying the Pyramid method in DUC 2005. In Proc. ofthe Document Understanding Conference (DUC’05), volume 5,2005.

[Por80] M. F. Porter. An algorithm for suffix stripping. Program,14(3):130–137, 1980.

[Rab90] Lawrence R. Rabiner. A tutorial on hidden markov modelsand selected applications in speech recognition. In Alex Waibeland Kai-Fu Lee, editors, Readings in speech recognition, pages267–296. Morgan Kaufmann Publishers Inc., San Francisco,CA, USA, 1990.

[RD00] Ehud Reiter and Robert Dale. Building natural language gen-eration systems. Cambridge University Press, New York, NY,USA, 2000.

[RH88] U. Reimer and U. Hahn. Text condensation as knowledge baseabstraction. In Proc. of the 4th Conference on Artifical Intel-ligence Applications, pages 338–344, 1988.

[RHM02] Dragomir R. Radev, Eduard Hovy, and Kathleen McKeown.Introduction to the special issue on summarization. Comput.Linguist., 28(4):399–408, 2002.

216

BIBLIOGRAPHY

[RJB00] Dragomir R. Radev, Hongyan Jing, and MalgorzataBudzikowska. Centroid-based summarization of multiple docu-ments: sentence extraction, utility-based evaluation, and userstudies. In NAACL-ANLP 2000 Workshop on Automatic sum-marization, pages 21–30, Morristown, NJ, USA, 2000. Associ-ation for Computational Linguistics.

[RJST04] Dragomir R. Radev, Hongyan Jing, Malgorzata Stys, andDaniel Tam. Centroid-based summarization of multiple docu-ments. Inf. Process. Manage., 40:919–938, 2004.

[RJZ89] L. F. Rau, P. S. Jacobs, and U. Zernik. Information extractionand text summarization using linguistic knowledge acquisition.Inf. Process. Manage., 25(4):419–428, 1989.

[RKEA00] Norbert Reithinger, Michael Kipp, Ralf Engel, and JanAlexandersson. Summarizing multilingual spoken negotiationdialogues. In ACL ’00: Proc. of the 38th Annual Meetingon Association for Computational Linguistics, pages 310–317,Morristown, NJ, USA, 2000. Association for ComputationalLinguistics.

[RM98] Dragomir R. Radev and Kathleen R. McKeown. Generat-ing natural language summaries from multiple on-line sources.Comput. Linguist., 24(3):470–500, 1998.

[RRS61] G. J. Rath, A. Resnick, and T. R. Savage. The formation of ab-stracts by the selection of sentences. part I. sentence selectionby men and machines. American Documentation, 12(2):139–141, 1961.

[RY02] Monica Rogati and Yiming Yang. High-performing featureselection for text classification. In CIKM ’02: Proc. of theeleventh international conference on Information and knowl-edge management, pages 659–661, New York, NY, USA, 2002.ACM Press.

[SAB93] Gerard Salton, J. Allan, and Chris Buckley. Approaches topassage retrieval in full text information systems. In SIGIR’93: Proc. of the 16th annual international ACM SIGIR con-ference on Research and development in information retrieval,pages 49–58, New York, NY, USA, 1993. ACM.

217

BIBLIOGRAPHY

[Sch73] Roger C. Schank. Identification of conceptualizations underly-ing natural language. In R. C. Schank and K. Colby, editors,Computer Models of Thought and Language. W. H. Freeman,San Francisco, 1973.

[SG07] Mark Steyvers and Tom Griffiths. Probabilistic topic models.In T. Landauer, S. Dennis McNamara, and W. Kintsch, edi-tors, Latent Semantic Analysis: A Road to Meaning. LaurenceErlbaum, 2007.

[SH09] Thomas Strecker and Leonhard Hennig. Automatic layoutingof personalized newspaper pages. In Operations Research Pro-ceedings 2008, pages 469–474, Berlin, Heidelberg, September2009. Springer-Verlag.

[SJ01] Tetsuya Sakai and Karen Sparck Jones. Generic summaries forindexing in information retrieval. In SIGIR ’01: Proc. of the24th annual international ACM SIGIR conference on Researchand development in information retrieval, pages 190–198, NewYork, NY, USA, 2001. ACM Press.

[SJ08] J. Steinberger and K. Jezek. Sutler: Update summarizer basedon latent topics. In Proceedings of the Text Analysis Confer-ence, 2008.

[SK08] Frank Schilder and Ravikumar Kondadadi. Fastsum: fast andaccurate query-based multi-document summarization. In Proc.of the 46th Annual Meeting of the Association for Compu-tational Linguistics on Human Language Technologies: ShortPapers, HLT ’08, pages 205–208, Morristown, NJ, USA, 2008.Association for Computational Linguistics.

[Sko72] E. F. Skorokhod’ko. Adaptive method of automatic abstract-ing and indexing. In Information Processing 71: Proc. of theIFIP Congress 71, pages 1179–1182, 1972.

[SKPSG05] Josef Steinberger, Mijail A. Kabadjov, Massimo Poesio, andOlivia Sanchez-Graillet. Improving LSA-based summarizationwith anaphora resolution. In Proc. of HLT-EMNLP ’05, pages1–8, Morristown, NJ, USA, 2005. Association for Computa-tional Linguistics.

218

BIBLIOGRAPHY

[SL02] Horacio Saggion and Guy Lapalme. Generating indicative-informative summaries with sumum. Comput. Linguist.,28(4):497–526, 2002.

[SM86] Gerard Salton and Michael J. McGill. Introduction to Mod-ern Information Retrieval. McGraw-Hill, Inc., New York, NY,USA, 1986.

[SM02] H. Grogory Silber and Kathleen F. McCoy. Efficiently com-puted lexical chains as an intermediate representation for auto-matic text summarization. Comput. Linguist., 28(4):487–496,2002.

[SM03] Radu Soricut and Daniel Marcu. Sentence level discourse pars-ing using syntactic and lexical information. In NAACL ’03:Proc. of the 2003 Conference of the North American Chapterof the Association for Computational Linguistics on HumanLanguage Technology, pages 149–156, Morristown, NJ, USA,2003. Association for Computational Linguistics.

[SNM02] Barry Schiffman, Ani Nenkova, and Kathleen McKeown. Ex-periments in multidocument summarization. In Proc. of thesecond international conference on Human Language Technol-ogy Research, pages 52–58, San Diego, California, 2002. Mor-gan Kaufmann Publishers Inc.

[SNM04] Advaith Siddharthan, Ani Nenkova, and Kathleen McKe-own. Syntactic simplification for improving content selectionin multi-document summarization. In COLING ’04: Proc. ofthe 20th international conference on Computational Linguis-tics, page 896, Morristown, NJ, USA, 2004. Association forComputational Linguistics.

[SPKJ07] Josef Steinberger, Massimo Poesio, Mijail A. Kabadjov, andKarel Jeek. Two uses of anaphora resolution in summarization.Inf. Process. Manage., 43:1663–1680, November 2007.

[SR81] Roger C. Schank and Christopher K. Riesbeck. InsideComputer Understanding: Five Programs plus Miniatures.Lawrence Erlbaum, Hillsdale, NJ, 1981.

[SRR08] M. Saravanan, B. Ravindran, and S. Raman. Automatic iden-tification of rhetorical roles using conditional random fields for

219

BIBLIOGRAPHY

legal document summarization. In Proc. of the Third Interna-tional Joint Conference on Natural Language Processing, pages481—488, 2008.

[SSMB97] G. Salton, A. Singhal, M. Mitra, and C. Buckley. Automatictext structuring and summarization. Information Processingand Management, 1997.

[SSMB99] G. Salton, A. Singhal, M. Mitra, and C. Buckley. Automatictext structuring and summarization. In I. Mani and M. T.Maybury, editors, Advances in Automatic Text Summariza-tion, pages 341–355. MIT Press, Cambridge, MA, 1999.

[SSWW99] T. Strzalkowski, G. Stein, J. Wang, and B. Wise. A robustpractical text summarizer. In I. Mani and M. T. Maybury, ed-itors, Advances in Automatic Text Summarization, pages 137–154. MIT Press, Cambridge, MA, 1999.

[SSZ+05] Jian-Tao Sun, Dou Shen, Hua-Jun Zeng, Qiang Yang, YuchangLu, and Zheng Chen. Web-page summarization using click-through data. In SIGIR ’05: Proc. of the 28th annual interna-tional ACM SIGIR conference on Research and developmentin information retrieval, pages 194–201, New York, NY, USA,2005. ACM.

[TAC09] TAC. Proc. of the text analysis conference 2008-2009.http://www.nist.gov/tac, 2009.

[TBG+07] Kristina Toutanova, Chris Brockett, Michael Gamon, Ja-gadeesh Jagarlamudi, Hisami Suzuki, and Lucy Vanderwende.The PYTHY summarization system: Microsoft research atDUC 2007. In DUC 2007, 2007.

[TH04] S. Teufel and H. Van Halteren. Evaluating information contentby factoid analysis: human annotation and stability. In Proc.of EMNLP, 2004.

[TJ05] R. I. Tucker and K. Sparck Jones. Between shallow and deep:An experiment in automatic summarising. Technical Report632, Computer Laboratory, University of Cambridge, Cam-bridge, England, 2005.

220

BIBLIOGRAPHY

[TM97] S. Teufel and M. Moens. Sentence extraction as a classificationtask. In ACL/EACL-97 Workshop on Intelligent and ScalableText Summarization, pages 58–65, 1997.

[TM02] Simone Teufel and Marc Moens. Summarizing scientific arti-cles: experiments with relevance and rhetorical status. Com-put. Linguist., 28(4):409–445, 2002.

[TVdBPC04] Gian Lorenzo Thione, Martin Van den Berg, Livia Polanyi,and Chris Culy. Hybrid text summarization: Combining ex-ternal relevance measures with structural analysis. In Marie-Francine Moens and Stan Szpakowicz, editors, Text Summa-rization Branches Out: Proc. of the ACL-04 Workshop, pages51–55, Barcelona, Spain, 2004. Association for ComputationalLinguistics.

[TYC09] J. Tang, L. Yao, and D. Chen. Multi-topic based query-oriented summarization. In Proc. of SDM ’09, 2009.

[Vap95] Vladimir N. Vapnik. The nature of statistical learning theory.Springer-Verlag New York, Inc., New York, NY, USA, 1995.

[VBM04] L. Vanderwende, M. Banko, and A. Menezes. Event-centricsummary generation. In Proc. of the Document UnderstandingConference (DUC 2004), pages 76–81, 2004.

[VCL07] Rakesh Verma, Ping Chen, and Wei Lu. A semantic free-textsummarization system using ontology knowledge. In Proc. ofthe Document Understanding Conf. (DUC 2007), 2007.

[vHT03] Hans van Halteren and Simone Teufel. Examining the consen-sus between human summaries: initial experiments with fac-toid analysis. In Proc. of the HLT-NAACL 03 on Text summa-rization workshop, pages 57–64, Morristown, NJ, USA, 2003.Association for Computational Linguistics.

[Voo03] E. M. Voorhees. Overview of the TREC 2003 question answer-ing track. In Proc. of the Twelfth Text Retrieval Conference(TREC 2003), pages 54–68, 2003.

[VSB06] Lucy Vanderwende, Hisami Suzuki, and Chris Brockett. Mi-crosoft research at DUC 2006: Task-focused summarizationwith sentence simplification and lexical expansion. In Proc. ofthe Document Understanding Conference (DUC 2006), 2006.

221

BIBLIOGRAPHY

[VSBN07] Lucy Vanderwende, Hisami Suzuki, Chris Brockett, and AniNenkova. Beyond sumbasic: Task-focused summarization withsentence simplification and lexical expansion. Inf. Process.Manage., 43(6):1606–1618, 2007.

[WAB+07] Robert Wetzker, Tansu Alpcan, Christian Bauckhage, Win-fried Umbrath, and Sahin Albayrak. An unsupervised hier-archical method for automated document categorization. InProc. of the IEEE/WIC/ACM Web Intelligence 2007. IEEEComputer Society Press, 2007.

[Wal06] Hanna M. Wallach. Topic modeling: beyond bag-of-words.In ICML ’06: Proc. of the 23rd international conference onMachine learning, pages 977–984, New York, NY, USA, 2006.ACM.

[Wei99] Y. Weiss. Segmentation using eigenvectors: A unifying view.In International Conference on Computer Vision, 1999.

[Wet09] Robert Wetzker. Graph-Based Recommendation in Broad Folk-sonomies. PhD thesis, Technische Universitat Berlin, 2009.

[WL03] Chia-Wei Wu and Chao-Lin Liu. Ontology-based text summa-rization for business news articles. In Computers and TheirApplications 2003, pages 389–392, 2003.

[WLZD08] Dingding Wang, Tao Li, Shenghuo Zhu, and Chris Ding. Multi-document summarization via sentence-level semantic analysisand symmetric matrix factorization. In Proc. of the 31st an-nual international ACM SIGIR conference on Research anddevelopment in information retrieval, SIGIR ’08, pages 307–314, New York, NY, USA, 2008. ACM.

[WM06] Xuerui Wang and Andrew McCallum. Topics over time: anon-markov continuous-time model of topical trends. In Proc.of the 12th ACM SIGKDD international conference on Knowl-edge discovery and data mining, KDD ’06, pages 424–433, NewYork, NY, USA, 2006. ACM.

[Wra02] A. Wray. Formulaic language and the lexicon. CambridgeUniversity Press, Cambridge, UK, 2002.

222

BIBLIOGRAPHY

[WWLL09] Wei Wang, Furu Wei, Wenjie Li, and Sujian Li. Hypersum:hypergraph based semi-supervised sentence ranking for query-oriented summarization. In Proceeding of the 18th ACM con-ference on Information and knowledge management, CIKM’09, pages 1855–1858, New York, NY, USA, 2009. ACM.

[WY08] Xiaojun Wan and Jianwu Yang. Multi-document summariza-tion using cluster-based link analysis. In Proc. of the 31stannual international ACM SIGIR conference on Research anddevelopment in information retrieval, SIGIR ’08, pages 299–306, New York, NY, USA, 2008. ACM.

[WYX06] Xiaojun Wan, Jianwu Yang, and Jianguo Xiao. Using cross-document random walks for topic-focused multi-documentsummarization. In Proc. of the 2006 IEEE/WIC/ACM Inter-national Conference on Web Intelligence, WI ’06, pages 1012–1018, Washington, DC, USA, 2006. IEEE Computer Society.

[YGVS07] Wen-tau Yih, Joshua Goodman, Lucy Vanderwende, andHisami Suzuki. Multi-document summarization by maximiz-ing informative content-words. In IJCAI’07: Proc. of the 20thinternational joint conference on Artifical intelligence, pages1776–1782, San Francisco, CA, USA, 2007. Morgan KaufmannPublishers Inc.

[YKYM05] Jen-Yuan Yeh, Hao-Ren Ke, Wei-Pang Yang, and I-HengMeng. Text summarization using a trainable summarizer andlatent semantic analysis. Inf. Process. Manage., 41(1):75–95,2005.

[ZDL+05] D. Zajic, B. Dorr, J. Lin, C. Monz, and R. Schwartz. Asentence-trimming approach to multi-document summariza-tion. In Proc. of the Document Understanding Conference(DUC 2005), 2005.

[ZDLS07] David Zajic, Bonnie J. Dorr, Jimmy Lin, and RichardSchwartz. Multi-candidate reduction: Sentence compressionas a tool for document summarization tasks. Inf. Process.Manage., 43(6):1549–1570, 2007.

[Zec01] Klaus Zechner. Automatic generation of concise summariesof spoken dialogues in unrestricted domains. In SIGIR ’01:Proc. of the 24th annual international ACM SIGIR conference

223

BIBLIOGRAPHY

on Research and development in information retrieval, pages199–207, New York, NY, USA, 2001. ACM.

[Zha02] Hongyuan Zha. Generic summarization and keyphrase extrac-tion using mutual reinforcement principle and sentence clus-tering. In SIGIR ’02: Proc. of the 25th annual internationalACM SIGIR conference on Research and development in infor-mation retrieval, pages 113–120, New York, NY, USA, 2002.ACM.

[Zip35] George K. Zipf. The Psychobiology of language. Houghton-Mifflin, New York, NY, 1935.

224