processing of large document collections part 3. text summarization z”process of distilling the...
TRANSCRIPT
Processing of large document collections
Part 3
Text summarization
”Process of distilling the most important information from a source to produce an abridged version for a particular user or task”
Text summarization
Many everyday uses: headlines (from around the world) outlines (notes for students) minutes (of a meeting) reviews (of books, movies) ...
Architecture of a text summarization system
Input: a single document or multiple
documents text, images, audio, video database
Architecture of a text summarization system
output: extract or abstract compression rate
ratio of summary length to source length
connected text or fragmentary generic or user-focused/domain-specific indicative or informative
Architecture of a text summarization system
Three phases: analyzing the input text transforming it into a summary
representation synthesizing an appropriate output form
Condensation operations
Selection of more salient or non-redundant information
aggregation of information (e.g. from different parts of the source, or of different linguistic descriptions)
generalization of specific information with more general, abstract information
The level of processing
Surface levelentity leveldiscourse level
Surface-level approaches
Tend to represent information in terms of shallow features
the features are then selectively combined together to yield a salience function used to extract information
Surface level
Shallow features thematic features
presence of statistically salient terms, based on term frequency statistics
locationposition in text, position in paragraph, section
depth, particular sections
backgroundpresence of terms from the title or headings in
the text, user’s query
Surface level
Cue words and phrases ”in summary”, ”our investigation” emphasizers like ”important”, ”in
particular” domain-specific bonus (+ ) and stigma
(-) terms
Entity-level approaches
Build an internal representation for text
modeling text entities and their relationships
tend to represent patterns of connectivity in the text to help determine what is salient
Relationships between entities
Similarity (e.g. vocabulary overlap)proximity (distance between text units)co-occurrence (words related based on
their occurring in common contexts)thesaural relationships among words
(synonymy, hypernymy, part-of relations)co-reference (of referring expressions
such as noun phrases)
Relationships between entities
Logical relationships (agreement, contradiction, entailment, consistency)
syntactic relations (based on parse trees)
meaning representation-based relations (e.g. based on predicate-argument relations)
Discourse-level approaches
Model the global structure of the text and its relation to communicative goals
structure can include: format of the document (e.g. hypertext
markup) threads of topics as they are revealed in
the text rhetorical structure of the text, such as
argumentation or narrative structure
Classical approaches
Luhn ’58Edmundson ’69
Luhn’s method
Filter terms in the document using a stoplist
Terms are normalized based on aggregating together ortographically similar terms
Frequencies of aggregated terms are calculated and non-frequent terms are removed
Luhn’s method
Sentences are weighted using the resulting set of ”significant” terms and a term density measure
each sentence is divided into segments bracketed by significant terms not more than 4 non-significant terms apart
Luhn’s method
each segment is scored by taking the square of the number of bracketed significant terms divided by the total number of bracketed terms
the score of the highest scoring segment is taken as the sentence score
the highest scoring sentences are chosen to the summary
Edmundson’s method
Extends earlier work to look at three features in addition to word frequencies: cue phrases (e.g. ”significant”,
”impossible”, ”hardly”) title and heading words location
Edmundson’s method
Programs to weight sentences based on each of the four methods separately
programs were evaluated by comparison against manually created extracts
corpus-based methodology: training set and test set in the training phase, weights were
manually readjusted
Edmundson’s method
Results: three additional features dominated
word frequency measures the combination of cue-title-location
was the best, with location being the best individual feature
keywords alone was the worst
Fundamental issues
What are the most powerful but also more general features to exploit for summarization?
How do we combine these features?How can we evaluate how well we
are doing?
Corpus-based approaches
In the classical methods, various features (thematic features, title, location, cue phrase) were used to determine the salience of information for summarization
an obvious issue: determine the relative contribution of different features to any given text summarization task
Corpus-based approaches
Contribution is dependent on the text genre, e.g. location: in newspaper stories, the leading text
often contains a summary in TV news, a preview segment may
contain a summary of the news to come in scientific text: an author-written
abstract
Corpus-based approaches
The importance of different text features for any given summarization problem can be determined by counting the occurrences of such features in text corpora
in particular, analysis of human-generated summaries, along with their full-text sources, can be used to learn rules for summarization
Corpus-based approaches
One could use a corpus to model particular components, without using a completely trainable approach e.g. a corpus can be used to compute
weights (TFIDF)
Corpus-based approaches
Challenges creating a suitable text corpus,
designing an annotation scheme ensuring the suitable set of summaries
is availablemay already be available: scientific papersif not: author, professional abstractor, judge
KPC method
Kupiec, Pedersen, Chen (1995): A Trainable Document Summarizer
a learning method using a corpus of abstracts written by professional human abstractors (Engineering Information Co.)
naïve Bayesian classification method is used
KPC method: features
Sentence-length cut-off feature given a threshold (e.g. 5 words), the feature
is true for all sentences longer than the threshold, and false otherwise
fixed-phrase feature this feature is true for sentences that contain
any of 26 indicator phrases (e.g. ”this letter…”, ”In conclusion…”), or that follow section head that contain specific keywords (e.g. ”results”, ”conclusion”)
KPC method: features
Paragraph feature sentences in the first 10 paragraphs and
the last 5 paragraphs in a document get a higher value
in paragraphs: paragraph-initial, paragraph-final, paragraph-medial are distinguished
KPC method: features
thematic word feature a small number of thematic words (the
most frequent content words) are selected each sentence is scored as a function of
frequency of the thematic words highest scoring sentences are selected binary feature: feature is true for a
sentence, if the sentence is present in the set of highest scoring sentences
KPC method: features
Uppercase word feature proper names and explanatory text for
acronyms are usually important feature is computed like the thematic word
feature an uppercase thematic word
is not sentence-initial and begins with a capital letter and must occur several times
first occurrence is scored twice as much as later occurrences
KPC method: classifier
For each sentence, we compute the probability it will be included in a summary S given the k features Fj, j=1…k
the probability can be expressed using Bayes’ rule:
),...,(
)()|,...(),...,|(
1
11
k
kk
FFP
SsPSsFFPFFSsP
KPC method: classifier
Assuming statistical independence of the features:
P(sS) is a constant, and P(Fj| sS) and P(Fj) can be estimated directly from the training set by counting occurrences
)(
)()|(),...|(
1
11
k
jj
k
jj
k
FP
SsPSsFPFFSsP
KPC method: corpus
Corpus is acquired from Engineering Information Co, which provides abstracts of technical articles to online information services
articles do not have author-written abstracts
abstracts were created by professional abstractors
KPC method: corpus
188 document/summary pairs sampled from 21 publications in the scientific/technical domain
summaries are mainly indicative, average length is 3 sentences
average number of sentences in the original documents is 86
author, address, and bibliography were removed
KPC method: sentence matching
The abstracts from the human abstractors are not extracts but inspired by the original sentences
the automatic summarization task here: extract sentences that the human
abstractor might have chosen to prepare summary text (with minor modifications…)
KPC method: sentence matching
For training, a correspondence between the manual summary sentences and sentences in the original document need to be obtained
matching can be done in several ways
KPC method: sentence matching
matching can be done in several ways: a direct sentence match
the same sentence is found in both
a direct join 2 or more original sentences were used to
form a summary sentence
summary sentence can be ’unmatchable’ summary sentence (single or joined) can
be ’incomplete’
KPC method: sentence matching
Matching was done in two passes first, the best one-to-one sentence
matches were found automatically (79%)
second, these matches were used as a starting point for the manual assignment of correspondences
KPC method: evaluation
Cross-validation strategy for evaluation documents from a given journal were
selected for testing one at a time; all other document/summary pairs were used for training
unmatchable and incomplete sentences were excluded
total of 498 unique sentences
KPC method: evaluation
Two ways of evaluation the fraction of manual summary sentences
that were faithfully reproduced by the summarizer programthe summarizer produced the same number of
sentences as were in the corresponding manual summary
-> 35% 83% is the highest possible value, since
unmatchable and incomplete sentences were excluded
KPC method: evaluation
The fraction of the matchable sentences that were correctly identified by the summarizer-> 42%
the effect of different features was also studiedbest combination (44%): paragraph, fixed-phrase,
sentence-lengthbaseline: selecting sentences from the beginning of
the document (result: 24%)
if 25% of the original sentences selected: 84%
Discourse-based approaches
Discourse structure appears to play an important role in the strategies used by human abstractors and in the structure of their abstracts
an abstract is not just a collection of sentences, but it has an internal structure -> abstract should be coherent and it should
represent some of the argumentation used in the source
Discourse models
Cohesion relations between words or referring
expressions, which determine how tightly connected the text isanaphora, ellipsis, synonymy, hypernymy (dog is-
a-kind animal)
coherence overall structure of a multi-sentence text in
terms of macro-level relations between sentences (e.g. ”although” -> contrast)
Boguraev, Kennedy (BG)
Goal: identify those phrasal units across the entire span of the document that best function as representative highlights of the document’s content
these phrasal units are called topic stamps
a set of topic stamps is called capsule overview
BG
A capsule overview not a set/sequence of sentences a semi-formal (normalised) representaion
of the document, derived after a process of data reduction over the original text
not always very readable, but still represents the flow of the narrativecan be combined with surrounding information
to produce more coherent presentation
BG
Primary consideration: methods should apply to any document type and source (domain independence)
also: efficient and scalable technology shallow syntactic analysis, no
comprehensive parsing engine needed
BGBased on the findings on technical terms
technical terms have such linguistic properties that can be used to find terms automatically in different domains quite reliably
technical terms seem to be topicaltask of content characterization
identifying phrasal units that have lexico-syntactic properties similar to technical terms discourse properties that signify their status as most
prominent
BG: terms as content indicators
Problems undergeneration overgeneration differentiation
Undergeneration
a set of phrases should contain an exhaustive description of all the entities that are discussed in the text
the set of technical terms has to be extended to include also expressions with pronouns etc.
Overgeneration
already the set of technical terms can be large
extensions make the information overload even worse
solution: phrases that refer to one participant in the discourse are combined with referential links
Differentiation
The same list of terms may be used to describe two documents, even if they, e.g., focus on different subtopics
it is necessary to differentiate term sets not only according to their membership, but also according to the relative representativeness of the terms they contain
Term sets and coreference classes
Phrases are extracted using a phrasal grammar (e.g. a noun with modifiers) also expressions with pronouns and
incomplete expressions are extracted using a (Lingsoft) tagger that provides
information about the part of speech, number, gender, and grammatical function of tokens in a text
solves the undergeneration problem
Term sets and coreference classes
The phrase set has to be reduced to solve the problem of overgeneration
-> a smaller set of expressions that uniquely identify the objects referred to in the text
application of anaphora resolution e.g. to which noun a pronoun ’he’ refers
to?
Resolving coreferences
Procedure moving through the text sentence by
sentence and analysing the nominal expressions in each sentence from left to right
either an expression is identified as a new participant in the discourse, or it is taken to refer to a previously mentioned referent
Resolving coreferences
Coreference is determined by a 3 step procedure a set of candidates is collected: all nominals
within a local segment of discourse some candidates are eliminated due to
morphological mismatches or syntactical restrictions
remaining candidates are ranked according to their relative salience in the discourse
Salience factors
sent(term) = 100 iff term is in the current sentence
cntx(term) = 50 iff term is in the current discourse segment
subj(term) = 80 iff term is a subjectacc(term) = 50 iff term is a direct objectdat(term) = 40 iff term is an indirect obj...
Local salience of a candidate
The local salience of a candidate is the sum of the values of the salience factors
the most salient candidate is selected as the antecedent for the anaphor
if the coreference link cannot be established to some other expression, the nominal is taken to introduce a new referent
-> coreferent classes
Topic stamps
In order to further reduce the referent set, some additional structure has to be imposed the term set is ranked according to the
salience of its members relative prominence or importance in the
discourse of the entities to which they refer objects in the centre of discussion have a
high degree of salience
Saliency
Measured like local saliency in coreference resolution, but tries the measure the importance of unique referents in the discourse
Priest is charged with Pope attack
A Spanish priest was charged here today with attempting to murder the Pope. Juan Fernandez Krohn, aged 32, was arrested after a man armed with a bayonet approached the Pope while he was saying prayers at Fatima on Wednesday night.
According to the police, Fernandez told the investigators today that he trained for the past six months for the assault. He was alleged to have claimed the Pope ’looked furious’ on hearing the priest’s criticism of his handling of the church’s affairs. If found quilty, the Spaniard faces a prison sentence of 15-20 years.
Saliency’priest’ is the primary element
eight references to the same actor in the body of the story
these reference occur in important syntactic positions: 5 are subjects of main clauses, 2 are subjects of embedded clauses, 1 is a possessive
’Pope attack’ is also important ’Pope’ occurs 5 times, but not in so important
positions (2 are direct objects)
Discourse segments
If the intention is to use very concise descriptions of one or two salient phrases, i.e. topic stamps, longer text have to be broken down into smaller segments
topically coherent, contiguous segments can be found by using a lexical similarity measure assumption: distribution of words used
changes when the topic changes
BG: Summarization process
Linguistic analysisdiscourse segmentationextended phrase analysisanaphora resolutioncalculation of discourse saliencetopic stamp identificationcapsule overview
Knowledge-rich approaches
Structured information can be used as the starting point for summarization structured information: e.g. data and
knowledge bases, may have been produced by processing input text
summarizer does not have to address the linguistic complexities and variability of the input, but also the structure of the input text is not available
Knowledge-rich approaches
There is a need for measures of salience and relevance that are dependent on the knowledge source
addressing coherence, cohesion, and fluency becomes the entire responsibility of the generator
STREAK, PLANDOC
McKeown, Robin, Kukich (1995): Generating concise natural language summaries
goal: folding information from multiple facts into a single sentence using concise linguistic constructions
STREAK
Produces summaries of basketball games
first creates a draft of essential factsthen uses revision rules constrained
by the draft wording to add in additional facts as the text allows
STREAK
Input: a set of box scores for a basketball game historical information (from a database)
task: summarize the highlights of the game,
underscoring their significance in the light of previous games
output: a short summary: a few sentences
STREAK
The box score input is represented as a conceptual network that expresses relations between what were the columns and rows of the table
essential facts: the game result, its location, date and at least one final game statistic (the most remarkable statistic of a winning team player)
STREAK
Essential facts can be obtained directly from the box-score
in addition, other potential facts other notable game statistics of individual
players - from box-score game result streaks (Utah recorded its
fourth straight win) - historical extremum performances such as
maximums or minimums - historical
STREAK
Essential facts are always includedpotential facts are included if there is
space decision on the potential facts to be
included could be based on the possibility to combine the facts to the essential information in cohesive and stylistically successful ways
STREAK
Given facts: Karl Malone scored 39 points. Karl Malone’s 39 point performance is
equal to his season higha single sentence is produced:
Karl Malone tied his season high with 39 points
PLANDOC
Produces summaries of telephone network planning activity
uses discourse planning, looking ahead in its text plan to group together facts which can be expressed concisely using conjunction and deleting repetitions
PLANDOC
The system must produce a report documenting how an engineer investigated what new technology is needed in a telephone route to meet demand through use of a sophisticated software planning system
PLANDOC
Input: a trace of user interaction with the planning
system software PLANoutput:
1-2 page report, including a paragraph summary of PLAN’s solution, a summary of refinements than an engineer made to the system solution, and a closing paragraph summarizing the engineer’s final proposition
Summary generationSummaries must convey maximal
information in a minimal amount of spacerequires the use of complex sentence
structures multiple modifiers of a noun or a verb conjunction (’and’) ellipsis (deletion of repetitions)
selection of words that convey multiple aspects of the information