the pythy summarization system: microsoft research at duc 2007

The PYTHY Summarization System: Microsoft Research at DUC 2007

Kristina Toutanova, Chris Brockett, Michael Gamon, Jagadeesh Jagarlamudi,

Hisami Suzuki, and Lucy Vanderwende

Microsoft Research

April 26, 2007

DUC Main Task Results

• Automatic Evaluations (30 participants)

• Human Evaluations

• Did pretty well on both measures

Criterion Rank ScoreROUGE-2 2 0.12028

ROUGE-SU4 3 0.17074

Criterion RankPyramid 1=

Content 5=

Overview of PYTHY

• Linear sentence ranking model

• Learns to rank sentences based on:• ROUGE scores against model summaries• Semantic Content Unit (SCU) weights of

sentences selected by past peers• Considers simplified sentences

alongside original sentences

Kk

kk sfwsScore..1

)()(

Featureinventor

y

TargetsROUGE Oracle

Pyramid/SCU

ROUGE X 2

Ranking/

Training

Model

Sentences

SimplifiedSentences

DocsDocsDocs

Docs

PYTHYTraining

SentencesDocs

Docs

Featureinventor

y

SimplifiedSentences

DocsDocs

Model

PYTHYTesting

Search

Dynamic

Scoring

Summary

Sentence Simplification

• Extension of simplification method for DUC06• Provides sentence alternatives, rather than

deterministically simplify a sentence• Uses syntax-based heuristic rules• Simplified sentences evaluated alongside originals

• In DUC 2007:• Average new candidates generated: 1.38 per sentence• Simplified sentences generated: 61% of all sents• Simplified sentences in final output: 60%

Featureinventory

TargetsROUGE OraclePyramid

/SCUROUGE

X 2

Ranking

Training

Model

SentencesSimplifiedSentences

Docs Do

csDocs Doc

s

PYTHYTrainin

g

Sentence-Level Features

• SumFocus features: SumBasic (Nenkova et al 2006) + Task focus• cluster frequency and topic frequency• only these used in MSR DUC06

• Other content word unigrams: headline frequency• Sentence length features (binary features)• Sentence position features (real-valued and binary)• N-grams (bigrams, skip bigrams, multiword phrases)• All tokens (topic and cluster frequency)• Simplified Sentences (binary and ratio of relative length)• Inverse document frequency (idf)

Featureinventory


/SCUROUGE

X 2

Ranking

Training

Model


Docs Do

csDocs Doc

s

PYTHYTrainin

g

Pairwise Ranking

• Define preferences for sentence pairs• Defined using human summaries and SCU weights

• Log-linear ranking objective used in training

• Maximize the probability of choosing the better sentence from each pair of comparable sentences

Featureinventory


/SCUROUGE

X 2

Ranking

Training

Model


Docs Do

csDocs Doc

s

PYTHYTrainin

g

[Ofer et al. 03], [Burges et al. 05]

ROUGE Oracle Metric

• Find an oracle extractive summary• the summary with the highest average ROUGE-2

and ROUGE-SU4 scores • All sentences in the oracle are considered

“better” than any sentence not in the oracle• Approximate greedy search used for finding

the oracle summary

Featureinventory


/SCUROUGE

X 2

Ranking

Training

Model


Docs Do

csDocs Doc

s

PYTHYTrainin

g

Pyramid-Derived Metric

• University of Ottawa SCU-annotated corpus (Copeck et al 06)

• Some sentences in 05 & 06 document collections are:• known to contain certain SCUs• known not to contain any SCUs

• Sentence score is sum of weights of all SCUs

• for un-annotated sentences, the score is undefined

• A sentence pair is constructed for training s1 > s2 iff w(s1) >w(s2)


/SCUROUGE

X 2

Ranking

Training

Model


Docs Do

csDocs Doc

s

PYTHYTrainin

g Feature

inventory

Model Frequency Metrics

• Based on unigram and skip bigram frequency

• Computed for content words only• Sentence si is “better” than sj if


/SCUROUGE

X 2

Ranking

Training

Model


Docs Do

csDocs Doc

s

PYTHYTrainin

g Feature

inventory

k

kcpsw )(ˆ)( models

Combining multiple metrics

• From ROUGE oracle

all sentences in oracle summary better than other sentences

• From SCU annotations

sentences with higher avg SCU weights better

• From model frequency

sentences with words occurring in models better• Combined loss: adding the losses according to all metrics

}:{1 ji ssijD

}:{2 ji ssijD

}:{3 ji ssijD

)()()( 321 DLDLDLL


/SCUROUGE

X 2

Ranking

Training

Model


Docs Do

csDocs Doc

s

PYTHYTrainin

g Feature

inventory

Ranking

Training

SentencesDocs

Docs

Featureinventory

SimplifiedSentences

DocsDocs

Model

PYTHYTesting

Search

Dynamic

Scoring

Summary

Dynamic Sentence Scoring

• Eliminate redundancy by re-weighting• Similar to SumBasic (Nenkova et al 2006), re-

weighting given previously selected sentences

• Discounts for features that decompose into word frequency estimates

SearchDynami

c Scoring

Search

• The search constructs partial summaries and scores them:

• The score of a summary does not decompose into an independent sum of sentence scores• Global dependencies make exact search hard

• Used multiple beams for each length of partial summaries• [McDonald 2007]

),..,|(),...,,( 11...1

21 i

niin sssscoresssScore

SearchDynami

c Scoring

Impact of Sentence Simplification

No Simplified SimplifiedR-2 R-SU4 R-2 R-SU4

SumFocus 0.078 0.132 0.078 0.134

PYTHY 0.089 0.140 0.096 0.147

•Trained on 05 data, tested on O6 data

Evaluating the Metrics

Criterion Num Pairs

Train Acc Content Only All Words

R-2 R-SU4 R-2 R-SU4Oracle 941K 93.1 0.076 0.107 0.093 0.143

SCUs 430K 62.0 0.078 0.108 0.086 0.134

Model Freq. 6.3M 96.9 0.076 0.106 0.096 0.147All 7.7M 94.2 0.076 0.107 0.096 0.147

Trained on 05 data, tested on 06 dataIncludes simplified sentences

Update Summarization Pilot• SVM novelty classifier trained on TREC 02 & 03 novelty

track

ROUGE 2 ROUGE-SU4

PYTHY + Novelty (1) 0.07135 0.11164

PYTHY + Novelty (.5) 0.07879 0.12929

PYTHY + Novelty (.1) 0.08721 0.12958

PYTHY 0.08686 0.12876

SumFocus 0.07002 0.11033

)BG|)(novelPr(PrevS|(PrevS|(Score iiPythyi s)sScore)s

Summary and Future Work• Summary

• Combination of different target metrics for training• Many sentence features• Pair-wise ranking function• Dynamic scoring

• Future work• Boost robustness

• Sensitive to cluster properties (e.g., size)• Improve grammatical quality of simplified sentences• Reconcile novelty and (ir)relevance• Learn features over whole summaries rather than individual

sentences

Thank You

the pythy summarization system: microsoft research at duc 2007

Documents

sentssimplified sentences

sentencesimplified sentences

better sentence

sentence pairsdefined

rouge oracle metricfind

rouge scores

scusfor unannotated

undefineda sentence