Download - Jeremy G. Kahn's PhD dissertation
c©Copyright 2010
Jeremy G. Kahn
Parse decoration of the word sequence in the speech-to-textmachine-translation pipeline
Jeremy G. Kahn
A dissertation submitted in partial fulfillmentof the requirements for the degree of
Doctor of Philosophy
University of Washington
2010
Program Authorized to Offer Degree: Linguistics
University of WashingtonGraduate School
This is to certify that I have examined this copy of a doctoral dissertation by
Jeremy G. Kahn
and have found that it is complete and satisfactory in all respects,and that any and all revisions required by the final
examining committee have been made.
Chair of the Supervisory Committee:
Mari Ostendorf
Reading Committee:
Mari Ostendorf
Paul Aoki
Emily M. Bender
Fei Xia
Date:
In presenting this dissertation in partial fulfillment of the requirements for the doctoraldegree at the University of Washington, I agree that the Library shall make its copiesfreely available for inspection. I further agree that extensive copying of this dissertation isallowable only for scholarly purposes, consistent with “fair use” as prescribed in the U.S.Copyright Law. Requests for copying or reproduction of this dissertation may be referredto Proquest Information and Learning, 300 North Zeeb Road, Ann Arbor, MI 48106-1346,1-800-521-0600, to whom the author has granted “the right to reproduce and sell (a) copiesof the manuscript in microform and/or (b) printed copies of the manuscript made frommicroform.”
Signature
Date
University of Washington
Abstract
Parse decoration of the word sequence in the speech-to-text machine-translationpipeline
Jeremy G. Kahn
Chair of the Supervisory Committee:
Professor Mari Ostendorf
Electrical Engineering & Linguistics
Parsing, or the extraction of syntactic structure from text, is appealing to natural lan-
guage processing (NLP) engineers and researchers. Parsing provides an opportunity to
consider information about word sequence and relatedness beyond simple adjacency. This
dissertation uses automatically-derived syntactic structure (parse decoration) to improve
the performance and evaluation of large-scale NLP systems that have (in general) used
only word-sequence level measures to quantify success. In particular, this work focuses on
parse structure in the context of large-vocabulary automatic speech recognition (ASR) and
statistical machine translation (SMT) in English and (in translation) Mandarin Chinese.
The research here explores three characteristics of statistical syntactic parsing: dependency
structure, constituent structure, and parse-uncertainty — making use of the parser’s ability
to generate an M -best list of parse hypotheses.
Parse structure predictions are applied to ASR to improve word-error rate over a baseline
non-syntactic (sequence-only) language model (achieving 6–13% of possible error reduction).
Critical to this success is the joint reranking of an N×M -best list of N ASR hypothesis tran-
scripts and M -best parse hypotheses (for each transcript). Jointly reranking the N×M lists
is also demonstrated to be useful in choosing a high-quality parse from these transcriptions.
In SMT, this work demonstrates expected dependency pair match (EDPM), a new mech-
anism for evaluating the quality of SMT translation hypotheses by comparing them to refer-
ence translations. EDPM, which makes direct use of parse dependency structure directly in
its measurement, is demonstrated to be superior in correlation with human measurements
of translation quality to the competitor (and widely-used) evaluation metrics BLEU4 and
translation edit rate.
Finally, this work explores how syntactic constituents may predict or improve the behav-
ior of unsupervised word-aligners, a core component of SMT systems, over a collection of
Chinese-English parallel text with reference alignment labels. Statistical word-alignment is
improved over several machine-generated alignments by exploiting the coherence of certain
parse constituent structures to identify source-language regions where a high-recall aligner
may be trusted.
These diverse results across ASR and SMT point together to the utility of including
parse information into large-scale (and generally word-sequence oriented) NLP systems and
demonstrate several approaches for doing so.
TABLE OF CONTENTS
Page
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
Chapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Evaluating the word sequence . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Using parse information within automatic language processing . . . . . . . . 4
1.3 Overview of this work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Chapter 2: Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1 Statistical parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Reranking n-best lists . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3 Automatic speech recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.4 Statistical machine translation . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
Chapter 3: Parsing Speech . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.2 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.3 Corpus and experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
Chapter 4: Using grammatical structure to evaluate machine translation . . . . . 61
4.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.2 Approach: the DPM family of metrics . . . . . . . . . . . . . . . . . . . . . . 63
4.3 Implementation of the DPM family . . . . . . . . . . . . . . . . . . . . . . . . 66
4.4 Selecting EDPM with human judgements of fluency & adequacy . . . . . . . 68
4.5 Correlating EDPM with HTER . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.6 Combining syntax with edit and semantic knowledge sources . . . . . . . . . 74
i
4.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
Chapter 5: Measuring coherence in word alignments for automatic statistical ma-chine translation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.2 Coherence on bitext spans . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.3 Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.4 Analyzing span coherence among automatic word alignments . . . . . . . . . 88
5.5 Selecting whole candidates with a reranker . . . . . . . . . . . . . . . . . . . . 95
5.6 Creating hybrid candidates by merging alignments . . . . . . . . . . . . . . . 101
5.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
Chapter 6: Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
6.1 Summary of key contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
6.2 Future directions for these applications . . . . . . . . . . . . . . . . . . . . . . 109
6.3 Future challenges for parsing as a decoration on the word sequence . . . . . . 111
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
ii
LIST OF FIGURES
Figure Number Page
2.1 A lexicalized phrase structure and the corresponding constituent and depen-dency trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2 The models that contribute to ASR. . . . . . . . . . . . . . . . . . . . . . . . 17
2.3 Word alignment between e and f . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.4 The models that make up statistical machine translation systems . . . . . . . 24
3.1 A SParseval example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.2 System architecture at test time. . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.3 n-best resegmentation using confusion networks . . . . . . . . . . . . . . . . . 38
3.4 Oracle parse performance contours for different numbers of parses M andrecognition hypotheses N on reference segmentations. . . . . . . . . . . . . . 51
3.5 SParseval performance for different feature and optimization conditions asa function of the size of the N-best list. . . . . . . . . . . . . . . . . . . . . . 56
4.1 Example dependency trees and their dlh decompositions. . . . . . . . . . . . 64
4.2 The dl and lh decompositions of the hypothesis tree in figure 4.1. . . . . . . 64
4.3 An example headed constituent tree and the labeled dependency tree derivedfrom it. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.4 Pearson’s r for various feature tunings, with 95% confidence intervals. EDPM,BLEU and TER correlations are provided for comparison. . . . . . . . . . . . 76
5.1 A Chinese sentence and its translation, with reference alignments and align-ments generated by unioned GIZA++ . . . . . . . . . . . . . . . . . . . . . . 80
5.2 Examples of the four coherence classes . . . . . . . . . . . . . . . . . . . . . . 83
5.3 Decision trees for VP and IP spans. . . . . . . . . . . . . . . . . . . . . . . . 93
5.4 An example incoherent CP-over-IP. . . . . . . . . . . . . . . . . . . . . . . . . 94
5.5 An example of clause-modifying adverb appearing inside a verb chain . . . . 96
5.6 An example of English ellipsis where Chinese repeats a word. . . . . . . . . . 97
5.7 Example of an NP-guided union. . . . . . . . . . . . . . . . . . . . . . . . . . 103
iii
iv
LIST OF TABLES
Table Number Page
1.1 Two ASR hypotheses with the same WER. . . . . . . . . . . . . . . . . . . . 3
1.2 Word-sequences not considered to match by naıve word-sequence evaluation . 3
3.1 Reranker feature descriptions . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.2 Switchboard data partitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.3 Segmentation conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.4 Baseline and oracle WER reranking performance from N = 50 word sequencehypotheses and 1-best parse . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.5 Oracle SParseval (WER) reranking performance from N = 50 word se-quence hypotheses and M = 1, 10, or 50 parses . . . . . . . . . . . . . . . . . 51
3.6 Reranker feature combinations . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.7 WER on the evaluation set for different sentence segmentations and featuresets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.8 Word error rate results comparing γ . . . . . . . . . . . . . . . . . . . . . . . 54
3.9 Results under different segmentation conditions when optimizing for SPar-seval objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.1 Per-segment correlation with human fluency/adequacy judgements of differ-ent combination methods and decompositions. . . . . . . . . . . . . . . . . . 69
4.2 Per-segment correlation with human fluency/adequacy judgements of base-lines and different decompositions. N = 1 parses used. . . . . . . . . . . . . . 70
4.3 Considering γ and N . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.4 Corpus statistics for the GALE 2.5 translation corpus. . . . . . . . . . . . . . 72
4.5 Per-document correlations of EDPM and others to HTER . . . . . . . . . . . 73
4.6 Per-sentence, length-weighted correlations of EDPM and others to HTER,by genre and by source language. . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.1 Four mutually exclusive coherence classes for a span s and its projected range s′ 83
5.2 GALE Mandarin-English manually-aligned parallel corpora . . . . . . . . . . 84
5.3 The Mandarin-English parallel corpora used for alignment training . . . . . . 86
5.4 Alignment error rate, precision, and recall for automatic aligners . . . . . . . 88
5.5 Coherence statistics over the spans delimited by comma classes . . . . . . . . 89
v
5.6 Coherence statistics over the spans delimited by certain syntactic non-terminals 91
5.7 Some reasons for IP incoherence . . . . . . . . . . . . . . . . . . . . . . . . . 95
5.8 Reranking the candidates produced by a committee of aligners. . . . . . . . . 99
5.9 Reranking the candidates produced by giza.union.NBEST. . . . . . . . . . . 100
5.10 AER, precision and recall for the bg-precise alignment . . . . . . . . . . . . 101
5.11 AER, precision and recall over the entire test corpus, using various XP -strategies to determine trusted spans . . . . . . . . . . . . . . . . . . . . . . . 104
vi
ACKNOWLEDGMENTS
My advisor, Mari Ostendorf, has been a reliable source of support, encouragement, and
ideas through the process of this work. An amazingly busy and productive engineering
professor, she has welcomed me into the Signal Speech and Language Interpretation (SSLI)
laboratory when I was looking only for summer employment — on the condition that I
remain with her for at least another year. It was a good bargain: Mari’s empirical, skeptical,
practical approach to research has served as a model and inspiration, and I am proud
every time I notice myself saying something Mari would have suggested. SSLI’s home in
Electrical Engineering (in a different college, let alone department, from Linguistics) has
been a valuable source of perspective: working in the lab (and with the electrical engineers
and computer scientists there) gives me the unusual privilege of being the “language guy”
among the engineers and the “engineering guy” among the linguists.
My committee of readers was delightfully representative of the intersection between
linguistics and computers. Paul Aoki represented practical translation and the use of com-
puters for language teaching — and provided unstinting positive regard for me and my
work. Emily Bender opened doors for me by opening a master’s program in computational
linguistics at the University of Washington just as I began, creating entire cohorts of pro-
fessional NLP people just across Stevens Way. Fei Xia’s perspectives on Chinese parsing
and on statistical machine translation were welcome on every single revision.
Among my colleagues at SSLI, I would like to acknowledge Becky Bates, who adopted me
as a “big sister” from my first day there, for her clear-eyed, mindful approach to engineering
education and her grounded, open approach to the full experience of the world, even for
those of us who — through practice or predisposition — spend a lot of time in our head
and in the world of words. Dustin Hillard and Kevin Duh shared their enthusiasm and
excitement for engineering and machine learning in application to language problems. Lee
vii
Damon kept the entire lab infrastructure running in the face of thousands of submitted jobs,
many of which were mine. Bin and Wei tolerated both my questions about Chinese and
my eagerly verbose explanations of some of the crookeder corners of the English language.
Alex, Brian, Julie and Amittai were always game for engaging in a discussion about tactics
and strategies for natural-language engineering graduate students, and I am pleased to leave
my role as SSLI morale officer in their hands.
Across the road in Padelford, my colleagues and teachers in the Linguistics department
have also been a pleasure. Beyond my committee members named already, I had the
pleasure of guidance and welcome from Julia Herschensohn, the departmental chair, whose
enthusiasm for an interdisciplinary computational linguist like me spared me a number of
administrative ordeals, some of which I’ll probably never know about (and I am grateful
to Julia for that). Richard Wright and Alicia Beckford-Wassink were happy to let me be
an “engineering guy” in a room full of empirical linguists. Fellow students Bill, David, and
Scott reminded me from the very beginning that having spent time in industry does not
disqualify one from still studying linguistics. Lesley, Darren, Julia, Amy, and Laurie remind
me whenever I see them (which is often online rather than in person!) that linguistics can
be fun, whichever corner of it you live in.
Over the last two years, I have had the privilege of being hosted at the Speech Technology
and Research (STAR) laboratory at SRI International in Menlo Park, California. I began
my study there as part of the DARPA GALE project, on which SSLI and the STAR lab
collaborated. STAR director Kristin Precoda graciously allowed me to use office space
and come to lab meetings, even after that project ended, while I finished my dissertation.
Dimitra, Wen, Fazil, Jing, Murat, Luciana, Martin, Colleen and Harry, support staff Allan
and Debra, and fellow SSLI alumni Arindam Mandal and Xin Lei also hosted and oriented
me during my time at SRI. All of them have been pleasant hosts and supportive colleagues.
I am doubly grateful that they tolerated my poor attempts at playing Colleen’s guitar in
the break room.
I have had fruitful and enjoyable collaborations with students and faculty beyond UW
viii
and SRI in my time in the UW graduate program: I am pleased to have explored interesting
computational linguistics research with Matt Lease, Brian Roark, Mark Johnson (who was
also my first syntax professor!), Mary Harper, and Matt Snover, among many others. I re-
ceived support and software guidance from John DeNero, Chris Dyer and Eugene Charniak,
again, among others. I am indebted to them all.
About a year before completing this dissertation, I began part-time work at Wordnik. I
am grateful to Erin McKean for offering me employment thinking about words and language
even while I finished this dissertation, and for allowing me to work less than half-time while
I finished up the thesis. This was offered with far less grumbling than I deserved. I am
lucky, too, to have intelligent, funny, talented co-workers there: Tony, John, Robert, Angela,
Russ, Kumanan, Krishna and Mark continue to be a pleasure to work with and work for.
Of course, I had little chance to complete this work without support from an amazing
troupe of supportive friends in many locations. Matt, Shannon, Kristina, Maryam, Lauren
Neil, Ben, Trey, Rosie, and others have held out from the wild world of the Internet. In
San Francisco, I am happy to have found community with Nancy, Heather, Jen, Susanna
and Derek, all holding on for Wisdom and for my success. Jim and Fiona, William and Jo,
Eldan and Melinda, Chris and Miriam, Alex and Kirk, Johns L and A, and many others
support me with love and wisdom from Seattle. Finally, I am lucky to have been supported
all along the way by my parents, Mickey and Henry; by my brother Daniel, and, most of
all, by my wife Dorothy Lemoult, whom I met in Seattle in my second year of the program.
Since the day we met, Dorothy has seen me as a better person than even I believed myself
to be; to be the object of that kind of fierce love is the best way to be alive.
I have received funding for my work from the University of Washington, the National
Science Foundation, SRI International, and the Defense Advance Research Projects Admin-
istration.
Finally, a framing comment: I was supported in the process of creating this dissertation
by a community that will undoubtedly be under-represented by any attempt to list everyone,
especially this one. To all of you I’ve overlooked or omitted, please forgive me.
ix
x
DEDICATION
For the pursuit of a life of love, play, and inquiry;
For my partner, my ally, my friend, my lover;
For what we have already and for what we make together;
For Dorothy.
xi
xii
1
Chapter 1
INTRODUCTION
Parsing, or extracting syntactic structure from text, is an appealing process to lin-
guists studying the grammatical properties of natural language: parsing is an application
of syntactic theory. For non-linguists, including many natural-language engineers, it is not
necessarily of immediate practical use. Engineers and other users of language technology
have generally found word sequences (as in writing) to be a more tractable input and out-
put, and traditional evaluation measures for their tasks have not considered any linguistic
structure beyond the word sequence in their design.
While some natural language applications have embraced parsing at their core (e.g. infor-
mation extraction, which generally begins from parsed sentence structures), this dissertation
applies parsers to two other domains: automatic speech recognition (ASR) and statistical
machine translation (SMT). In evaluation, both of these natural-language processing tasks
traditionally use measurements that evaluate using only matches of words or adjacent se-
quences of words (N -grams) against a reference (human-generated) output. In ASR, parsing
features and scores have been explored for improved modeling of word sequences, but these
approaches have not been widely adopted. Similarly, although a few SMT systems use a
parse tree in parts of decoding, parse structures are also not widely adopted in SMT. For
example, statistical word-alignment, a core internal technology for SMT, generally uses no
parse information to hypothesize links between source- and target-language words.
This dissertation explores the incorporation of parsing into representations of language
for natural language processing, particularly for components that have traditionally consid-
ered only the word sequence as input and output. This work takes two related approaches:
exploring new opportunities to bring the information provided by a parser to bear within
the traditional (syntactically-uninformed) approaches to these natural-language tasks, and
2
exploring the construction of new, parser-informed automatic evaluation measures to guide
the behavior of these systems in directions that lead to qualitative improvements in results,
as judged by human assessors.
1.1 Evaluating the word sequence
This work focuses on two natural-language processing applications: speech recognition and
machine translation. The output of speech recognition is a word sequence transcription
hypothesis; the output of a machine translation system is a word sequence translation
hypothesis. In each case, the usual approach to evaluation is to compare the transcript (or
translation) hypothesis to a reference transcript (or translation).
Using the undecorated word-sequence as an interface among natural language systems
may sometimes introduce surprising behaviors in evaluation. A word sequence is a very
shallow representation of the linguistic structure of language. This representation is almost1
completely devoid of theoretical baggage: no theoretical training is required for language
users (or machines) to count over words and compare them for identity.
Speech transcript quality, for example, is ordinarily measured by word error rate
(WER), which is defined over hypothesis transcript h and reference transcript r as:
WER(h; r) =insertions(h; r) + deletions(h; r) + substitutions(h; r)
length(r)(1.1)
where insertion, deletion and substitution error counts are calculated through a Levenshtein
alignment between reference and hypothesis that minimizes the total number of errors.
Automated methods like WER facilitate the optimization and evaluation of natural
language processing technology, because they can report the quality of a hypothesis without
human intervention, given only a previously-generated reference. For these optimization
and evaluation processes, though, the automatic measures should ideally be consistent with
human judgements of quality.
Word-sequence evaluation measures, however, do not always match human judgements
about quality. For example, they rarely have any notion of centrality: no aspect of WER
1Chinese and several other written languages do not separate words in text, but there is still high agree-ment among literate speakers about the character-sequence. Character sequences, rather than word se-quences, are thus usually used for evaluation in Chinese speech recognition.
3
Table 1.1: Two ASR hypotheses with the same WER.
Hypothesis WER
Reference People used to arrange their whole schedules around those —
(a) people easter arrange their whole schedules around those 0.22
(b) people used to arrange their whole schedule and those 0.22
Table 1.2: Word-sequences not considered to match by naıve word-sequence evaluation
The man saw the cat. The cat was seen by the man.
The diplomat left the room. The diplomat went out of the room.
He quickly left. He left quickly.
He warmed the soup. He heated the soup.
optimize optimise
don’t do not
because cuz
captures the intuition that some words are more important to the sentence than others.
Table 1.1 considers two hypotheses that are projected to the same distance (WER = 0.22)
by the WER metric. In table 1.1, hypothesis (a) and hypothesis (b) have equal WER, but
(a)’s substitution is on a more central sequence (the main verb used to), while (b)’s word
errors are on a grammatical affix (schedule instead of schedules) and an adjunct adverbial
(around those). One indicator of the centrality of used to is that (b)’s substitution causes
little adjustment to the overall structure of the sentence, where (a)’s substitution leaves (a)
with no workable parse structure other than a fragment.
Conversely, table 1.2 presents some example word sequences that a human evaluator
might reasonably consider equivalent (for some evaluation tasks), and which a naıve word-
sequence evaluation would score as different. To capture any of these matches, the evaluation
sequence must be able to find a projection of the word sequences such that they may be
found equivalent. The last two pairs in table 1.2 are usually handled by normalization tools,
4
but the others are usually ignored: with the exception of contractions, case normalization
and sometimes spelling normalization, most evaluations consider only exact matches over
sections of the word-sequence, and treat all words as equally important.
Evaluation measures like WER (or extensions using N -grams) use only surface word
identity and word adjacency in their measurements. These measures incorporate neither a
notion of centrality nor argument structure, but individual words’ roles in the meaning of a
sentence are determined by their relationship to other, not necessarily adjacent words. It is
the central contention of this work that extending our measurements and evaluations of the
word sequence to include a deeper representation of linguistic structure provides benefits to
both linguistic and engineering approaches to natural language.
1.2 Using parse information within automatic language processing
The core theme of this work is the use of automatically-derived parse structure to improve
the performance and evaluation of language-processing systems that have generally used
only word-sequence level measures.
Parse decorations on the word sequence can provide benefits to these systems in these
two ways:
• parse decoration offers a new source of structural information within the models that
go into these systems, providing features from which the models may derive more
powerful hypothesis-choice criteria, and
• parse decoration enables new target measures, for use in system tuning and/or eval-
uation of the overall performance of a system.
Both of these techniques are used in this dissertation in ASR and SMT applications. For
ASR systems, this work explores using parse structure for optimization towards both WER
and SParseval (an evaluation measure for parses of speech transcription hypotheses). For
SMT systems, this work explores using parse structure towards providing an evaluation
measure that correlates better with human judgement and towards the optimization of an
internal target (word-alignment).
5
Parse structure is not observable in transcripts or other easily-derived training data (out-
side of the relatively small domain of treebanks), which is one reason that parse-information
has not been widely adopted into some of these systems. Parser accuracy, especially on gen-
res that do not match the parser’s training data, may not be very good. This work adopts
the approach that a parser’s own confidence estimates may be used to avoid egregious
blunders, by using expectations (confidence-weighted averages) over parser predictions. A
common thread among the research directions presented here is thus the use of more than
one parse-decoration hypothesis to provide structural information about the word sequence.
Previous work on applying grammatical structure to ASR systems has focused on either
parsing a single hypothesis transcript (the parsing task) or on using a single hypothesis parse
to select a transcript (the language-modeling task). By exploring the joint optimization of
parse and transcript hypotheses (chapter 3), this work demonstrates the utility of each to
the other. It frames the parse-decoration as a source of structural features of the hypothe-
ses, to be used in reranking hypotheses. In this approach, WER-optimization is improved
by including information from multiple parse hypotheses, and parse-metric optimization
is improved by comparing multiple parse hypotheses over multiple transcript hypotheses.
Because many NLP tasks either explicitly use parsing, chunking, or have verb-dependent
processing, the parse metric is often a better choice for word transcription associated with
NLP tasks.
After considering parsing as an ASR objective, we turn to incorporation of parse dec-
oration towards SMT tasks, beginning by considering SMT evaluation (chapter 4). SMT
evaluation measures have traditionally used only word-sequence information (e.g., measur-
ing the precision of n-grams against a reference translation). This work explores the use
of parsing dependency structure to provide a syntactically-sensitive evaluation measure of
the translation hypotheses. Parse structure, here, is represented as an expectation over
dependency structure (using the multiple-parse hypotheses approach suggested above), and
this work demonstrates that evaluations informed by parse-structure correlate more closely
with human judgements of translation quality than the traditional (word-sequence based)
metrics.
Previous work on applying parsers to SMT has focused mostly on parsing for reordering
6
source language text or within decoders. A third limb of the work presented here (chapter 5)
explores the use of parsers in improving translation word-alignment (an internal component
of SMT). In this approach, parse-decoration is treated as labels on source-language spans,
and this information is applied to selecting better machine translation word-alignments, an
SMT task that generally uses only word-sequence information. In this work, we explore
the coherence properties of the parse-annotated spans, finding some span-classes that tend
to be coherent, in the sense that a contiguous sequence of source language words is not
broken up in translation. This syntactic coherence is used to guide the combination of a
precision-oriented and recall-oriented automatic alignment.
By exploring applying parse decoration to word sequences, this work offers several pieces
of evidence for new directions in language-processing work. Word sequences are not always
the best way to evaluate the performance of natural language processing systems; gram-
matical structure (from parsing) is in fact a useful source of information to these other
natural-language processing systems, even when used as a component in evaluation (in ma-
chine translation). As part of those results, this work offers new reasons to use and improve
work in syntactic parsers.
1.3 Overview of this work
The dissertation’s structure is as follows: Chapter 2 covers the shared background material:
statistical parsing, and schematic overviews of the operation of ASR and SMT systems.
To accomodate the diversity of corpora and applications, some discussion of background
material and related work is deferred to the appropriate chapter, rather than covering all
background materials in chapter 2. Chapters 3–5 present the prior work, new methods and
experimental results of each of the three applications explored in this thesis.
Chapter 3 applies parsing to automatic speech recognition on English conversational
speech, and shows that information derived from parse structure offers improvements on
WER. In addition, when the ASR/parsing pipeline is directed to target a parse-quality
measure designed for speech transcripts, not only does the pipeline perform better on that
measure but it selects qualitatively different word sequences, reflecting the effect of parse
structure (and its evaluation) on speech recognition.
7
Chapter 4 proposes a new evaluation measure, Expected Dependency Pair Match (EDPM)
for machine translation evaluation. EDPM is a measure of parse-structure similarity be-
tween hypothesis and reference translations. Experiments in this chapter correlating EDPM
with human and human-derived judgments of translation quality show that EDPM surpasses
popular word-sequence-based evaluation measures and is competitive with other newly-
proposed metrics that rely on external knowledge sources.
Chapter 5 focuses on Chinese-English parallel-text word alignment, an internal com-
ponent of machine translation that also traditionally ignores structural information. This
chapter applies parsing to the Chinese side of the parallel text, and introduces translation
coherence, which is a property of a source span and an alignment. The work in this chap-
ter explores the utility of coherence in selecting good alignments, examines where those
coherence measures break down, and shows that parse structure information is useful in
selecting regions where two alignment candidates may be combined to improve alignment
recall without hurting alignment precision.
Chapter 6 concludes with a summary of the key contributions of this thesis, which include
both application advances and new understanding of general methods for leveraging parse
decorations. It further suggests future directions of research, in which parse-decoration may
be applied in new ways to machine-translation, speech recognition, and evaluation methods.
8
9
Chapter 2
BACKGROUND
This chapter provides an overview of the natural-language processing technologies that
this dissertation rests upon. The next section (2.1) provides some background on statistical
syntactic parsing and describes the statistical syntactic parsers in use in this work. The
subsequent section (2.2) explains the framework for n-best list reranking used in several
parts of this work. The following sections (2.3 and 2.4) describe the general framework of
the two applications (speech recognition and statistical machine translation, respectively)
to which this work applies those rerankers and parsers.
2.1 Statistical parsing
Statistical parsing serves as the method of word sequence decoration for all of the research
proposed in this work. This section reviews the key decorations available from a statistical
parser, considers the strengths and weaknesses of the probabilistic context-free grammar
(PCFG) paradigm, and discusses the training and evaluation of such parsers.
2.1.1 Constituent and dependency parse trees
The parse decorations on word sequences used here include both dependency structure
and hierarchical spans over word sequences. Hierarchical spans are known as constituent
structures; in these trees, span labels nest to form a hierarchy (a tree) of constituent spans;
these spans are labeled with the phrase class (e.g. np or vp) that describes its content. The
entire segment is labeled with a root span, which is usually coterminous with a single s
spanning the sentence.
A dependency structure, by contrast, labels each word with a single dependency link to
its “head”, with a label representing the nature of the dependency. One word (usually the
main verb of the sentence) is dependent on a notional root node; all the other words in
10
the sentence depend on other words in the sentence.
These two representations of grammatical structure may be reconciled in a lexicalized
phrase representation, which marks one child subspan as the head child of each span. If head
information is ignored, this representation is equivalent to the span label representation. The
head word of each phrase-constituent φ is recursively defined as the head word of φ’s head
child or, if φ contains only one word, that word. A constituent structure is lexicalized
when each constituent is additionally annotated with its head word; one may read either
constituent spans or dependency structures off of these lexicalized constituent structures.
Figure 2.1 shows a lexicalized constituent structure and the dependency tree and constituent
tree that may be derived from it.
The arc labels on the dependency structure shown in figure 2.1 are derived by extraction
from the headed phrase structure by concatenating two labels A/B: A is the lowest con-
stituent dominating both the dependent and the headword and B the highest constituent
dominating the dependent. This approach for arc-labeling works well for a language like
English (or Chinese) with relatively fixed word order.
2.1.2 Generating parse hypotheses
To provide parse-decoration, we desire a parser/decorator which generates n-best lists of
parse hypotheses over input sentences.1 Such an n-best list may be useful in reranking the
parse hypotheses (see section 2.2 below) or other applications which benefit from access to
the confidence of the parser. We require that the parsers used to generate these n-best lists
are adaptable to new domains, robust, and probabilistic. Retrainable parsers are desired
because the domain over which this work predicts parses varies widely with the task: parse
structures over speech, as in chapter 3, are qualitatively different than parse structures over
edited text (e.g. the news-text translation in chapter 5). Robustness, the reliable generation
of predictions for any input word sequences, is desirable because the parser is to distinguish
among machine-generated word sequences (the output of ASR and SMT), which are not
1Packed parse forests, the combined representation of the parser search space used by e.g. Huang [2008],represent a speedy and sometimes elegant alternative to n-best lists, but are constrained by the forest-packing to use only those features that may be computed locally in the tree. This work uses n-best listsinstead for their easy combination and for freedom from the tree-locality constraint.
11
root
s/was
np/I
I
vp/was
vbd/was
was
adjp/acquainted
rb/personally
personally
vbn/acquainted
acquainted
pp/with
in/with
with
np/people
dt/the
the
nns/people
people
root
s
np
I
vp
vbd
was
adjp
rb
personally
vbn
acquainted
pp
in
with
np
dt
the
nns
people
I was personally acquainted with the people root
s/np adjp/rb
root/s
vp/adjpadjp/pp
pp/np
np/dt
Figure 2.1: A lexicalized phrase structure and the corresponding constituent and depen-
dency trees. Dashed arrows indicate the upward propagation of head words to head phrases.
The lexicalized constituent tree encodes both the constituency tree and the dependency rela-
tions. The dependency tree may be understood as the link to the headword of the governing
constituent.
12
always well-formed either due to recognizer errors or speaker differences. Since we use the
parser to predict fine-grained information to make decisions about the word sequences, the
ability to generate parse structure over all (or nearly all) the candidate inputs is important.
Probabilistic scoring is required not only to predict the order of the n-best list, but to
compute the relative contribution of each parse hypothesis to the n-best list. All else being
equal, preferred parsers are also fast.
While unification grammars, e.g. head-driven phrase structure grammar [HPSG, e.g.
Pollard and Sag, 1994] and lexical functional grammar [LFG, e.g. Bresnan, 2001] produce
complex and linguistically-informed parse structures that also may be interpreted as headed
phrase grammars, existing grammars in these formalisms do not reflect a match to a training
set, nor do they have complete coverage (for out-of-domain or ill-formed word sequences,
they often produce no structure at all). Most problematic for the research explored here,
is that state-of-the-art unification grammars [e.g., Flickinger, 2002, Cahill et al., 2004] do
not provide parse N -best lists with the probability of each parse in the list, which is used
in some of our work for taking expectations over parse alternatives.
Instead of a unification grammar like the ones above, this work uses statistical proba-
bilistic context-free grammar (PCFG) parsers. These sorts of parsers (e.g. Collins [2003],
Charniak [2000], and Petrov and Klein [2007]) use lexical and span-label information from
a training set of hand-labeled trees known as a treebank, e.g. the Penn Treebank of English
[Marcus et al., 1993], and construct syntactic structures on new sentences (in the same
language) consistent with the grammar inferred from these training sentences. Because
they are probabilistic, these parsers may return not only a “best” parse analysis according
to its model, but also a list of n analyses reflecting the n-best parse structures that this
parser (and its grammar) assign to the input sentence. Each carries a probabilistic weight
p(t, w) of the likelihood of a tree t with leaves w. The PCFG estimation makes the context-
free assumption: that the probability of generating the tree is composed of a combination
of probability estimates from tree-local decisions. By constraining the model to use only
tree-local decisions, PCFG models may use dynamic-programming techniques to efficiently
search a very large space of possible tree structures.
13
2.1.3 Treebanks for the PCFG-derived parser
Parsers of this nature are constrained by the availability and structure of treebanks (from
which to learn a grammar). The Penn treebank [Marcus et al., 1993], for example, encodes
span labels over a collection of edited English text (mostly the Wall Street Journal); the
availability of this labeled set has enabled the development of the statistically-trained parsers
for English mentioned above. Recent work to construct treebanks in other languages than
English, e.g. in Chinese [Xue et al., 2002], and in other domains than edited English text
[e.g., Switchboard telephone speech: Godfrey et al., 1992] have made these parsers much
more broadly accessible for use in applications with broader focus than parsing itself. In
particular, Huang and Harper [2009] have built a parser tuned for certain genres of Mandarin
Chinese. Certain aspects of this research depend on the power of this parser to handle
Mandarin news text, despite the relative lack of data (compared to English).
Though these parsers do not explicitly include head structure in their output (to match
the treebanks on which they were trained), all of the state-of-the-art PCFG parsers in-
fer head structure internally, most using Magerman [1995] style context-free headfinding
rules. Recovering the head structure from their output (also using Magerman [1995] style
headfinding) is fast and deterministic, and allows for an easy conversion, when dependency
structure is called for, from treebank-style span trees to headed span trees and thence to
dependency structure.
2.1.4 Intrinsic evaluation of statistical parsing
Statistical parsers are usually evaluated by comparing the hypothesized parse thyp to a
reference parse tref . The standard test uses parseval [Black et al., 1991], an F-measure
over span-precision and span-recall, which was developed for comparison to the Wall Street
Journal treebank [Marcus et al., 1993].
The parseval technique assumes that the hypothesis proposed shares the same word
sequence; that is, parseval is only well-defined when whyp = wref and the basic (sen-
tence) segmentation agrees. If the division of those word sequences into segments differs,
parseval is not well-defined. Kahn et al. [2004] addressed this on reference transcripts of
14
conversational speech by concatenating all reference segment transcriptions from a single
conversation side and computing an F -measure based on error counts at the level of the
conversation side.
In speech applications, however, it is not reasonable to assume that the reference tran-
script is available to the parser, so scoring must compare (thyp, whyp) to (tref , wref ) instead.
In this situation, when comparing parses over hypothesized speech sequences, parse qual-
ity may instead be measured using SParseval [Roark et al., 2006], which computes an
F-measure over syntactic dependency pairs derived from the trees.
Research on parse quality over transcription hypotheses, however, has been very limited.
It has largely been restricted to parsing only the ASR engine’s best hypothesis, e.g., Harper
et al. [2005], which sought to improve the automatic segmentation of ASR transcripts into
utterance-level segments. Approaches like this one that use only one ASR hypothesis ignore
the potential of more parse-compatible alternative transcription hypotheses available from
the ASR engine. Further discussion of the SParseval measure over speech is included in
the background section of chapter 3, which explores parsing speech and speech transcripts.
2.2 Reranking n-best lists
As discussed above, PCFG parsers make strong assumptions about locality in order to
efficiently explore the very large space of possible trees. However, these independence as-
sumptions also prevent the use of feature extraction that crosses that locality boundary. For
example, the relative proportion of noun phrases to verb phrases may be a useful discrimi-
nator among good and bad trees, but this statistic is not computable within the context-free
locality assumptions that go into the parser itself.
An approach to dealing with this challenge is to first generate an n-best list of top-
ranking candidate hypotheses, and then apply discriminative reranking [Collins, 2000]
to re-score the set of candidates (incorporating the original scores as one of the features).
The features available to n-best reranking need not obey the locality assumptions that were
used in generating the candidate list in the first place: rather, the features may be holistic
because they are computed exhaustively (against every member of the n-best list) since n
is much smaller than the original search space. Collins and Koo [2005] and Charniak and
15
Johnson [2005] use this approach to achieve roughly 13% improvements in parseval F
performance on parsing Wall Street Journal text.
2.2.1 Reranking as a general tool
Reranking is of general use, and has been applied elsewhere before being applied to parsing.
In ASR, for example, it was applied to transcription n-best lists to lower word error rate
long before its use in parsing [e.g., Kannan et al., 1992], and Shen et al. [2004] introduce
the use of discriminative reranking in SMT work. Discriminative reranking is a form of
discriminative learning, which seeks to minimize the cost function of the top hypotheses.
Unlike generative models, which learn their parameters from counting occurrences in train-
ing data, n-best rerankers must be trained on hypotheses with explicit evaluation metrics
attached. Reranking has one important extension from the general case of discriminative
learning: in reranking, the ranker must learn which features separate the optimal candidates
from the suboptimal ones by comparing elements only within an n-best list, rather than
pooling all positive and negative examples to seek a margin. One way to do this (discussed
below in section 2.2.2) is to divide a candidate pool into ranks and attempt to separate
each rank from the other ranks. In parsing, for example, it is the relative difference in
(e.g.) prepositional phrase count among candidate parses that is used in reranking, not the
absolute count; candidate parse trees must be compared to other candidates derived from
the same n-best list. The generative component produces overly optimistic n-best lists over
its training data, so in order to provide reranker training with realistic N -best lists from
which to learn weights, the reranker needs to be trained using candidate parses from a data
set that is independent of both the generative component’s training and the evaluation test
set. Because of the limited amount of hand-annotated training and evaluation data, it is
not always preferable to sequester a separate training partition just for this model. Instead,
one may adopt the round-robin procedure described in Collins and Koo [2005]: build N
leave-n-out generative models, each trained on N−1N of the partitioned training set, and run
each on the subset that it has not been trained on. The resulting candidate sets are passed
to the feature-extraction component and the resulting vectors (and their objective function
16
values) are used to train the reranker models.
As already indicated, n-best reranking need not be applied only to parsing. As the
following sections will show, it is useful in other complex natural-language processing tasks,
where the n-best list generator (the generative stage) must obey strong independence as-
sumptions for the sake of efficiency, but the final result may be re-evaluated with new
features (classes of information) applied to the discriminative stage. In addition to using
reranking to improve parse quality, this work also uses n-best reranking as a framework for
applying syntactic information to other tasks.
2.2.2 Reranker strategy used in this work
Within this work, n-best list reranking is treated as a rank-learning margin problem: within
each segment, the task is to separate the best candidate from the other candidate hypotheses.
We adopt the svm-rank toolkit [Joachims, 2006] as our reranking tool. To prepare data for
training this toolkit, the approach adopted here selects the oracle best score on the objective
function φ∗ from the n-best list and converts the objective function into an objective loss
with regard to the oracle for all hypotheses ti, e.g., φl(ti) =∣∣∣φ∗p − φp (ti)
∣∣∣ for the parseval
objective φp. To interpret φl as a rank function, we assign ranks to training candidates that
focus on those distinctions near the optimal candidate, as follows:
rank(ti) =
1 : φl(ti) ≤ ε
2 : ε < φl(ti) ≤ 2ε
3 : 2ε < φl(ti)
(2.1)
where ε is a small value tuned empirically so that ranks 1 and 2 have a small proportion
of the total number of members in the candidate set. Since svm-rank uses all pairwise
comparison between candidates of different rank, and ranks 1 and 2 have very few members,
this approach reduces the number of comparisons from a square in |C| to linear in |C|,
(where |C| represents the number of candidates in the set) while still focusing the margin
costs towards the best candidates.
17
Figure 2.2: The models that contribute to ASR.
2.3 Automatic speech recognition
Automatic speech recognition (ASR) is the process of automatically creating a transcript
word sequence w1 . . . wn from recorded speech waveform α. The literature in this discipline
is enormous, and the survey here skims only the surface, to orient the reader to the basic
models in play in state-of-the-art systems in ASR and to provide context for the contribu-
tions of this work.
2.3.1 A schematic summary of ASR
Speech recognition systems are constructed from multiple models. As illustrated in fig-
ure 2.2, the usual expression of these models (e.g. in the SRI large-vocabulary speech recog-
nizer [Stolcke et al., 2006]) is as a combination of multiple generative models, which operate
together to score possible hypotheses that are pruned down to a list of the top n word
sequence hypotheseses. In large vocabulary systems, the resulting list is typically re-scored
by discriminative components that reorder that list.
Among the generative models, acoustic models pam(α|φ) provide a score of acoustic
features of speech α (typically cepstral vectors) given pronunciation φ; pronunciation models
18
provide a score ppm(φ|w) of pronunciation-representation φ given word w; and language
models (LMs, e.g. Stolcke [2002]; see Goodman [2001]) give a score plm(w1, · · · , wn) of the
word sequence w1, · · · , wn. In decoding, all three of the models descibed above operate
on a relatively small local window: pam(·) uses phone-level contexts, ppm(·) uses the word
in isolation or with its immediate neighbors, and plm(·) most often uses n-gram Markov
assumptions, computing word sequence likelihoods from only the most-recent n− 1 words.
The most typical value for n is three, also known as a “trigram” model, and n rarely exceeds
four or five, due to the computational explosion in storage costs required.
The rescoring component F (α, φ,w1, · · · , wn), by contrast, may use all of the above
scores and also extracts additional features of an utterance- or sentence-length hypothesis
from any of the values mentioned above for use in re-ordering the n-best list. Even with-
out the feature-extraction F (·), the rescoring component may change the relative weight
of the contribution of the upstream models, but F (·) is often used to extract long-distance
(non-local) features that would be expensive or impossible to extract in the local-context
decoding that the other models provide. An exhaustive survey of prior work using rerank-
ing to capture non-local information in ASR is impractical, but the sorts of long-distance
information exploited include topic information, as in Iyer et al. [1994] or more recently
Naptali et al. [2010], or trigger information [Singh-Miller and Collins, 2007]. These model
long-distance effects from as far away as other sentences (or speakers!) in the same dis-
course, not with a syntactic model but with various approaches that cue the activation of
a different vocabulary subset. Another application of reranking operates by adjusting the
output of the generative model to focus on the specific error measure, as in e.g. Roark et al.
[2007]. Further discussion of the use of syntactic information in language-model rescoring
may be found in section 2.3.3.
2.3.2 Evaluation of ASR
Evaluation — and optimization — of speech recognition and its components are carried out
with word error rate (WER), a measure that treats words (or characters) equally, regardless
of their potential impact on a downstream application, as discussed in section 1.1; for
19
example, function words are given equal weight with content words. One exception is that
filled-pauses are, in some evaluations, e.g. GALE [DARPA, 2008], optionally inserted or
deleted without cost when evaluating speech.
A few larger projects that include ASR as a component have suggested extrinsic evalua-
tion methods: in dialog systems, for example, ASR performance is evaluated along with the
other components with a measure of action accuracy (e.g. in Walker et al. [1997] and Lamel
et al. [2000]). In the 2005 Summer Workshop on Parsing Speech [Harper et al., 2005], speech
recognition was evaluated in the extrinsic context of a downstream parser, but only a sin-
gle transcription hypothesis was used. Al-Onaizan and Mangu [2007] explored adjustments
to ASR hypothesis selection in an ASR-to-MT pipeline to allow relatively more insertions
(keeping the WER constant), but found that this made little difference in automatically-
evaluated MT performance.
As an alternative to evaluating ASR with WER or evaluating it directly in the context
of a downstream task, one may instead choose to optimize the ASR towards an improved
form of some intermediate representation (neither the immediate word sequence nor a fully-
extrinsic representation). Hillard et al. [2008], for example, experimented with selecting for
high-SParseval Chinese character-sequences for a downstream Chinese-to-English SMT
system (instead of selecting low character error rate (CER) hypotheses). In follow-up work,
Hillard [2008] found improvement on the automatic SMT measures for unstructured (broad-
cast conversation) genres of speech, though not for structured speech (broadcast news).
Additionally, they found that SParseval measurements of source-language transcription
were better correlated with human assessment of MT performance in the target language
than CER measurements. Intrinsic measures for ASR, however, are almost entirely limited
to WER or its simpler alternative for Chinese, CER.
Chapter 3, which uses parse decoration to rerank ASR transcription hypotheses, evalu-
ates ASR with WER and also with the SParseval parse-quality measure.
20
2.3.3 Parsing in ASR
Efforts to include parsing information in ASR systems have used the parser as an extra in-
formation source for selecting word sequences in speech recognition. This section highlights
a few parser-based language-models and reranking models that have been used in ASR,
demonstrating improvements in both perplexity and WER over n-gram LM baselines.
The structured language model [Chelba and Jelinek, 2000] is a shift-reduce parser that
conditions probabilities on preceding headwords. When interpolated with an n-gram model,
it achieved small improvements in WER on read speech from the Wall Street Journal corpus
and on conversational telephone speech from the Switchboard [Godfrey et al., 1992] corpus.
The top-down PCFG parser used by Roark [2001] achieved somewhat larger improvements
over a trigram on the same set (though the baseline it was compared to was worse than the
baseline in Chelba and Jelinek [2000]). Charniak [2001] implemented a top-down PCFG
parser that conditions probabilities on the labels and lexical heads of a constituent and
its parent. In this model, the probability of a parse is modeled as the product of the
conditional distributions of various structural factors. In contrast to both the models in
Chelba and Jelinek [2000] and Roark [2001], most of these factors are conditioned on the
identity of at most one other lexical item in the tree. This relative reliance on structure
(over lexical identity) makes this model distinctly un-trigram-like. This model gets a lower
perplexity than both the Structured Language Model and Roark’s model on Wall Street
Journal treebank text.
While the details of the parsing algorithms and probability models of the above models
vary, all are fundamentally some kind of PCFG. A non-CFG syntactic language model that
has been used for speech recognition is the SuperARV model, or “almost parsing language
model” [Wang and Harper, 2002], which calculates the joint probability of a string of words
and their corresponding super abstract role values. These values are tags containing part of
speech, semantic and syntactic information. The SuperARV got better perplexity and WER
results than both a baseline trigram and the Chelba and Jelinek [2000] and Roark [2001]
language models, for a variety of read Wall Street Journal corpora. It also out-performed a
state-of-the-art 4-gram interpolated word- and class-based language model on the DARPA
21
RT-02 conversational telephone speech evaluation data [Wang et al., 2004].
Filimonov and Harper [2009] introduce a generalization and extension of the Super-ARV
tagging model in a joint language modeling framework for using very large sets of “tags”,
which (when they include automatically-induced syntactic information in the tag set), was
competitive with the SuperARV performance on both perplexity and WER measures, but
requires less complex linguistic knowledge.
One challenge for combining parsing with ASR is that parsing is ordinarily performed
over well-formed, complete sentences, while automatic segmentation of ASR is difficult,
especially in conversational speech (where even a correct segmentation may not be a syn-
tactically well-formed sentence). Parse models of language do not perform as well on poorly-
segmented text [Kahn et al., 2004, Kahn, 2005, Harper et al., 2005]. In chapter 3, this work
goes into more depth regarding the impact of different methods of automatic segmentation
on the utility of parse decorations and success of parsing.
2.4 Statistical machine translation
Statistical machine translation (SMT) is the process of automatically creating a target-
language word sequence e1 . . . eE from a source-language word sequence f1 . . . fF . There are
non-statistical approaches to this task, e.g. the LOGON Norwegian-English MT project
[Lønning et al., 2004], but these are not the subject of this research. This section offers
an overview of the state-of-the-art in statistical machine translation, identifying the core
models and techniques that are used, the mechanisms for automatic evaluation, and where
syntactic structures are already in use.
2.4.1 A schematic summary of SMT
In SMT based on the IBM models [Brown et al., 1990] and their successors, candidate
translations are understood to be made up of the source words f , the target words e, and also
the alignment a between source and target words. The contributing components are broken
down in a noisy-channel model: a language model plm(e) scores the quality of the target
word sequence; a reordering model pr(a|e) assigns a penalty for the “reordering” performed
22
I don’t like blue cheese .
e{ e1 e2 e3 e4 e5 eE
a{f{ f1 f2 f3 f4 f5 f6 fF
Je n’ aime pas fromage bleu .
Figure 2.3: Word alignment between e and f . Each alignment link in a represents a corre-
spondence between one word in e and one word in f . There is no guarantee that e and f
are the same length (E = F ).
by the alignment, and the translation model ptm(f |e, a) provides a score for pairing source-
language word (or word-group) f with target-language word (or word-groups) e according
to alignment a. This approach formulates the translation decoding process as a search over
words (e) and alignments (a), which is typically approximated as:
argmaxe
p(e|a, f) ∼ argmaxe
ptm(f |e, a)prm(a|e)plm(e) (2.2)
Most current approaches to decoding do not actually use this generative model, but instead
a weighted combination of multiple ptm(·) translation models including both ptm(f |e, a) and
ptm−1(e|f, a), which lack the well-formed noisy-channel generative structure of equation 2.2
above but seem to work better in practice [Och, 2003].
Training the ptm(·) and prm(·) models requires many parallel sentences with alignments
between source and target words, of the form suggested in figure 2.3. Alignments, like parse
structure, are rarely annotated over large amounts of parallel text. The approach offered
by the IBM models and their descendants is to bootstrap alignment and translation models
from bitexts (corpora of parallel sentences). In general, the training of these alignment
and translation models is iterated in a bootstrap process. This bootstrap process, as im-
plemented in popular word-alignment tools, [e.g. GIZA++: Och and Ney, 2003], begins
with simple, tractable models for prm(·) and ptm(·) and, as the models improve, trains more
sophisticated reordering and translation models. Later models are initialized from the align-
23
ments hypothesized by earlier iterations. The language model plm(e) does not participate in
this phase of the training: in a bitext, predicting plm(e) is not helpful; language models are
usually trained separately, using monolingual text. As a byproduct of the parameter-search
to improve these models, the GIZA++ toolkit produces a best alignment linking each word
in e to words in f .
Other tools exist for generating alignments (such as the Berkeley aligner [DeNero and
Klein, 2007]) and there is substantial discussion over how to evaluate and improve the
quality of these alignments. Review of this discussion is passed over here; we will return to
this literature in chapter 5.
Typical independence assumptions in the word-alignment models constrain them to word
sequence and adjacency, applying a penalty for moving words into a different order in
translation. These models for reordering penalties are usually very simple, and do not
incorporate any notion of parse decoration — instead, they assign monotonically-increasing
penalties for moving words in translation. For example, Vogel et al. [1996] uses a hidden
Markov model (HMM, derived from only sequence information) to assign a prm(·) reordering
model. Language-models in translation are also generally sequence-driven: ASR’s basic n-
gram language-modeling approach serves as an excellent baseline to model plm(e) in MT
work. Early stages in the training bootstrapping sometimes ignore even word sequence
information: GIZA++’s “Model 1” treats prm(·) as uniform and ptm(·) as independent of
adjacency information (dependent only on the alignment links themselves).
For language-pairs like French-English, where word-order is largely similar, the local-
movement penalties of these simple prm(·) models usefully constrain the search space of
possible translations to those without large re-ordering: the language- and translation-model
scores will correctly handle any necessary small, local reorderings. For other language-pairs
(e.g., Chinese-English or Arabic-English), though, long-distance re-orderings are necessary,
and these models must assign a small penalty to long-distance movement, which leads to
an explosion in the search space (and a corresponding loss in translation quality).
Having bootstrapped from bitext to word-based alignments, many SMT systems (e.g.
Pharaoh [Koehn et al., 2003] and its open-source successor Moses [Koehn et al., 2007]) take
the bootstrapping farther by automatically extracting a “phrase table” from the aligned
24
Figure 2.4: The models that make up statistical machine translation systems
text. These “phrase-based”2 systems treat aligned chunks of adjacent words as a sort of
translation memory (the “phrase table”) which incorporates local reordering and context-
aware translations into the translation model. Entries in the phrase table are discovered
from the aligned bitext by heuristic selection of observed aligned spans. For some “phrase-
based” systems, such as Hiero [Chiang, 2005], the span-discovery (and decoding) may even
allow nested “phrases” with discontinuities.
Statistical machine translation systems thus, like ASR, use multiple models which con-
tribute together to generate (or “decode”) a scored list of possible hypotheses, as suggested
in the top half of figure 2.4. The “phrase table” incorporates some aspects of alignment
and translation models, but even when phrase tables are quite sophisticated, choosing and
assembling these phrases at run-time usually requires additional translation and alignment
models, even if only to assign appropriate penalties to the assembly of phrase-table entries.
The n-best list generated by the decoder is typically re-scored using a discriminative re-
2“Phrase-based” SMT systems use the term “phrase” to refer to a sequence of adjacent words; these donot have any guarantee of relating to a syntactic or otherwise linguistically-recognizable phrase. Xia andMcCord [2004] use “chunk-based” to refer to these systems but this expression has not been widely adopted.This work uses “phrase-based” for consistency with the literature (which describes “phrase-based” SMTand “phrase tables”), despite the infelicity of the expression.
25
ranking component, as outlined in figure 2.4, that takes into account the language-model,
translation-model, alignment-model and phrase-table scores already mentioned, and may
also incorporate additional features F (a, e, f) that are difficult to include in the decoding
process that generates the original n translation hypotheses. The re-ranking component
relies on an automatic measure of translation quality which is computable without human
intervention for a given hypothesis translation and one or more reference translations.
2.4.2 Evaluation measures for MT
The development of reliable automatic measures for optimization has changed the field of
statistical machine translation, by allowing the discriminative training of rescoring and re-
weighting models, such as minimum error rate training [MERT: Och, 2003], and by providing
a shared measure for success.
In MT, evaluation is a complex process, in large part because two (human) translators
asked to perform the same translation task may quite ordinarily produce very different re-
sulting strings. The challenge of accounting for allowable variability is not shared with ASR;
in ASR, two human transcribers will usually agree on most of the transcription. Instead of a
string match to a reference translation, human-assessed measures of translation quality are
traditionally broken into separate scales of fluency and adequacy to assess system quality
(whether translations are performed by human or machine) [King, 1996, LDC, 2005]. Of
course, fluency and adequacy judgements cannot be performed without a human evalua-
tor.3 Comparing system translations to reference translations allows monolingual assessors,
which reduces the cost by increasing the available pool of assessors. In many evaluations,
automatic measures compare automatic translations to these reference translations; these
automatic measures have the virtue of removing annotator variability from the evaluation
and further reducing the labor costs of assessing the system translations. For optimization
purposes (such as the MERT models and discriminative re-ranking described above), a mea-
sure that operates without human intervention is required, because the rescoring models
3One might think that fluency and adequacy judgements require a bilingual evaluator as well, but for eval-uating MT quality, a monolingual (in the target language) evaluator can compare machine and referencetranslations of the same text to report these judgements.
26
operate over hundreds (or thousands!) of sample translations of the same sentence.
The two most popular of the automatic metrics are the BLEU [Papineni et al., 2002] mea-
sure of n-gram precision and the TER [Snover et al., 2006] edit distance. BLEU [Papineni
et al., 2002], a measure of n-gram precision, remains the most popular and widely-reported
measure for measuring translation quality against a reference translation (or set of reference
translations). BLEU is a geometric mean of precisions over varying N -gram lengths:
BLEUn(h; r) = n
√√√√ n∏i=1
πi(h; r) · BP (h, r) (2.3)
where πi(h; r) reflects the precision of the i-grams in hypothesis h with respect to reference
r, and the term BP(h, r) is a “brevity penalty” to discourage the production of extremely
short (low-recall, high-precision) translations:
BP(h,r) =
exp(1− |r||h|
)if |h| < |r|
1 if |h| ≥ |r|
Most results are reported with BLEU4.
Translation Edit Rate (TER) is an error measure like WER, which measures the oper-
ations required to transform hypothesis h into reference r:
TER(h; r) =insertions(h; r) + deletions(h; r) + substitutions(h; r) + shifts(h; r)
length(r)(2.4)
where insertions, deletions and substitutions count one per word, while shift operations
move any adjacent sequence of words from one position in h to another. Insertion, deletion,
substitution and shift error counts are calculated through an alignment between reference
and hypothesis that heuristically minimizes the total number of operations needed.
When working with multiple references, BLEU4 is defined so that its n-grams may match
those in any of the references, allowing translation variability across the multiple references,
but TER’s approach to multiple references is just to return the minimum edit ratio over
the set of references, which is less forgiving to the candidate translation.
Like word error rate for ASR, the BLEU and TER metrics use no syntactic or argument-
structure modeling to determine which words matter more: all words are treated equally.
In TER, substituting or shifting a single word incurs the same cost regardless of where the
27
substitution or shift happens; in BLEU, all hypothesis n-grams contribute equally to the
score of the sentence. Because of the emphasis on these automatic measures, innovations in
MT have often focused on the innovations’ effects on these measures directly, sometimes to
the point of reporting only on one of these entirely automatic measures.
Some have raised skepticism towards the focus on the BLEU and TER automatic mea-
sures on theoretical [Callison-Burch, 2006] and empirical [Charniak et al., 2003] grounds, in
that they do not always accurately track translation quality as judged by a human annota-
tor, and they may not even reliably separate professional from machine translations [Culy
and Riehemann, 2003]. Other automatic MT measures have been proposed, some of which
use parse decorations. Chapter 4 describes some of these alternatives in more detail.
An ideal automatic measure would correlate well with human judgements of translation
quality. However, judgements of fluency and adequacy themselves are highly variable across
annotators. Rather than correlate with these measurements, one may instead examine the
correlation with a different human-derived measure of translation quality: Snover et al.
[2006] propose Human-targeted Translation Edit Rate (HTER), a measurement of the work
performed by a human editor to correct the translation until it is equivalent to the reference
translation. They show that a single HTER score is very well-correlated to fluency/adequacy
judgements, and has lower variance: they find that a single HTER score is more predictive of
a held-out fluency/adequacy judgement than a single fluency/adequacy judgement. HTER
still requires human intervention, but, probably because of its consistency in evaluation,
it has been adopted as the evaluation standard for the DARPA GALE project [DARPA,
2008].
2.4.3 Parsing in MT
Early explorations of the application of syntactic structure to SMT were explored as an
alternative to the phrase-table approach. Yamada and Knight [2001] and Gildea [2003] in-
corporate operations on a treebank-trained target-language parse tree to represent p(f |a, e)
and p(a|e), but have no “phrase” component; Charniak et al. [2003] apply grammatical
structure to the p(e) language-model component. These approaches met with only moder-
28
ate success.
Rather than building a syntactic model into the decoder or language model, others pro-
posed automatically [Xia and McCord, 2004, Costa-jussa and Fonollosa, 2006] and manually
coded [Collins et al., 2005a, Popovic and Ney, 2006] transformations on source-language
trees, to reorder source sentences from f to f ′ before training or decoding (translation
models are trained on bitexts with f ′ and e). Zhang et al. [2007] extend this approach
by inserting an explicit source-to-source “pre-reordering” model pr0(f′|f) to provide lattice
input alternatives to the main translation.
The phrase-table models described in section 2.4.1 capture some local syntactic struc-
ture — even when the phrases are simply reliably-adjacent word-sequences — by virtue
of recording actually-observed n-grams in the source- and target-language sequences, but
these models offer additional power when they are made syntactically aware. Syntactically-
aware decoders are united with the phrase-table approach in such approaches as the ISI
systems [Galley et al., 2004, 2006, Marcu et al., 2006], the systems built by Zollmann et al.
[2007], and recently the Joshua open-source project [Li et al., 2009]. Each of these builds
syntactic trees over the target side of the bitext in training and learn phrase-table entries
with syntactically-labeled spans. Conversely, Quirk et al. [2005] and Xiong et al. [2007]
construct phrase-table entries using source-language dependency structure, while Liu et al.
[2006a] applies a similar technique using constituent structure instead of dependency.
Rather than pursue these phrase-table based decoder models directly, chapter 5 of this
work explores mechanisms to use parsers to improve the word-to-word alignments that are
the material from which the phrases are learned.
2.5 Summary
This chapter has provided an overview of four key technologies for the remainder of this
work: statistical parsing, n-best list reranking, automatic speech recognition, and statistical
machine translation. Special attention is paid to the interaction of parsers with speech
recognition, the evaluation of speech recognition and machine translation, and the existing
roles of syntactic structure in statistical machine translation. The next three chapters
use parsers (and rerankers) in various combinations on conversational speech recognition
29
(chapter 3), machine translation evaluation (chapter 4), and on improving word alignment
quality for machine translation (chapter 5). Further details on related work more directly
related to this thesis are provided in each chapter.
30
31
Chapter 3
PARSING SPEECH
Parse-decoration on the word sequence has a strong potential for application in the
domain of automatic speech recognition (ASR). Extracting syntactic structure from speech
is more challenging than ASR or parsing alone, because the combination of these two stages
introduces the potential for cascading error, and most parsing systems assume that the leaves
(words) of the syntactic tree are fixed. This chapter1 applies parse structure as an additional
knowledge source, even when the evaluation targets do not include parse structure explicitly.
It also considers the benefits to parsing of considering alternative speech transcripts (when
the evaluation targets are parse measures themselves).
We thus consider recognition and parsing as a joint reranking problem, with uncertainty
(in the form of multiple hypotheses) in both the recognizer and parser components. In this
joint problem, there are two possible targets: word sequence quality, measured by word
error rate (WER), and parse quality, measured over speech transcripts by SParseval. For
both these targets, sentence boundary concerns have largely been ignored in prior work:
speech recognition research has generally assumed that sentence boundaries do not have
a major impact, since the placement of segment boundaries in a string does not affect
WER on that string. Parsing research, on the other hand, has generally assumed that
sentence boundaries are given (usually by punctuation), since most parsing research has
been on text. Spoken language, unlike written language, does not have explicit markers for
sentence and paragraph breaks; i.e., punctuation is not verbalized. Sentence boundaries in
spoken corpora must therefore be automatically recognized, introducing another source of
difficulty for the joint recognition-and-parsing problem, regardless of the target: sentence
segmentation.
1Tthe work presented in this chapter is included in a paper that has been accepted to Computer Speechand Language.
32
Although there has been a substantial amount of research on speech recognition, seg-
mentation of spoken language, and parsing (as described in the next section), there has
been little work exploring automation of all three together. Most research has incorporated
only one or two of these areas, typically treating recognition and parsing as separable pro-
cesses. In this chapter, we combine recognition and parsing using discriminative reranking:
selecting optimal word sequences from the N -best word sequences generated from a speech
recognizer given cues from M parses for each, and selecting optimal parse structure from the
N ×M -best parse structures associated with these word sequences. At the same time, we
explore the impact of automatic segmentation. We ask the following inter-related questions:
• In the task of extracting parse structure from conversational speech, how much can
we improve performance by exploiting the uncertainty of the speech recognizer?
• In the word recognition task, does a discriminative syntactic language model benefit
from incorporating parse uncertainty in parse feature extraction?
• How does segmentation affect the usefulness of parse information for improving speech
recognition, and what is its impact on parsing accuracy, given alternative word se-
quences and alternative parse hypotheses?
Section 3.1 discusses the relevant background for this research integrating speech segmen-
tation, parsing, and speech recognition. Section 3.2 outlines the experimental framework in
which this chapter explores those questions, while section 3.3 describes the corpus and the
configuration of the various components of this system. Section 3.4 describes the results of
those experiments, and section 3.5 discusses these results in the context of the dissertation
as a whole.
3.1 Background
Our approach to parsing conversational speech builds on several active research areas in
speech and natural language processing. This section extends the review from chapter 2 to
highlight the prior work most related to the work in this chapter.
33
3.1.1 Parsing on speech and its evaluation
As discussed in section 2.1.4, most parsing research has been developed with the parseval
metric [Black et al., 1991], which was inititally developed for parse measurement on text.
It was used in initial studies of speech based on reference transcripts (without considering
speech recognizer errors). The grammatical structures of speech are different than those of
text: for example, Charniak and Johnson [2001] demonstrated the usefulness (as measured
by parseval) of explicit modeling of edit regions in parsing transcripts of conversational
speech.
Unfortunately, parseval is not well-suited to evaluating parses of automatically-recognized
speech. In particular, when the words (leaves) are different between reference and hypoth-
esized trees (as will be the case when there are recognition errors), it is difficult to say
whether a particular span is included in both, and the parseval measure is not well de-
fined. Roark et al. [2006] introduce alternative scoring methods to address this problem
with SParseval, a parse evaluation toolkit. The SParseval method used here takes into
account dependency relationships among words instead of spans. Specifically, CFG trees
are converted into dependency trees using a head-finding algorithm and head percolation of
the words at the leaves. Each dependency tree is treated as a bag of triples 〈d, r, h〉 where
d is the dependent word, r is a symbol describing the relation, and h is the dominating
lexical headword (central content word in the phrase). Arc-labels r are determined from
the highest constituent label in the dependent and the lowest constituent label dominating
the dependent and the head. SParseval describes the overlap between the “gold” and
hypothesized bags-of-triples in terms of precision, recall and F measure.
Overall, SParseval allows a principled incorporation of both word accuracy and accu-
racy of parse relationships. Since every triple (the dependency-pair and its link label, as
in figure 3.1) involves two words, this measure depends heavily on word accuracy, but in a
more complex way than word error rate, the standard speech recognition evaluation met-
ric. Figure 3.1 demonstrates a number of properties of the SParseval measure. Although
both (b) and (c) have the same word error (one substitution each), they have very different
precision and recall behavior. As the figure suggests, the SParseval measure over-weights
34
(a) S/think
NP/I
I
VP/think
AdvP/really
really
VP/think
V/think
think
AdvP/so
so
(I, S/NP, think)
(really, VP/AdvP, think)
(think, <s>/S, <s>)
(so, VP/AdvP, think)
(b) S/think
S/think
NP/I
I
VP/think
AdvP/really
really
VP/think
V/think
think
DM/yeah
yeah
Precision = 34 , Recall = 3
4
Word Error Rate = 14
(I, S/NP, think)
(really, VP/AdvP, think)
(think, <s>/S, <s>)
(yeah, S/DM, think)
(c) S/sink
NP/I
I
VP/sink
AdvP/really
really
VP/sink
V/sink
sink
AdvP/so
so
Precision = 04 , Recall = 0
4
Word Error Rate = 14
(I, S/NP, sink)
(really, VP/AdvP, sink)
(sink, <s>/S, <s>)
(so, VP/AdvP, sink)
Figure 3.1: A SParseval example that includes a reference tree (a) and two hypothesized
trees (b,c) with alternative word sequences. Each tree lists the dependency triples that
it contains; bold triples in the hypothesized trees indicate triples that overlap with the
reference tree. Although all have the same parse structure, tree (c) is penalized more
heavily (no triples right) because it gets the head word think wrong.
35
“key” words, making SParseval a joint measure of word sequence and parse quality. All
words appear exactly once in the left (dependent) side of the triple, but only the heads of
phrases appear on the right. Thus, those words that are the lexical heads of many other
words (such as think in the figure) are multiply-weighted by this measure. Head words are
multiply weighted because getting head words wrong impacts not only the triples where
that head word is dependent on some other token, but also the triples where some other
word depends on that head word. Non-head words are not involved in so many triples.
In this work, we use SParseval as our measure of parse quality for parses produced over
speech recognition transcription hypotheses.
3.1.2 Speech segmentation
Speech transcripts offer another challenge for parsing, whether used as an evaluation or
a knowledge source. We showed that parser performance (as measured by an adapted
parseval) degrades significantly when using automatically-detected (rather than reference)
sentence boundaries [Kahn et al., 2004, Kahn, 2005], even when the speech transcripts are
entirely accurate.
Extending the same lines of research, Harper et al. [2005] used SParseval to assess the
impact of automatic segmentation on parse quality, but using automatic word transcrip-
tions as well. Their work focuses on selecting segmentations from a fixed word sequence and
providing a top-choice parse for each of those segments. As we [Kahn et al., 2004] previ-
ously found on reference transcripts, they show a negative impact of segmentation error on
ASR hypothesis transcripts, and further show that optimizing for minimum segmentation
error does not lead to the best parsing performance. Parser performance, rather, benefits
more from higher segment-boundary recall (i.e., shorter segments). They do not, however,
consider alternative speech recognition hypotheses, which is an important focus of the work
in this chapter.
Though choosing a different segmentation does not affect the WER error measure (as
it would for SParseval), choosing alternate segmentations affects ASR language model-
ing, even in the absence of a parsing language model, because even n-gram language models
36
assume that segment-boundary conditions are matched between LM training and test: Stol-
cke [1997] demonstrated that adjusting pause-based ASR n-best lists to take into account
segment boundaries matched to language model training data gave reductions in word error
rate.
3.1.3 Parse features in reranking
Section 2.3.3 discussed general approaches to using parsing as a language model, including
parsing language-models like Chelba and Jelinek [2000] and Roark [2001]. Reranking, as
discussed in section 2.2, is applied to parsers [Collins and Koo, 2005] but also to language-
modeling for ASR, with [e.g., Collins et al., 2005b] and without [Roark et al., 2007] parse
features.
Collins et al. [2005b] does discriminative reranking using features of the parse structure
extracted from a single-best parse of the English ASR hypothesis. Arisoy et al. [2010] used
a similar strategy for Turkish language modeling. In both cases, the objective was the
minimization of WER. Harper et al. [2005] and others, as mentioned above, use reranking
with the parsing objective over automatic speech transcripts. However, neither the syn-
tactic language-modeling work using syntax nor the parsing work using automatic speech
transcripts considers the variable hypotheses of both the speech recognizer and the parser in
a reranking context. Using both variables together is the approach pursued in this chapter.
3.2 Architecture
The system for handling conversational speech presented in this chapter is illustrated
schematically in figure 3.2 and involves the following steps:
1. a speech recognizer, which generates speech recognition lattices with associated
probabilities from an audio segment (here, a conversation side);
2. a segmenter which detects sentence-like segment boundaries E, given the top word
hypothesis from the recognizer and prosodic features from the audio;
37
Figure 3.2: System architecture at test time.
3. a resegmenter which applies the segment boundaries E to confusion networks de-
rived from the lattices and generates an N -best word hypothesis cohort W s for each
segment s, made up of word sequences wi with associated recognizer posteriors pw(wi)
for each of the N sequences wi ∈W s;
4. a parser component which generates an M -best list of parses ti,j , j = 1, . . . ,M ,
for each wi ∈ W s, along with confidences pp(ti,j , wi) for each parse over each word
sequence (all the ti,j for a given segment s make up the parse cohort T s)
5. a feature extractor which extracts a vector of descriptive features fi,j over each
member of the parse structure cohort which together make up the feature cohort F s;
and
6. a reranker component which selects an optimal vector of features (and thus a pre-
ferred candidate) from the cohort and effectively chooses an optimal 〈w, t〉, which
38
Figure 3.3: n-best resegmentation using confusion networks
maximize performance with respect to some objective function on the selected candi-
date and the reference word transcripts and parse-tree.
In the remainder of this section we describe the components created for this joint-problem
architecture: the resegmenter (step 3), the features chosen in the feature extractor (step 5),
and the re-ranker itself (step 6). We describe the details of each component’s configuration
in section 3.3.2.
3.2.1 Resegmentation
This chapter compares multiple segmentations of the word stream, including the ASR-
standard pause-based segmentation, reference sentence boundaries, and two cases of auto-
matically detected sentence-like units. Since the recognizer output is based on pause-based
segmentation, a resegmenter (step 3) is needed to generate N-best hypotheses for the al-
ternative segmentations, taking recognizer word lattices and a hypothesized segmentation
as input. The resegmentation strategy is depicted in Figure 3.3. First, the lattices from
step 1 are converted into confusion networks, a compact version of lattices which consist
39
of a sequence of word slots where each slot contains a list of word sequence hypotheses
with associated posterior probabilities [Mangu et al., 2000]. Because the slots are linearly
ordered, they can be cut and rejoined at any inter-slot boundary. All the confusion net-
works for a single conversation side are concatenated. Speaker diarization (the relationship
between this conversation side and the transcription of the interlocutor) is not varied. The
concatenated confusion network is then cut at locations corresponding to the hypothesized
segment boundaries, producing a segmented confusion network. Each candidate segmenta-
tion produces a different re-cut confusion network.
These re-cut confusion networks are used to generate W s, an N -best list of transcription
hypotheses, for each hypothesized segment s from the target segmentation. Each transcrip-
tion wi of W s has a recognizer confidence pr(wi), calculated as
pr(wi) =
len(wi)∏k=1
pr(wik) (3.1)
where pr(wik) is the confusion network confidence of the word selected for wi from the k-th
slot in the confusion net. This posterior probability pr(wik) is derived from the recognizer’s
forward-backward decoding where the acoustic model, language model, and posterior scaling
weights are tuned to minimize WER on a development set.
3.2.2 Feature extraction
After creating the parse cohort T s from the word-sequence cohort W s, each member of the
parse cohort is a word-sequence hypothesis wi with a parse tree ti,j projected over it, along
with two confidences: the ASR system posterior pr(wi) and the parse posterior pp(ti,j , wi).
The feature-extraction step (step 5) extracts additional features and organizes all of these
features into a vector fi,j to pass to the reranker. The feature extraction is organized to
allow us to vary fi,j to include different subsets of those extracted.
In this subsection, we present three classes of features extracted from our joint recognizer-
parser architecture: per-word-sequence features, generated directly from the output of
the recognizer and resegmenter and shared by all parse candidates associated with a tran-
scription hypothesis wi; per-parse features, generated from the output of the parser,
40
Table 3.1: Reranker feature descriptions for parse ti,j of word sequence wi
Feature Description Feature Class
pr(wi) Recognizer probability per-word-
sequence
features
Ci Word count of wi
Bi True if wi is empty
pp(ti,j , wi) Parse probabilityper-parse features
ψ(ti,j) Non-local syntactic features
pplm(wi) Parser language model aggregated parse
featuresE[ψi] Non-local syntactic feature expectations
which are different for each parse hypothesis ti,j ; and aggregated-parse features, con-
structed from the parse candidates but which aggregate across all ti,j that belong to the
same wi. The features are listed in Table 3.1. All of the probability features p(·) are
presented to the reranker in logarithmic form (values −∞ to 0).
Per-word-sequence features
Two recognizer outputs are read directly from the N -best lists produced in step 3 and
reflect non-parse information. The first is the recognizer language-model score pr(wi), which
is calculated from the resegmenter’s confusion networks as described in equation 3.1. A
second recognizer feature is the number of words Ci in the word hypothesis, which allows
the reranker to explicitly model sequence length. Lastly, an empty-hypothesis indicator Bi
(where Bi = 1 when Ci = 0) allows the reranker to learn a score to counterbalance for lack
of a useful parse score. (It is possible that a segment will have some hypothesized word
sequences wi that have valid words and some that contain only noise, silence or laughter,
i.e., an empty hypothesis, which would have no meaningful parse.)
41
Per-parse features
Each parse ti,j has an associated lexicalized-PCFG probability pp(ti,j , wi) returned by the
parser. For the parse quality objective, our system needs to compare parses generated
from different word hypotheses. The joint probability p(t, w) contains information about
the word sequence (the marginal parsing language model probability p(w) =∑t p(t, w))
and the parse for that word sequence p(t|w). For the two objectives of parsing and word
transcriptions, it is useful to factor these. Parse-specific features are described here, and in
the next section we consider features that are aggregated over the M-best parses.
For parsing, we compute the probabilities
pp(ti,j |wi) =pp(ti,j , wi)∑Mk=1 pp(ti,k, wi)
(3.2)
that represent the proportion of the M -best parser probability mass for sequence wi assigned
to tree ti,j .
The score pp(·) described above models the parser’s confidence in the quality of the entire
parse. Following the parse-reranking schemes sketched in section 3.1.3, we also extract
non-local parse features: a vector of integer counts ψ(ti,j) extracted from parse ti,j and
reflecting various aspects of the parse topology, using the feature-extractor from Charniak
and Johnson [2005]. These features are non-local in the sense that they make reference to
topology outside the usual context-free condition. For example, one element of this vector
might count, in a given parse, the number of VPs of length 5 and headed by the word think.
Further examples of the sorts of components in ψ(·) may be found in Charniak and Johnson
[2005]. Because these features are often counts of the configurations of specific words or non-
terminal labels, ψ(ti,j) is a very high-dimensional vector, which is pruned at training time
for the sake of computational tractability. For each segmentation condition, we construct a
different definition of ψ, keeping only those features whose values vary between candidates
for more than k segments of the training corpus.
When Ci = 0, we assign exactly one dummy tree [S [-NONE- null]] to the empty word
sequence, set pp(ti,1, wi) to a value very close to zero, and derive ψ(ti,1) from the dummy
tree using the same feature extractor. pp(ti,1|wi) is set to unity since there is only one
(dummy) parse available.
42
Aggregated-parse features
For the WER objective, the details of specific parses are not of interest, but rather their ex-
pected behavior given the distribution over possible trees {p(t|w)}. We calculate the “parser
language model” feature pplm(wi) by summing the probabilities of all parse candidates for
wi:
pplm(wi) =M∑k=1
pp(ti,k, wi). (3.3)
We also aggregate our non-local syntactic feature vectors ψ(ti,j) across the multiple parses
ti,j associated with a single word sequence wi by taking the (possibly flattened) expectation
over the conditional parse probabilities:
E[ψi] =M∑j=1
pp(ti,j |wi)ψ(ti,j) (3.4)
We further investigated flattening the parse probabilities (i.e. replace pp(ti,k, wi) with
pp(ti,k, wi)γ for 0 < γ ≤ 1) under the hypothesis that they were “over-confident”, which is
useful in chapter 4 (also published as Kahn et al. [2009]).
3.2.3 Reranker
The reranker (step 6) takes as input the feature vector fi,j for each candidate and applies
a discriminative model θ to sort the cohort candidates. θ is learned by pairing feature
vectors fi,j with a value assigned by an external objective function φ, and finding a θ
that optimizes the cumulative objective function of the top-ranked hypotheses over all the
training segments s. In this work, we consider two alternative objective functions: word
error rate (WER), for targeting the word sequence (φw(wi)), and SParseval for evaluating
parse structure (φp(ti,j)). For the SParseval objective, the optimization problem is given
by:
θ = argmaxθ
∑s
φp
argmintsi,j
θ · f si,j
(3.5)
A similar equation results for the word error rate objective, but the minimization is only
over word hypotheses wi.
43
For training the re-ranker component of our system, we need segment-level scores, since
we apply the re-ranking per-segment. Ideally, scoring operates on the concatenated result
for a whole conversation side, to avoid artifacts of mapping word sequences associated
with hypothesized segments that differ from the reference segments. However, in training,
where scoring is needed for all M × N hypotheses, it is prohibitively complex to score
all combinations at the conversation-side level. We therefore approximate per-segment
SParseval scores at training time by computing precision and recall against all those
reference dependency pairs whose child dependent aligns within the segment boundaries
available to the parses being ranked.
We treat reranker training as a rank-learning margin problem, using the svm-rank
toolkit [Joachims, 2006] as described in section 2.2.2.
3.3 Corpus and experimental setup
These experiments used the Switchboard corpus [Godfrey et al., 1992], a collection of English
conversational speech. The Switchboard corpus consists of five-minute telephone conversa-
tions between strangers on a randomly-assigned topic. The audio is recorded on different
channels for each speaker. The data from a single channel is referred to as a “conversation
side.”
All the data used in this experiment was taken from a subset of Switchboard conversa-
tions which the Linguistic Data Consortium (LDC) has annotated with parse trees. These
experiments use the original LDC transcriptions of the Switchboard audio rather than the
subsequent Mississippi State transcriptions [ISIP, 1997], because hand-annotated reference
parses only exist for the former. The data also has manual annotations of disfluencies and
sentence-like units [Meteer et al., 1995], labeled with reference to the audio. Because the
treebanking effort used only transcripts (no audio), there are occasional differences in the
definition of a “sentence”; because the audio-based annotation was likely to be more faithful
to the speaker’s intent, and because the automatic segmenter was trained from data anno-
tated with a related LDC convention [Strassel, 2003], we used the audio-based definition of
sentence-like units (referred to henceforth as SUs).
The Switchboard parses were preprocessed for use in this system following methods
44
Table 3.2: Switchboard data partitions
Partition Sides Words
Train 1042 654271
Dev 116 76189
Eval 128 58494
described in Kahn [2005], which are summarized here. Various aspects of the syntactic an-
notation beyond the scope of this task—for example, empty categories—were removed. The
parses were also resegmented to match the SU segments, with some additional rule-based
changes performed to make these annotations more closely match the LDC SU conventions.
In the resegmented trees, constituents spanning manually-annotated segment boundaries
were discarded, and multiple trees within a single manually annotated segment were sub-
sumed beneath a top-level SUGROUP constituent. To match the speech recognizer output,
punctuation is removed, and contractions are retokenized (e.g., can + n’t ⇒ can’t).
The corpus was partitioned into training, development and evaluation sets whose sizes
are shown in Table 3.2. Results are reported on the evaluation set; the development set was
used during debugging and for exploring new feature-sets for f , but no results from it are
reported here.
3.3.1 Evaluation measures
Word recognition performance is evaluated using word-error rate measurements generated
by the NIST sclite scoring tool [NIST, 2005] with the words in the reference parses taken
as the reference transcription. Because we want to compare performance across different
segmentations, WER is calculated on a per-conversation side basis, concatenating all the
top-ranked word sequence hypotheses in a given conversation side together. When com-
paring the statistical significance of different results between configurations, the Wilcoxon
Signed Rank test provided by sclite is used.
For parse-quality evaluation, we use the SParseval toolkit [Roark et al., 2006], again
45
calculated on a per-conversation side basis, concatenating all the top-ranked parse hypothe-
ses in a given conversation. We use the setting that invokes Charniak’s implementation
of the head-finding algorithm and consider performance over both closed- and open-class
words. When comparing the statistical significance of SParseval results, we use a per-
segment randomization [Yeh, 2000].
3.3.2 Component configurations
Speech recognizer
The recognizer is the SRI Decipher conversational speech recognition system [Stolcke et al.,
2006], a state-of-the-art large-vocabulary speech recognizer that uses various acoustic and
language models to perform multiple recognition and adaptation passes. The full system has
multiple front-ends, each of which produce n-best lists containing up to 2000 word sequence
hypotheses per audio segment, which are then combined into a single set of word sequence
hypotheses using a confusion network. This system has a WER of 18.6% on the standard
NIST RT-04 evaluation test set.
Human-annotated reference parses are required for all the data involved in these exper-
iments. Unfortunately, because they are difficult to create, reference parses are in short
supply, and all the Switchboard conversations used in the evaluation of this system are
already part of the training data for the SRI recognizer. Although it represents only a very
small part of the training data (Switchboard is only a small part of the corpus, and the
data here are restricted to the hand-parsed fraction of Switchboard), there is the danger
that this will lead to unrealistically good recognizer performance. This work compensates
for this potential danger by using a less powerful version of the full recognizer, which has
fewer stages of rescoring and adaptation than the full system and a WER of 20.2% on the
RT-04 test set. On our evaluation set from Switchboard, this system has a 22.9% WER.
Segmenter
Our automatic segmenter [Liu et al., 2006b] frames the sentence-segmentation problem as a
binary classification problem in which each boundary between words can be labeled as either
46
a sentence boundary or not. Given a word sequence and prosodic features, it estimates the
posterior probability of a boundary after each word. The particular version of the system
used here is based on the hidden-event model (HEM) from Stolcke and Shriberg [1996],
with features that include n-gram probabilities, part of speech, and automatically-induced
semantic classes, and combines the lexical and prosodic information sources. The HEM is
an HMM with a higher-order Markov process on the state sequence (the word-boundary
label pair) and observation probabilities given by the prosodic information using bagging
decision trees. Segment boundaries were hypothesized for all word boundaries where the
posterior probability of a sentence boundary was above a certain threshold.
We explore four segmentation conditions in our experiments:
Pause-based segmentation uses the recognizer’s “automatic” segments, which are deter-
mined based on speech/non-speech detection by the recognizer (i.e., pause detection);
it serves as a baseline.
Min-SER segmentation is based on the automatic system using a posterior threshold of
0.5, which minimizes the word-level slot error rate (SER).
Over-segmented segmentation is based on the automatic system using a posterior thresh-
old of 0.35, which is that suggested by Harper et al. [2005] for obtaining better parse
quality.
Reference segmentation is mapped to the hypothesized word sequence by performing a
dynamic-programming alignment between the confusion networks and the reference;
it provides an oracle upper bound.
Table 3.3 summarizes the segmentation conditions, including the performance (measured
as SU boundary F and SER), the number of segments and the average segment length
in words for each segmentation condition on the evaluation set. Note that the automatic
segmentation with the lower threshold results in more boundaries, so that the average
“sentence” length is shorter and recall is favored over precision.
47
Table 3.3: Segmentation conditions. F and SER report the SU boundary performance over
the evaluation section of the corpus.
Segmentation # Segments Average
condition threshold F SER Train Eval length
Pause-based NA 0.62 0.61 54943 5693 10.3
Min-SER 0.5 0.77 0.45 86681 8417 6.9
Over-segmented 0.35 0.78 0.46 96627 9369 6.2
Reference NA (1.00) (0.00) 91254 8779 6.7
Resegmenter
Given the confusion network representation of the speech recognition output, the main
task of resegmentation is generating N -best lists given a new segmentation condition for
the confusion networks. For a given segment, the lattice-tool program from the SRI
Language Modeling Toolkit [Stolcke, 2002] is used to find paths through the confusion
network ranked in order of probability, so the N most probable paths are emitted as an
N -best list 〈w1 . . . wN 〉, where each wi is a sequence of words. For these experiments, the
N -best lists are limited to at most N = 50 word sequence hypotheses.
Parser
Our system uses an updated release of the Charniak generative parser [Charniak, 2001] (the
first stage of the November 2009 updated release of [Charniak and Johnson, 2005], without
the discriminative second-stage component) to do the M -best parse-list (and parse-score)
generation. As in Kahn [2005], we do not implement a separate “edit detection” stage
but treat edits as part of the syntactic structure. The parser is trained on the entire
training set’s reference parses; no parse trees from other sources are included in the training
set. We generate M = 10 parses for each word sequence hypothesis, based on analyses
(presented later) that showed little benefit from additional parses and much more benefit
from increasing the number of sentence hypotheses. If the parser generates less than M
48
hypotheses, we take as many as are available. For the full system, we train a single parser
on the entire training set; for providing training cohorts to the reranker, the parser is trained
on round-robin subsets of the training set, as discussed in section 3.3.2.
Feature extractor
The extraction of non-local syntactic feature ψ(ti,j) uses the software and feature definitions
from Charniak and Johnson [2005]. For tractability, we prune the set of features to those
with non-zero (and non-uniform) values within a single segment’s hypothesis set for more
than 2000 segments, which is approximately 2% of the total number of training segments
(as in the parse-reranking experiments in Kahn et al. [2005]). Pruning is done separately
for each segmentation of the training set, yielding about 40,000 non-local syntactic features
under most segmentation conditions.2
The aggregate parse features pplm(wi) and E[ψi] are calculated by sums across the M
parses generated for each wi. We assume that this approximation (instructing the parser to
return no parses after the M -th) has no important impact on the value of these features.
Reranker
As discussed in section 2.2, the reranker component of our system is the svm-rank tool from
Joachims [2006]. The reranker needs to be trained using candidate parses from a data set
that is independent of the parser training and the evaluation test set. Because of the limited
amount of hand-annotated parse tree data, we did not want to create a separate training
partition just for this model. Instead, we adopt the round-robin procedure described in
Collins and Koo [2005]: we build 10 leave-n-out parser models, each trained on 9/10 of the
training set, and run each on the tenth that it has not been exposed to. The resulting parse
candidate sets are passed to the feature-extraction component and the resulting vectors
(and their objective function values) are used to train the reranker models.
2Our non-local syntactic feature set is thus slightly different for each segmentation, since the numberand content of the set of segments vary among segmentations. The pause-based segmentation, withsubstantially longer segments, selects about 28,000 features under this pruning condition; others haveabout 40,000.
49
To avoid memory constraints, we assign each segment to one of 10 separate bins and
train 10 svm-rank models.3 For each experimental combination of segmentation and features
in fi,j , we re-train all 10 rerankers. At evaluation time, the cohort candidates are ranked
by all 10 models and their scores are averaged. The parse (or word-sequence) of the top-
ranked candidate is taken to be the system’s hypothesis for a given segment, and evaluated
according to either the WER or SParseval objective.
3.4 Results
This section describes the results of experiments designed to assess the potential for per-
formance improvement associated with increasing the number of word-sequence vs. parse
candidates, as well as the actual gains achieved by reranking under both WER and SParse-
val objectives and different segmentation conditions. We also include a qualitative analysis
of improvements.
3.4.1 Baseline and Oracle Results
To provide a baseline, we sequentially apply the recognizer, segmenter, and parser, choosing
the top scoring word-sequence and then the top parse choice. We establish upper bounds
for each objective by selecting the candidate from the M × N parse-and-word-sequence
cohort that scores the best on each objective function. The results of these experiments are
reported in tables 3.4 (optimizing for WER with M = 1 and different N) and 3.5 (optimizing
for SParseval with N = 50 and different M). The number in parentheses corresponds to
the mismatched condition — picking a candidate based on one criterion and scoring it with
another. Both sets of results show that improving one objective leads to improvements in
the other, since word errors are incorporated into the SParseval score.
Table 3.4 shows that the N -best cohorts contain a potential WER error reduction of
32%. Larger gains are possible for the shorter-segment segmentation conditions, due to the
increase in the number of available alternatives when generating N -best lists from more
3Each candidate set is generated by a single leave-n-out parser (populated by conversation-side), but eachsvm-rank bin (populated by segments, not by conversation sides) includes some cohorts from each of theleave-n-out tenths.
50
Table 3.4: Baseline (1-best serial processing) and oracle WER reranking performance from
N = 50 word sequence hypotheses and 1-best parse. Parenthesized values indicate (unop-
timized) SParseval scores of the selected hypothesis.
Serial Baseline 1xN WER Oracle
Segmenter WER (SParseval) WER (SParseval)
Pause 23.7 (68.2) 17.6 (70.7)
Min-SER 23.7 (70.7) 16.7 (73.7)
Over-seg 23.7 (70.9) 16.2 (74.1)
Reference 23.7 (72.5) 16.2 (77.0)
(and shorter) confusion networks.
Table 3.5 shows that there is a potential of 39% reduction parse error (1-F ) between
the serial baseline (F = 72.5) and the joint M ×N optimization (F = 83.2) with the oracle
segmentation. The potential benefit is smaller for the pause-based segmentation (F = 68.2
vs. 75.2), both in terms of the relative improvement (22%) and the absolute F score. The
possible benefit of automatic segmentation falls between these ranges, with slightly better
results for the over-segmented case. We observe smaller gains in going from M = 10 to
M = 50 parses (and no gains in the automatic segmentation cases), so only M = 10 parses
are used in subsequent experiments, to reduce memory requirements in training.
We can also compare the benefit of increasing N vs. M . Figure 3.4 illustrates the
trade-off for reference segmentations, showing that there is a bigger benefit from increasing
N than M . However, a comparison of the results in the two tables shows that there is a
significant gain in SParseval parse performance from increasing both of N×M : if only M
is increased (from 1× 1 to 1× 50), the potential benefit is 25% error reduction. If increased
to 10× 50, possible reduction is 36% (39% for 50× 50).
51
Table 3.5: Oracle SParseval (WER) reranking performance from N = 50 word sequence
hypotheses and M = 1, 10, or 50 parses. Parenthesized values indicate (unoptimized) WER
of the selected hypotheses.
Parse Oracle (N = 50)
M = 1 M = 10 M = 50
Segments (WER) SParseval (WER) SParseval (WER) SParseval
Pause (20.8) 72.7 (20.3) 74.4 (20.0) 75.2
Min-SER (20.3) 75.8 (19.7) 78.0 (19.7) 78.0
Over-seg (20.0) 76.2 (19.3) 78.5 (19.3) 78.5
Reference (19.1) 79.4 (18.3) 82.3 (18.1) 83.2
0 10 20 30 40 50N
0
10
20
30
40
50
M
0.7650.780
0.795
0.810
0.825
0.825
SParseval oracle performance
Figure 3.4: Oracle parse performance contours for different numbers of parses M and recog-
nition hypotheses N on reference segmentations.
52
Table 3.6: Reranker feature combinations. Additionally all feature sets also contain the
per-word-sequence features pr(wi), Ci and Bi.
Feature Set Additional features Per
ASR (No additional features) word sequence
ParseP pp(ti,j , wi) parse
ParseLM pplm(wi) word sequence
ParseP+NLSF pp(ti,j , wi), ψ(ti,j) parse
ParseLM+E[NLSF] pplm(wi), E[ψi] word sequence
3.4.2 Optimizing for WER
We also investigate whether providing multiple M -best parses to the reranker augments
the parsing knowledge source when optimizing for WER (compared to using only one parse
annotation, or to using no parse annotation at all). To examine this, we explore different
alternatives for creating the feature-vector representation fi,j of a word-sequence candidate,
as summarized in Table 3.6. All experiments include recognizer confidences pw(wi), word
count Ci, empty-hypothesis flag Bi, and parser posteriors pp(ti,j , wi) in the feature vector.
Table 3.6 shows all the feature combinations investigated with the feature names used here.
Table 3.7 shows the WER results of all the segmentation conditions and feature sets,
which can be compared to the baseline serial result of 23.7%. Reranking with the ASR
features alone does not improve performance, since there is little that the reranker can
learn (acoustic and language model scores are combined in the process of generating N -best
lists from confusion networks). The WER performance is worse than baseline on the Min-
SER and Ref segmentations, possibly because these segments are relatively longer than the
Over-seg condition, making word length differences a less useful feature. Other results in
table 3.7 confirm that non-local syntactic features ψ(ti,j) (NLSF here) are useful for word
recognition, confirming the results from Collins et al. [2005b]. In addition, there are some
new findings. First, SU segmentation impacts the utility of the parser for word transcription
(as well as for parsing). There is no benefit to using the parse probabilities alone except
53
Table 3.7: WER on the evaluation set for different sentence segmentations and feature sets.
Baseline WER for all segmentations is 23.7%.
Segmentation
Features Pause-based Min-SER Over-seg Ref seg
ASR (M = 1) 23.6 24.2 23.7 24.1
ParseP (M = 1) 23.6 23.7 23.7 23.1
ParseLM 23.7 23.7 23.7 23.1
ParseP+NLSF (M = 1) 23.3 23.4 23.4 22.8
ParseLM+E[NLSF] 23.3 23.3 23.4 22.7
Oracle-WER 17.6 16.7 16.2 16.2
in the case of reference segmentation,4 and the benefit of parse features is greater with the
reference segmentation than with the automatic segmentations (22.7% vs. 23.3% WER).
Second, the use of more than one parse with parse posteriors does not lead to significant
performance gains for any feature set.
While there is not a significant benefit from using M = 10 for the parse probability plus
features, it does give the best result and we will use this in comparisons to the M = 10
SParseval optimization. For all segmentations, the ParseLM+E[NLSF] features provide
a significant reduction (p < 0.001 using the Wilcoxon test) in WER from the baseline,
but only 4–6% of the possible improvement within the N-best cohort is obtained with the
automatic segmentation. When using reference segmentation, reranking with any of the
feature sets provides significant (p < 0.001) WER reductions compared to baseline.
Table 3.8 explores the effect of lowering the parse-flattening γ below 1.0 for those WER-
optimized models that use more than one parse candidate (γ has no effect on expectation
weighting when there is only one parse candidate). The differences introduced by γ = 0.5
or γ = 0.1 are not significantly different than γ = 1.0, and systems trained with γ 6= 1 are
in general slightly worse than those with γ = 1.0. In all further experiments, γ is set to the
4Since the n-gram language model is trained on much more data than the parser, it may be difficult forthe parsing language model to provide added benefit.
54
Table 3.8: Word error rate results for different sentence segmentations and feature sets,
comparing γ parse-flattening for WER optimization when N = 10. The baseline WER for
all segmentations is 23.7%.
Segmentation
Features γ Pause-based Min-SER Over-seg Ref seg
ParseLM γ = 0.1 23.7 23.7 23.7 23.4
ParseLM γ = 0.5 23.6 23.7 23.7 23.2
ParseLM γ = 1.0 23.6 23.7 23.7 23.1
ParseLM+E[NLSF] γ = 0.1 23.3 23.4 23.4 22.9
ParseLM+E[NLSF] γ = 0.5 23.3 23.4 23.4 22.7
ParseLM+E[NLSF] γ = 1.0 23.3 23.3 23.4 22.7
default (1.0).
3.4.3 Optimizing for SParseval
When optimizing for SParseval, we train and evaluate with the feature set that includes
parse-specific features: pr(wi), Ci, Bi, pplm(wi), pp(ti,j , wi), and ψ(ti,j). Table 3.9 summa-
rizes the results for the different segmentation conditions in comparison to the serial baseline
and M × N -best oracle result. WER numbers, reported in parentheses, are the WER of
the leaves of the selected parse. As expected from prior work [Kahn et al., 2004, Harper
et al., 2005], we find an impact on parsing from the segmentation. The best results for all
feature sets are obtained with reference segmentations, and the over-segmented threshold
in automatic segmentation is slightly better than the min-SER case. For reference segmen-
tations, higher parse scores correspond to lower WER, but for other segmentations this is
not always the case for the automatic systems. For all segmentations, optimizing with the
ParseP feature set is better than the baseline (p < 0.01 using per-segment randomization
[Yeh, 2000]).
The non-local syntactic features did not lead to improved parse performance over the
55
Table 3.9: Results under different segmentation conditions when optimizing for SParseval
objective; the associated WER results are reported in parentheses.
Segmentation
Features Pause-based Min-SER Over-seg Ref seg
Baseline (23.7) 68.2 (23.7) 70.7 (23.7) 70.9 (23.7) 72.5
ParseP (24.1) 68.8 (24.0) 71.1 (24.0) 71.3 (23.2) 73.4
ParseP+NLSF (24.3) 69.1 (25.5) 70.4 (25.8) 70.4 (23.5) 73.1
oracle (20.3) 74.4 (19.7) 78.0 (19.3) 78.5 (18.3) 82.3
parse probability alone, and in some cases hurt performance, which seems to contradict
prior results in parse reranking. However, as shown in Figure 3.5, there is an improvement
due to use of features for the case where there is only N = 1 recognition hypothesis, but
that improvement is small compared to gains from increasing N . Figure 3.5 also shows
that optimizing for WER with non-local syntactic features actually leads to better parsing
performance than when optimizing directly for parse performance. We conjecture that this
result is due to overtraining the reranker when the feature dimensionality is high and the
training samples are biased to have many poorly-scoring candidates. The parsing problem
involves many more candidates to rank than WER (300 vs. 30 on average) because parse-
reranking has M ×N candidates while transcription-reranking has at most N candidates.
Since the pool of M × N is much larger, it contains more poorly-ranking candidates and
thus the learning may be dominated by the many pairwise cases involving poor-quality
candidates.
3.4.4 Qualitative observations
We examined the recognizer outputs for the WER optimization with the ParseLM and
expected NLSF features to understand the types of improvements resulting from using a
parse-based language-model for re-ranking. Under this WER optimization on reference
segmentation, of the 8,726 segments in the test set, 985 had WER improvements and 462
56
Figure 3.5: SParseval performance for different feature and optimization conditions as a
function of the size of the N-best list.
had WER degradations. We examined a sample (about 100 each) of these improvements
and degradations. Some improvements (about 15%) are simple determiner recoveries, e.g.
“a” in “have a happy thanksgiving.” Other examples involve short main verbs (also a bit
more than 15%; above 20% if contractions are included), as in:
some evenings are [or] worse than others
that is [as] a pretty big change
used [nice] to live in colorado
well they don’t [—] always
where the corrected word is in boldface and the incorrect word (substitution) output by the
baseline recognizer is italicized and in brackets.
More significant from a language processing perspective are the corrections involving
pronouns (about 5%), which would impact coreference and entity analysis. The parsing LM
recovers lost pronouns and eliminates incorrectly recognized ones, particularly in contrac-
tions with short main verbs, as in the following examples:
57
she was there [they’re] like all winter semester
they’re [there] going to school
we’re [where] the old folks now
(Contraction corrections like these are not included in the count for short main verb cor-
rections.) Further improvements are found in complementizers and prepositions (about 5%
each), while only about 10% of the improvements changed content words. The remaining
45% of improvements are miscellaneous.
Another pronoun example illustrates how the parse features can overcome the bias of
frequent n-grams in conversational speech:
Improved: they *** really get uh into it
Baseline: yeah yeah really get uh into it
Reference: they uh really get uh into it
with substitution errors in italics and deletions indicated by “***.” (The bigram “yeah
yeah” is very frequent in the Switchboard corpus.)
Of the segments that suffered WER degradation under ParseLM+E[NLSF] WER op-
timization, a little more than 15% were errors on a word involved in a repetition or self-
correction, e.g. the omission of the boldface the in:
. . . that’s not the not the way that the society is going
Another 7-10% of these candidates that had WER degradation were more grammatically
plausible than the reference transcription, e.g. the substitution of a determiner a for an
unusually-placed pronoun (probably a correction):
Reference: but i lot of times i don’t remember the names
Optimized: but a lot of times i do not remember the names
Most importantly, these last two classes of WER degradation do not have an impact on the
meaning of the sentence. The remaining roughly 75% of the WER-degraded segments are
difficult to characterize, but are a large-majority of function-words as well.
58
Most of these types of corrections are observed whether the optimization is for WER or
SParseval. Many cases where they give different results have a higher or equal WER for
the SParseval-optimized case, but the result is arguably better, as in:
WER obj.: i know that people *** to arrange their whole schedules . . . (1 error)
SParseval obj.: i know that people used to arrange their whole schedule . . . (1 errors)
Baseline: i know that people easter arrange their whole schedules . . . (2 errors)
Reference: i know that people used to arrange their whole schedules . . .
We compared the WER-optimized segment to SParseval-optimized segments, and found
that about 100 segments had better SParseval and worse WER in the WER-optimized
segment, and better WER and worse SParseval in the WER-optimized segment. Of these
cases, about 15% seem to be cases where the SParseval-optimization is more grammati-
cally plausible than the reference, e.g.:
Reference: i’ve i’ve probably talked maybe to five people
SParseval opt.: i’ve i’ve probably talked to maybe just five people
WER opt: i’ve i’ve probably talked maybe just five people
Reference: now it’s like you know tough and dirty team
SParseval opt.: now it’s like you know a tough and dirty team
WER opt.: now it’s like you know tough and dirty team
Note that it is important that the parser is trained on conversational speech in order to make
useful predictions on conversational phenomena such as the hedging “like, you know” and
the prescriptively proscribed double-adverb “maybe just”. The remaining improvements in
this analysis may be categorized as a variety of other cases.
3.5 Discussion
In this chapter, we have presented a discriminative framework for jointly modeling speech
recognition and parsing, with which we improve both word sequence quality (as measured
59
by WER) and parse quality (as measured by SParseval). We confirm and extend previous
work in using parse structure for language-modeling [Collins et al., 2005b] and in parsing
conversational speech [Kahn et al., 2004, Kahn, 2005, Harper et al., 2005].
Experiments using this framework provide some answers to the questions posed at the
beginning of the chapter. First, we find that parsing performance can be improved sub-
stantially by incorporating parser uncertainty via N -best list rescoring, particularly with
high quality sentence segmentation, although the automatic reranking systems achieve only
a small fraction of the potential gain. Further, allowing for word uncertainty is much more
important than considering parse alternatives. In optimizing for WER, however, no signif-
icant gains are obtained from modeling parse uncertainty in a statistical parser, either in a
language model or in non-local syntactic features. Of course, these findings may depend on
the particular parser used. Finally, we find that sentence segmentation quality is important
for parse information to have a significant impact on speech recognition WER, and that
a good segmentation can increase the potential gains in parsing from considering multiple
word-sequence hypotheses. A conclusion of these findings is that improvements to auto-
matic segmentation algorithms would substantially extend the utility of parsers in speech
processing.
One surprising result was that non-local syntactic features in reranking were of more
benefit to speech recognition than to parsing and, in fact, sometimes hurt parsing perfor-
mance. We conjecture that this result is due to the fact that the joint parsing problem
involves many more poor candidate pairs among reranker training samples, which seems to
be problematic for the learner when the features are high-dimensional. It may be that other
types of rerankers are better suited to handling such problems.
60
61
Chapter 4
USING GRAMMATICAL STRUCTURE TO EVALUATE MACHINETRANSLATION
This chapter1 explores a different use of grammatical structure prediction: its use in
predicting the quality of machine translation. As suggested in chapters 1 and 2, a key
challenge in automatic machine translation evaluation is to account for allowable variability,
since two equally good translations may be quite different in surface form. This is especially
challenging when the evaluation measures used consider only the word-sequence.
We motivate the use of dependencies for SMT evaluation with two example machine
translations (and a human-translated reference):
Ref: Authorities have also closed southern Basra’s airport and seaport.
S1: The authorities also closed the airport and seaport in the southern port of Basra.
S2: Authorities closed the airport and the port of.
(4.1)
A human evaluator judged the system 1 result (S1) as equivalent to the reference, but
indicated that the system 2 (S2) result had problematic errors. BLEU4 (a popular automatic
metric for SMT) gives S1 and S2 similar scores (0.199 vs. 0.203). TER (another popular
metric) prefers S2 (with an error of 0.7 vs. 0.9 for S1), since a deletion requires fewer edits
than rephrasing. EDPM (the new metric described later in this chapter) provides a score for
S1 (0.414) that is preferred to S2 (0.356), reflecting EDPM’s ability to match dependency
structure. The two phrases “southern Basra’s airport and seaport” and “the airport and
seaport in the southern port of Basra” have more similar dependency structure than word
order.
The next section (4.1) reviews some relevant research in the evaluation of machine trans-
lation. In section 4.2, this chapter describes a family of dependency pair match (DPM) au-
tomatic machine-translation metrics, and section 4.3 describes the infrastructure and tools
1Matthew Snover provided invaluable assistance in a version of this work, which has been published asKahn et al. [2009].
62
used to implement that family. Sections 4.4 and 4.5 explore two ways to compare members
of this family with human judgements. Section 4.6 explores the potential to adapt the
EDPM component measures by combining them with another state-of-the-art MT metric’s
use of synonym tables and other word-sequence and sub-word features. Section 4.7, finally,
discusses the broader implications and future directions for these findings.
4.1 Background
Currently, the most popular approaches for automatic MT evaluation are BLEU [Papineni
et al., 2002], based on n-gram precision, and Translation Edit Rate (TER), an edit distance
[Snover et al., 2006]. These measures can only account for variability when given multiple
translations, and studies have shown that they may not accurately track translation quality
[Charniak et al., 2003, Callison-Burch, 2006]. Both BLEU and TER are word-sequence
measures: they use exclusively features of the word-sequence and no knowledge of language
similarity or structure beyond that sequence.
Some alternative measures have proposed using external knowledge sources to explore
mappings within the words themselves, such as synonym tables and morphological stem-
ming, e.g. METEOR [Banerjee and Lavie, 2005] and the ATEC measure [Wong and Kit,
2009]. TER Plus (TERp) [Snover et al., 2009], which is an extension of the previously-
mentioned TER, also incorporates synonym sets and stemming, along with automatically-
derived paraphrase tables. Still other systems attempt to map language similarity measures
into a high-level semantic entailment abstraction, e.g. [Pado et al., 2009].
By contrast, this chapter’s research proposes a technique for comparing syntactic decom-
positions of the reference and hypothesis translations. Other metrics modeling syntactically-
local (rather than string-local) word-sequences include tree-local n-gram precision in various
configurations of constituency and dependency trees [Liu and Gildea, 2005] and the d and
d var measures proposed by Owczarzak et al. [2007a,b], which compare relational tuples
derived from a lexical functional grammar (LFG) over reference and hypothesis transla-
tions.2
2 Owczarzak et al. [2007a] extend their previous line of research [Owczarzak et al., 2007b] by variably-weighting dependencies and by including synonym matching, two directions not pursued here. Hence,
63
Any syntactic-dependency-oriented measure requires a system for proposing dependency
structure over the reference and hypothesis translations. Liu and Gildea [2005] use a PCFG
parser with deterministic head-finding, while Owczarzak et al. [2007a] extract the seman-
tic dependency relations from an LFG parser [Cahill et al., 2004]. This chapter’s work
extends the dependency-scoring strategies of Owczarzak et al. [2007a], which reported sub-
stantial improvement in correlation with human judgement relative to BLEU and TER,
by using a publicly-available probabilistic context-free grammar (PCFG) parser and deter-
ministic head-finding rules, rather than an LFG parser. In addition, this chapter considers
alternative syntactic decompositions and alternative mechanisms for computing score com-
binations. Finally, the work presented here explores combination of syntax with synonym-
and paraphrase-matching scoring metrics.
Evaluation of automatic MT measures requires correlation with MT evaluation mea-
sures performed by human beings. Some [Banerjee and Lavie, 2005, Liu and Gildea, 2005,
Owczarzak et al., 2007a] compare the measure to human judgements of fluency and ade-
quacy. Other work Snover et al. [e.g. 2006] compares measures’ correlation with human-
targeted TER (HTER), an edit-distance to a human-revised reference. The metrics de-
veloped here are evaluated in terms of their correlation against both fluency/adequacy
judgement and against HTER scores.
4.2 Approach: the DPM family of metrics
The specific family of dependency pair match (DPM) measures described here combines
precision and recall scores of various decompositions of a syntactic dependency tree. Rather
than comparing string sequences, as BLEU does with its n-gram precision, this approach
defers to a parser for an indication of the relevant word tuples associated with meaning — in
these implementations, the head on which that word depends. Each sentence (both reference
and hypothesis) is converted to a labeled syntactic dependency tree and then relations from
each tree are extracted and compared. These measures may be seen as generalizations of
the earlier paper is cited in comparisons. Section 4.6 includes synonym matching, but over data whichare not directly comparable with either Owczarzak paper and using an entirely different mechanism forcombination.
64
Reference Hypothesis
treeThe red cat ate 〈root〉
detmod subj root
The cat stumbled 〈root〉
det subj root
dlh list
〈the,det→ , cat 〉
〈red,mod→ , cat 〉
〈cat,subj→ , ate 〉
〈ate,root→ , <root>〉
〈 the ,det→ , cat 〉
〈 cat ,subj→ , stumbled〉
〈stumbled,root→ , <root> 〉
Figure 4.1: Example dependency trees and their dlh decompositions.
dl lh
〈 the ,det→ 〉
〈 cat ,subj→ 〉
〈stumbled,root→ 〉
〈 det→ , cat 〉
〈 subj→ , stumbled〉
〈root→ , <root> 〉
Figure 4.2: The dl and lh decompositions of the hypothesis tree in figure 4.1.
the dependency-pair F measures found in Owczarzak et al. [2007b].
The particular relations that are extracted from the dependency tree are referred to
here as decompositions. Figure 4.1 illustrates the dependency-link-head decomposition of a
toy dependency tree into a list of 〈d, l, h〉 tuples. Some members of the DPM family may
apply more than one decomposition; other good examples are the dl decomposition, which
generates a bag of dependent words with outbound links, and the lh decomposition, which
generates a bag of inbound link labels, with the head word for each included. Figure 4.2
shows the dl and lh decompositions for the same hypothesis tree.
The decompositions explored in various configurations in this chapter include:
dlh 〈Dependent, arc Label,Head〉 – full triple
dl 〈Dependent, arc Label〉 – marks how the word fits into its syntactic context
lh 〈arc Label,Head〉 – implicitly marks how key the word is to the sentence
65
dh 〈Dependent,Head〉 – drops syntactic-role information.
1g,2g – simple measures of unigram (bigram) counts
Various members of the family may choose to include more than one of these decomposi-
tions.3
It is worth noting here that the dlh and lh decompositions (but not the dl decomposition)
“overweight” the headwords, in that there are n elements in the resulting bag, but if a word
has no dependents it is found in the resulting bag exactly one time (in the dlh case) or
not at all (in the lh case). Conversely, syntactically “key” words, those on which many
other words in the tree depend, are included multiple times in the decomposition (once for
each inbound link). This “overweighting” effectively allows the grammatical structure of
the sentence to indicate which words are more important to translate correctly, e.g. “Basra”
in example (4.1), or head verbs (which participate in multiple dependencies).
A statistical parser provides confidences associated with parses in a probabilistically-
weighted N -best list, which we use to compute expected (probability-weighted) counts for
each decomposition in both reference and hypothesized translations. By using expected
counts, we may count partial matches in computing precision and recall. This approach
addresses both the potential for parser error and for syntactic ambiguity in the translations
(both reference and hypothesis).
When multiple decomposition types are used together, we may combine these subscores
in a variety of ways. Here, we experiment with using two variations of a harmonic mean:
computing precision and recall over all decompositions as a group (giving a single precision
and recall number) vs. computing precision and recall separately for each decomposition.
We distinguish between these using the notation in (4.2) and (4.3):
F [dl, lh] = µh (Prec (dl ∪ lh) ,Recall (dl ∪ lh)) (4.2)
µPR[dl, lh] = µh (Prec (dl) ,Recall (dl) ,Prec (lh) ,Recall (lh)) (4.3)
where µh represents a harmonic mean. (Note that when there is only one decomposition,
3No d decomposition is included: this would be equivalent to a 1g decomposition. h decomposition mightcapture the syntactic weighting without the syntactic role that lh captures, but we find that lh has thesame effect.
66
as in F [dlh], F [·] ≡ µPR[·].) Dependency-based SParseval [Roark et al., 2006] and the
d approach from Owczarzak et al. [2007a] may each be understood as F [dlh] (although
SParseval focuses on the accuracy of the parse, and Owczarzak et al. use a different
mechanism for generating trees for decomposition). The latter’s d var method may be
understood as something close to F [dl, lh]. BLEU4 is effectively µP (1g . . . 4g) with the
addition of a brevity penalty. Both the combination methods F and µPR are “naive” in that
they treat each component score as equivalent. When we introduce syntactic/paraphrasing
features in section 4.6, we will consider a weighted combination.
4.3 Implementation of the DPM family
The entire family of DPM measures may be implemented with any parser that generates
a dependency graph (a single labeled arc for each word, pointing to its head-word). Prior
work [Owczarzak et al., 2007a] on related measures has used an LFG parser [Cahill et al.,
2004] or an unlabelled dependency tree [Liu and Gildea, 2005].
In this work, we use a state-of-the-art PCFG (the first stage of Charniak and Johnson
[2005]) and context-free head-finding rules [Magerman, 1995] to generate an N -best list of
dependency trees for each hypothesis and reference translation. We use the parser’s default
(English) Wall Street Journal training parameters. Head-finding uses the Charniak parser’s
rules, with three modifications to make the semantic (rather than syntactic) relations more
dominant in the dependency tree: prepositional and complementizer phrases choose nom-
inal and verbal heads respectively (rather than functional heads) and auxiliary verbs are
dependents of main verbs (rather than the converse). These changes capture the idea that
main verbs are more important for adequacy in translation, as illustrated by the functional
equivalence of “have also closed” vs. “also closed” in the introductory example.
Having constructed the dependency tree, we label the arc between dependent d and
its head h as A/B when A is the lowest constituent-label headed by h and dominating d
and B is the highest constituent label headed by d. For illustrations, in figure 4.3, the
s node is the lowest node headed by stumbled that dominates cat, and the np node is
the highest constituent label headed by cat, so the arc linking cat to stumbled is labelled
s/np. This strategy is very similar to one adopted in the reference implementation of
67
root/stumbled
s/stumbled
np/cat
dt/the
the
nn/cat
cat
vp/stumbled
vbd/stumbled
stumbled The cat stumbled 〈root〉
np/dt s/np root/s
Figure 4.3: An example headed constituent tree and the labeled dependency tree derived
from it.
labelled-dependency SParseval [Roark et al., 2006], and may be considered as a shallow
approximation of the rich semantics generated by LFG parsers [Cahill et al., 2004]. The
A/B labels are not as descriptive as the LFG semantics, but they have a similar resolution
in English (with its relatively fixed word order), e.g. the s/np arc label usually represents
a subject dependent of a sentential verb.
For the cases where we have N -best parse hypotheses, we use the associated parse prob-
abilities (or confidences) to compute expected counts. The sentence will then be represented
with more tuples, corresponding to alternative analyses. For example, if the N -best parses
include two different roles for dependent “Basra”, then two different dl tuples are included,
each with the weighted count that is the sum of the confidences of all parses having the
respective role.4
The parse confidence p is normalized so that the N -best confidences sum to one. Because
the parser is overconfident, we explore a flattened estimate: p(k) = p(k)γ∑ip(i)γ
, where k, i index
the parse and γ is a free parameter.
4 The use of expectations with N -best parses is different from d 50 and d 50 pm in Owczarzak et al.[2007a], in that the latter uses the best-matching pair of trees rather than an aggregate over the tree setsand they do not use parse confidences.
68
4.4 Selecting EDPM with human judgements of fluency & adequacy
We explore various configurations of the DPM by assessing the results against a corpus
of human judgements of fluency and adequacy, specifically the LDC Multiple Translation
Chinese corpus parts 2 [LDC, 2003] and 4 [LDC, 2006], which are composed of English
translations (by machine and human translators) of written (and edited) Chinese newswire
articles. For each article in these corpora, multiple human evaluators provided judgements
of fluency and adequacy for each sentence (assigned on a five-point scale), with each judge-
ment using a different human judge and a different reference translation. For a rough5
comparison with Owczarzak et al. [2007a], we treat each judgement as a separate segment,
which yields 16,815 tuples of 〈hypothesis, reference, fluency, adequacy〉. We compute per-
segment correlations.6 The baselines for comparison are case-sensitive BLEU (4-grams, with
add-one smoothing) and TER.
The specific dimensions of DPM explored include:
Decompositions. We compute precision and recall of several different decompositions:
d,dl,dlh increasing n-grams, directed up through the tree, as inspired by BLEU4 and
Liu and Gildea [2005].
dl,lh partial decomposition, to match d var
dlh all labeled dependency link pairs, as suggested by SParseval and d
1g,2g surface unigrams and bigrams only
Parser variations. When using more than one parse, we explore:
Size of N-best list. 1 (adopting only the best parse) or 50 (as in Owczarzak et al.
[2007a])
5Our segment count differs slightly from Owczarzak et al. [2007a] for the same corpus: 16,807 vs. 16,815.As a result, the baseline per-segment correlations differ slightly (BLEU4 is higher here, while TER here islower), but the trends in gains over those baselines are very similar.
6The use of the same hypothesis translations in multiple comparisons in the Multiple Translation Corpusmeans that scored segments are not strictly independent, but for methodological comparison with priorwork, this strategy is preserved.
69
Table 4.1: Per-segment correlation with human fluency/adequacy judgements of different
combination methods and decompositions.
metric r
BLEU4 0.218
F [1g, 2g, dl, lh] 0.237
µPR[1g, 2g, dl, lh] 0.217
F [1g, 2g] 0.227
µPR[1g, 2g] 0.215
F [1g, dl, dlh] 0.227
F [dl, lh] 0.226
µPR[dl, lh] 0.208
Parse confidence. The distribution flattening parameter is varied from γ = 0 (uni-
form distribution) to γ = 1 (no flattening).
Score combination. Global F vs. component harmonic mean µPR.
4.4.1 Choosing a combination method: F vs. µPR
In table 4.1, we compare combination methods for a variety of decompositions. These
results demonstrate that F consistently outperforms µPR as well as the BLEU4 baseline
(see table 4.2). µPR measures are never better than BLEU; µPR combinations are thus not
considered further in this work.
4.4.2 Choosing a set of decompositions
Considering only the 1-best parse, we compare DPM with different decompositions to the
baseline measures. Table 4.2 shows that all decompositions except [dlh] have a better
per-segment correlation with the fluency/adequacy scores than TER or BLEU4. Includ-
ing progressively larger chunks of the dependency graph with F [1g, dl, dlh], inspired by the
70
Table 4.2: Per-segment correlation with human fluency/adequacy judgements of baselines
and different decompositions. N = 1 parses used.
metric |r|
F [1g, 2g, dl, lh] 0.237
F [1g, 2g] 0.227
F [dl, lh] 0.226
BLEU4 0.218
F [dlh] 0.185
TER 0.173
BLEUk idea of progressively larger n-grams, did not give an improvement over [dl, lh]. De-
pendencies [dl, lh] and string-local n-grams [1g, 2g] give similar results, but the combination
of all four decompositions [1g, 2g, dl, lh] gives further improvement in correlation over their
use in isolation. The results also confirm, with a PCFG, what Owczarzak et al. [2007a]
found with an LFG parser: that partial-dependency matches are better correlated with hu-
man judgements than full-dependency links. We speculate that this improvement is because
partial-dependency matches are more forgiving: they allow the system to detect that a word
is used in the proper context without requiring its syntactic neighbors to also be translated
in the same way.
4.4.3 Choosing a parse-flattening γ
Since the parser in our implementation provides a confidence in each parse, we explore the
use of that confidence with the γ free parameter and N = 50 parses. Table 4.3 explores
various “flattenings” (values of γ) of the parse confidence in the F [·] measure. γ = 1 is
not always the best, suggesting that the parse probabilities p(tree|words) are overconfident.
The differences are small, but the trends are consistent across all the decompositions tested
here. We find that γ = 0.25 is generally the best flattening of the parse confidence for
the variants of this measure that we have tested: it is nearest the maximum r for both
71
Table 4.3: Considering values of γ,N = 50 (and one N = 1 case) for two different sub-graph
lists (dl, lh and 1g, 2g, dl, lh).
γ F [1g, 2g, dl, lh] F [dl, lh]
1 0.239 0.232
0.75 0.240 0.233
0.5 0.240 0.234
0.25 0.240 0.234
0 0.239 0.234
[N = 1] 0.237 0.226
decompositions in table 4.3, though rounding hides the exact maxima.
Table 4.3 also shows the effect of using N -best parses for different decompositions. The
N = 50 cases are uniformly better than N = 1. While not all of these differences are
significant, there is a consistent trend of correlation r improving with 50 vs. 1 parse.
In summary, exploring a number of variants of the DPM metric against an average
fluency/adequacy judgement leads to a best-case of:
EDPM = F [1g, 2g, dl, lh], N = 50, γ = 0.25
We use this configuration in experiments assessing correlations with HTER.
4.5 Correlating EDPM with HTER
In this section, we compare the EDPM metric selected in the previous section to baseline
metrics in terms of document- and segment-level correlation with HTER scores using the
GALE 2.5 translation corpus [LDC, 2008]. The corpus includes system translations into
English from three SMT research sites, all of which use system combination to integrate re-
sults from several systems, some phrase-based and some that use syntax on either the source
or target side. No system provided system-generated parses; the EDPM measure’s parse
structures are generated entirely at evaluation time. The source data includes Arabic and
Chinese in four genres: bc (broadcast conversation), bn (broadcast news), nw (newswire),
72
Table 4.4: Corpus statistics for the GALE 2.5 translation corpus.
Arabic Chinese Total
doc sent doc sent doc sent
bc 59 750 56 1061 115 1811
bn 63 666 63 620 126 1286
nw 68 494 70 440 138 934
wb 69 683 68 588 137 1271
Total 259 2593 257 2709 516 5302
and wb (web text), with corpus sizes shown in table 4.4. This data may thus be broken
down in several ways — in one large corpus, or into by language into two corpora (one
derived from Arabic and one from Chinese), or by genre (into four) or by language×genre
(eight subcorpora). The corpus includes one English reference translation [LDC, 2008] for
each sentence and a system translation for each of the three systems. Additionally, each
of the system translations has a corresponding “human-targeted” reference aligned at the
sentence level, so we may compute the HTER score at both the sentence and document
level.
HTER and automatic scores all degrade, on average, for more difficult sentences. Since
there are multiple system translations in this corpus, it is possible to roughly factor out this
source of variability by correlating mean normalized scores,7 m(ti) = m(ti)− 1I
∑Ij=1m(tj)
where m can be HTER, TER, BLEU4 or EDPM, and ti represents the i-th translation of
segment t. Mean-removal ensures that the reported correlations are among differences in
the translations rather than among differences in the underlying segments.
7Previous work Kahn et al. [2008] reported HTER correlations against pairwise differences among trans-lations derived from the same source to factor out sentence difficulty, but this violates independenceassumptions used in the Pearson’s r tests.
73
Table 4.5: Per-document correlations of EDPM and others to HTER, by genre and by
source language. Bold numbers are within 95% significance of the best per column; italics
indicate that the sign of the r value has less than 95% confidence (that is, the value r = 0
falls within the 95% confidence interval).
r vs. HTER bc bn nw wb all Arabic all Chinese all
TER 0.59 0.35 0.47 0.17 0.54 0.32 0.44
−BLEU4 0.42 0.32 0.46 0.27 0.42 0.33 0.37
−EDPM 0.69 0.39 0.47 0.27 0.60 0.39 0.50
Table 4.6: Per-sentence, length-weighted correlations of EDPM and others to HTER, by
genre and by source language. Bold numbers indicate significance as above.
r vs. HTER bc bn nw wb all Arabic all Chinese all
TER 0.44 0.29 0.33 0.25 0.44 0.25 0.36
−BLEU4 0.31 0.24 0.29 0.25 0.31 0.24 0.28
−EDPM 0.46 0.31 0.34 0.30 0.44 0.30 0.37
4.5.1 Per-document correlation with HTER
Table 4.5 shows per-document Pearson’s r between −EDPM and HTER, as well as the
TER and −BLEU4 baselines’ Pearson’s r with HTER. (We correlate with negative BLEU4
and EDPM to keep the sign of a good correlation positive.) EDPM has the best correlation
overall, as well as in each of the subcorpora created by dividing by genre or by source
language. In structured data (bn and nw), these differences are not significant, but in the
unstructured domains (wb and bc), EDPM is always significantly better than at least one
of the comparison baselines.
74
4.5.2 Per-sentences correlation with HTER
Table 4.6 presents per-sentence (rather than per-document) correlations based on scores,
weighted by sentence length in order to get a per-word measure of correlation which reduces
variance across sentences. (Even with length weighting, the r values have smaller magnitude
due to the higher variability at the sentence level.) EDPM again has the largest correlation
in each category, but TER has r values within 95% confidence of EDPM scores on nearly
every breakdown.
4.6 Combining syntax with edit and semantic knowledge sources
While the results in the previous section show that EDPM is as good or better than base-
line measures TER and BLEU4, the correlation is still low. This result is consistent with
intuitions derived from the example in section 4.2, where the EDPM score is much less
than 1 for the good translation. For that reason, we investigated combining the alternative
wording features (synonymy and paraphrase) of TERp [Snover et al., 2009] with the EDPM
syntactic features.
The TERp tools take an entirely different approach from EDPM. Rather than intro-
duce grammatical structure, the TERp (“TER plus”) model extracts counts of multiple
classes of edit operations and linearly combines the costs of those operations. These op-
erations extend the TER operations (insert, delete, substitute, and shift) to include also
“substitute-stem”, “substitute-synonym” and “substitute-paraphrase” operations that rely
on external knowledge sources (stemmers, synonym tables, and paraphrase tables respec-
tively). TERp’s approach thus exploits a knowledge source that is relatively well-separated
from the grammatical-structure information provided by EDPM.
To determine the relative cost of each class of edit operation, TERp provides an optimizer
for weighting multiple simple subscores. The TERp optimizer performs a hill-climbing
search, with randomized restarts, to maximize the correlation of a linear combination of the
subscores with a set of human judgements. Within the TERp framework, the subscores are
the counts of the various edit types, normalized for the length of the reference, where the
counts are determined after aligning the MT output to the reference using default (uniform)
75
edit costs.
The experiments here use the TERp optimizer but extend the set of subscores by includ-
ing the syntactic and n-gram overlap features (modified to reflect false and missed detection
rates for the TERp format rather than precision and recall). The subscores explored include:
E : the 8 fully syntactic subscores from the DPM family, including false/miss error rates
for the expected values of dl, lh, dlh, and dh decompositions.
N : the 4 n-gram subscores from the DPM family; specifically, error rates for the 1g and
2g decompositions.
T : the 11 subscores from TERp, which include matches, insertions, deletions, substitu-
tions, shifts, synonym and stem matches, and four paraphrase edit scores.
For these experiments, we again use the GALE 2.5 data, but with 2-fold cross-validation
in order to have independent tuning and test data. Documents are partitioned randomly,
such that each subset has the same document distribution across source-language and genre.
As in section 4.5.2, the objective is length-normalized per-sentence correlation with HTER,
using mean-removed scores as before. In figure 4.4, we plot the Pearson’s r (with 95%
confidence interval) for the results on the two test sets combined, after linearly normalizing
the predicted scores to account for magnitude differences in the learned weight vectors. The
baseline scores, which involve no tuning, are not normalized.
The left side of figure 4.4 shows that TER and EDPM are significantly more correlated
with HTER than BLEU when measured in this dataset, which is consistent with the overall
results of the previous section. It is also worth noting that the N+E combination is not
equivalent to EDPM (though it has the same decompositions of the syntactic tree), but
EDPM’s combination strategy yields a more robust r correlation with HTER. The N+E
combination outperforms E alone (i.e. it is helpful to use both n-gram and dependency
overlap) but gives lower performance than EDPM because of the particular combination
technique. Both findings are consistent with the fluency/adequacy experiments in sec-
tion 4.4. The TERp features (T in figure 4.4), which account for synonym/paraphrase
76
Figure 4.4: Pearson’s r for various feature tunings, with 95% confidence intervals. EDPM,
BLEU and TER correlations are provided for comparison.
differences, have much higher correlation with HTER than the syntactic E+N subscores.
However, a significant additional improvement is obtained by adding syntactic features to
TERp (T+E). Adding the n-gram features to TERp (T+N) gives almost as much improve-
ment, probably because most dependencies are local. There is no further gain from using
all three subscore types.
4.7 Discussion
In summary, this chapter introduces the DPM family of dependency pair match measures.
Through a corpus of human fluency and adequacy judgements, we select EDPM, a member
of that family with promising predictive power. We find that EDPM is superior to BLEU4
and TER in terms of correlation with human fluency/adequacy judgements and as a per-
document and per-sentence predictor of mean-normalized HTER. We also experiment with
including syntactic (EDPM-style) features and synonym/paraphrase features in a TERp-
style linear combination, and find that the combination improves correlation with HTER
77
over either method alone. EDPM’s approach is shown to be useful even beyond TERp’s
own state-of-the-art use of external knowledge sources.
One difference with respect to the work of Owczarzak et al. [2007a] is the use of a PCFG
vs. an LFG parser. The PCFG has the advantage that it is publicly available and easily
adaptable to new domains. However, the performance varies depending on the amount of
labeled data for the domain, which raises the question of how sensitive EDPM and related
measures are to parser quality.
A limitation of this method for MT system tuning is the computational cost of parsing
compared to word-based measures such as BLEU or TER. Parsing every sentence with the
full-blown PCFG parser, as done here, is hundreds of times slower than these simple n-gram
methods. Two alternative low-cost use scenarios include late-pass evaluation, for choosing
between different system architectures, or system diagnostics, looking at relative quality of
these component scores compared to those of an alternative configuration.
78
79
Chapter 5
MEASURING COHERENCE IN WORD ALIGNMENTS FORAUTOMATIC STATISTICAL MACHINE TRANSLATION
Syntactic trees (of the type described in section 2.1) fundamentally capture two kinds
of information: dependency and span. Chapters 3 and 4 primarily use dependency links
in their evaluation (from word to word within the same sentence). This chapter, by con-
trast, explores the utility of span information in natural language processing, specifically in
the analysis of automatically-generated word-alignments in statistical machine translation
bitexts.
Statistical machine translation (introduced and briefly sketched in section 2.4) uses word-
to-word alignment as a core component in its model training, perhaps most critically as a
source of aligned bitexts for the construction of the phrase table. For the creation of
the phrase tables, a key concern is that bitext alignments of low quality will induce poor
phrase tables. For example, a single stray alignment link can greatly reduce the number
of useful phrases that may be extracted, as in figure 5.1. In hierarchical or syntactic sta-
tistical MT systems, too, incorrect alignments may lead to lower-quality phrasal structure;
higher-quality alignments offer more opportunities for any of these systems to learn correct
translations by example.
The machine alignments in figure 5.1, for example, prevent the alignment of the noun
phrase “唯一遗憾的” to “the only regret”. It does still allow larger clusters to be mutually
aligned (e.g. “唯一 遗憾 的 是” with “the only regret was in the”) and a few of the smaller
alignments are still possible (e.g., “唯一” may still be aligned straightforwardly to “only”)
but the extra alignment links in the lower alignment force the Chinese span NP1 to be
incoherent: its projection in the English side of the lower alignment surrounds the projection
of words (e.g. 是) that do not belong to NP1.
This chapter makes explicit this mechanism for describing the coherence of a monolingual
80
唯一 遗憾 的 是 单杠 。
The only regret was in the horizontal bar .
NP1
唯一 遗憾 的 是 单杠 。
The only regret was in the horizontal bar .
Figure 5.1: A Chinese sentence (about the 2008 Olympic Games) and its translation, with
reference alignments (above) and alignments generated by unioned GIZA++ (below). Bold
dashed links in the lower link-set indicate alignments that force NP1 to be incoherent.
span in an aligned bitext, and explores the coherence of syntactically-motivated spans over
alignments generated by human and machine. Further exploration uses this measure of
coherence to choose among alignment candidates derived from multiple machine alignments,
and a following approach uses coherent regions to assemble a new, improved alignment from
two automatic alignments.
Section 5.1 describes the relevant background for this chapter. Section 5.2 outlines the
notion of coherence used here, and describes how it is computed on a given span. Sec-
tion 5.3 outlines the preparation of data for the explorations performed here: the corpora of
Chinese-English bitexts and manual alignments, and the construction of several automatic
alignments for comparison with these coherence metrics. Section 5.4 examines the per-
formance of the various alignment systems in terms of alignment quality (against manual
alignments) and the coherence of certain linguistically-motivated categories, and demon-
strates that the coherence measures correspond to the alignment quality of those systems.
In section 5.5, we explore using the coherence measures to select a better alignment from
81
a pool of alignment candidates, and section 5.6 explores the creation of hybrid alignments
by combining members from the varied system-alignments assembled here. Section 5.7
discusses the implications (linguistic and practical) of these findings.
5.1 Background
Word alignments, as discussed in section 2.4.1, are an important part of the preparation of
a parallel corpus for the training of statistical machine translation engines. A wide variety
of statistical systems build their models off of aligned parallel corpora – whether to extract
word-by-word translation parameters, as in the IBM models [Brown et al., 1990], “phrase”
tables as in Moses [Koehn et al., 2007], or more syntactically-involved systems such as the
Galley et al. [2006] syntactic translation models. As a tool for building and evaluating these
aligned parallel corpora, Och and Ney [2003] proposed an alignment evaluation scheme
“alignment error rate” (AER), in the hope that an intrinsic measure of evaluating align-
ments could shorten the development cycle for new statistical machine translation systems
(eliminating the need to try the entire pipeline).
AER is based on an F -measure over reference alignments (“sure”, S) and proposed (A)
alignment links:
AER(S,A) = 1− 2× |S ∩A||S|+ |A|
(5.1)
This formulation1 measures individual links rather than groups of links. A variety of other
systems have explored using supervised learning over manually-aligned corpora to improve
AER, with some success, including Ayan et al. [2005a,b], Lacoste-Julien et al. [2006] and
Moore et al. [2006], who mostly focused on improving the AER over English-French parallel
corpora.
Other metrics exist, e.g. CPER [Ayan and Dorr, 2006], which measures the F of possible
aligned phrases for inclusion in the phrase table, but in pilot experiments we found that per-
sentence oracle CPER was not consistent with an improvement in global CPER performance.
Fraser and Marcu [2006, 2007] find that optimizing alignments towards an AER variant
1The original formulation of AER was defined with both “sure” and “possible” reference alignment links.No reference data available for this task uses “possible” alignment links, so only the simplified version ispresented here.
82
that weights recall more heavily than precision higher improves BLEU performance on
the language pairs they explored (English-Romanian, English-Arabic, and English-French).
However, Fossum et al. [2008] find that they can improve a syntactically-oriented statistical
machine translation engine by improving precision; their work focuses on deleting individual
links from existing GIZA++ alignments, using syntactic features derived directly from the
syntactic translation engine for Chinese-English and Arabic-English translation pairs.
Another approach to improving alignments with grammatical structure is to do simulta-
neous parsing and use the parse information to (tightly or loosely) constrain the alignment,
as in Lin and Cherry [2003], Cherry [2008], Haghighi et al. [2009] and Burkett et al. [2010],
who constrain parsers of one (or both) languages to engage in the parallel alignment pro-
cess. Rather than combine parse or span constraint information into a machine translation
or alignment decoder, this chapter explores span coherence measures (with spans derived
from a syntactic parser of Chinese) to select from multiple machine translation alignment
candidates over a corpus of manually-labeled Chinese-English alignments. Since evidence
for preferring an alignment error measure that over-weights precision or recall seems to be
ambiguous (and possibly dependent on the choice of translation engine), we retain AER as
the measure of alignment quality, and we explore the coherence measures’ ability to help
select alignments to reduce AER.
5.2 Coherence on bitext spans
We define a span s to be any region of adjacent words fi · · · fk on one side (here the source
language) of a bitext. Given a set of links a of the form 〈em, fn〉, we define the projection
of a span to be all nodes e such that a link exists between e and some element within s. We
further define the projected range s′ of the span s to be:
s′ = emini{ei∈proj(s)} · · · emaxi{ei∈proj(s)}
and we define the reprojection of the span s to be the projected range of s′ (identifying a
range of nodes in the same sequence as s).
We may thus describe a span s as coherent when the reprojection of s is entirely
within s. However, we find it useful to categorize spans into four categories, characterized
83
Table 5.1: Four mutually exclusive coherence classes for a span s and its projected range s′
coherent The reprojection of s is entirely within s
null No link includes any term in fi . . . fk
subcoherent s is not coherent, but s′ is coherent
incoherent neither s nor s′ is coherent.
fi fi+1 fi+2 fi+3 fi+4
ej ej+1 ej+2 ej+3 ej+4
s1 s2s3 s0
s′1s′2
Figure 5.2: Examples of the four coherence classes. s1 is coherent (because it is its own
reprojection); s0 is null; s2 is incoherent (because its reprojection is s1 rather than a subset
of s2); and s3 is subcoherent (because its projection span s′1 is coherent).
in table 5.1. Figure 5.2 also includes examples of each of the coherence classes. While
coherent, non-coherent, and null coherence classes are fairly easily explained, subcoherent
spans are worth a brief digression: these spans often appear in alignments of two corre-
sponding phrases with non-compositional meanings. Such phrases often form a complete
bipartite subgraph, in that every source word in the phrase is linked to every target word
in the phrase. Any span that includes less than the entire phrase (on one side or the other)
will be subcoherent.
Unlike AER, coherence is not a measure against the reference alignment; it is instead
a measure of a particular span’s behavior in an alignment. It is not necessarily a sign of a
high-quality alignment, but section 5.4 explores how coherence corresponds with AER over
a pool of automatic alignment candidates.
84
Table 5.2: GALE Mandarin-English manually-aligned parallel corpora used for alignment
evaluation and learning. Numbers here reflect the size of the newswire data available from
each corpus. Note that Phase 5 parts 1 and 2 (LDC2010E05 and LDC2010E13) had no
newswire data included.
word
LDC-ID Name sentence English Chinese
2009E54 Chinese Word Alignment Pilot 290 8,818 6,329
2009E83 Phase 4 Chinese Alignment Tagging Part 1 2,092 76,487 55,145
2009E89 Phase 4 DevTest Chinese Word Alignment Tagging 2,829 101,484 73,794
2010E37Phase 5 Chinese Parallel Word Alignment and Tag-
ging Part 3962 33,537 22,018
Total 6,173 220,327 157,286
5.3 Corpus
These experiments focus on the alignment of Chinese-English parallel corpora. They make
use of both unaligned (sentence-aligned but not word-aligned) corpora and manually-aligned
corpora (aligned by both sentence and word). The key corpora for the experiments in this
chapter are the manual alignments generated by the GALE [DARPA, 2008] project. These
alignments take text and transcripts of spoken Mandarin Chinese and translations of both
into English and provide manual annotation of alignment links between the English and
Chinese words. Since Chinese word segmentation is not given by the text, the manual
alignments link English words to Chinese characters, even when more than one Chinese
character is required to form a word. Table 5.2 lists the sets of manual corpora used to
evaluate the aligners (and to train the rerankers). Chinese word counts are listed using
the number of words provided by automatic segmentation, and the alignment links (which
were manually aligned to individual characters) are collapsed to link the English words to
segmented Chinese words (rather than characters). The experiments in this chapter use only
the newswire segments of these corpora, so (although other genres of text and transcript
are available) only those numbers and sizes are reported here.
85
5.3.1 Corpus preparation
The analyses in this chapter are based on a comparison among these manual alignments and
those generated by automatic systems. The most popular automatic aligners GIZA++ [Och
and Ney, 2003] and the Berkeley Aligner [DeNero and Klein, 2007] are unsupervised, but
require training on very large bodies of parallel text; here we do a similar training to avoid
overly pessimistic automatic alignment results. Table 5.3 lists the component corpora used
to train the unsupervised aligners. As in table 5.3, the Chinese word count reflects the
number of word tokens returned by automatic segmentation.
State-of-the-art SMT systems for Chinese-to-English translation do word segmentation
and text normalization (the replacement, for example, of numbers and dates by $number
and $date tokens) before providing parallel text to the unsupervised aligner. In order to
provide automatic alignments for the corpora in table 5.2, all the corpora (both aligned
and unaligned, though alignments were discarded at this stage) were passed through the
Stanford word segmenter [Chang et al., 2008] and the SRI/UW GALE text normalization
system (on the Chinese side) and the RWTH text normalization system (on the English
side). Three aligners were trained on the resulting segmented and normalized parallel text:
• the Berkeley aligner [DeNero and Klein, 2007], which we refer to hereafter as berkeley,
which uses a symmetric alignment strategy;
• the GIZA++ aligner [Och and Ney, 2003], projecting from source-to-target (f -e),
which we refer to as giza.f-e; and
• the GIZA++ aligner, projecting from target-to-source (e-f), referred to as giza.e-f.
For each of the giza trainings, we further generate multiple additional alignment candidates:
the giza.e-f.NBEST and giza.f-e.NBEST lists retrieve the N = 10 best alignments from
each of the GIZA++ trainings. The berkeley system does not support N -best generation.
The parallel corpora from table 5.3 are then discarded: their role is only to improve
the unsupervised aligners trained above. Over the parallel text corpora in table 5.2, all of
86
Table 5.3: The Mandarin-English parallel corpora used for alignment training
word
Name (ID) sentence English Chinese
ADSO Translation Lexicon 179,284 265,705 267,466
Chinese English News Magazine Parallel Text 269,479 9,233,773 8,826,377
Chinese English Parallel Text Project Syndicate 45,767 1,069,021 1,129,198
Chinese English Translation Lexicon (v3.0) 81,521 135,261 93,073
Chinese News Translation Text Part 1 10,264 314,377 279,512
Chinese Treebank English Parallel Corpus 4,064 123,825 92,996
CU Web Data (Oct 07) 34,811 883,886 894,809
FBIS Multilingual Texts 123,950 4,037,811 3,011,172
Found Parallel Text 180,222 5,345,040 4,713,169
GALE Phase 1 Chinese Blog Parallel Text 8,620 185,637 166,508
GALE Phase 2r1 Translations 14,768 347,480 286,093
GALE Phase 2r2 Translations 4,794 128,111 104,330
GALE Phase 2r3 Translations 21,360 387,458 322,524
GALE Phase 3 OSC Alignment (v1.0.FOUO) 4,915 183,812 134,975
GALE Phase 3r1 Translations 40,503 643,745 595,411
GALE Phase 3r2 Translations 5,786 177,357 149,250
GALE Y1 Interim Release Translations 20,926 446,367 398,043
GALE Y1Q1 Translations 6,618 147,574 128,740
GALE Y1Q2 FBIS NVTC Parallel Text (v2.0) 404,368 14,729,700 12,070,648
GALE Y1 Q2 Translations (v2.0) 9,382 194,171 172,106
GALE Y1 Q3 Translations 11,879 283,354 247,961
GALE Y1 Q4 Translations 30,496 572,210 506,563
Hong Kong Parallel Text 699,665 16,154,447 14,650,516
MITRE 1997 Mandarin Broadcast News Speech
Translations (HUB4NE)19,672 414,762 365,157
UMD CMU Wikipedia translation 77,162 181,592 145,069
Xinhua Chinese English Parallel News Text (v1β) 103,415 3,455,994 3,411,085
Total 2,413,691 60,042,470 53,162,751
87
the resulting alignments (including the reference alignments) were reconciled with the pre-
normalized text (using dynamic programming to synchronize the English side and retrieving
the original text from the SRI/UW GALE text-normalization system), but the Chinese word
segmentation was retained (for compatibility with later parsing).
Finally, we generate still more alignment candidates by performing union and intersec-
tion on the giza candidates and their corresponding N -best lists:
• The giza.union alignment is the union of giza.e-f and giza.f-e.
• Correspondingly, the giza.intersect alignment is the intersection of giza.e-f and
giza.f-e.
• The giza.union.NBEST and giza.intersect.NBEST alignments are alignments that
choose alignments from the giza.e-f.NBEST and giza.f-e.NBEST lists and union (or
intersect) them. In these experiments, the ranks of the e-f and f-e elements in these
unions or intersections are constrained to sum to no more than N + 1 = 11 (e-f2
union f-e9 is acceptable because 2+9 = 11 6> 11, but e-f5 union f-e7 is not included
because 5 + 7 = 12 > 11).
5.3.2 Alignment performance of automatic aligners
Table 5.4 reports the AER as computed against the manual labeling of the corpus for the
five automatic alignments (excepting the N -best lists). It also includes the precision, recall,
and link density (in proportion to reference ldc link density). The Berkeley aligner has the
best AER, and its nearest competitor (the giza.union alignment) has the best alignment
recall. As one might expect, the giza.intersect alignments have the highest precision,
but this high precision comes at a high recall (and AER) cost. This table also includes
the per-sentence AER oracle alignment, which reflects the AER (as well as precision and
recall) of choosing, for each segment, the single alignment from the pool of five with the
best AER.2 The Berkeley system, we may note, seems to be strongly precision heavy, while
2The per-sentence oracle is not necessarily the best possible overall AER from this candidate pool, sinceunder some circumstances (especially when the precision/recall proportions are very imbalanced) one mayminimize sentence-level AER and increase global AER.
88
Table 5.4: Alignment error rate, precision, and recall for automatic aligners. Link density
is in percentage of ldc links.
System AER Precision Recall Link density (%)
berkeley 32.87 84.21 55.81 68.52
giza.e-f 36.46 70.14 58.08 88.22
giza.f-e 40.31 76.78 48.82 67.06
giza.intersect 42.37 96.78 41.04 32.90
giza.union 35.42 63.34 65.87 122.38
(per-sentence) oracle 30.12 80.01 62.02 75.78
its competitor giza.union is more balanced, but stronger on recall.
As we might expect, the giza.intersect and giza.union alignments have the lowest
and highest link density respectively. Precision seems to rise as link density drops (again,
not unexpectedly), but berkeley is more precise than either of the directional giza systems
while having a higher link density. Even the oracle selection has a lower link density than
100%, because most of the candidates from which the per-sentence oracle are lower-density
than the reference ldc alignments.
5.4 Analyzing span coherence among automatic word alignments
This section poses two questions:
• what kinds of spans are reliably coherent in reference alignments?
• what varieties of coherent spans are not captured well by current alignment algo-
rithms?
We examine the coherence of reference alignments to answer the first question, and compare
those coherences to those generated by the unsupervised automatic alignment systems. We
explore both syntactic and orthographic (not explicitly syntactic) techniques for identifying
spans over the Chinese source sentences.
89
Table 5.5: Coherence statistics over the spans delimited by comma classes
Alignment (% of spans)
span
(counts)coherence
ldc
(ref)giza.union berkeley giza.intersect
comma
(16,932)
yes 77.9 23.4 58.9 83.1
no 19.2 60.6 35.3 13.7
sub 2.8 16.0 5.3 0.0
null 0.1 0.0 0.5 3.2
tadpole
(14,682)
yes 83.5 22.8 62.1 88.0
no 14.0 59.5 32.6 10.0
sub 2.3 17.7 5.0 0.0
null 0.1 0.0 0.3 2.0
5.4.1 Orthographic spans
The first class of spans we consider are spans that may be extracted from orthographic
“segment” choices, namely, spans that are delimited by commas on the Chinese side of the
bitext. This delimitation is made more complicated by an property of Chinese orthography:
in many Chinese texts, a special3 “enumeration comma” is used to delimit items in a list.
If this standard were used uniformly, it would actually be a useful distinction, but the
enumeration comma is only used sometimes: on many occasions, when an enumeration
comma would be correct, Chinese writers or typesetters will use U+002C COMMA, which
we dub a “tadpole” comma to distinguish it from the conjoined class that includes the
enumeration comma. Nevertheless, the converse error (using enumeration commas when
tadpole commas are appropriate) does not seem to occur in the corpus, so there is still
information available in using only the tadpole commas as delimiters.
We explore dividing sentences into orthographic regions using the orthographic di-
3Unicode uses code U+3001 for this symbol (、) and unfortunately dubs this character IDEOGRAPHIC
COMMA, which is a misnomer: its Chinese name is “顿号” which may be glossed as pause symbol.
90
viders of comma-delimited spans and “tadpole”-delimited spans. These delimiters — when
present — divide the sentence into non-overlapping regions. Table 5.5 shows the distribu-
tion of coherence values for comma-delimited spans and for tadpole-delimited spans over
the manually- and automatically-generated alignments in table 5.2. Sentences without a
comma-delimiter are omitted from the counts here, or the proportion of coherence would
be (trivially) higher in all alignments (since a span covering the entire sentence will al-
ways be coherent). From the reference (ldc) alignments differences, we can see that using
tadpole spans instead of commas improves the proportion of coherent spans to 83.5% and
reduces the proportion of non-coherent spans; this result speaks to the utility of excluding
the enumeration comma from use as a delimiter.
Also, we may observe that the berkeley alignments, which have the best AER of the
three automatic systems compared here, also consistently have intermediate values (be-
tween the low-recall giza.intersect system and the low-precision giza.union) in all four
of the coherence classes. Among these orthographic spans, the high-precision, low-recall
giza.intersect system performs the closest to ldc manually-annotated alignments, but
overpredicts both coherence and null-links, probably because its link density is too low
overall.
By comparing the coherence measures over these orthographic alignments, we find sup-
porting evidence that the berkeley alignments are the best (because they are the most
similar to the reference). It is difficult to say whether the orthographic (comma- and
tadpole-delimited) spans are useful constraints on alignment regions for evaluating align-
ment accuracy, however: the giza.union and giza.intersect results confirm that the
over-linking and under-linking (respectively) even cross these comma delimiters. However,
orthographic delimitation represents a mixture of different linguistic phenomena, so we turn
instead to grammatical span exploration.
5.4.2 Syntactic spans
To explore the information available from the parser, we parse each of the source sentences in
the aligned corpus with a parser [Harper and Huang, 2009] tuned to produce Penn Chinese
91
Table 5.6: Coherence statistics over the spans delimited by certain syntactic non-terminals
Alignment (% of spans)
span
(count)coherence ldc giza.union berkeley giza.intersect
NP
(59,635)
yes 72.4 44.5 74.9 81.0
no 16.4 44.6 17.0 4.1
sub 10.0 10.9 3.9 0.0
null 1.2 0.0 4.2 14.9
VP
(37,167)
yes 64.4 26.3 64.0 78.5
no 20.7 58.1 28.1 9.6
sub 14.4 15.7 5.3 0.0
null 0.6 0.0 2.6 11.9
IP
(14,738)
yes 65.7 22.7 59.8 84.2
no 17.9 60.6 33.9 10.4
sub 16.2 16.7 5.5 0.0
null 0.2 0.0 0.7 5.4
Treebank [Xue et al., 2002] parse trees. The Chinese word segmentation from the alignment
steps in section 5.3 is retained.
Table 5.6 shows the same systems as the previous section, but using instead the spans
labeled4 by the parser as NP, VP, or IP (noun, verb, or inflectional phrase; IP is the Chinese
Treebank equivalent to a sentence or small clause). We choose these categories because they
are three core non-terminal categories of the treebank, each with a strong and relatively
theory-agnostic linguistic basis. Furthermore, these three categories together make up more
than 70% of the non-terminals in the parse trees produced by the automatic parsers used
in these experiments. It is reasonable to expect that most of these phrases are coherent in
the reference alignment, and indeed they are (72.4% coherent NPs, 64.4% coherent VPs,
4Spans that cover the entire sentence are not included in these counts; by definition such spans are alwayscoherent, but this is not informative.
92
and 65.7% coherent IPs).
Again, we may observe in table 5.6 that the berkeley system’s coherence is interme-
diate between the giza.union and giza.intersect systems’ coherence values. For these
syntactic spans, the berkeley alignments are much closer to the human labels than the
giza.intersect, which substantially overpredicts coherence of these smaller units. How-
ever, berkeley alignments overpredict incoherent spans on VPs and IPs, and giza.union
also overpredicts incoherence on NPs. Together, these results suggest that the union align-
ments are too link-dense, the intersect alignments too sparse, and the berkeley align-
ments just about right — although berkeley seems to still make syntactically-unaware
errors, inducing incoherent spans.
It is interesting to note that the giza.intersect results are actually over-coherent,
due to their link density, but that alignment also has a worse AER (due to its low recall).
Accordingly, high coherence is not necessarily neatly correlated with improvements to AER.
5.4.3 Syntactic cues to coherence
We find in the previous two subsections that (for example) though roughly 83.5% of tadpole
spans are coherent under the LDC (reference alignments), only about 65% of IP and VP
spans are coherent in the reference alignments. These proportions are low enough, for the
syntactic classes, to suggest inquiry into what characteristics indicate that a span of given
syntactic label XP is likely to be coherent. To explore this question, we build a binary
decision tree using the WEKA toolkit [Hall et al., 2009] over each collection of XP spans
(where XP ∈ {NP,VP, IP}), where the decision tree is binary over the following syntactic
features from that XP ’s structure:
• whether that span is also a tadpole-delimited span,
• the syntactic tags of that XP ’s syntactic children,
• the syntactic tags of that XP ’s syntactic parents, and
• the length of the XP in question.
93
VP (64% coherent)
NT-parent = CP (2,291)
29.7% coherent
NT-parent 6= CP (34,876)
66.6% coherent
(a) VP tree
IP (65.7% coherent)
NT-parent = CP (3,362)
29.9% coherent
NT-parent 6= CP (11,376)
75.6% coherent
(b) IP tree
Figure 5.3: Decision trees for VP and IP spans. The decision tree did not find error-reducing
distinctions among the NP spans.
These features were chosen as a reasonable characterization of the syntactically-local in-
formation (roughly parallel to the information provided on arc-links in the SParseval
measure in chapter 3). Although some IP spans (unsurpisingly) cover the entire sentence,
spans over the entire sentence are not included in this analysis (whole-sentence spans are
always coherent, by definition). Figure 5.3 shows the top forks of the VP and IP decision
trees (the single decision offering the greatest error reduction). We may observe an inter-
esting commonality: in both IPs and VPs, the majority of spans with a parent label of
CP (“complementizer phrase”) are not coherent in the reference alignment. Anecdotally,
the CP-over-IP construction seems to occur incoherently in the bitext when the CP-over-IP
marks an construction that is divided in English, e.g. the example in figure 5.4. This kind
of construction, which may be expressed in English as a pre-and-post-modified NP (“[np
[np the largest X] in the world]”) or a left-modified NP (“[np the world’s largest X]”) is
likely to be incoherent, despite having a uniform analysis as CP-over-IP in Chinese (“[cp
[ip 世界 最大 X]]”). It is also worth observing that the IP node in this example has a
unary expansion to a VP predicate (with two parts), and so accounts for some of the same
94
Thailand be world most big comp rice export+nation .
泰国 是 世界 最 大 的 稻米 出口国 。
Thailand is the largest rice exporter in the world .
ip1cp1
Figure 5.4: An example incoherent CP-over-IP. Note that ip1’s reprojection to English is
actually larger than ip1 itself (since it includes “rice exporter” within its English projection
span) and larger than cp1 in which it is embedded. Had the English translation chosen the
phrasing “the world’s largest rice exporter”, ip1 would be coherent.
incoherent CP-over-VP spans in figure 5.3(a) as well.
It is also of note (though not visible in figure 5.3) that the tadpole feature was available
to the decision tree but was never selected, even when the tree was allowed to ramify further,
suggesting that this orthographic information is not useful in determining the coherence of
these syntactic spans.
From table 5.6 and figure 5.3, we may see that the base rate of coherence for NPs, VPs
and IPs (at least, those not immediate children of CPs) is at about 70% for each, with IPs
being particularly promising — at 76% coherence — but relatively rare.
5.4.4 A qualitative survey of incoherent IP spans
Inspired by the example in figure 5.4, we extracted 50 spans labeled IP by the parser that
were incoherent according to the ldc reference alignment. The categories suggested here
were selected to attempt to characterize the reasons that these parser-indicated spans are
incoherent. The most common category of differences (about 30%) were alignments where
a clause-modifying adverb was used in English somewhere other than the left or right edge
of the clause (and where in Chinese, the clause-modifier lives outside the IP). One common
scenario among those examined here was a clause-external Chinese adverb that is aligned
95
Table 5.7: Some reasons for IP incoherence
Reason n
Sentential adverb between subject and main verb in English 14
IPs in conjunction: English language ellipsis; Chinese re-
peated word
8
Two-part predicates in Chinese pre- and post-modify noun
in English
6
Punctuation differences (periods inside quotes) 3
Other translation divergences 10
Parsing attachment errors introducing incoherence 9
with an English adverb after the first (finite) verb in an English clause, as in figure 5.5.
Nearly 20% of the incoherent spans were incoherent because of parsing attachment errors,
usually because a Chinese adverb was attached low within an adjacent small IP when it
should have been considered an sentential modifier. Improving parser performance on the
correct attachment of clausal adverbs would be valuable here.
Another key challenge is found when Chinese and English disagree on whether all com-
ponents of a conjunct need to be repeated. Although Chinese omits pronouns in many
circumstances, it was the English conjunction of subject-less VPs that introduced incoher-
ence. About 16% of the incoherent IP spans are attributable to ellipsis in the English side
(alternatively, a choice to repeat a term in Chinese which is left out in English), e.g. as
in figure 5.6. The remaining categories of incoherent IP include Chinese two-part CP/IP
predicates that pre- and post-modify a noun in English (about 12%, as in figure 5.4) and a
small variety of others.
5.5 Selecting whole candidates with a reranker
In previous sections, we see evidence that the coherence of certain categories corresponds to
alignment quality, and that it is — at least in principle — possible to select an alignment
with better AER from the pool of candidates: the oracle scores in table 5.4 demonstrate
96
IP
NP
该指数
this index
VP
ADVP
AD
也有
sometimes
VP
LB
被
by
IP
NP
投资者
investors
VP
用作投资指南
use as investment-guide
“This index has sometimes been used by investors as an investment guide.”
Figure 5.5: An example of clause-modifying adverb (也有) appearing inside a verb chain.
Note that ldc alignment links link the lowest VP to English has, been, and used, so that
the projection of the lower IP contains the projection of the upper ADVP and LB spans.
that a better AER is possible.
As in chapter 3, we establish a reranking paradigm, where alignment candidates are
converted into feature vectors by a feature extraction steps, and (in training) candidates’
optimization target (in this case, AER) is converted to a rank for training an svmrank
learner over the training candidates. The learner ranks the candidates in the pool, and we
report AER over the learner’s choice of top-ranked candidates. Because of the relatively
small amount of labeled data, we report results here over ten-fold cross-validation.
This arrangement allows variation in two key experimental variables:
Candidate pool : We may include candidates from all of the rerankers available, or only
certain subsets. We define two pools of interest:
• experts is the “committee of experts” made up of all of the direct outputs of
the automatic aligners (berkeley, giza.e-f, giza.f-e, giza.intersect, and
giza.union)
97
TO
P
IP
NP
NR
俄罗斯
Ru
ssia
IP
NP
1997年估计
“199
7-es
tim
ated
”
VP
VV
增长
grow
th
QP
CD
百分之零点五
0.5
%
PU , ,
IP
NP
1998年预计
“1998-e
stim
ate
d”
VP
VV
增长
gro
wth
QP
CD
百分之一点五
1.5
%
PU 。 .
“Ru
ssia
’ses
tim
ated
grow
this
0.5%
for
1997
,an
d1.
5%
for
1998
.”
Fig
ure
5.6
:A
nex
am
ple
of
En
glis
hel
lip
sis
wh
ere
Ch
ines
ere
pea
tsa
wor
d(增长
,“g
row
th”)
.T
he
En
glis
htr
ansl
atio
nh
ason
ly
on
e“g
row
th”.
To
lin
kb
oth增长
nod
esto
the
En
glis
h“g
row
th”
requ
ires
atle
ast
one
ofth
elo
wer
IPnod
esto
be
inco
her
ent.
98
• giza.union.NBEST is the pool generated by performing the union operation on
the members of giza.e-f.NBEST and giza.f-e.NBEST list.
Feature selection : We may use any of a variety of features to rerank members of the
candidate pool. We define the following features:
• voting features include a binary feature for each expert system (e.g., berkeley
or giza.union); if a candidate is generated by that system, this feature will be
true; otherwise false.
• span-X features represent four features: the proportion of spans of type X that
are coherent (span-X-yes), subcoherent (span-X-sub), null-coherent (span-X-null),
or incoherent (span-X-no). For example, we may use span-NP features, which
provide features describing the coherence (or non-coherence) of the noun-phrases
in the sentence.
5.5.1 Selecting candidates from expert committee
We conjecture that a reranking approach would help to select better alignments from a pool
of alignment candidates generated by a diverse set of rerankers. In this experiment, the
alignment candidates berkeley, giza.e-f, giza.f-e, giza.intersect, and giza.union
are included in the pool to be reranked. For reranker features, we consider the span-comma,
span-tadpole, and span-IP features. Because of the IP coherence patterns gleaned from
figure 5.3(b), we further include span-nonCP-IP, which includes features of those IP spans
that are not the direct children of CP constituents. Finally, we include an experiment using
span-NT features, which are the set of features including span-X for all non-terminal
symbols used in the Chinese treebank (span-IP, span-DNP, span-QP, etc.). For all learners,
we include the voter feature to allow the reranker to include a learned estimate of the
quality of each committee member. As a baseline, we include a voter-only feature set,
which learns, as one might expect, to always select the committee member with the best
overall AER).
Table 5.8 shows the results of reranking the members of the committee of quality experts.
99
Table 5.8: Reranking the candidates produced by a committee of aligners.
Identity AER Precision Recall
isolated
berkeley 32.87 84.21 55.81
giza.e-f 36.46 70.14 58.08
giza.f-e 40.31 76.78 48.82
giza.union 35.42 63.34 65.87
giza.intersect 42.37 96.78 41.04
voter only (∼berkeley) 32.87 84.21 55.81
voter &
span-tadpole 32.95 83.49 56.02
span-comma 33.13 82.00 56.46
span-IP 33.10 82.69 56.18
span-nonCP-IP 33.02 82.81 56.23
span-NT 34.09 76.65 57.80
(per-sentence) oracle 30.12 80.01 62.02
Relative to the span-IP features, the span-nonCP-IP features have a larger improvement in
recall with a smaller loss to precision, which suggests that using spans which are generally
expected to be coherent may be helpful in this kind of reranking. However, we also observe
that none of the rerankers (in the lower half of the table) actually reduce AER from the
baseline (voter only) systems: instead, they boost recall, at varying costs to precision.
We also observe that the more features are involved, the larger the effect on recall, with
span-NT having the largest impact. For those cases with the same number of features (e.g.
span-IP and span-tadpole), the features with more spans in the data generally have larger
impact. In retrospect, this is unsurprising: the next best systems, beyond berkeley, are
giza.union and giza.e-f, which each have lower precision and higher recall: thus, when
the reranker chooses an alternative, it is usually choosing one of those two, improving recall
(and hurting precision). We even see this in the oracle: though its AER is superior to the
berkeley system, its precision is lower.
100
Table 5.9: Reranking the candidates produced by giza.union.NBEST.
Identity AER Precision Recall
nbranks only (∼giza.union) 35.42 63.34 65.87
nbranks &
span-tadpole 35.41 63.35 65.87
span-comma 35.41 63.36 65.88
span-IP 35.42 63.34 65.87
span-NT 35.38 63.39 65.89
giza.union.NBEST (per-sentence) oracle 32.44 66.58 68.56
5.5.2 Selecting candidates from N -best lists
To avoid the problems with very-different second-best candidates (with very different pre-
cision and recall) suggested by the previous experiments, we construct a separate exper-
iment with only the alignments generated by giza.union.NBEST, which (as described in
section 5.3) include two new rank features (dubbed nbranks): the rank of the giza.e-f
member of the union and the rank of the giza.f-e member of the union.
Table 5.9 shows none of the precision-recall imbalance present in the experiments in
table 5.8. However, the coherence features do not seem to make much difference, only
nudging both precision and recall (non-significantly) higher.
5.5.3 Reranker analysis
In sections 5.5.1 and 5.5.2, we explored using a reranker to select candidates from a pool of
generated candidates. System combination at the level of whole candidate selection, as in
these experiments, works best when the systems under combination have similar operating
performance (similar quality) while also being diverse (making different kinds of errors).
In this perspective, the analyses here suggest that the committee of experts (section 5.5.1)
performed well in generating diversity, but the single best member of the committee (the
berkeley system) so outperformed its fellows that alternates only rarely improved the over-
all AER. Conversely, the experiments in section 5.5.2 selected from the giza.union N -best
101
Table 5.10: AER, precision and recall for the bg-precise alignment
System AER Precision Recall
berkeley 32.87 84.21 55.81
giza.intersect 42.37 96.78 41.04
bg-precise (berkeley ∪ giza.intersect) 32.38 83.91 56.62
lists, where the criterion of similar quality was met, but the candidates were insufficiently
diverse. We conjecture that the lack of improvement in AER from reranking is due to these
problems. However, it may be that the coherence features are not sufficiently powerful to
distinguish the candidates without incorporating lexical or other cues, since the berkeley
aligner coherence percentages of the different phrases are not so different from the ldc
percentages.
5.6 Creating hybrid candidates by merging alignments
Whole-candidate selection from the previous section suggests that the available candidates
are insufficiently diverse (when chosen from the N -best lists) and too dissimilar in perfor-
mance (when chosen from the committee of expert systems). As an alternative strategy, we
may perform partial-candidate selection, by constructing hybrid candidates, guided by the
syntactic strategies suggested here.
The analysis in section 5.3.2 shows that the berkeley and giza.intersect systems
are very high precision, but both have relatively low recall. By contrast, giza.union has
the best recall, but its precision suffers. We cast the problem of sub-sentence alignment
combination as a problem of improving the recall of a high-precision alignment. As a first
baseline, we may combine (union) berkeley and giza.intersect, the two high-precision
alignments from table 5.4, into a new precision alignment bg-precise, shown in table 5.10.
The bg-precise alignment has a lower precision than either of its component high-precision
alignments, but yields the best AER thus far, because of improvements to recall. This simple
combination, in fact, yields an AER better (although not significantly better) than the per-
sentence oracle AER from the giza.union.NBEST selection, providing further evidence that
102
the N -best lists are insufficiently diverse for reranking as they are.
5.6.1 Using “trusted spans” to merge high-precision and high-recall alignments
Although bg-precise improves recall to some degree, we would like to improve recall fur-
ther. The giza.union alignments have substantially better recall than any of the precision
alignments, so we adopt the strategy of merging only certain alignment links from the
giza.union alignment into the bg-precise alignments.
We introduce the guideline of a “trusted span” on the source text, and define the guided
union over a high-precision alignment and a high-recall alignment and a set of trusted spans:
all links from the recall alignment that originate within one of the trusted spans, unioned
with all links from the precision alignment. This combination heuristic changes the problem
of combining alignments to the problem of identifying “trusted spans” from the source text.
5.6.2 Defining the syntactic “trusted span”
We may thus use the syntactic coherence analysis from section 5.4 to describe “trusted
spans” to be used in the guided union operation, and evaluate the resulting guided union
alignment according to the same AER metrics we have used throughout this chapter.
We extract syntactic trusted spans in a bottom-up recursion from the syntactic tree,
defining trusted XP spans with the following heuristic: an XP span s is trusted for the
process of the guided union between a precision-oriented alignment P and a recall-oriented
alignment R when
• s is coherent in P ,
• s is coherent in R,
• all XP spans contained within s are also trusted.
These spans define a guided union PXP∪ R between precision-oriented alignment P and
recall-oriented alignment R. Thus we may define, for example, the NP-trusted guided union
103
(a) Precision alignment
China high+new tech. develop+zone prepare p 80+decade early
中国 高新 技术 开发区 酝酿 于 八十年代 初 。
The new high level technology development zones of China were brewed in the early 1980 ’s .
np-max
npnp
np np
(b) Resulting alignment. (Dashed lines are new.)
China high+new tech. develop+zone prepare p 80+decade early
中国 高新 技术 开发区 酝酿 于 八十年代 初 。
The new high level technology development zones of China were brewed in the early 1980 ’s .
np-max
npnp
np np
Figure 5.7: Example of an np-guided union. The precision alignment (a) and the recall
alignment (b) both agree that each np span is coherent (and all np sub-spans are coherent).
np-max may thus be used as a trusted span, allowing us to copy the heavy dashed links
from the recall alignment into the union.
104
Table 5.11: AER, precision and recall over the entire test corpus, using various XP -
strategies to determine trusted spans
System AER Precision Recall
Recall system = R giza.union 35.42 63.34 65.87
Precision system1 = P1 giza.intersect 42.37 96.78 41.04
P1XP∪ R, where XP=
IP 41.23 92.71 43.02
VP 41.16 93.07 43.01
IP or VP 41.05 92.91 43.17
NP 39.56 93.82 44.57
NP or VP 39.23 93.00 45.13
Precision system2 = P2 bg-precise 32.38 83.91 56.62
P2XP∪ R, where XP=
IP 32.29 82.61 57.36
VP 32.17 82.77 57.47
IP or VP 32.15 82.73 57.50
NP 31.61 83.16 58.07
NP or VP 31.51 82.90 58.35
(Pnp∪ R): the maximal NPs that are coherent in both P and R such that all descendant
NPs are also coherent.
Figure 5.7 illustrates an NP-guided union Pnp∪ R, in which we can see, at least anecdo-
tally, that it is reasonable to expect that this syntactic mechanism for selecting trustworthy
links to be helpful in extending a high-precision alignment by improving recall without
hurting precision much.
Table 5.11 shows the results of using this guided union heuristic to generate new align-
ment candidates, using two different alignments (giza.intersect and bg-precise) for the
P role and giza.union as the R role. We see the same trends for each choice of P align-
ment: using Pnp∪ R has the smallest reduction in precision. It also has the second-largest
improvement in recall, with the best performance going to the union-guide that uses both
105
VP and NP spans to form the trusted spans. By contrast, while using Pvp∪ R or P
ip∪ R for
guided unions both generate a reduction in AER, this reduction is small (and using both
VP and IP together seems to make little improvement, probably because VP and IP spans
are often nested).
However, guided union approaches are not sufficiently powerful to overcome the ex-
tremely low link-density of the giza.intersect alignment — precision and recall trade
off nearly four percentage points but the corresponding improvement (due to moving the F
measure towards balance) is not sufficient to bring AER below the giza.union AER. When
the precision measure begins more balanced (as in bg-precise), the guided union effects
can drive AER to new minima: using np and vp spans to guide the trusted-span selection
produces the best overall AER of 31.51%.
5.7 Discussion
In this chapter, we have presented a new formalism for quantifying the degree to which a
bitext alignment retains the coherence of certain spans on the source language. We evaluate
the coherence behavior of some orthographically- and syntactically-derived classes of Chi-
nese spans on a manually-aligned corpus of Chinese-English parallel text, and we identify
certain classes (motivated by orthography and syntax) of Chinese span that have consis-
tent coherence behavior. We argue that this coherence behavior may be useful in training
improved statistical machine translation systems by making a difference in improving sta-
tistical machine-translation alignments.
To improve alignments, this chapter explored the potential for alignment system combi-
nation, following at first the approach of choosing candidates from a committee of experts or
from the N -best lists generated by the GIZA++ toolkit (using a reranking toolkit). These
initial experiments found that the needs of system combination (systems of rough parity,
with usefully different kinds of errors) were not met, and we turned to sub-segment system
combination. In this approach, we define a syntax-guided alignment hybridization between
a high-precision and a high-recall alignment, and show that the resulting alignments, hy-
bridized with guidance from syntactic structure, have a better performance in AER than
106
the best alignments produced by the component expert systems.
These results, taken together, suggest that source-side grammatical structure and co-
herence can be a useful cue to quality alignment links in producing good alignments for the
training of statistical machine translation engines.
107
Chapter 6
CONCLUSION
The work in this dissertation is motivated by the hypothesis that linguistic dependency
and span parse structure is an informative and powerful representation of human language,
so much so that accounting for parse structure will be useful even to those applications where
only a word sequence is produced, e.g. a speech transcript or a text translation. In support
of this hypothesis, this work has presented three ways that parse structure (as provided
by statistical syntactic parsers) may be engaged with these large sequence-oriented natural
language processing systems. This chapter summarizes the work presented here (section 6.1)
and suggests directions for further study of these research areas individually (section 6.2)
and for parsing as a general-purpose parse decoration tool for these and related applications
(section 6.3).
6.1 Summary of key contributions
Chapter 3 demonstrated that it is possible to improve the performance of a speech recog-
nizer and a parser by rescoring the two systems jointly: the speech-recognizer’s output is
improved (in terms of WER) by exploiting information from parse structure, and the parse
structure resulting from parsing the speech recognizer’s output may be improved by con-
sidering alternative transcript hypotheses while evaluating resulting parses. This research
also found that the utility of the parse structure was strongly dependent on the quality of
the speech segmentation: parse structure was much more valuable in the context of high-
quality segment boundaries than in the context of using default speech recognizer segment
boundaries. In addition, we present a qualitative analysis of the use of parse structure
in selecting transcription hypotheses, finding improvement, for example, in the prediction
of pronouns and the main sentential verb, which would be critical for use in subsequent
linguistic application.
108
Chapter 4 applies parse structure to a rather different domain: the evaluation of sta-
tistical machine translation. Like speech recognizers, SMT evaluation is dominated by
sequence-focused models, but this work introduces an application of syntax to SMT eval-
uation: a parse-decomposition measure for comparing translation hypotheses to reference
translations. This measure correlates better with human judgement of translation quality
than BLEU4 and TER, two popular sequence-based metrics. We further explore combining
the new technique with other cutting-edge SMT evaluation techniques like synonym and
paraphrase tables and show that the combined techniques (syntax and synonymy) perform
better than either alone, although the gains are not strictly additive.
Chapters 3 and 4 both explore the utility of considering (probability-weighted) alterna-
tive parse hypotheses when using parse structure in their tasks. In the parsing-speech tasks
of chapter 3, using this extra information in joint reranking with recognizer N -best tran-
scription hypotheses shows trends in the direction of improvement (but not to significance,
possibly because the number of parameters exceeded the reranker’s ability to exploit them),
but chapter 4’s machine translation evaluation showed that including additional parse hy-
potheses clearly improved EDPM’s ability to predict the quality of machine translations.
While the parsing-speech work (chapter 3) uses both span information and dependency
information from the parses in comparing parse information for reranking, the SMT eval-
uation work in chapter 4 focuses on dependency (even to the point of putting it in the
name of the Expected Dependency Pair Match metric). By contrast, the work in chap-
ter 5 focuses on a use of constituent structure for an internal component of SMT: word-
alignment. Chapter 5’s research conducts an analysis which demonstrates the tendencies
of a particular class of spans (e.g., those motivated by syntactic constituency) to hold to-
gether in a given alignment. It explores the use of this constituent measure to select an
alignment hypothesis from a pool of alignment candidates. Although the reranking ap-
proach has limited success (because the available candidates are too dissimilar in overall
quality), the coherence measure illuminates some characteristics of quality alignments that
further work on word alignment might pursue. Chapter 5 also discovered a technique for
using these characteristically-coherent spans as a guidance framework for alignment com-
bination, through a guided union of a precision-oriented alignment and a recall-oriented
109
alignment. Syntactic coherence (using this guided-union approach) was demonstrated to be
useful in improving the AER of the alignments by this technique. The effects of syntactic
constituent coherence are probably even stronger than indicated by these results, since a
qualitative analysis in that chapter identified that a sizable minority of incoherent IP spans
were incoherent due to parse-decoration error.
6.2 Future directions for these applications
Chapters 3, 4, and 5 offer three different approaches to the using parse structure to improve
natural language processing. The results presented here suggest future work in applying
parsers to each of these areas of research.
6.2.1 Adaptations for speech processing with parsers
In the domain of parsing speech (chapter 3), it would be valuable to explore the impact
of additional parse-reranking features, especially those more directly focused on speech.
The features extracted in this work were a re-purposing of the feature extraction used for
reranking parses over text; it might be valuable to include features that are more directly
targeting the challenges of speech transcription. For example, explicit edit or disfluency
modeling, as in Johnson and Charniak [2004], or prosodic features, as in Kahn et al. [2005],
might be useful in further improving the reranking available here. Alternatively, including
parse structure from parsers using other structural paradigms (e.g. the English Resource
Grammar [Flickinger, 2002]) would be an alternative valuable knowledge source (further
discussed in section 6). Along similar lines, expanding the joint modeling of speech tran-
scription and parsing to include sentence segmentation (as in Harper et al. [2005]) might
be valuable, especially because the evidence presented here points so strongly towards the
need for improved segmentation.
6.2.2 Extension of EDPM similarities to other tasks
In extending EDPM, it would be interesting to consider whether these techniques could be
shared with other tasks that require a sentence similarity measure. EDPM substantially
110
outperformed BLEU4, an n-gram precision measure, on correlation with human evaluation
of machine-translation quality. In the summarization domain, ROUGE [Lin, 2004] uses n-
grams to serve as an automatic evaluation of summary quality; EDPM’s generalization of
this approach to use expected parse-structure is worth exploring in summarization as well.
Even within machine translation, EDPM’s notion of sentence similarity may be useful
in other ways, for example, in computing distances between training and test data in graph-
based learning approaches for MT (e.g. Alexandrescu and Kirchhoff [2009]).
6.2.3 Extending syntactic coherence
The coherence measures reported in chapter 5 suggest that one may be able to parse source
text alone to identify regions that are translated in relative isolation from one another.
However, coherence of those spans by itself does not indicate that the alignment quality
is good: a key factor is in the relative density (the proportion of links to words), since
high-density alignments seem to under-predict coherence and low-density alignments to
over-predict it. We suggest exploring a revised reranking, including link density as a feature
alongside (possibly weighting) coherence.
Furthermore, the guided-union work showed a welcome success in improving the recall
without greatly damaging precision by using source language (Chinese) parsing. As an
extension, parse structure from the target language (English) could also be used to iden-
tify regions where alignment unions are worth including. Parsing the target side would
require a different parser, trained on the target language, which would identify target spans
(rather than source-side spans) to trust in guiding alignment union. Since the analysis in
section 5.4.4 indicated that some of the incoherent regions could be explained by English
constructions, this approach might be particularly fruitful.
Further work to integrate the notion of span coherence into machine translation align-
ments would be valuable: identifying that a span is likely to be coherent in translation
should offer a criterion for augmenting the search space pruning strategy for good transla-
tions. However, it would be wise to do further analysis of what regions are coherent before
undertaking the substantial effort of incorporating coherence into a translation or alignment
111
decoder. Such an analysis might incorporate a lexicalized study of coherence extending the
syntactic-span study done in section 5.4.2.
Beyond improving AER, both the reranking-alignments and the guided-union techniques
may show further improvement in alignment quality (or demonstrate a need for adjust-
ment) when dealing with alternative measures of alignment quality. One computationally-
expensive technique would be to use direct translation evaluation: to evaluate the alignment-
selection by training the entire translation model from the generated alignment and evaluate
with a translation quality measure on a held-out set of texts.
6.3 Future challenges for parsing as a decoration on the word sequence
Each of the applications described here was developed with a PCFG-based parser which
produces a simple span-labeling output. The parsers used here were all trained on in-domain
data, with state-of-the-art PCFG engines. As a direction of future work, it is worthwhile
to explore which of these constraints is necessary and which may be improved by trying
alternatives.
6.3.1 Sensitivity to parser
On any of the three applications presented here, one could explore varying the parser.
Alternative PCFG-based systems may present a different variation (their N -best lists, for
example, may be richer than the high-precision systems used here). However, one could go
farther and explore non-PCFG parsers. Any parser that can generate a dependency tree
could be used for EDPM, and any parser that can generate spans with labels could be used
in the coherence experiments. The reranking experiments over speech require that the parse
trees generated be compatible with the feature-extraction, but if one is willing to adjust the
feature extraction as well, any parser could be used there too. One direction of approach
might be to generate dependency structures directly for EDPM, e.g. by using dependency
parser strategies like the ones described in Kubler et al. [2009].
Alternatively, the English Resource Grammar [Flickinger, 2002] produces a rich parse
structure that may be projected to span or dependency structure; recent work (e.g. Miyao
and Tsujii [2008]) has suggested that it may even generate probabilistically weighted tree
112
structures. Integrating this knowledge-driven parser into these experimental frameworks
(as a replacement or supplemental parse-structure knowledge source) would be a valuable
exploration of the relative merits of these parsers.
6.3.2 Sensitivity of these measures to parser domain
We expect that the training domain is relevant to a parser’s utility for these applications
in ASR and SMT. In the limit, if the parser is trained on the wrong language, most of the
information it offers to these measurement and reranking techniques will be lost. However,
it is not clear how closely dialect, genre, and register must be matched: is it workable to
use a newswire-trained parser in EDPM when comparing translations of a more informal
genre (e.g. weblogs or conversational speech)? For some applications, genre may not have
an impact on the useful performance of the parser, and for others it may have a substantial
one: it would be a useful contribution to explore whether the benefits are retained when
parser domains mismatch.
6.3.3 Sensitivity of these measures to parser speed and quality
Parse structure decoration is shown here to be a valuable supplement to large word-sequence-
based NLP projects: this work offers a variety of opportunities for further work exploring
new ways in which parse structure decoration may benefit large NLP projects.
In evaluation, an obstacle to the wider adoption of dependency-based scoring functions
such as EDPM (for MT) and SParseval (for ASR) is a concern for scoring speed. Systems
that use error-driven training or tuning require fast scoring in order to do multiple runs of
parameter estimation for different system configurations. Using a parser is much slower than
scoring based on word and word n-gram matches alone. This objection invites exploration
regarding the robustness of dependency-based scoring algorithms when a faster (though
presumably lower-quality) parser is used rather than the Charniak and Johnson [2005]
system; perhaps (rather than the PCFG-inspired system used here) a direct-to-dependency
parser (e.g. the Kubler et al. [2009] parser) would capture enough similar information at
high enough quality to offer the same performance in SMT evaluation.
113
Speed, of course, would be of benefit to any application of parsing: the use of syntac-
tic coherence as a feature of word-alignment in machine translation would also be more
appealing if the benefits were present with a much faster parser.
Parser error, of course, can be a serious problem, as the qualitative study of Chinese
coherence analyses indicated. A different way of approaching sensitivity to parser quality
would be to create an array of parsers of known variation in quality (perhaps by using
reduced training sets) and exploring the relative merit of each in the tasks presented here.
In general, the experiments presented in this work suggest that parsers provide a useful
knowledge source for natural language processing tasks in several areas. Improving the
parser would, one expects, make that knowledge source more valuable, although it may
be that the environments (e.g., the candidates to be reranked) are not sufficiently diverse
for that additional knowledge to be valuable. In either case, this work stands as a call to
continue exploration for both parsers and natural language processing tasks in which to
apply those parsers.
114
115
BIBLIOGRAPHY
Y. Al-Onaizan and L. Mangu. Arabic ASR and MT integration for GALE. In Proc. ICASSP,
volume 4, pages 1285–1288, Apr. 2007.
A. Alexandrescu and K. Kirchhoff. Graph-based learning for statistical machine translation.
In Proc. HLT/NAACL, pages 119–127, 2009.
E. Arisoy, M. Saraclar, B. Roark, and I. Shafran. Syntactic and sub-lexical features for
Turkish discriminative language models. In Proc. ICASSP, pages 5538–5541, Mar. 2010.
N. F. Ayan and B. J. Dorr. Going beyond AER: An extensive analysis of word alignments
and their impact on MT. In Proc. ACL, pages 9–16, July 2006.
N. F. Ayan, B. J. Dorr, and C. Monz. NeurAlign: Combining word alignments using neural
networks. In Proc. HLT/EMNLP, pages 65–72, Oct. 2005a.
N. F. Ayan, B. J. Dorr, and C. Monz. Alignment link projection using transformation-based
learning. In Proc. HLT/EMNLP, pages 185–192, Oct. 2005b.
S. Banerjee and A. Lavie. METEOR: An automatic metric for MT evaluation with improved
correlation with human judgments. In Proc. ACL Workshop on Intrinsic and Extrinsic
Evaluation Measures for MT and/or Summarization, pages 65–72, 2005.
E. Black, S. Abney, D. Flickinger, C. Gdaniec, R. Grishman, P. Harrison, D. Hindle, R. In-
gria, F. Jelinek, J. Klavans, M. Liberman, M. Marcus, S. Roukos, B. Santorini, and
T. Strzalkowski. A procedure for quantitatively comparing syntactic coverage of English
grammars. In Proc. 4th DARPA Speech & Natural Lang. Workshop, pages 306–311, 1991.
J. Bresnan. Lexical-functional syntax. Number 16 in Blackwell textbooks in linguistics.
Blackwell, Malden, Mass., 2001.
116
P. F. Brown, J. Cocke, S. Della Pietra, V. J. Della Pietra, F. Jelinek, J. D. Lafferty, R. L.
Mercer, and P. S. Roossin. A statistical approach to machine translation. Computational
Linguistics, 16(2):79–85, 1990.
D. Burkett, J. Blitzer, and D. Klein. Joint parsing and alignment with weakly synchronized
grammars. In Proc. HLT, pages 127–135, June 2010.
A. Cahill, M. Burke, R. O’Donovan, J. van Genabith, and A. Way. Long-distance depen-
dency resolution in automatically acquired wide-coverage PCFG-based LFG approxima-
tions. In Proc. ACL, pages 319–326, 2004.
C. Callison-Burch. Re-evaluating the role of BLEU in machine translation research. In
Proc. EACL, pages 249–256, 2006.
P.-C. Chang, M. Galley, and C. D. Manning. Optimizing Chinese word segmentation for
machine translation performance. In Proceedings of the Third Workshop on Statistical
Machine Translation, pages 224–232, June 2008.
E. Charniak. A maximum-entropy-inspired parser. In Proc. NAACL, pages 132–139, 2000.
E. Charniak. Immediate-head parsing for language models. In Proc. ACL, pages 116–123,
2001.
E. Charniak and M. Johnson. Edit detection and parsing for transcribed speech. In Proc.
NAACL, pages 118–126, 2001.
E. Charniak and M. Johnson. Coarse-to-fine n-best parsing and MaxEnt discriminative
reranking. In Proc. ACL, pages 173–180, June 2005. A revised version was downloaded
November 2009 from ftp://ftp.cs.brown.edu/pub/nlparser/.
E. Charniak, K. Knight, and K. Yamada. Syntax-based language models for statistical
machine translation. In MT Summit IX. Intl. Assoc. for Machine Translation., 2003.
C. Chelba and F. Jelinek. Structured language modeling. Computer Speech and Language,
14(4):283–332, October 2000.
117
C. Cherry. Cohesive phrase-based decoding for statistical machine translation. In Proc.
ACL, pages 72–80, June 2008.
D. Chiang. A hierarchical phrase-based model for statistical machine translation. In Proc.
ACL, pages 263–270, June 2005.
M. Collins. Discriminative reranking for natural language parsing. In Proc. ICML, pages
175–182, 2000.
M. Collins. Head-driven statistical models for natural language parsing. Computational
Linguistics, 29(4):589–638, 2003.
M. Collins and T. Koo. Discriminative reranking for natural language parsing. Computa-
tional Linguistics, 31(1):25–69, 2005.
M. Collins, P. Koehn, and I. Kucerova. Clause restructuring for statistical machine trans-
lation. In Proc. ACL, pages 531–540, June 2005a.
M. Collins, B. Roark, and M. Saraclar. Discriminative syntactic language modeling for
speech recognition. In Proc. ACL, pages 507–514, June 2005b.
M. R. Costa-jussa and J. A. R. Fonollosa. Statistical machine reordering. In Proc. EMNLP,
pages 70–76, July 2006.
C. Culy and S. Z. Riehemann. The limits of n-gram translation evaluation metrics. In
Proceedings of MT Summit IX, 2003.
DARPA. Global Autonomous Language Exploitation (GALE). Mission, http://www.
darpa.mil/ipto/programs/gale/gale.asp, 2008.
J. DeNero and D. Klein. Tailoring word alignments to syntactic machine translation. In
Proc. ACL, pages 17–24, June 2007.
D. Filimonov and M. Harper. A joint language model with fine-grain syntactic tags. In
Proc. EMNLP, pages 1114–1123, Aug. 2009.
118
D. Flickinger. On building a more efficient grammar by exploiting types. In S. Oepen,
D. Flickinger, J. Tsujii, and H. Uszkoreit, editors, Collaborative Language Engineering,
chapter 1. CSLI Publications, 2002.
V. Fossum, K. Knight, and S. Abney. Using syntax to improve word alignment precision for
syntax-based machine translation. In Proceedings of the Third Workshop on Statistical
Machine Translation, pages 44–52, June 2008.
A. Fraser and D. Marcu. Semi-supervised training for statistical word alignment. In Proc.
ACL, pages 769–776, July 2006.
A. Fraser and D. Marcu. Measuring word alignment quality for statistical machine transla-
tion. Computational Linguistics, 33(3):293–303, Sept. 2007.
M. Galley, M. Hopkins, K. Knight, and D. Marcu. What’s in a translation rule? In D. M.
Susan Dumais and S. Roukos, editors, Proc. HLT/NAACL, pages 273–280, May 2004.
M. Galley, J. Graehl, K. Knight, D. Marcu, S. DeNeefe, W. Wang, and I. Thayer. Scalable in-
ference and training of context-rich syntactic translation models. In Proc. COLING/ACL,
pages 961–968, July 2006.
D. Gildea. Loosely tree-based alignment for machine translation. In Proc. ACL, pages
80–87, July 2003.
J. J. Godfrey, E. C. Holliman, and J. McDaniel. SWITCHBOARD: Telephone speech corpus
for research and development. In Proc. ICASSP, volume I, pages 517–520, 1992.
J. T. Goodman. A bit of progress in language modeling. Computer Speech and Language,
15:403–434(32), Oct. 2001.
A. Haghighi, J. Blitzer, J. DeNero, and D. Klein. Better word alignments with supervised
ITG models. In Proc. ACL, pages 923–931, Aug. 2009.
M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. Witten. The WEKA
data mining software: an update. SIGKDD Explorations Newsletter., 11:10–18, Nov.
2009.
119
M. Harper and Z. Huang. Chinese Statistical Parsing, chapter in press. DARPA, 2009.
M. Harper, B. Dorr, J. Hale, B. Roark, I. Shafran, M. Lease, Y. Liu, M. Snover, L. Yung,
A. Krasnyanskaya, and R. Stewart. Parsing and spoken structural event detection. Tech-
nical report, Johns Hopkins Summer Workshop Final Report, 2005.
D. Hillard. Automatic Sentence Structure Annotation for Spoken Language Processing. PhD
thesis, University of Washington, 2008.
D. Hillard, M. yuh Hwang, M. Harper, and M. Ostendorf. Parsing-based objective functions
for speech recognition in translation applications. In Proc. ICASSP, 2008.
L. Huang. Forest reranking: Discriminative parsing with non-local features. In Proc. HLT,
pages 586–594, June 2008.
Z. Huang and M. Harper. Self-training PCFG grammars with latent annotations across
languages. In Proc. EMNLP, pages 832–841, Aug. 2009.
ISIP. Mississippi State transcriptions of SWITCHBOARD, 1997. URL http://www.isip.
msstate.edu/projects/switchboard/.
R. Iyer, M. Ostendorf, and J. R. Rohlicek. Language modeling with sentence-level mixtures.
In Proc. HLT, pages 82–87, 1994.
T. Joachims. Training linear SVMs in linear time. In Proceedings of the ACM Conference
on Knowledge Discovery and Data Mining (KDD), 2006.
M. Johnson and E. Charniak. A tag-based noisy-channel model of speech repairs. In Proc.
ACL, pages 33–39, 2004.
J. G. Kahn. Moving beyond the lexical layer in parsing conversational speech. Master’s
thesis, University of Washington, 2005.
J. G. Kahn, M. Ostendorf, and C. Chelba. Parsing conversational speech using enhanced
segmentation. In Proc. HLT/NAACL, pages 125–128, 2004.
120
J. G. Kahn, M. Lease, E. Charniak, M. Johnson, and M. Ostendorf. Effective use of prosody
in parsing conversational speech. In Proc. HLT/EMNLP, pages 233–240, 2005.
J. G. Kahn, B. Roark, and M. Ostendorf. Automatic syntactic MT evaluation with expected
dependency pair match. In MetricsMATR: NIST Metrics for Machine Translation Chal-
lenge. NIST, 2008.
J. G. Kahn, M. Snover, and M. Ostendorf. Expected dependency pair match: predicting
translation quality with expected syntactic structure. Machine Translation, 23(2–3):169–
179, 2009.
A. Kannan, M. Ostendorf, and J. R. Rohlicek. Weight estimation for n-best rescoring. In
Proc. of the DARPA workshop on speech and natural language, pages 455–456, Feb. 1992.
M. King. Evaluating natural language processing systems. Communications of the ACM,
39(1):73–79, 1996.
P. Koehn, F. J. Och, and D. Marcu. Statistical phrase-based translation. In Proc.
HLT/NAACL, pages 48–54, May–June 2003.
P. Koehn, H. Hoang, A. Birch, C. Callison-Burch, M. Federico, N. Bertoldi, B. Cowan,
W. Shen, C. Moran, R. Zens, C. Dyer, O. Bojar, A. Constantin, and E. Herbst. Moses:
Open source toolkit for statistical machine translation. In Proc. ACL, pages 177–180,
June 2007.
S. Kubler, R. McDonald, and J. Nivre. Dependency parsing. Synthesis Lectures on Human
Language Technologies, 2(1):1–127, 2009.
S. Lacoste-Julien, B. Taskar, D. Klein, and M. I. Jordan. Word alignment via quadratic
assignment. In Proc. HLT/NAACL, pages 112–119, June 2006.
L. Lamel, W. Minker, and P. Paroubek. Towards best practice in the development and
evaluation of speech recognition components of a spoken language dialog system. Natural
Language Engineering, 6(3&4):305–322, 2000.
LDC. Multiple translation Chinese corpus, part 2, 2003. Catalog number LDC2003T17.
121
LDC. Linguistic data annotation specification: Assessment of fluency and adequacy in
translations. http://projects.ldc.upenn.edu/TIDES/Translation/TransAssess04.
pdf, Jan. 2005.
LDC. Multiple translation Chinese corpus, part 4, 2006. Catalog number LDC2006T04.
LDC. GALE phase 2 + retest evaluation references, 2008. Catalog number LDC2008E11.
Z. Li, C. Callison-Burch, C. Dyer, S. Khudanpur, L. Schwartz, W. Thornton, J. Weese, and
O. Zaidan. Joshua: An open source toolkit for parsing-based machine translation. In
Proceedings of the Fourth Workshop on Statistical Machine Translation, pages 135–139,
Mar. 2009.
C.-Y. Lin. ROUGE: A package for automatic evaluation of summaries. In Text Summa-
rization Branches Out: Proc. ACL-04 Workshop, pages 74–81, July 2004.
D. Lin and C. Cherry. Word alignment with cohesion constraint. In Proc. NAACL, pages
49–51, 2003.
D. Liu and D. Gildea. Syntactic features for evaluation of machine translation. In Proc.
ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for MT and/or Summa-
rization, pages 25–32, June 2005.
Y. Liu, Q. Liu, and S. Lin. Tree-to-string alignment template for statistical machine trans-
lation. In Proc. COLING/ACL, pages 609–616, July 2006a.
Y. Liu, E. Shriberg, A. Stolcke, D. Hillard, M. Ostendorf, and M. Harper. Enriching speech
recognition with sentence boundaries and disfluencies. IEEE Transactions on Speech,
Audio, and Language Processing, 14(5):1526–1540, 2006b.
J. T. Lønning, S. Oepen, D. Beermann, L. Hellan, J. Carroll, H. Dyvik, D. Flickinger, J. B.
Johannessen, P. Meurer, T. Nordgard, V. Rosen, and E. Velldal. LOGON. A Norwegian
MT effort. In Proc. Recent Advances in Scandinavian Machine Translation, 2004.
D. M. Magerman. Statistical decision-tree models for parsing. In Proc. ACL, pages 276–283,
1995.
122
L. Mangu, E. Brill, and A. Stolcke. Finding consensus in speech recognition: word er-
ror minimization and other applications of confusion networks. Computer Speech and
Language, pages 373–400, 2000.
D. Marcu, W. Wang, A. Echihabi, and K. Knight. SPMT: Statistical machine translation
with syntactified target language phrases. In Proc. EMNLP, pages 44–52, July 2006.
M. P. Marcus, B. Santorini, and M. A. Marcinkiewicz. Building a large annotated corpus
of English: the Penn treebank. Computational Linguistics, 19(1):313–330, Mar. 1993.
M. Meteer, A. Taylor, R. MacIntyre, and R. Iyer. Dysfluency annotation stylebook for the
switchboard corpus. Technical report, Linguistic Data Consortium (LDC), 1995.
Y. Miyao and J.-i. Tsujii. Feature forest models for probabilistic HPSG parsing. Computa-
tional Linguistics, 34(1):35–80, 2008.
R. C. Moore, W.-t. Yih, and A. Bode. Improved discriminative bilingual word alignment.
In Proc. ACL, pages 513–520, July 2006.
W. Naptali, M. Tsuchiya, and S. Nakagawa. Topic-dependent language model with voting
on noun history. ACM Transactions on Asian Language Information Processing (TALIP),
9(2):1–31, 2010.
NIST. NIST speech recognition scoring toolkit (SCTK). Technical report, NIST, 2005.
URL http://www.nist.gov/speech/tools/.
F. J. Och. Minimum error rate training in statistical machine translation. In Proc. ACL,
pages 160–167, July 2003.
F. J. Och and H. Ney. A systematic comparison of various statistical alignment models.
Computational Linguistics, 29(1):19–51, 2003.
K. Owczarzak, J. van Genabith, and A. Way. Evaluating machine translation with LFG
dependencies. Machine Translation, 21(2):95–119, June 2007a.
123
K. Owczarzak, J. van Genabith, and A. Way. Labelled dependencies in machine translation
evaluation. In Proceedings of the Second Workshop on Statistical Machine Translation,
pages 104–111, June 2007b.
S. Pado, D. Cer, M. Galley, D. Jurafsky, and C. Manning. Measuring machine transla-
tion quality as semantic equivalence: A metric based on entailment features. Machine
Translation, 23:181–193, 2009.
K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu. BLEU: a method for automatic evaluation
of machine translation. In Proc. ACL, pages 311–318, 2002.
S. Petrov and D. Klein. Improved inference for unlexicalized parsing. In Proc. HLT, pages
404–411, Apr. 2007.
C. J. Pollard and I. A. Sag. Head-driven phrase structure grammar. Studies in contemporary
linguistics. Stanford: CSLI, 1994.
M. Popovic and H. Ney. POS-based word reorderings for statistical machine translation. In
Proc. LREC, pages 1278–1283, May 2006.
C. Quirk, A. Menezes, and C. Cherry. Dependency treelet translation: syntactically in-
formed phrasal SMT. In Proc. ACL, pages 271–279, 2005.
B. Roark. Probabilistic top-down parsing and language modeling. Computational Linguis-
tics, 27(2):249–276, June 2001.
B. Roark, M. Harper, E. Charniak, B. Dorr, M. Johnson, J. G. Kahn, Y. Liu, M. Ostendorf,
J. Hale, A. Krasnyanskaya, M. Lease, I. Shafran, M. Snover, R. Stewart, and L. Yung.
SParseval: Evaluation metrics for parsing speech. In Proc. LREC, 2006.
B. Roark, M. Saraclar, and M. Collins. Discriminative n-gram language modeling. Computer
Speech and Language, 21(2):373–392, Apr. 2007.
L. Shen, A. Sarkar, and F. J. Och. Discriminative reranking for machine translation. In
Proc. HLT/NAACL, pages 177–184, May 2004.
124
N. Singh-Miller and M. Collins. Trigger-based language modeling using a loss-sensitive
perceptron algorithm. In Proc. ICASSP, volume 4, pages 25–28, 2007.
M. Snover, B. Dorr, R. Schwartz, L. Micciulla, and J. Makhoul. A study of translation edit
rate with targeted human annotation. In Proc. AMTA, 2006.
M. Snover, N. Madnani, B. Dorr, and R. Schwartz. Fluency, adequacy, or HTER? Exploring
different human judgments with a tunable MT metric. In Proceedings of the Workshop
on Statistical Machine Translation at EACL, Mar. 2009.
A. Stolcke. Modeling linguistic segment and turn-boundaries for n-best rescoring of spon-
taneous speech. In Proc. Eurospeech, volume 5, pages 2779–2782, 1997.
A. Stolcke. SRILM – an extensible language modeling toolkit. In Proc. ICSLP, pages
901–904, 2002.
A. Stolcke and E. Shriberg. Automatic linguistic segmentation of conversational speech. In
Proc. ICSLP, pages 1005–1008, 1996.
A. Stolcke, B. Chen, H. Franco, V. R. R. Gadde, M. Graciarena, M.-Y. Hwang, K. Kirch-
hoff, A. Mandal, N. Morgan, X. Lei, T. Ng, M. Ostendorf, K. Sonmez, A. Venkataraman,
D. Vergyri, W. Wang, J. Zheng, and Q. Zhu. Recent innovations in speech-to-text tran-
scription at SRI-ICSI-UW. Audio, Speech, and Language Processing, IEEE Transactions
on, 14(5):1729–1744, Sept. 2006.
S. Strassel. Simple Metadata Annotation Specification V5.0. Linguistic Data Consortium,
2003. URL http://www.nist.gov/speech/tests/rt/rt2003/fall/docs/SimpleMDE_
V5.0.pdf.
S. Vogel, H. Ney, and C. Tillmann. HMM-based word alignment in statistical translation.
In Proc. COLING, pages 836–841, Copenhagen, Denmark, 1996.
M. A. Walker, D. J. Litman, C. A. Kamm, and A. Abella. Evaluating interactive dialogue
systems: extending component evaluation to integrated system evaluation. In Interactive
125
Spoken Dialog Systems on Bringing Speech and NLP Together in Real Applications, pages
1–8, 1997.
W. Wang and M. P. Harper. The SuperARV language model: Investigating the effectiveness
of tightly integrating multiple knowledge sources. In Proc. EMNLP, pages 238–247, July
2002.
W. Wang, A. Stolcke, and M. P. Harper. The use of a linguistically motivated language
model in conversational speech recognition. In Proc. ICASSP, volume 1, pages 261–264,
2004.
B. Wong and C. Kit. ATEC: automatic evaluation of machine translation via word choice
and word order. Machine Translation, 23:141–155, 2009.
F. Xia and M. McCord. Improving a statistical MT system with automatically learned
rewrite patterns. In Proc. COLING, pages 508–514, 2004.
D. Xiong, Q. Liu, and S. Lin. A dependency treelet string correspondence model for statis-
tical machine translation. In Proceedings of the Second Workshop on Statistical Machine
Translation, pages 40–47, June 2007.
N. Xue, F.-D. Chiou, and M. Palmer. Building a large-scale annotated Chinese corpus. In
Proc. COLING, 2002.
K. Yamada and K. Knight. A syntax-based statistical translation model. In Proc. ACL,
pages 523–530, July 2001.
A. Yeh. More accurate tests for the statistical significance of result differences. In Proc.
COLING, volume 2, pages 947–953, 2000.
Y. Zhang, R. Zens, and H. Ney. Chunk-level reordering of source language sentences
with automatically learned rules for statistical machine translation. In Proc. NAACL-
HLT/AMTA Workshop on Syntax and Structure in Statistical Translation, pages 1–8,
April 2007.
126
A. Zollmann, A. Venugopal, M. Paulik, and S. Vogel. The syntax augmented MT (SAMT)
system at the shared task for the 2007 ACL workshop on statistical machine translation.
In Proceedings of the Second Workshop on Statistical Machine Translation, pages 216–219,
June 2007.
127
VITA
Jeremy Gillmor Kahn was born in Atlanta, Georgia and has proceeded widdershins
around the continental United States: Providence, Rhode Island, where he received his AB
in Linguistics from Brown University; Ithaca, New York, where he discovered a career in
speech synthesis; Redmond and Seattle, Washington, where that career extended to include
speech recognition. He entered the University of Washington in Linguistics in 2003, receiving
an MA and (now) a Ph.D.
His counter-clockwise trajectory continues; Jeremy is employed by Wordnik, a Bay Area
computational lexicography company. He has a job where they pay him to think about
words and numbers and how they fit together.
Jeremy lives in San Francisco, California with his wife Dorothy, a dramatherapist. The
two of them spend a lot of time talking about what it means to say what you mean and
what it says to mean what you say.
128