0685081 2a739 liu y structural event detection for rich transcription of s
TRANSCRIPT
-
7/28/2019 0685081 2A739 Liu y Structural Event Detection for Rich Transcription of s
1/253
STRUCTURAL EVENT DETECTION FOR RICH TRANSCRIPTION OF
SPEECH
A Thesis
Submitted to the Faculty
of
Purdue University
by
Yang Liu
In Partial Fulfillment of the
Requirements for the Degree
of
Doctor of Philosophy
December 2004
-
7/28/2019 0685081 2A739 Liu y Structural Event Detection for Rich Transcription of s
2/253
ii
To my parents and my husband.
-
7/28/2019 0685081 2A739 Liu y Structural Event Detection for Rich Transcription of s
3/253
iii
ACKNOWLEDGMENTS
I started this research of structural event detection at Purdue University and
continued this at ICSI where I have been for the past two and a half years. ICSI
has provided a wonderful environment for me to enrich my research on speech and
language processing.
I gratefully acknowledge my major advisor Mary Harper for both academic and
moral support over the past few years. Even while I have been away from campus,
she has been in constant touch via email and phone supporting my research. I have
benefited from her insightful guidance and discussion, as well as her encouragement.
She has given me the intellectual freedom to do research in spoken language pro-
cessing and has provided lots of advice. She has taught me how to be a researcher
through all these years, while ploughing through the many paper drafts she has
revised.
I would like to thank Elizabeth Shriberg and Andreas Stolcke for giving me the
opportunity to continue my research at ICSI. I thank them for their valuable sugges-
tions and comments when I encountered difficulties in my research. I have learned
from them how to look at a problem both from a scientific and engineering point of
view. Especially thanks to Elizabeth Shriberg for teaching me about linguistics as
well as for providing academic advice over the past two years.
I thank my other Ph.D committee members: Leah Jamieson and Jack Gandour
at Purdue University. They have been very generous with their time and supportive
of my research topic. I have benefited from discussions with Leah Jamieson about
speech processing in my first two years of study at Purdue University.
Many people at ICSI also deserve acknowledgment. On the academic front, Bar-
bara Peskin shared her vision of the entire structural event detection project, and
at the same time was always willing to spend her time working out details. Nelson
-
7/28/2019 0685081 2A739 Liu y Structural Event Detection for Rich Transcription of s
4/253
iv
Morgan as the director of ICSI, has created an excellent environment that nurtures
research and learning. Chuck Wooters and James Fung deserve a special thanks for
generating speaker diarization results. Jeremy Ang, Kofi Boakye, Barry Chen, Dave
Gelbart, Dan Gillick, Andy Hatch, Yan Huang, Adam Janin, Nikki Mirghafori, and
Qifeng Zhu are helpful office mates and neighbors at ICSI. They have made my time
at ICSI more enjoyable.
There are so many other people who have contributed to my research. Luciana
Ferrer at SRI has helped much with prosodic feature extraction. I thank Mari
Ostendorf and Dustin Hillard at University of Washington for their collaboration
on the structural event detection work. Wen Wang, who finished her Ph.D from
Purdue University and is at SRI now, has been so patient with all my questions
regarding language models. I am glad that I had the chance to work together with
Lei Chen at Purdue University using a multimodal corpus for sentence boundary
detection. Nitesh Chawla at CIBC has been a wonderful source for answers to
my machine learning questions. Thanks also to Andrew McCallum at University
of Massachusetts and Fernando Pereira at the University of Pennsylvania for their
support and advice on the CRF model. I also thank Julia Hirsberg and Yoav Freund
at Columbia University for their assistance with the boosting algorithm.
Most of all, I thank my family for their support of my education. I would not be
able to reach the end of this journey without consistent support and encouragement
from my husband. His belief in me has made this thesis possible. The love from my
parents and sister has also supported me during the difficult times in my graduate
life.
-
7/28/2019 0685081 2A739 Liu y Structural Event Detection for Rich Transcription of s
5/253
v
TABLE OF CONTENTS
Page
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvi
ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xviii
1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Scope of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.1 Structural Event Detection Tasks . . . . . . . . . . . . . . . . 5
1.2.2 Our Approach to the Problem . . . . . . . . . . . . . . . . . . 6
2 RELATED WORK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1 Sentence Boundary Detection . . . . . . . . . . . . . . . . . . . . . . 10
2.1.1 Text-based Processing for Sentence Boundary Detection . . . 11
2.1.2 Combining Textual and Prosodic Information for Sentence Bound-
ary Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.1.3 Summary of Past Research on Sentence Boundary Detection . 17
2.2 Edit Disfluency Processing . . . . . . . . . . . . . . . . . . . . . . . . 18
2.2.1 Production and Properties of Disfluencies . . . . . . . . . . . . 18
2.2.2 Past Research on Automatic Disfluency Detection . . . . . . . 24
2.2.3 Summary of Past Research on Disfluencies . . . . . . . . . . . 33
2.3 Filler Word Processing . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.3.1 Production and Perception of Fillers . . . . . . . . . . . . . . 342.3.2 Past Research on Filler Word Processing . . . . . . . . . . . . 37
2.4 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3 DATA RESOURCES AND TASKS . . . . . . . . . . . . . . . . . . . . . . 40
3.1 Structural Speech Events Types . . . . . . . . . . . . . . . . . . . . . 40
-
7/28/2019 0685081 2A739 Liu y Structural Event Detection for Rich Transcription of s
6/253
vi
Page
3.1.1 Sentence-like Units (SUs) . . . . . . . . . . . . . . . . . . . . 41
3.1.2 Fillers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.1.3 Edit Disfluencies . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.2 Structural Event Detection Task Description . . . . . . . . . . . . . . 45
3.2.1 Task Description . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.2.2 Performance Measures . . . . . . . . . . . . . . . . . . . . . . 47
3.3 Corpora . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4 THE HMM APPROACH TO STRUCTURAL EVENT DETECTION . . . 54
4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.2 Feature Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 554.2.1 Prosodic Features . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.2.2 Textual Features . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.3 The Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.3.1 The Prosody Model . . . . . . . . . . . . . . . . . . . . . . . . 60
4.3.2 The Language Model (LM) . . . . . . . . . . . . . . . . . . . 62
4.4 Model Combination . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.5 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 675 HMM BASELINE PERFORMANCE . . . . . . . . . . . . . . . . . . . . . 68
5.1 System Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.1.1 Choice of Classes . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.1.2 Training Procedures . . . . . . . . . . . . . . . . . . . . . . . 70
5.1.3 Testing Procedures . . . . . . . . . . . . . . . . . . . . . . . . 73
5.2 Baseline System Performance . . . . . . . . . . . . . . . . . . . . . . 76
5.2.1 Task 1: SU Detection . . . . . . . . . . . . . . . . . . . . . . . 765.2.2 Task 2: Filler Word Detection . . . . . . . . . . . . . . . . . . 82
5.2.3 Task 3: Edit Word and IP Detection . . . . . . . . . . . . . . 84
5.2.4 Summary for All the Tasks . . . . . . . . . . . . . . . . . . . . 92
5.3 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
-
7/28/2019 0685081 2A739 Liu y Structural Event Detection for Rich Transcription of s
7/253
vii
Page
6 INCORPORATING TEXTUAL KNOWLEDGE SOURCES INTO THEHMM SYSTEM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
6.1 Review of Related Language Model Techniques . . . . . . . . . . . . 97
6.2 Various Knowledge Sources . . . . . . . . . . . . . . . . . . . . . . . 100
6.2.1 Word-LM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
6.2.2 Automatically Induced Classes (AIC) . . . . . . . . . . . . . . 101
6.2.3 Part-of-speech (POS) Tags . . . . . . . . . . . . . . . . . . . . 102
6.2.4 Syntactic Chunk Tags . . . . . . . . . . . . . . . . . . . . . . 104
6.2.5 Word LMs from Additional Corpora . . . . . . . . . . . . . . 107
6.3 Integration Methods for the LMs in an HMM . . . . . . . . . . . . . 108
6.4 Experiments on SU Detection Task . . . . . . . . . . . . . . . . . . . 110
6.4.1 CTS SU Task . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
6.4.2 BN SU Task . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
6.5 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
7 PROSODY MODEL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
7.1 Addressing the Imbalanced Data Set Problem . . . . . . . . . . . . . 118
7.1.1 The Imbalanced Class Distribution Problem . . . . . . . . . . 118
7.1.2 Approaches to Address the Problem . . . . . . . . . . . . . . . 120
7.2 Pilot Study for SU Detection . . . . . . . . . . . . . . . . . . . . . . . 123
7.2.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . 123
7.2.2 Sampling Results . . . . . . . . . . . . . . . . . . . . . . . . . 125
7.2.3 Bagging Results . . . . . . . . . . . . . . . . . . . . . . . . . . 129
7.3 Sampling and Bagging Across SU and IP Tasks . . . . . . . . . . . . 132
7.3.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . 1327.3.2 Results Across SU and IP Tasks . . . . . . . . . . . . . . . . . 133
7.4 Evaluation on the Full NIST SU Task . . . . . . . . . . . . . . . . . . 136
7.4.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . 136
7.4.2 Results on the NIST SU Task . . . . . . . . . . . . . . . . . . 138
-
7/28/2019 0685081 2A739 Liu y Structural Event Detection for Rich Transcription of s
8/253
viii
Page
7.5 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
7.5.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
7.5.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1418 APPROACHES TO COMBINE KNOWLEDGE SOURCES . . . . . . . . 143
8.1 Knowledge Sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
8.2 A Review of the HMM for SU Detection . . . . . . . . . . . . . . . . 145
8.3 The Maxent Posterior Probability Model for SU Detection . . . . . . 149
8.3.1 Description of the Maxent Model . . . . . . . . . . . . . . . . 150
8.3.2 Features Used . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
8.3.3 Comparisons of the Maxent and HMM Approaches . . . . . . 1558.3.4 Results and Discussion for the Maxent SU Model . . . . . . . 156
8.4 The Conditional Random Field (CRF) Model for SU Detection . . . . 164
8.4.1 Description of the CRF Model . . . . . . . . . . . . . . . . . . 165
8.4.2 Comparisons of CRF and Other Models . . . . . . . . . . . . 166
8.4.3 Results and Discussion for the CRF SU Model . . . . . . . . . 167
8.5 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
9 SYSTEM FOR RT-04 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1739.1 RT-04 Tasks and Data . . . . . . . . . . . . . . . . . . . . . . . . . . 173
9.2 System Performance for SU Boundary Detection . . . . . . . . . . . . 174
9.3 SU/SU-Subtype Detection . . . . . . . . . . . . . . . . . . . . . . . . 176
9.4 Edit Word Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
9.4.1 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
9.4.2 Edit Detection Results . . . . . . . . . . . . . . . . . . . . . . 182
9.5 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18310 RELATED EFFORTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
10.1 Factors Impacting Performance . . . . . . . . . . . . . . . . . . . . . 185
10.1.1 Word Error Rates (WER) . . . . . . . . . . . . . . . . . . . . 185
10.1.2 Speaker Label for SU Detection . . . . . . . . . . . . . . . . . 187
-
7/28/2019 0685081 2A739 Liu y Structural Event Detection for Rich Transcription of s
9/253
ix
Page
10.2 Word Fragment Detection . . . . . . . . . . . . . . . . . . . . . . . . 190
10.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
10.2.2 Acoustic and Prosodic Features . . . . . . . . . . . . . . . . . 193
10.2.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
10.3 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
11 FINAL REMARKS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
11.1 Impact on Other Research Efforts . . . . . . . . . . . . . . . . . . . . 200
11.1.1 Using Structural Event Information for Word Recognition . . 200
11.1.2 SU Detection in a Multi-modal Corpus . . . . . . . . . . . . . 202
11.1.3 Dialog Act Detection in Meeting Corpus . . . . . . . . . . . . 204
11.2 Summary of Experiments . . . . . . . . . . . . . . . . . . . . . . . . . 206
11.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208
11.4 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
LIST OF REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212
APPENDICES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224
Appendix A: ADT Boosting For SU and IP Detection . . . . . . . . . . . . 224
A.1 ADT Boosting Description . . . . . . . . . . . . . . . . . . . . 224
A.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . 226
A.3 ADT Boosting Summary . . . . . . . . . . . . . . . . . . . . . 227
Appendix B: Prosodic Features . . . . . . . . . . . . . . . . . . . . . . . . 227
VITA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230
-
7/28/2019 0685081 2A739 Liu y Structural Event Detection for Rich Transcription of s
10/253
x
LIST OF TABLES
Table Page
1.1 Symbols used for the structural events in the example of annotated tran-scriptions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1 A summary of some important prior studies on sentence boundary de-tection. Column two is the task chosen for each investigation: boundarymeans the sentence boundary detection task, compared to its subtypeor punctuation detection; column three describes the model or the in-formation sources used by each investigation; column four is the cor-
pus on which the experiments were conducted; column five representswhether the experiments were performed on human transcriptions (Ref)or recognition results (ASR). Note that CTS (i.e., conversational tele-phone speech) is used in the corpus column for those experiments thatwere conducted on the Switchboard corpus. Even though no textualinformation is used in this automatic detection model, Ref condition isused in that study for its evaluation. . . . . . . . . . . . . . . . . . . . . 19
2.2 A summary of some important prior studies on disfluency detection. Col-umn two is the taskfor each investigation; column three describes themodel or information sources used by each investigation; column four is
the corpus on which the experiments were conducted; column five repre-sents whether the experiments were performed on human transcriptions(Ref) or recognition results (ASR). In Core [52], preliminary repairinformation is provided, and the parser further corrects them. . . . . . . 35
3.1 Structural events annotated by LDC and investigated in this thesis. Notethat the subtype of an edit disfluency is not annotated LDC, nor is thecorrection in an edit disfluency. . . . . . . . . . . . . . . . . . . . . . . . 41
3.2 Information on the CTS and BN corpora, including the data set sizes,the percentage of the different types of structural events in the training
set, and the word error rate (WER) of the speech recognizer on the testset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.1 Examples of cue words that are highly representative of some structuralevent types. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.2 Examples of the prosodic features used for the SU detection problem thatappear in the decision tree shown in Figure 4.3. . . . . . . . . . . . . . . 63
-
7/28/2019 0685081 2A739 Liu y Structural Event Detection for Rich Transcription of s
11/253
xi
Table Page
5.1 CTS SU detection results using the NIST SU error rate (%) and theboundary-based CER (% in parentheses) on human transcriptions (REF)and recognition output (STT), for the LM and the prosody model indi-
vidually, and in combination. The baseline error rate, assuming there isno SU boundary at each word boundary is 100% for the NIST SU errorrate and 15.7% for CER. . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.2 Deletion and insertion error rates (NIST SU error rate in %) for theCTS REF condition, using the LM and the prosody alone and in theircombination. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.3 Feature usage (%) for SU detection on CTS. . . . . . . . . . . . . . . . 79
5.4 BN SU detection results using the NIST SU error rate (%) and the CER(% in parentheses) using the prosody model, the LM, and their combina-
tion. Results are shown for both REF and STT conditions. The baselineerror rate is 100% for the NIST SU error rate and 7.2% for CER. . . . . 79
5.5 Deletion and insertion error rates (NIST SU error rate in %) for the BNREF condition, using the LM and the prosody model alone and in theircombination. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5.6 Feature usage (%) for SU detection on BN. . . . . . . . . . . . . . . . . 81
5.7 Results for CTS filler word (including FP and DM) detection, FP detec-tion, and DM boundary detection using NIST error rate (%) and CER(% in parenthesis) for the prosody model, LM, and their combination.
Results are for both the REF and STT conditions. The baseline CER is8.3% for filler word detection, 3.6% for FP detection, and 2.8% for DMboundary detection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
5.8 Feature usage (%) for the FP and DM detection tasks in CTS. . . . . . 85
5.9 CTS edit word and IP detection results using NIST error rate (%) andCER (% in parenthesis) for the prosody model, the LM, and their combi-nation. Results are for the REF and STT conditions. The baseline CERis 8.3% for edit word detection, and 4.8% for edit IP detection. . . . . . 91
5.10 Feature usage (%) for IP detection on CTS corpus. . . . . . . . . . . . . 925.11 System performance (NIST error rate in %) for all the structural event
detection tasks on CTS and BN test sets. Results are presented for boththe REF and STT conditions. . . . . . . . . . . . . . . . . . . . . . . . 93
6.1 Two examples of automatically induced classes for the CTS SU detectiontask, depicting member words and each words probability given the class. 103
-
7/28/2019 0685081 2A739 Liu y Structural Event Detection for Rich Transcription of s
12/253
xii
Table Page
6.2 The POS and chunk tags for a sentence from the BN corpus, the topselling car of nineteen ninety-seven was announced today and the winneris toyota camry. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
6.3 SU detection results (NIST error rate in %) for human transcriptions ofCTS data using various LMs, alone and in combination with the prosodymodel. The deletion (DEL), insertion (INS), and total error rate arereported. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
6.4 SU detection results (NIST error rate in %) for human transcriptionsof the BN data using various LMs, alone and in combination with theprosody model. The deletion (DEL), insertion (INS), and total error rateare reported. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
7.1 Description of the data set used in the pilot study for the CTS SU detec-tion task. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
7.2 SU detection results (CER in % and F-measure) for different samplingapproaches in the pilot study of the CTS corpus, using the prosody modelalone and in combination with the LM. The CER of the LM alone on thetest set is 5.02%. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
7.3 Recall and precision results for the sampling methods in the pilot studyof CTS SU detection. Using LM alone yields a recall of 74.6% and aprecision of 84.9%. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
7.4 CTS SU detection results (CER in % and F-measure) with bagging ap-plied to a randomly downsampled data set (DS), ensemble of downsam-pled training sets, and the original training set. The results for the train-ing conditions without bagging are also shown for comparison. . . . . . 130
7.5 Description of the data sets used for the SU and IP detection tasks. Thedata set used in the pilot study is shown in the second column, which isa subset of the data set used in this investigation (large set denoted inthe table). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
7.6 IP and SU detection results in CER (%). DS denotes downsampled.Chance performance is 4.36% on the original test set for IP, and 13.64%for SU. The CER using LM alone is 2.34% on the IP task, and 5.27% on
the SU task. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
7.7 SU detection results (NIST error rate in %) for both the CTS and BNcorpora, on the REF and STT conditions. . . . . . . . . . . . . . . . . . 139
8.1 SU detection results (NIST error rate in %) for different state configura-tions using the trigram LM alone on the CTS reference condition. Theinsertion (INS), deletion (DEL), and total error rate are shown. . . . . . 148
-
7/28/2019 0685081 2A739 Liu y Structural Event Detection for Rich Transcription of s
13/253
xiii
Table Page
8.2 SU detection results (NIST error rate in %) using the Maxent and theHMM approaches individually and in combination on BN and CTS, onreference transcriptions (REF) and recognition output (STT). . . . . . . 157
8.3 Deletion, insertion, and total error rate (NIST error rate in %) of theHMM and Maxent approaches on reference transcriptions of BN and CTS. 158
8.4 SU detection results (NIST error rate in%) using different knowledgesources on BN and CTS, evaluated on the reference transcription. . . . 159
8.5 Comparison of using the posterior probabilities from the prosody model asbinary features versus continuous valued features in the Maxent approachfor SU detection in CTS reference transcription condition. . . . . . . . . 160
8.6 Some of the N-gram features with the highest IG weights for the CTS SU
detection task. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1628.7 Notation for a 2 2 contingency table used in Chi-square statistics. . . . 163
8.8 SU detection results (NIST error rate in %) using different feature selec-tion metrics and different pruning thresholds (number of the preservedfeatures), for the CTS REF condition. . . . . . . . . . . . . . . . . . . . 164
8.9 SU detection results (NIST error rate in %) using the HMM, Maxent,and CRF approaches individually and in combination on BN and CTS,on reference transcriptions (REF) and recognition output (STT). Thecombination of the three approaches is obtained via a majority vote. . . 169
8.10 CTS SU detection results (NIST error rate in %) using the HMM, Maxent,and CRF individually, using different knowledge sources. Note that theall features condition uses all the knowledge sources described in Section8.3.2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
8.11 BN SU detection results (NIST error rate %) using the HMM, Maxent,and CRF individually, using different knowledge sources. . . . . . . . . 171
9.1 Data description for CTS and BN used in the RT-04 NIST evaluation.BN training data is the combined RT-03 and RT-04 data. CTS contains
only the RT-04 training data. . . . . . . . . . . . . . . . . . . . . . . . . 1749.2 SU boundary detection results (NIST SU error rate %) on the RT-04
evaluation data. The combination is the majority vote of the Maxent,CRF, and the improved HMM approaches. DS denotes a downsampledtraining set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
9.3 Percentage of SU subtypes for CTS and BN. . . . . . . . . . . . . . . . 177
-
7/28/2019 0685081 2A739 Liu y Structural Event Detection for Rich Transcription of s
14/253
xiv
Table Page
9.4 SU/SU-subtype detection results (%) on RT-04 CTS evaluation data.Results are reported using NIST SU boundary error rate, substitutionerror rate, and the subtype classification error rate (CER). . . . . . . . . 177
9.5 SU subtype detection results (in confusion matrix) on CTS human tran-scription condition. Each cell shows the count and percentage (%) of areference subtype (row) that is hypothesized as the subtype shown in thecolumn. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
9.6 States and transitions used by the CRF for edit word and edit IP de-tection. The class tags are: the beginning of an edit (B-E), inside of anedit (I-E), each of which has a possible IP associated with it (B-E+IP orI-E+IP), and outside of an edit (O). . . . . . . . . . . . . . . . . . . . . 181
9.7 Results (NIST error rate in %) for edit word and IP detection, using the
HMM, Maxent, and CRF approaches on the reference and recognitionoutput conditions of CTS data. . . . . . . . . . . . . . . . . . . . . . . . 182
9.8 Results (NIST error rate in %) for edit word and IP detection, using theHMM and Maxent approaches. . . . . . . . . . . . . . . . . . . . . . . . 183
10.1 SU and edit word detection results (NIST error rate in %) for CTS andBN, on REF and various STT conditions using the RT-04 data. For SUdetection, results are reported for the SU boundary detection error. STT-1 and STT-2 are two different STT outputs, and the WER (%) for themis shown in the table. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
10.2 Comparisons of different ways to derive speaker labels on the RT-04 testset for the BN SU boundary detection task. Results are shown usingthe NIST error rate (%) for the HMM on the reference transcriptioncondition. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
10.3 Word fragment detection results (in confusion matrix) on the downsam-pled data of Switchboard corpus. . . . . . . . . . . . . . . . . . . . . . . 196
10.4 Feature usage (%) for word fragment detection using the Switchboarddata. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
11.1 WER (%) when SU information is fed back to re-segment and re-recognizespeech, compared to the baseline using the acoustic segments, evaluatingon half of the RT-03 BN data. . . . . . . . . . . . . . . . . . . . . . . . 202
11.2 SU detection results (NIST error rate in %) on the Wombat data. Notethat the combined result is not shown when using textual informationonly, in order to make results in parallel to the results in Chapter 8(Table 8.2 and Table 8.4). . . . . . . . . . . . . . . . . . . . . . . . . . . 203
-
7/28/2019 0685081 2A739 Liu y Structural Event Detection for Rich Transcription of s
15/253
xv
Table Page
11.3 DA boundary detection results (NIST error rate in %) on ICSI Meetingdata. Results are for the reference transcriptions (REF) and STT output,using the pause decision tree (pause DT) model, hidden event LM, andthe HMM combination of them. . . . . . . . . . . . . . . . . . . . . . . 205
11.4 DA subtype classification accuracy (%) using the reference DA bound-aries of the ICSI Meeting corpus using the human transcriptions andrecognition output. Two conditions are used: word-based features only,and the combined word-based features and the binned posterior probabil-ities from the decision tree (DT). Chance performance is obtained whenthe majority type (statement) is hypothesized for each DA. . . . . . . . 206
A.1 SU and IP detection results (classification error rate in %) using ADTlearning algorithm and bagging. Training and testing were conductedusing a downsampled training and testing set. Chance performance is
50%. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226
-
7/28/2019 0685081 2A739 Liu y Structural Event Detection for Rich Transcription of s
16/253
xvi
LIST OF FIGURES
Figure Page
1.1 A flow diagram for the automatic structural event detection task. . . . . 7
3.1 Examples of transcriptions for CTS and BN, respectively. SU boundariesare not shown in the examples. . . . . . . . . . . . . . . . . . . . . . . . 51
4.1 The waveform, pitch and energy contours, word alignment, and SU bound-aries for the utterance um no I hadnt heard of that. . . . . . . . . . . 56
4.2 The raw and stylized F0 contours for the utterance um no I hadnt heardof that. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.3 An example of a decision tree for SU detection. Each line represents anode in the tree, with the associated question regarding one particularprosodic feature, the class distribution, and the most likely class amongthe examples going through this node (S stands for SU boundary, and 0 fornon-SU boundary). The indentation represents the level of the decisiontree. Some of the features used in this tree are described in Table 4.2. . . 62
5.1 Data preparation for model training. . . . . . . . . . . . . . . . . . . . . 71
5.2 System flow diagram of the testing procedure. . . . . . . . . . . . . . . . 74
5.3 System diagram for edit word and IP detection. . . . . . . . . . . . . . . 86
5.4 Valid state transitions for repetitions of up to 3 words. The X and Y axesrepresent the position in the reparandum and repetition regions respec-tively, with events denoted as ORIG- and REP-. In ORIG-n, n meansthe position of a word in the reparandum; in REP-m.n, m is the totalnumber of repeated words and n represents the position of the event inthe repeat region. Optional filler words are allowed after the IP in thetransition. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.5 A rule-based method for determining the reparandum region after IPsare hypothesized. SU hypotheses are used in the rules. . . . . . . . . . . 90
6.1 Integration methods for the various LMs and the prosody model. . . . . 111
7.1 The bagging algorithm. T is 50 in our experiments. In each bag, theclass distribution is the same as in the original data S. . . . . . . . . . . 122
-
7/28/2019 0685081 2A739 Liu y Structural Event Detection for Rich Transcription of s
17/253
xvii
Figure Page
7.2 ROC curves and their AUCs for the decision trees trained from differentsampling approaches and the original training set. . . . . . . . . . . . . . 128
7.3 ROC curves and their AUCs for the decision trees when bagging is used
on the downsampled training set (bag-ds), the ensemble of downsampledtraining sets (bag-ensemble), and the original training set (bag-original). 131
7.4 ROC curves for IP and SU detection using the prosody model alone onthe CTS corpus. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
8.1 The graphical model for the SU detection problem. Only one word-eventpair is depicted in each state, but in a model based on N-grams theprevious N 1 tokens would condition the transition to the next state.O are observations consisting of words W and prosodic features F, andE are structural events. . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
8.2 The graphical model for the POS tagging problem. POS tags are thehidden states in this problem. S are POS tags, and W are words. . . . . 148
8.3 The graphical representation of a CRF for the sentence boundary detec-tion problem. E represents the state tags (i.e., SU boundary or not),while W and F are word and prosodic features respectively. O are obser-vations consisting of W and F. . . . . . . . . . . . . . . . . . . . . . . . 165
8.4 The graphical model representations of the HMM, CMM, and CRF ap-proaches. O are observations, and S are events (or tags). . . . . . . . . 168
10.1 An illustration of how speaker change is obtained for the CTS data. An
arrow represents a speaker change after that segment. . . . . . . . . . . . 18810.2 The pruned decision tree used to detect word fragments. The decision is
made in the leaf nodes; however, in the figure the decision for an internalnode in the tree is also shown. . . . . . . . . . . . . . . . . . . . . . . . 198
11.1 Using SU information for re-recognition in BN. . . . . . . . . . . . . . . 201
A.1 An example of an alternating decision tree (ADT). . . . . . . . . . . . . 225
-
7/28/2019 0685081 2A739 Liu y Structural Event Detection for Rich Transcription of s
18/253
xviii
ABSTRACT
Liu, Yang. Ph.D., Purdue University, December, 2004. Structural Event Detectionfor Rich Transcription of Speech. Major Professor: Mary P. Harper.
Although speech recognition technology has significantly improved during the
past few decades, current speech recognition systems output only a stream of words
without providing other useful structural information that could aid a human reader
and downstream language processing modules. This thesis research focuses on theautomatic detection of several helpful structural events in speech, including sentence
boundaries, type of utterance, filled pauses, discourse markers, and edit disfluencies.
The systems evaluated combine prosodic cues and textual information sources in a
variety of ways to support automatic detection of these structural events. Exper-
iments were conducted across corpora (conversational speech and broadcast news
speech) and with different transcription quality (human transcriptions versus recog-
nition output).The imbalanced data problem is investigated for training the decision tree prosody
model component of our system because structural events are much less frequent than
non-events. A variety of sampling approaches and bagging are used to address this
imbalance. Significant performance improvements are obtained via bagging. Some
of the sampling methods are useful depending on the performance metrics used.
Sentence boundary detection and disfluency detection tasks are impacted differently
by sampling, bagging, and boosting, suggesting the inherent differences between thetwo tasks.
A variety of methods for combining knowledge sources are examined: a hidden
Markov model (HMM), the maximum entropy (Maxent) model, and the conditional
random field (CRF). The Maxent and CRF approaches are discriminatively trained
-
7/28/2019 0685081 2A739 Liu y Structural Event Detection for Rich Transcription of s
19/253
xix
to model the posterior probabilities and thus correlate with the performance mea-
sures. They also support the use of more correlated features and so enable the
combination of a variety of textual information sources. The HMM and CRF both
model sequence information, unlike the Maxent which explicitly models local infor-
mation. A model that combines these three approaches is superior to any method
alone.
Interactions with other research efforts suggest that the methods developed in
this thesis generalize well to other corpora (e.g., a multimodal corpus, a multiparty
meeting corpus) and to similar tasks (e.g., a gestural model, dialog act segmentation
and classification).
-
7/28/2019 0685081 2A739 Liu y Structural Event Detection for Rich Transcription of s
20/253
1
1. INTRODUCTION
1.1 Motivation
Speech recognition technology has improved significantly during the past few
decades; for tasks involving read or pre-planned speech, recognition accuracy is often
greater than 90%. However, the word-level transcription accuracy for spontaneous
conversational speech falls far short of this level, generally lower than 80%. The
acoustic properties of spontaneous conversational speech are quite challenging tomodel due to phenomena such as coarticulation, word fragments, and filled pauses.
Additionally, disfluencies and ungrammatical utterances pose serious problems for
language models (LMs). These factors combine to affect the performance of speech
recognizers on spontaneous speech. The following is an excerpt of a transcription of
spontaneous conversational speech. Both the human transcription and the recogni-
tion output are shown in the below example. The presence of a word fragment in
the example is represented by a - after the partial word. Word recognition errors inthe recognition output have a strikethrough in them, and the corresponding correct
words are shown in bold face inside curly parentheses (corresponding to deletion or
substitution errors).
Human Transcription:
but uh im i i i think that you know i mean we always uh i mean ive ive
had a a lot of good experiences with uh with many many people especially
where theyve had uh extended family and i and an- i i kind of see that
that you know perhaps you know we may need to like get close to the
family environment and and get down to the values of you know i mean
uh its money seems to be too big of an issue wi- with with with with with
whats going on today
-
7/28/2019 0685081 2A739 Liu y Structural Event Detection for Rich Transcription of s
21/253
2
Recognition Output:
but um that that{uh im i i} i think that you know we{i mean} we
always uh i mean ive ive had it there {a} a lot of good experiences with
the{uh} with many many people especially with have{where theyve}
had extended family night and i and{an- i} i kind of see that that you
know perhaps you know we may need to like youre {get} close to the
family environment and in {and} get down to the values of you know i
mean no and{uh its} money seems to be too big of an issue we would
{wi- with with with} with with really was we would whats going on
today
As can be seen from the recognition output example, current automatic speech
recognition (ASR) systems simply output a stream of words. Structural informa-
tion (such as the location of punctuation, disfluencies, and speaker turns) is missing,
making it difficult for a human to read or for downstream automatic processors to
deal with. As shown in the example above, even the human transcription, which
contains no word errors, is still hard to read due to the absence of punctuation and
the presence of speech disfluencies and filler words.
The transcriptions can be marked with different types of structural information
to enhance readability or ease downstream processing. In this thesis, the following
types of structural events are considered:
Sentence boundaries: A sentence ends with ./ for a statement, .../ for an
incomplete statement, and ?/ for a question in the marked up transcription
examples in this thesis.
Filler words: These include filled pauses (e.g., uh and um) and discourse
marker words (such as you know, well). The tokens and are used to
mark the extent of these filler words.
-
7/28/2019 0685081 2A739 Liu y Structural Event Detection for Rich Transcription of s
22/253
3
Edit disfluencies: Disfluencies are highly prevalent in conversational speech.
In this thesis, the term edit disfluency is used for the disfluencies1 with the
following structure (see Chapter 3 for more details):
(reparandum) * editing term correction
The edited portion of a disfluency (i.e., the reparandum) is marked in examples
with parentheses ( and ). For example, in a a lot in the human transcription
shown above, the first a is the reparandum so it should be marked with
parentheses. The interruption point (IP) inside the edit disfluency is marked
by *. The editing term, which follows the IP and precedes the corrections, is
optional. The edit disfluency structure is embedded in utterances and so may
be preceded and followed by words that are not part of the edit disfluency.
These types of structural information will be described in more detail in Chap-
ter 3. Below is the annotation of our human transcription example.2 All the words
that interrupt the fluency of speech are shown in bold face in this example. Table 1.1
summarizes the meanings of the symbols used in the annotated transcriptions.
but uh (im * i * i think that you know i mean ive) * ive
had (a) * a lot of good experiences (with) * uh with (many) * many
people especially where theyve had uh extended family ./
(and i * and) * an- (i) * i kind of see (that) * that you know perhaps
you know we may need to like get close to the family environment
(and) and get down to the values of you know i mean .../
(uh its) * money seems to be too big of an issue (wi- * with * with
* with * with) * with whats going on today ./
The transcriptions containing this structural information are called rich tran-
scriptions because they contain much richer information than a simple stream of
1These disfluencies are also called speech repairs in the literature.2The human transcription is used here to illustrate the the importance of structural information inorder to factor out the effect of speech recognition errors.
-
7/28/2019 0685081 2A739 Liu y Structural Event Detection for Rich Transcription of s
23/253
4
Table 1.1Symbols used for the structural events in the example of annotatedtranscriptions.
Symbol Meaning
./ or .../ sentence boundaries (complete or incomplete)
filler words
( ) reparandum in an edit disfluency
* interruption point in an edit disfluency
words. Given this structural information (either human annotated or automatically
generated), human transcriptions or recognition output can be cleaned up for im-
proved readability. For example, if the disfluencies and fillers are removed from the
previous transcription and each sentence is presented with the appropriate punctu-
ation, the cleaned-up transcription would be as follows:
But ive had a lot of good experiences with many people especially where
theyve had extended family. I kind of see that perhaps we may need to
get close to the family environment and get down to the value of... Money
seems to be too big of an issue with whats going on today.
Clearly this cleaned-up transcription is more readable, is easier to understand, and
is more appropriate for subsequent language processing modules.
There has been a growing interest recently in the study of the impact of structural
events. Jones et al. [1] have conducted experiments, showing that cleaned-up tran-
scriptions improve human readability compared to the original transcription. Other
recent research has considered whether automatically generated sentence informa-
tion can play a role in parsing. Gregory et al. [2] have found that using sentence-
internal prosodic cues degrades parsing performance; however, the method used for
automatically generating sentence-internal annotations was not state-of-the-art. On
-
7/28/2019 0685081 2A739 Liu y Structural Event Detection for Rich Transcription of s
24/253
5
the other hand, Kahn et al. [3] have achieved significant error reductions in pars-
ing performance when using sentence boundary information from a state-of-the-art
automatic detection system.
1.2 Scope of the Thesis
1.2.1 Structural Event Detection Tasks
Automatic structural event detection is a crucial step for improving the readabil-
ity of speech recognition output and for making spontaneous speech understanding
systems possible. The goal of this thesis is to enrich the recognition output with
multiple levels of structural information, including sentence boundaries, filled pause
and discourse marker words, and edit disfluencies. We will construct and evaluate
algorithms that automatically detect such structural event types.
Note that the problem of sentence boundary detection differs from its analog in
text processing, which is sometimes called sentence splitting or sentence boundary
detection. The goal of the sentence splitting task is to identify sentence boundaries
in written text where punctuation is available; hence, the problem is effectively
reduced to deciding which symbols that potentially denote sentence boundaries (i.e.,
. ! ?) actually do. The sentence splitting problem is not deterministic since these
punctuation symbols do not always occur at the end of sentences. For example, in
I watch C. N. N., only the final period denotes the end of a sentence. In the
sentence boundary detection task using speech, no punctuation is available, yet the
availability of speech provides additional useful information.
We will investigate structural event detection across corpora, on both broadcast
news and conversational telephone speech. Broadcast news comprises read speech,
formal interviews, man-on-the-street interviews, and some spontaneous speech, al-
though not usually conversational. In contrast, telephone conversational speech is
spontaneous, and much of it is quite informal. Broadcast news usually has fewer
edit disfluencies than spontaneous conversational speech, and many of these may
-
7/28/2019 0685081 2A739 Liu y Structural Event Detection for Rich Transcription of s
25/253
6
be caused by reading errors. Our algorithms will be evaluated on both the human
transcriptions and recognition output to investigate the effect of incorrect words in
ASR output on system performance.
1.2.2 Our Approach to the Problem
The framework of most current speech recognition systems is to find the most
likely word sequence given the speech signal. Because the hidden structure of the
utterance (sentence boundaries and disfluencies) does not have an explicit acoustic
signal,3 it is hard to integrate the problem of structural event detection with word
recognition in current speech recognition systems. Therefore, we will address this
problem by using a post-processing approach that generates the structural informa-
tion after the recognition results are available. Several knowledge sources will be
employed, involving both textual information and prosodic cues to reduce ambiguity
inherent in one knowledge source. Figure 1.1 shows a diagram of our approach, the
final output of which is a rich transcription or cleaned-up transcription. As the figure
shows, prosodic information is obtained from a combination of the speech signal and
recognition output, which is used to provide word and phone alignments.
In our investigations, textual information is obtained from the word strings in
the transcriptions generated either by a human transcriber or by the ASR system.
This type of information is no doubt very important. In many cases, people have
no problem inferring appropriate structural events from word transcriptions. Some
textual cues are quite useful for automatic identification of structural events, for
example, words like I often start a new sentence, and a repeated or revised word
string often signals disfluencies. In addition, the syntactic and semantic information
derived from the words provides valuable cues for structural event detection.
3There are some implicit prosodic cues at the boundary points, which will be described in Chapter 5.
-
7/28/2019 0685081 2A739 Liu y Structural Event Detection for Rich Transcription of s
26/253
7
ASRASR transcriptionSpeech
signal
Extract
textual features
Extract
prosodic features
Structural event
detection systems
Structural event
output
Prosodic
featuresTextural
features
Process
transcriptions
Rich or cleaned-up
transcription
Fig. 1.1. A flow diagram for the automatic structural event detection task.
In some cases, the use of textual information alone may not completely disam-
biguate structural events. The following example is extracted from the broadcast
news data:
Anne what are the chances well hear uh something of substance again
from the President prior to the vote ?/
And thats a possible next step ?/
-
7/28/2019 0685081 2A739 Liu y Structural Event Detection for Rich Transcription of s
27/253
8
A purely textual model would not be able to determine whether the second sentence
is a statement or a question. However, the rising tone in the speech signal would
enable the listener to determine that a question is intended.
In the face of high word error rates, word level information may be unreliable and
possibly misleading. In such a case, the lexical, syntactic, and semantic patterns used
for detecting sentence boundaries and disfluencies will be less reliable due to the word
errors. The following example compares ASR output with a human transcription of
the speech:
ASR output:
Its been a while for the good for the tackle that stuff
Human transcription:
Its been a while since Ive uh uh since Ive tackled that stuff
It will be difficult, if not impossible, for a word-based language model to identify the
repetition or the existing disfluencies using this ASR output.
Prosody, the rhythm and melody of speech, is important for automating
rich transcription. Past research results [414] suggest that speakers use prosody to
impose structure on both spontaneous and read speech. Examples of such prosodic
indicators include pause duration, change in pitch range and amplitude, global pitch
declination, melody and boundary tone distribution, vowel duration lengthening, and
speaking rate variation. Since these features provide information complementary to
the word sequence, they provide an additional potentially valuable source of infor-
mation for structural event detection. Additionally, since they may be more robust
than textual features to word errors, they may provide a more reliable knowledge
source.
Textual and prosodic knowledge sources have been exploited in previous re-
search [12,13,1518], and their combination has proven to be beneficial to the per-
formance for structural event detection. This thesis builds upon this prior work that
combined these knowledge sources using a hidden Markov model (HMM) approach.
-
7/28/2019 0685081 2A739 Liu y Structural Event Detection for Rich Transcription of s
28/253
9
We will focus on developing a richer feature set for these knowledge sources, building
more effective models to capture such information, and integrating various knowledge
sources for structural event detection by using different modeling approaches.
The investigations in this thesis should help to answer several questions with
respect to the automatic detection of structural events: What knowledge sources
are helpful? What is the best modeling approach for combining different knowledge
sources? How is the model performance affected by various factors such as corpora,
transcriptions, and event types?
-
7/28/2019 0685081 2A739 Liu y Structural Event Detection for Rich Transcription of s
29/253
10
2. RELATED WORK
In the past decade, a substantial amount of research has been conducted in the areas
of detecting intonational and linguistic boundaries in conversational speech, as well as
in detecting and correcting speech disfluencies. In this chapter, we introduce research
related to the automatic detection of different structural events, namely, sentence
boundaries, edit disfluencies, and filler words. For each type, related research is
categorized based on what knowledge sources have been used. Additionally, for
completeness, studies from linguistics or psychology are discussed where appropriate.
2.1 Sentence Boundary Detection
For speech recognition, sentences are usually defined by acoustic segment
boundaries that correspond to long stretches of silence or a change of conversa-
tional turn.1 In contrast, linguistic segment boundaries mark a unit that represents
a complete idea but may not necessarily represent a grammatical sentence nor begin
or end with a long silence or turn change. Experiments by Meteer and Iyer in [19]
suggest that language model perplexity can be reduced by working with linguistic
segments rather than acoustic segments. Our goal is to automatically find such
linguistic sentence-like units.
Some of the previous research has focused on detecting major sentence bound-
aries;2 others have investigated detecting subtypes of sentences (e.g., questions, state-
ments). Prior research related to sentence and its subtype detection can be divided
1The definition of turn varies in the literature. In this thesis, a turn is a portionof speech uttered by a single speaker and bounded by silence from that speaker. Seehttp://secure.ldc.upenn.edu/intranet/Annotation/MDE/guidelines/2004/control floor.shtmlfor more details.2The definition of sentence varies across these past research efforts. The term used in this thesiswill be defined in Chapter 3.
-
7/28/2019 0685081 2A739 Liu y Structural Event Detection for Rich Transcription of s
30/253
11
into two categories based on the knowledge sources employed: a text-based approach
or an approach using textual and acoustic information. The text-based approach uses
only textual information; hence, it is suitable for both transcribed speech and writ-
ten text. Text-based methods may not be able to resolve some ambiguities using
information found in text, as in the example in Section 1.2.2, for which a question
type is detected based on the rising tone. A combination approach uses both the
acoustic cues and textual information. In most cases, it is difficult to compare the
results of prior research since they often differ on the corpora used for training and
testing, as well as in the information used by their systems.
2.1.1 Text-based Processing for Sentence Boundary Detection
As mentioned in Chapter 1, the sentence boundary detection problem in written
text aims to disambiguate punctuation marks with the goal of identifying sentence
boundaries. Palmer and Hirst [20] developed an efficient automatic sentence bound-
ary labeling algorithm, which uses the part-of-speech (POS) probabilities of the
tokens surrounding a punctuation mark as input to a feed-forward neural network
to obtain the role of the punctuation mark. Because sentence boundaries were not
available to their part-of-speech tagger, they used the prior probabilities of all parts
of speech for a word. They tested their system on a portion of the Wall Street Jour-
nal (WSJ) corpus. Their experiments found that a context of six surrounding tokens
and a hidden layer with two units yielded the best accuracy on the test set. When
training and testing were conducted using texts in lower-case-only format, the net-
work was able to disambiguate 96.2% of the boundaries. Other approaches have also
been used to investigate this problem, for example, Reynar and Ratnaparkhi [21]
used a maximum entropy algorithm, and Schmid [22] employed an unsupervised
learning method. Walker et al. [23] compared three different methods for sentence
boundary detection as a preprocessing step in machine translation. They showed
that the maximum entropy method [21] outperforms the other two systems, i.e.,
-
7/28/2019 0685081 2A739 Liu y Structural Event Detection for Rich Transcription of s
31/253
12
the direct model and the rule-based system. They also argued that high recall is
more important for the application of machine translation: fragmenting sentences is
better than combining two sentences. This insight might be useful if we are going
to use our structural event detection results in the downstream language processing
modules, among which machine translation is one. The sentence boundary problem
in text processing is different from that in speech processing in that punctuation
information is available in text (although it is not deterministic). However, some
knowledge obtained from such a task is useful to our automatic sentence boundary
detection in speech, such as the lexical cues that are most effective for determining
the role of punctuation.
An automatic punctuation system, called Cyberpunc, which is based only onlexical information, was developed by Beeferman et al. [24]. They counted the oc-
currence of each punctuation mark in the 42 million tokens of the WSJ corpus and
reported that about 10.5% of the tokens in that corpus were punctuation, mostly
commas (4.658%) and periods (4.174%). Cyberpunc generates only commas, as-
suming that sentence boundaries are provided or pre-determined. They extended
a language model to account for punctuation by explicitly including commas in an
N-gram LM and allowing commas to occur at interword boundaries. Commas wereadded to the testing word strings by finding the best hypothesis using a Viterbi
algorithm. They evaluated this method for generating commas on 2,317 reference
sentences of the Penn Treebank WSJ corpus that were stripped of punctuation marks.
They obtained a recall rate of 66% and precision of 76% for this comma generation
task. The goal of this research differs from sentence boundary detection in speech
because the task is to find commas assuming that the major sentence boundaries are
known. Beeferman et al. [24] claimed that a punctuation-aware language model canbe applied to rescore speech recognition lattices in general, but they did not evaluate
this.
Stevenson and Gaizauskas [25] also conducted experiments on identifying sen-
tence boundaries in transcriptions of the WSJ corpus using a memory-based learn-
-
7/28/2019 0685081 2A739 Liu y Structural Event Detection for Rich Transcription of s
32/253
13
ing (MBL) algorithm. For each word boundary, they obtained a feature vector of
13 elements from the word and its neighboring words, including the probability of
the word starting or ending a sentence, their POS tags, and so on. The precision
and recall of their approach was around 35% when case information of the word was
removed. The results were much improved when case information was provided to
their sentence boundary detection system. Clearly, case information is important for
this method, suggesting that it may not extend well to ASR outputs, which do not
capture case information and often contain incorrect words.
2.1.2 Combining Textual and Prosodic Information for Sentence Bound-
ary Detection
Some past research has been conducted on combining prosodic information and
textual information to find sentence boundaries and their subtypes in speech. It
is known that there is a strong correspondence between discourse structure and
prosodic information. A comparison between syntactic and prosodic phrasing was
presented by Fach [26]. In that study, the syntactic structure was generated by Ab-
neys chunk parser [27] and prosodic structure was given by ToBI label files [28]. Thiswork showed that at least 65% of the syntactic boundaries were prosodic boundaries
in read speech.
Chen [29] proposed a method combining speech recognition with punctuation
generation based on acoustic and lexical information using a business letter corpus.
Punctuation marks were treated as words in the dictionary, with acoustic baseforms
of silence, breath, and other non-speech sounds, and her language model was mod-
ified to include punctuation. Chen found that 75.6% of all pauses correspond to
punctuation marks, and that only 6.5% of the punctuation marks do not correspond
to pauses. This finding suggests that pauses are closely related to punctuation in
read speech. Chen conducted a speech recognition and automatic punctuation ex-
periment on a business letter with 330 words, read aloud by 3 speakers. For different
-
7/28/2019 0685081 2A739 Liu y Structural Event Detection for Rich Transcription of s
33/253
14
testing conditions, Chen reported a result of about 70%-80% accuracy on punctu-
ation placement, but lower accuracy on correct identification of punctuation types.
How this result will apply to conversational speech or a larger corpus is unknown.
A sentence boundary recognizer using textual information and pause duration wasdeveloped by Gotoh and Renals [15]. In their work, for each interword boundary, a
decision is made about whether there is a sentence boundary or not. Their algorithm
finds the sequence of sentence boundary classes using speech recognition output by
combining probabilities from a language model and a pause duration model. They
conducted sentence boundary experiments on 16 hours of Broadcast News corpus
using acoustic and duration models trained on 300 hours of acoustic data and using
a language model trained on 9 million words. The word error rate (WER) for theirtest set was 26.3%. They obtained a recall rate of about 62% and precision rate
of 80% for sentence boundary detection. Their study found that a pause duration
model when used alone performs more accurately than using an N-gram language
model for sentence boundary detection. This could be possibly because the language
model suffers a lot from the word errors in the recognition output. They found that
the result is improved by combining these two information sources.
Shriberg, Stolcke and their colleagues have built a general HMM framework forcombining lexical and prosodic cues for tagging speech with various kinds of hidden
structural information, including sentence boundaries, disfluencies, topic boundaries,
dialogue acts, emotion, and so on [12,3033]. Experimental results have shown that
the combination of the prosody model and language models generally performs better
than using each knowledge source alone.
In [12], Shriberg et al. directly compared two corpora (Switchboard and Broad-
cast News) on the task of sentence segmentation. Experiments were conducted on
both human transcriptions and speech recognition outputs to compare the degra-
dation of the prosody model and LM in the face of ASR errors. They extracted
prosodic features such as pause, phone and rhyme duration, and F0 features, as well
as other non-prosodic features such as turn change and gender. The features were
-
7/28/2019 0685081 2A739 Liu y Structural Event Detection for Rich Transcription of s
34/253
15
used as inputs to a decision tree model, which predicted the appropriate segment
boundary type at each interword boundary. They investigated the performance of
the prosody model, a statistical LM that captures lexical correlations with segment
boundaries, and a combination of the two models. On Broadcast News, the prosodic
model alone performed as well as (or even better than) the word-based statistical LM,
for both human transcriptions and recognized words. They found that the prosody
model often degraded less in the face of recognition errors. Furthermore, for all tasks
and corpora, they obtained a significant improvement over the word-only models by
combining models. Analysis of the decision trees revealed that the prosody model
captures language-independent boundary indicators, such as pre-boundary length-
ening, boundary tones, and pitch resets. In addition, feature usage was found to
be corpus dependent. While pause features were heavily used in both corpora, they
found that duration cues dominated in Switchboard conversational speech; whereas,
pitch is a more informative feature in Broadcast News.
Kim and Woodland [16] also combined prosodic and lexical information in a
system designed to identify full stops, question marks, and commas in Broadcast
News. Their approach is similar to the one used by Shriberg et al. [12]. A prosodic
decision tree was tested alone and in combination with a language model, with some
improvements reported through the use of the combined model.
Christensen et al. [17] investigated two different approaches to automatically
identify punctuation using the Broadcast News corpus. A finite state approach com-
bining a linguistic model with a prosody model significantly reduced the detection
error rate and increased the related precision and recall measures, especially when
using pause duration. They also showed how prosodic features like pause duration
increased detection accuracy for full stops but had very little impact for detecting
the other types of punctuation marks. The second approach used a multi-layer per-
ception (MLP) to model the prosodic features. This approach provides insight into
the relationship between the individual prosodic features and the various punctua-
-
7/28/2019 0685081 2A739 Liu y Structural Event Detection for Rich Transcription of s
35/253
16
tion marks. The results confirmed that pause duration features are the most useful
features for finding full stops.
Huang and Zweig [34] developed a maximum entropy based method to add punc-
tuation (period, comma, and question mark) into transcriptions for the Switchboardcorpus. Features used in their models involve the neighboring words, the tags (punc-
tuation marks) associated with the previous words, and pause features. They evalu-
ated this approach on both the reference transcription and speech recognition output.
Performance was measured using precision, recall, and F-measure. Results showed
that performance varies for the different punctuation marks, and adding the bigram
type of features (features about the previous and the current position, or the current
and the next position) improves F-measure by about 4% over unigram information.They noticed that adding pause information only yields a small gain, in contrast
to the results reported for Broadcast news speech (such as [16]). This could be
attributed to the different data sets, or to a suboptimal use of pause information
in this maximum entropy approach. They observed also that a comma is hard to
distinguish from no-punctuation, and that question mark is often confusable with a
period. This approach provides a good framework for designing additional features.
The maximum entropy approach will be investigated further in Chapter 8.
In the 2003 NIST sentence boundary detection evaluation, all the systems used
both prosodic and textual features for sentence boundary detection [35]. The ap-
proaches used are similar to the HMM approach used in [12]. For example, one
system estimated the likelihood of three classes: complete sentence, incomplete sen-
tence, and non-sentence. They used 48 acoustic-prosodic features estimated for each
word boundary, including pause, speaking rate, energy, and pitch features. These
prosodic features were used to train a 2-layer neural network. A linguistic subsystem
used a trigram LM which has sentence tokens inserted between words. The com-
bined decoder used the likelihood of the sentence classes from the acoustic-prosodic
subsystem and the likelihood from the linguistic system, along with a Viterbi al-
gorithm to find the class hypothesis at each word boundary. In another system, a
-
7/28/2019 0685081 2A739 Liu y Structural Event Detection for Rich Transcription of s
36/253
17
decision tree was used to predict 4 classes: complete sentence, incomplete sentence,
interruption point in edit disfluencies, or non-event boundary. The prosodic features
provided to the decision tree are similar to the ones described in [12]. In addition,
the posterior probability from the LMs was included as a feature in the decision
tree. These two systems were further combined using a 2-layer neural network which
uses the minimum square error back-propagation algorithm to hypothesize a binary
score at each word boundary. These systems were evaluated on both the Conversa-
tional Telephone Speech (CTS) and Broadcast News speech (BN), using both human
transcriptions and speech recognition output.
There is also some work that relies on only the prosodic information for finding
the sentence units. Wang and Narayanan [36] developed a method that used only theprosodic features (mostly pitch features) in a multi-pass approach. They did not use
any word or phone alignment and thus avoid using a speech recognizer. They fit the
pitch contour with two linear folds and search for major breaks in the pitch contour.
Then in the second pass, sentence boundaries are detected based on some pre-defined
rules and statistics. They evaluated this algorithm using a subset of the Switchboard
corpus, and obtained a false alarm rate of 17.9% and a miss rate of 7.1%. This result
is encouraging since only pitch information is used. However, in conversationalspeech, pitch may not be a very effective feature for sentence boundary detection.
Clearly we would expect that adding additional prosodic and textual information
may yield further improvement.
2.1.3 Summary of Past Research on Sentence Boundary Detection
Finding sentence-like units and their subtypes can make transcriptions more read-
able, while also aiding downstream language processing modules, which typically
expect sentence-like segments. Previous work has shown that lexical cues are a
valuable knowledge source for determining punctuation roles and detecting sentence
boundaries, and that prosody provides additional important information for spo-
-
7/28/2019 0685081 2A739 Liu y Structural Event Detection for Rich Transcription of s
37/253
18
ken language processing. Useful prosodic features include pause, word lengthening,
and pitch patterns. Past experiments also show that detecting sentence boundaries
is relatively easier than reliably determining sentence subtypes or sentence-internal
breaks (e.g., commas). The poor performance of sentence-internal structure detec-
tion also affects downstream processing, such as parsing [2]. Table 2.1 summarizes
important attributes of much of the previous research. Most make use of textual
information, either by using a statistical LM or employing other machine learning
strategies. The value of adding more syntactic information to the task of sentence
detection is an open question. The approaches listed in the first five rows are simi-
lar to the approach taken in this thesis, since textual and prosodic information are
combined for sentence boundary detection.
2.2 Edit Disfluency Processing
Disfluencies have been investigated using a variety of approaches. Linguists and
psychologists have considered disfluencies largely from a production and perception
standpoint; whereas, computational linguists have been more concerned with recog-
nizing disfluencies and thus improving machine recognition of spontaneous speech.
Although the latter is our main focus, we believe that a better understanding of the
underlying theory of disfluency production and its effect on listeners comprehension
can help to construct a better model for the automatic detection of disfluencies;
therefore, we will briefly discuss some studies in psychology and linguistics.
2.2.1 Production and Properties of Disfluencies
Disfluency Production
Disfluencies are very common in spontaneous speech. When speakers cannot
formulate an entire utterance at once or when they change their minds about what
they are saying, they may suspend their speech and introduce a pause or filler before
-
7/28/2019 0685081 2A739 Liu y Structural Event Detection for Rich Transcription of s
38/253
19
Table2.1
A
summar
yofsomeimportantpriorstudiesonsentenceboundarydetection.
Columntwoisthetask
chosen
foreachin
vestigation:boundarymeans
thesentenceboundarydetectiontask,comparedtoitssub
typeor
punctuatio
ndetection;columnthreedes
cribesthemodelortheinform
ationsourcesusedbyeachin
vestiga-
tion;colum
nfouristhecorpusonwhic
htheexperimentswerecondu
cted;columnfiverepresentswhether
theexperimentswereperformedonhumantranscriptions(Ref)orreco
gnitionresults(ASR).Noteth
atCTS
(i.e.,conversationaltelephonespeech)isusedinthecorpuscolumnforthoseexperimentsthatwerecon-
ductedon
theSwitchboardcorpus.
Eventhoughnotextualinformationisusedinthisautomaticdetection
model,R
efconditionisusedinthats
tudyforitsevaluation.
Inves
tigation
ClassificationTask
Model
Corpus
ReforASR
Shribergetal.[12]
bounda
ry
prosody,word-L
M
CTS,BN
Ref,ASR
Gotoh,
Renal[15]
bounda
ry
pause,word-LM
BN
ASR
Kim,Woodland[16]
punctuation
prosody,word-L
M
BN
Ref
Huang,
Zweig[34]
punctuation
Maxent(word,pa
use)
CTS
Ref,ASR
NISTevalsystems[35]
bounda
ry
prosody,word-L
M
CTS,BN
Ref,ASR
Beeferman[24]
commasgiven
boundary
word-LM
WSJ
Ref
Stevenson,Gaizauskas[25]
bounda
ry
MBL(word,PO
S)
WSJ
Ref
Chen[29]
punctuation
punctuationtok
en
abusinessletter
ASR
withacousticinform
ation
Wang,Narayanan[36]
bounda
ry
pitch
CTS
Ref
-
7/28/2019 0685081 2A739 Liu y Structural Event Detection for Rich Transcription of s
39/253
20
continuing, or add, delete, or replace words they have already produced. Spontaneous
speech is systematically shaped by the problems speakers encounter while planning
an utterance, accessing lexical items, and articulating a speech plan. Speech errors
and disfluencies produced by normal speakers have been studied for decades to learn
about linguistic production and the cognitive processes of speech planning [3739].
Disfluency has been used as evidence for cognitive load in speech planning. Ovi-
att [40] and Shriberg [41] have shown in different types of task-oriented conversations
that long utterances have a higher disfluency rate than short ones. This effect may
be related to the planning load of the utterance, i.e., speakers have more difficulty
planning longer utterances, while making task-oriented plans at the same time. An-
other observation is that disfluencies occur more frequently at the beginning of an
utterance when the utterance is at an early planning stage, providing evidence of
the impact of utterance planning on disfluencies.
Clark and Wasow [42] studied the phenomenon of repeated words in spontaneous
speech. In their work, repeats are divided into four stages: initial commitment,
suspension of speech, hiatus, and restart of the constituent. These four stages cor-
respond to the four components (i.e., reparandum, interruption, editing term, and
correction) that have been laid out in Chapter 1 for all edit disfluencies. They pro-
posed a commit-and-restore modelof repeated words, as well as three hypotheses to
account for the repeats, namely, the complexity hypothesis, the continuity hypoth-
esis, and the commitment hypothesis. They hypothesize that the more complex a
constituent, the more likely speakers are to suspend it after an initial commitment
to it (i.e., complexity hypothesis), and that speakers prefer to produce constituents
with a continuous delivery (i.e., continuity hypothesis), and that speakers make a
preliminary commitment to constituents, expecting to suspend them afterward (i.e.,
commitment hypothesis). They analyzed repeated articles and pronouns in two large
corpora, the Switchboard corpus and the London-Lund corpus,3 and found strong
empirical evidence to support the proposed commit-and-restore model, along with
3See [42] for a description of the corpus.
-
7/28/2019 0685081 2A739 Liu y Structural Event Detection for Rich Transcription of s
40/253
21
evidence for all three hypotheses. They noticed that speakers are more likely to make
a premature commitment, and then immediately suspend it when the constituent
becomes more complex, and that it is more likely that speakers restart a constituent
the more that their suspension disrupts the utterance. One example is the frequent
occurrence of function words in repeats. It has long been recognized for English
that function words are repeated far more often than content words. When speakers
want to make an initial commitment to a constituent, the word they mostly com-
monly use is a function word. Overall, Clark and Wasow [42] found that function
words were repeated more than ten times as often as content words, 25.2 versus 2.4
per thousand in the Switchboard corpus. This more frequent occurrence of function
words in repeats is explained by the three hypotheses they proposed.
Knowing the types of words that speakers tend to repeat (or revise) is helpful
for building a better model of spontaneous speech. For example, when speakers
repair a content word, they often return to a major constituent boundary, such as
on Friday, I mean, on Monday. Such an observation is beneficial for defining
disfluency patterns and can aid in automatically identifying them.
Effect on Listeners
It is also valuable to understand how human listeners cope with disfluent input.
Studies by Lickley [43], Lickley and Bard [5] have shown that listeners generally miss
the disfluencies or incorrectly report on the occurrence of disfluencies, suggesting that
disfluencies may have been filtered out for utterance comprehension. Psycholinguists
believe that disfluencies play specific roles in our communication, such as sending sig-
nals to the listener to do things like pay more attention, help the speaker find a word,
or be patient while the speaker gathers his or her thoughts. Disfluencies provide in-
formation that enables people in a conversation to better coordinate interaction and
manage turn-taking [41].
-
7/28/2019 0685081 2A739 Liu y Structural Event Detection for Rich Transcription of s
41/253
22
Brennan [44] investigated how comprehension is affected when listeners hear dis-
fluent speech. In her experiments, listeners followed fluent and disfluent instructions
for selection of an object in a graphical display. She found that listeners make fewer
errors when hearing less misleading information before the interruption points of
disfluencies. She also observed that mid-word interruptions are better signals than
between-word interruptions that a word was produced in error and that the speaker
intends to replace it. This supports Levelts hypothesis [38] that by interrupting
a word, a speaker signals to the addressee that that word is an error. If a word is
completed, the speaker intends the listeners to interpret it as correctly delivered.
Brennan also found in her experiments that there is information in disfluencies that
partially compensates for any disruption that listeners meet while processing disflu-
ent speech.
Fox Tree [45] studied how naturally occurring speech disfluencies affect listeners
comprehension. She observed that disfluencies do not always have a negative effect
on comprehension. For example, repetitions do not hinder the listeners, because they
can help listeners to recover information missing in the first occurrence of words that
are repeated. However, it does take longer to identify words when there is a false
start. When false starts begin utterances, listeners may abort the false starts with nocost to comprehension. But, if false starts are in the middle of utterances, listeners
have to figure out where the false start begins, what to abort, and where to attach
the restarted information. This process slows down comprehension.
Disfluency Rates
A conservative estimate (excluding silent hesitations) for the rate of disfluencies4
in spontaneous speech is approximately 6 words per 100 words [45]. There are a
variety of