0685081 2a739 liu y structural event detection for rich transcription of s

7/28/2019 0685081 2A739 Liu y Structural Event Detection for Rich Transcription of s

1/253

STRUCTURAL EVENT DETECTION FOR RICH TRANSCRIPTION OF

SPEECH

A Thesis

Submitted to the Faculty

of

Purdue University

by

Yang Liu

In Partial Fulfillment of the

Requirements for the Degree

of

Doctor of Philosophy

December 2004


2/253

ii

To my parents and my husband.


3/253

iii

ACKNOWLEDGMENTS

I started this research of structural event detection at Purdue University and

continued this at ICSI where I have been for the past two and a half years. ICSI

has provided a wonderful environment for me to enrich my research on speech and

language processing.

I gratefully acknowledge my major advisor Mary Harper for both academic and

moral support over the past few years. Even while I have been away from campus,

she has been in constant touch via email and phone supporting my research. I have

benefited from her insightful guidance and discussion, as well as her encouragement.

She has given me the intellectual freedom to do research in spoken language pro-

cessing and has provided lots of advice. She has taught me how to be a researcher

through all these years, while ploughing through the many paper drafts she has

revised.

I would like to thank Elizabeth Shriberg and Andreas Stolcke for giving me the

opportunity to continue my research at ICSI. I thank them for their valuable sugges-

tions and comments when I encountered difficulties in my research. I have learned

from them how to look at a problem both from a scientific and engineering point of

view. Especially thanks to Elizabeth Shriberg for teaching me about linguistics as

well as for providing academic advice over the past two years.

I thank my other Ph.D committee members: Leah Jamieson and Jack Gandour

at Purdue University. They have been very generous with their time and supportive

of my research topic. I have benefited from discussions with Leah Jamieson about

speech processing in my first two years of study at Purdue University.

Many people at ICSI also deserve acknowledgment. On the academic front, Bar-

bara Peskin shared her vision of the entire structural event detection project, and

at the same time was always willing to spend her time working out details. Nelson


4/253

iv

Morgan as the director of ICSI, has created an excellent environment that nurtures

research and learning. Chuck Wooters and James Fung deserve a special thanks for

generating speaker diarization results. Jeremy Ang, Kofi Boakye, Barry Chen, Dave

Gelbart, Dan Gillick, Andy Hatch, Yan Huang, Adam Janin, Nikki Mirghafori, and

Qifeng Zhu are helpful office mates and neighbors at ICSI. They have made my time

at ICSI more enjoyable.

There are so many other people who have contributed to my research. Luciana

Ferrer at SRI has helped much with prosodic feature extraction. I thank Mari

Ostendorf and Dustin Hillard at University of Washington for their collaboration

on the structural event detection work. Wen Wang, who finished her Ph.D from

Purdue University and is at SRI now, has been so patient with all my questions

regarding language models. I am glad that I had the chance to work together with

Lei Chen at Purdue University using a multimodal corpus for sentence boundary

detection. Nitesh Chawla at CIBC has been a wonderful source for answers to

my machine learning questions. Thanks also to Andrew McCallum at University

of Massachusetts and Fernando Pereira at the University of Pennsylvania for their

support and advice on the CRF model. I also thank Julia Hirsberg and Yoav Freund

at Columbia University for their assistance with the boosting algorithm.

Most of all, I thank my family for their support of my education. I would not be

able to reach the end of this journey without consistent support and encouragement

from my husband. His belief in me has made this thesis possible. The love from my

parents and sister has also supported me during the difficult times in my graduate

life.


5/253

v

TABLE OF CONTENTS

Page

LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x

LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvi

ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xviii

1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Scope of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.2.1 Structural Event Detection Tasks . . . . . . . . . . . . . . . . 5

1.2.2 Our Approach to the Problem . . . . . . . . . . . . . . . . . . 6

2 RELATED WORK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.1 Sentence Boundary Detection . . . . . . . . . . . . . . . . . . . . . . 10

2.1.1 Text-based Processing for Sentence Boundary Detection . . . 11

2.1.2 Combining Textual and Prosodic Information for Sentence Bound-

ary Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.1.3 Summary of Past Research on Sentence Boundary Detection . 17

2.2 Edit Disfluency Processing . . . . . . . . . . . . . . . . . . . . . . . . 18

2.2.1 Production and Properties of Disfluencies . . . . . . . . . . . . 18

2.2.2 Past Research on Automatic Disfluency Detection . . . . . . . 24

2.2.3 Summary of Past Research on Disfluencies . . . . . . . . . . . 33

2.3 Filler Word Processing . . . . . . . . . . . . . . . . . . . . . . . . . . 34

2.3.1 Production and Perception of Fillers . . . . . . . . . . . . . . 342.3.2 Past Research on Filler Word Processing . . . . . . . . . . . . 37

2.4 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3 DATA RESOURCES AND TASKS . . . . . . . . . . . . . . . . . . . . . . 40

3.1 Structural Speech Events Types . . . . . . . . . . . . . . . . . . . . . 40


6/253

vi

Page

3.1.1 Sentence-like Units (SUs) . . . . . . . . . . . . . . . . . . . . 41

3.1.2 Fillers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

3.1.3 Edit Disfluencies . . . . . . . . . . . . . . . . . . . . . . . . . 43

3.2 Structural Event Detection Task Description . . . . . . . . . . . . . . 45

3.2.1 Task Description . . . . . . . . . . . . . . . . . . . . . . . . . 45

3.2.2 Performance Measures . . . . . . . . . . . . . . . . . . . . . . 47

3.3 Corpora . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

4 THE HMM APPROACH TO STRUCTURAL EVENT DETECTION . . . 54

4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

4.2 Feature Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 554.2.1 Prosodic Features . . . . . . . . . . . . . . . . . . . . . . . . . 55

4.2.2 Textual Features . . . . . . . . . . . . . . . . . . . . . . . . . 60

4.3 The Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

4.3.1 The Prosody Model . . . . . . . . . . . . . . . . . . . . . . . . 60

4.3.2 The Language Model (LM) . . . . . . . . . . . . . . . . . . . 62

4.4 Model Combination . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

4.5 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 675 HMM BASELINE PERFORMANCE . . . . . . . . . . . . . . . . . . . . . 68

5.1 System Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

5.1.1 Choice of Classes . . . . . . . . . . . . . . . . . . . . . . . . . 68

5.1.2 Training Procedures . . . . . . . . . . . . . . . . . . . . . . . 70

5.1.3 Testing Procedures . . . . . . . . . . . . . . . . . . . . . . . . 73

5.2 Baseline System Performance . . . . . . . . . . . . . . . . . . . . . . 76

5.2.1 Task 1: SU Detection . . . . . . . . . . . . . . . . . . . . . . . 765.2.2 Task 2: Filler Word Detection . . . . . . . . . . . . . . . . . . 82

5.2.3 Task 3: Edit Word and IP Detection . . . . . . . . . . . . . . 84

5.2.4 Summary for All the Tasks . . . . . . . . . . . . . . . . . . . . 92

5.3 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94


7/253

vii

Page

6 INCORPORATING TEXTUAL KNOWLEDGE SOURCES INTO THEHMM SYSTEM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

6.1 Review of Related Language Model Techniques . . . . . . . . . . . . 97

6.2 Various Knowledge Sources . . . . . . . . . . . . . . . . . . . . . . . 100

6.2.1 Word-LM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

6.2.2 Automatically Induced Classes (AIC) . . . . . . . . . . . . . . 101

6.2.3 Part-of-speech (POS) Tags . . . . . . . . . . . . . . . . . . . . 102

6.2.4 Syntactic Chunk Tags . . . . . . . . . . . . . . . . . . . . . . 104

6.2.5 Word LMs from Additional Corpora . . . . . . . . . . . . . . 107

6.3 Integration Methods for the LMs in an HMM . . . . . . . . . . . . . 108

6.4 Experiments on SU Detection Task . . . . . . . . . . . . . . . . . . . 110

6.4.1 CTS SU Task . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

6.4.2 BN SU Task . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

6.5 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

7 PROSODY MODEL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

7.1 Addressing the Imbalanced Data Set Problem . . . . . . . . . . . . . 118

7.1.1 The Imbalanced Class Distribution Problem . . . . . . . . . . 118

7.1.2 Approaches to Address the Problem . . . . . . . . . . . . . . . 120

7.2 Pilot Study for SU Detection . . . . . . . . . . . . . . . . . . . . . . . 123

7.2.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . 123

7.2.2 Sampling Results . . . . . . . . . . . . . . . . . . . . . . . . . 125

7.2.3 Bagging Results . . . . . . . . . . . . . . . . . . . . . . . . . . 129

7.3 Sampling and Bagging Across SU and IP Tasks . . . . . . . . . . . . 132

7.3.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . 1327.3.2 Results Across SU and IP Tasks . . . . . . . . . . . . . . . . . 133

7.4 Evaluation on the Full NIST SU Task . . . . . . . . . . . . . . . . . . 136

7.4.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . 136

7.4.2 Results on the NIST SU Task . . . . . . . . . . . . . . . . . . 138


8/253

viii

Page

7.5 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139

7.5.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139

7.5.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1418 APPROACHES TO COMBINE KNOWLEDGE SOURCES . . . . . . . . 143

8.1 Knowledge Sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144

8.2 A Review of the HMM for SU Detection . . . . . . . . . . . . . . . . 145

8.3 The Maxent Posterior Probability Model for SU Detection . . . . . . 149

8.3.1 Description of the Maxent Model . . . . . . . . . . . . . . . . 150

8.3.2 Features Used . . . . . . . . . . . . . . . . . . . . . . . . . . . 152

8.3.3 Comparisons of the Maxent and HMM Approaches . . . . . . 1558.3.4 Results and Discussion for the Maxent SU Model . . . . . . . 156

8.4 The Conditional Random Field (CRF) Model for SU Detection . . . . 164

8.4.1 Description of the CRF Model . . . . . . . . . . . . . . . . . . 165

8.4.2 Comparisons of CRF and Other Models . . . . . . . . . . . . 166

8.4.3 Results and Discussion for the CRF SU Model . . . . . . . . . 167

8.5 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171

9 SYSTEM FOR RT-04 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1739.1 RT-04 Tasks and Data . . . . . . . . . . . . . . . . . . . . . . . . . . 173

9.2 System Performance for SU Boundary Detection . . . . . . . . . . . . 174

9.3 SU/SU-Subtype Detection . . . . . . . . . . . . . . . . . . . . . . . . 176

9.4 Edit Word Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . 178

9.4.1 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178

9.4.2 Edit Detection Results . . . . . . . . . . . . . . . . . . . . . . 182

9.5 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18310 RELATED EFFORTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185

10.1 Factors Impacting Performance . . . . . . . . . . . . . . . . . . . . . 185

10.1.1 Word Error Rates (WER) . . . . . . . . . . . . . . . . . . . . 185

10.1.2 Speaker Label for SU Detection . . . . . . . . . . . . . . . . . 187


9/253

ix

Page

10.2 Word Fragment Detection . . . . . . . . . . . . . . . . . . . . . . . . 190

10.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190

10.2.2 Acoustic and Prosodic Features . . . . . . . . . . . . . . . . . 193

10.2.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195

10.3 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198

11 FINAL REMARKS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200

11.1 Impact on Other Research Efforts . . . . . . . . . . . . . . . . . . . . 200

11.1.1 Using Structural Event Information for Word Recognition . . 200

11.1.2 SU Detection in a Multi-modal Corpus . . . . . . . . . . . . . 202

11.1.3 Dialog Act Detection in Meeting Corpus . . . . . . . . . . . . 204

11.2 Summary of Experiments . . . . . . . . . . . . . . . . . . . . . . . . . 206

11.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208

11.4 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209

LIST OF REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212

APPENDICES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224

Appendix A: ADT Boosting For SU and IP Detection . . . . . . . . . . . . 224

A.1 ADT Boosting Description . . . . . . . . . . . . . . . . . . . . 224

A.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . 226

A.3 ADT Boosting Summary . . . . . . . . . . . . . . . . . . . . . 227

Appendix B: Prosodic Features . . . . . . . . . . . . . . . . . . . . . . . . 227

VITA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230


10/253

x

LIST OF TABLES

Table Page

1.1 Symbols used for the structural events in the example of annotated tran-scriptions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.1 A summary of some important prior studies on sentence boundary de-tection. Column two is the task chosen for each investigation: boundarymeans the sentence boundary detection task, compared to its subtypeor punctuation detection; column three describes the model or the in-formation sources used by each investigation; column four is the cor-

pus on which the experiments were conducted; column five representswhether the experiments were performed on human transcriptions (Ref)or recognition results (ASR). Note that CTS (i.e., conversational tele-phone speech) is used in the corpus column for those experiments thatwere conducted on the Switchboard corpus. Even though no textualinformation is used in this automatic detection model, Ref condition isused in that study for its evaluation. . . . . . . . . . . . . . . . . . . . . 19

2.2 A summary of some important prior studies on disfluency detection. Col-umn two is the taskfor each investigation; column three describes themodel or information sources used by each investigation; column four is

the corpus on which the experiments were conducted; column five repre-sents whether the experiments were performed on human transcriptions(Ref) or recognition results (ASR). In Core [52], preliminary repairinformation is provided, and the parser further corrects them. . . . . . . 35

3.1 Structural events annotated by LDC and investigated in this thesis. Notethat the subtype of an edit disfluency is not annotated LDC, nor is thecorrection in an edit disfluency. . . . . . . . . . . . . . . . . . . . . . . . 41

3.2 Information on the CTS and BN corpora, including the data set sizes,the percentage of the different types of structural events in the training

set, and the word error rate (WER) of the speech recognizer on the testset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

4.1 Examples of cue words that are highly representative of some structuralevent types. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

4.2 Examples of the prosodic features used for the SU detection problem thatappear in the decision tree shown in Figure 4.3. . . . . . . . . . . . . . . 63


11/253

xi

Table Page

5.1 CTS SU detection results using the NIST SU error rate (%) and theboundary-based CER (% in parentheses) on human transcriptions (REF)and recognition output (STT), for the LM and the prosody model indi-

vidually, and in combination. The baseline error rate, assuming there isno SU boundary at each word boundary is 100% for the NIST SU errorrate and 15.7% for CER. . . . . . . . . . . . . . . . . . . . . . . . . . . 77

5.2 Deletion and insertion error rates (NIST SU error rate in %) for theCTS REF condition, using the LM and the prosody alone and in theircombination. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

5.3 Feature usage (%) for SU detection on CTS. . . . . . . . . . . . . . . . 79

5.4 BN SU detection results using the NIST SU error rate (%) and the CER(% in parentheses) using the prosody model, the LM, and their combina-

tion. Results are shown for both REF and STT conditions. The baselineerror rate is 100% for the NIST SU error rate and 7.2% for CER. . . . . 79

5.5 Deletion and insertion error rates (NIST SU error rate in %) for the BNREF condition, using the LM and the prosody model alone and in theircombination. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

5.6 Feature usage (%) for SU detection on BN. . . . . . . . . . . . . . . . . 81

5.7 Results for CTS filler word (including FP and DM) detection, FP detec-tion, and DM boundary detection using NIST error rate (%) and CER(% in parenthesis) for the prosody model, LM, and their combination.

Results are for both the REF and STT conditions. The baseline CER is8.3% for filler word detection, 3.6% for FP detection, and 2.8% for DMboundary detection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

5.8 Feature usage (%) for the FP and DM detection tasks in CTS. . . . . . 85

5.9 CTS edit word and IP detection results using NIST error rate (%) andCER (% in parenthesis) for the prosody model, the LM, and their combi-nation. Results are for the REF and STT conditions. The baseline CERis 8.3% for edit word detection, and 4.8% for edit IP detection. . . . . . 91

5.10 Feature usage (%) for IP detection on CTS corpus. . . . . . . . . . . . . 925.11 System performance (NIST error rate in %) for all the structural event

detection tasks on CTS and BN test sets. Results are presented for boththe REF and STT conditions. . . . . . . . . . . . . . . . . . . . . . . . 93

6.1 Two examples of automatically induced classes for the CTS SU detectiontask, depicting member words and each words probability given the class. 103


12/253

xii

Table Page

6.2 The POS and chunk tags for a sentence from the BN corpus, the topselling car of nineteen ninety-seven was announced today and the winneris toyota camry. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

6.3 SU detection results (NIST error rate in %) for human transcriptions ofCTS data using various LMs, alone and in combination with the prosodymodel. The deletion (DEL), insertion (INS), and total error rate arereported. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

6.4 SU detection results (NIST error rate in %) for human transcriptionsof the BN data using various LMs, alone and in combination with theprosody model. The deletion (DEL), insertion (INS), and total error rateare reported. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

7.1 Description of the data set used in the pilot study for the CTS SU detec-tion task. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

7.2 SU detection results (CER in % and F-measure) for different samplingapproaches in the pilot study of the CTS corpus, using the prosody modelalone and in combination with the LM. The CER of the LM alone on thetest set is 5.02%. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

7.3 Recall and precision results for the sampling methods in the pilot studyof CTS SU detection. Using LM alone yields a recall of 74.6% and aprecision of 84.9%. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129

7.4 CTS SU detection results (CER in % and F-measure) with bagging ap-plied to a randomly downsampled data set (DS), ensemble of downsam-pled training sets, and the original training set. The results for the train-ing conditions without bagging are also shown for comparison. . . . . . 130

7.5 Description of the data sets used for the SU and IP detection tasks. Thedata set used in the pilot study is shown in the second column, which isa subset of the data set used in this investigation (large set denoted inthe table). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133

7.6 IP and SU detection results in CER (%). DS denotes downsampled.Chance performance is 4.36% on the original test set for IP, and 13.64%for SU. The CER using LM alone is 2.34% on the IP task, and 5.27% on

the SU task. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134

7.7 SU detection results (NIST error rate in %) for both the CTS and BNcorpora, on the REF and STT conditions. . . . . . . . . . . . . . . . . . 139

8.1 SU detection results (NIST error rate in %) for different state configura-tions using the trigram LM alone on the CTS reference condition. Theinsertion (INS), deletion (DEL), and total error rate are shown. . . . . . 148


13/253

xiii

Table Page

8.2 SU detection results (NIST error rate in %) using the Maxent and theHMM approaches individually and in combination on BN and CTS, onreference transcriptions (REF) and recognition output (STT). . . . . . . 157

8.3 Deletion, insertion, and total error rate (NIST error rate in %) of theHMM and Maxent approaches on reference transcriptions of BN and CTS. 158

8.4 SU detection results (NIST error rate in%) using different knowledgesources on BN and CTS, evaluated on the reference transcription. . . . 159

8.5 Comparison of using the posterior probabilities from the prosody model asbinary features versus continuous valued features in the Maxent approachfor SU detection in CTS reference transcription condition. . . . . . . . . 160

8.6 Some of the N-gram features with the highest IG weights for the CTS SU

detection task. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1628.7 Notation for a 2 2 contingency table used in Chi-square statistics. . . . 163

8.8 SU detection results (NIST error rate in %) using different feature selec-tion metrics and different pruning thresholds (number of the preservedfeatures), for the CTS REF condition. . . . . . . . . . . . . . . . . . . . 164

8.9 SU detection results (NIST error rate in %) using the HMM, Maxent,and CRF approaches individually and in combination on BN and CTS,on reference transcriptions (REF) and recognition output (STT). Thecombination of the three approaches is obtained via a majority vote. . . 169

8.10 CTS SU detection results (NIST error rate in %) using the HMM, Maxent,and CRF individually, using different knowledge sources. Note that theall features condition uses all the knowledge sources described in Section8.3.2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170

8.11 BN SU detection results (NIST error rate %) using the HMM, Maxent,and CRF individually, using different knowledge sources. . . . . . . . . 171

9.1 Data description for CTS and BN used in the RT-04 NIST evaluation.BN training data is the combined RT-03 and RT-04 data. CTS contains

only the RT-04 training data. . . . . . . . . . . . . . . . . . . . . . . . . 1749.2 SU boundary detection results (NIST SU error rate %) on the RT-04

evaluation data. The combination is the majority vote of the Maxent,CRF, and the improved HMM approaches. DS denotes a downsampledtraining set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175

9.3 Percentage of SU subtypes for CTS and BN. . . . . . . . . . . . . . . . 177


14/253

xiv

Table Page

9.4 SU/SU-subtype detection results (%) on RT-04 CTS evaluation data.Results are reported using NIST SU boundary error rate, substitutionerror rate, and the subtype classification error rate (CER). . . . . . . . . 177

9.5 SU subtype detection results (in confusion matrix) on CTS human tran-scription condition. Each cell shows the count and percentage (%) of areference subtype (row) that is hypothesized as the subtype shown in thecolumn. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178

9.6 States and transitions used by the CRF for edit word and edit IP de-tection. The class tags are: the beginning of an edit (B-E), inside of anedit (I-E), each of which has a possible IP associated with it (B-E+IP orI-E+IP), and outside of an edit (O). . . . . . . . . . . . . . . . . . . . . 181

9.7 Results (NIST error rate in %) for edit word and IP detection, using the

HMM, Maxent, and CRF approaches on the reference and recognitionoutput conditions of CTS data. . . . . . . . . . . . . . . . . . . . . . . . 182

9.8 Results (NIST error rate in %) for edit word and IP detection, using theHMM and Maxent approaches. . . . . . . . . . . . . . . . . . . . . . . . 183

10.1 SU and edit word detection results (NIST error rate in %) for CTS andBN, on REF and various STT conditions using the RT-04 data. For SUdetection, results are reported for the SU boundary detection error. STT-1 and STT-2 are two different STT outputs, and the WER (%) for themis shown in the table. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186

10.2 Comparisons of different ways to derive speaker labels on the RT-04 testset for the BN SU boundary detection task. Results are shown usingthe NIST error rate (%) for the HMM on the reference transcriptioncondition. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189

10.3 Word fragment detection results (in confusion matrix) on the downsam-pled data of Switchboard corpus. . . . . . . . . . . . . . . . . . . . . . . 196

10.4 Feature usage (%) for word fragment detection using the Switchboarddata. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197

11.1 WER (%) when SU information is fed back to re-segment and re-recognizespeech, compared to the baseline using the acoustic segments, evaluatingon half of the RT-03 BN data. . . . . . . . . . . . . . . . . . . . . . . . 202

11.2 SU detection results (NIST error rate in %) on the Wombat data. Notethat the combined result is not shown when using textual informationonly, in order to make results in parallel to the results in Chapter 8(Table 8.2 and Table 8.4). . . . . . . . . . . . . . . . . . . . . . . . . . . 203


15/253

xv

Table Page

11.3 DA boundary detection results (NIST error rate in %) on ICSI Meetingdata. Results are for the reference transcriptions (REF) and STT output,using the pause decision tree (pause DT) model, hidden event LM, andthe HMM combination of them. . . . . . . . . . . . . . . . . . . . . . . 205

11.4 DA subtype classification accuracy (%) using the reference DA bound-aries of the ICSI Meeting corpus using the human transcriptions andrecognition output. Two conditions are used: word-based features only,and the combined word-based features and the binned posterior probabil-ities from the decision tree (DT). Chance performance is obtained whenthe majority type (statement) is hypothesized for each DA. . . . . . . . 206

A.1 SU and IP detection results (classification error rate in %) using ADTlearning algorithm and bagging. Training and testing were conductedusing a downsampled training and testing set. Chance performance is

50%. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226


16/253

xvi

LIST OF FIGURES

Figure Page

1.1 A flow diagram for the automatic structural event detection task. . . . . 7

3.1 Examples of transcriptions for CTS and BN, respectively. SU boundariesare not shown in the examples. . . . . . . . . . . . . . . . . . . . . . . . 51

4.1 The waveform, pitch and energy contours, word alignment, and SU bound-aries for the utterance um no I hadnt heard of that. . . . . . . . . . . 56

4.2 The raw and stylized F0 contours for the utterance um no I hadnt heardof that. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

4.3 An example of a decision tree for SU detection. Each line represents anode in the tree, with the associated question regarding one particularprosodic feature, the class distribution, and the most likely class amongthe examples going through this node (S stands for SU boundary, and 0 fornon-SU boundary). The indentation represents the level of the decisiontree. Some of the features used in this tree are described in Table 4.2. . . 62

5.1 Data preparation for model training. . . . . . . . . . . . . . . . . . . . . 71

5.2 System flow diagram of the testing procedure. . . . . . . . . . . . . . . . 74

5.3 System diagram for edit word and IP detection. . . . . . . . . . . . . . . 86

5.4 Valid state transitions for repetitions of up to 3 words. The X and Y axesrepresent the position in the reparandum and repetition regions respec-tively, with events denoted as ORIG- and REP-. In ORIG-n, n meansthe position of a word in the reparandum; in REP-m.n, m is the totalnumber of repeated words and n represents the position of the event inthe repeat region. Optional filler words are allowed after the IP in thetransition. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

5.5 A rule-based method for determining the reparandum region after IPsare hypothesized. SU hypotheses are used in the rules. . . . . . . . . . . 90

6.1 Integration methods for the various LMs and the prosody model. . . . . 111

7.1 The bagging algorithm. T is 50 in our experiments. In each bag, theclass distribution is the same as in the original data S. . . . . . . . . . . 122


17/253

xvii

Figure Page

7.2 ROC curves and their AUCs for the decision trees trained from differentsampling approaches and the original training set. . . . . . . . . . . . . . 128

7.3 ROC curves and their AUCs for the decision trees when bagging is used

on the downsampled training set (bag-ds), the ensemble of downsampledtraining sets (bag-ensemble), and the original training set (bag-original). 131

7.4 ROC curves for IP and SU detection using the prosody model alone onthe CTS corpus. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137

8.1 The graphical model for the SU detection problem. Only one word-eventpair is depicted in each state, but in a model based on N-grams theprevious N 1 tokens would condition the transition to the next state.O are observations consisting of words W and prosodic features F, andE are structural events. . . . . . . . . . . . . . . . . . . . . . . . . . . . 146

8.2 The graphical model for the POS tagging problem. POS tags are thehidden states in this problem. S are POS tags, and W are words. . . . . 148

8.3 The graphical representation of a CRF for the sentence boundary detec-tion problem. E represents the state tags (i.e., SU boundary or not),while W and F are word and prosodic features respectively. O are obser-vations consisting of W and F. . . . . . . . . . . . . . . . . . . . . . . . 165

8.4 The graphical model representations of the HMM, CMM, and CRF ap-proaches. O are observations, and S are events (or tags). . . . . . . . . 168

10.1 An illustration of how speaker change is obtained for the CTS data. An

arrow represents a speaker change after that segment. . . . . . . . . . . . 18810.2 The pruned decision tree used to detect word fragments. The decision is

made in the leaf nodes; however, in the figure the decision for an internalnode in the tree is also shown. . . . . . . . . . . . . . . . . . . . . . . . 198

11.1 Using SU information for re-recognition in BN. . . . . . . . . . . . . . . 201

A.1 An example of an alternating decision tree (ADT). . . . . . . . . . . . . 225


18/253

xviii

ABSTRACT

Liu, Yang. Ph.D., Purdue University, December, 2004. Structural Event Detectionfor Rich Transcription of Speech. Major Professor: Mary P. Harper.

Although speech recognition technology has significantly improved during the

past few decades, current speech recognition systems output only a stream of words

without providing other useful structural information that could aid a human reader

and downstream language processing modules. This thesis research focuses on theautomatic detection of several helpful structural events in speech, including sentence

boundaries, type of utterance, filled pauses, discourse markers, and edit disfluencies.

The systems evaluated combine prosodic cues and textual information sources in a

variety of ways to support automatic detection of these structural events. Exper-

iments were conducted across corpora (conversational speech and broadcast news

speech) and with different transcription quality (human transcriptions versus recog-

nition output).The imbalanced data problem is investigated for training the decision tree prosody

model component of our system because structural events are much less frequent than

non-events. A variety of sampling approaches and bagging are used to address this

imbalance. Significant performance improvements are obtained via bagging. Some

of the sampling methods are useful depending on the performance metrics used.

Sentence boundary detection and disfluency detection tasks are impacted differently

by sampling, bagging, and boosting, suggesting the inherent differences between thetwo tasks.

A variety of methods for combining knowledge sources are examined: a hidden

Markov model (HMM), the maximum entropy (Maxent) model, and the conditional

random field (CRF). The Maxent and CRF approaches are discriminatively trained


19/253

xix

to model the posterior probabilities and thus correlate with the performance mea-

sures. They also support the use of more correlated features and so enable the

combination of a variety of textual information sources. The HMM and CRF both

model sequence information, unlike the Maxent which explicitly models local infor-

mation. A model that combines these three approaches is superior to any method

alone.

Interactions with other research efforts suggest that the methods developed in

this thesis generalize well to other corpora (e.g., a multimodal corpus, a multiparty

meeting corpus) and to similar tasks (e.g., a gestural model, dialog act segmentation

and classification).


20/253

1

1. INTRODUCTION

1.1 Motivation

Speech recognition technology has improved significantly during the past few

decades; for tasks involving read or pre-planned speech, recognition accuracy is often

greater than 90%. However, the word-level transcription accuracy for spontaneous

conversational speech falls far short of this level, generally lower than 80%. The

acoustic properties of spontaneous conversational speech are quite challenging tomodel due to phenomena such as coarticulation, word fragments, and filled pauses.

Additionally, disfluencies and ungrammatical utterances pose serious problems for

language models (LMs). These factors combine to affect the performance of speech

recognizers on spontaneous speech. The following is an excerpt of a transcription of

spontaneous conversational speech. Both the human transcription and the recogni-

tion output are shown in the below example. The presence of a word fragment in

the example is represented by a - after the partial word. Word recognition errors inthe recognition output have a strikethrough in them, and the corresponding correct

words are shown in bold face inside curly parentheses (corresponding to deletion or

substitution errors).

Human Transcription:

but uh im i i i think that you know i mean we always uh i mean ive ive

had a a lot of good experiences with uh with many many people especially

where theyve had uh extended family and i and an- i i kind of see that

that you know perhaps you know we may need to like get close to the

family environment and and get down to the values of you know i mean

uh its money seems to be too big of an issue wi- with with with with with

whats going on today


21/253

2

Recognition Output:

but um that that{uh im i i} i think that you know we{i mean} we

always uh i mean ive ive had it there {a} a lot of good experiences with

the{uh} with many many people especially with have{where theyve}

had extended family night and i and{an- i} i kind of see that that you

know perhaps you know we may need to like youre {get} close to the

family environment and in {and} get down to the values of you know i

mean no and{uh its} money seems to be too big of an issue we would

{wi- with with with} with with really was we would whats going on

today

As can be seen from the recognition output example, current automatic speech

recognition (ASR) systems simply output a stream of words. Structural informa-

tion (such as the location of punctuation, disfluencies, and speaker turns) is missing,

making it difficult for a human to read or for downstream automatic processors to

deal with. As shown in the example above, even the human transcription, which

contains no word errors, is still hard to read due to the absence of punctuation and

the presence of speech disfluencies and filler words.

The transcriptions can be marked with different types of structural information

to enhance readability or ease downstream processing. In this thesis, the following

types of structural events are considered:

Sentence boundaries: A sentence ends with ./ for a statement, .../ for an

incomplete statement, and ?/ for a question in the marked up transcription

examples in this thesis.

Filler words: These include filled pauses (e.g., uh and um) and discourse

marker words (such as you know, well). The tokens and are used to

mark the extent of these filler words.


22/253

3

Edit disfluencies: Disfluencies are highly prevalent in conversational speech.

In this thesis, the term edit disfluency is used for the disfluencies1 with the

following structure (see Chapter 3 for more details):

(reparandum) * editing term correction

The edited portion of a disfluency (i.e., the reparandum) is marked in examples

with parentheses ( and ). For example, in a a lot in the human transcription

shown above, the first a is the reparandum so it should be marked with

parentheses. The interruption point (IP) inside the edit disfluency is marked

by *. The editing term, which follows the IP and precedes the corrections, is

optional. The edit disfluency structure is embedded in utterances and so may

be preceded and followed by words that are not part of the edit disfluency.

These types of structural information will be described in more detail in Chap-

ter 3. Below is the annotation of our human transcription example.2 All the words

that interrupt the fluency of speech are shown in bold face in this example. Table 1.1

summarizes the meanings of the symbols used in the annotated transcriptions.

but uh (im * i * i think that you know i mean ive) * ive

had (a) * a lot of good experiences (with) * uh with (many) * many

people especially where theyve had uh extended family ./

(and i * and) * an- (i) * i kind of see (that) * that you know perhaps

you know we may need to like get close to the family environment

(and) and get down to the values of you know i mean .../

(uh its) * money seems to be too big of an issue (wi- * with * with

* with * with) * with whats going on today ./

The transcriptions containing this structural information are called rich tran-

scriptions because they contain much richer information than a simple stream of

1These disfluencies are also called speech repairs in the literature.2The human transcription is used here to illustrate the the importance of structural information inorder to factor out the effect of speech recognition errors.


23/253

4

Table 1.1Symbols used for the structural events in the example of annotatedtranscriptions.

Symbol Meaning

./ or .../ sentence boundaries (complete or incomplete)

filler words

( ) reparandum in an edit disfluency

* interruption point in an edit disfluency

words. Given this structural information (either human annotated or automatically

generated), human transcriptions or recognition output can be cleaned up for im-

proved readability. For example, if the disfluencies and fillers are removed from the

previous transcription and each sentence is presented with the appropriate punctu-

ation, the cleaned-up transcription would be as follows:

But ive had a lot of good experiences with many people especially where

theyve had extended family. I kind of see that perhaps we may need to

get close to the family environment and get down to the value of... Money

seems to be too big of an issue with whats going on today.

Clearly this cleaned-up transcription is more readable, is easier to understand, and

is more appropriate for subsequent language processing modules.

There has been a growing interest recently in the study of the impact of structural

events. Jones et al. [1] have conducted experiments, showing that cleaned-up tran-

scriptions improve human readability compared to the original transcription. Other

recent research has considered whether automatically generated sentence informa-

tion can play a role in parsing. Gregory et al. [2] have found that using sentence-

internal prosodic cues degrades parsing performance; however, the method used for

automatically generating sentence-internal annotations was not state-of-the-art. On


24/253

5

the other hand, Kahn et al. [3] have achieved significant error reductions in pars-

ing performance when using sentence boundary information from a state-of-the-art

automatic detection system.

1.2 Scope of the Thesis

1.2.1 Structural Event Detection Tasks

Automatic structural event detection is a crucial step for improving the readabil-

ity of speech recognition output and for making spontaneous speech understanding

systems possible. The goal of this thesis is to enrich the recognition output with

multiple levels of structural information, including sentence boundaries, filled pause

and discourse marker words, and edit disfluencies. We will construct and evaluate

algorithms that automatically detect such structural event types.

Note that the problem of sentence boundary detection differs from its analog in

text processing, which is sometimes called sentence splitting or sentence boundary

detection. The goal of the sentence splitting task is to identify sentence boundaries

in written text where punctuation is available; hence, the problem is effectively

reduced to deciding which symbols that potentially denote sentence boundaries (i.e.,

. ! ?) actually do. The sentence splitting problem is not deterministic since these

punctuation symbols do not always occur at the end of sentences. For example, in

I watch C. N. N., only the final period denotes the end of a sentence. In the

sentence boundary detection task using speech, no punctuation is available, yet the

availability of speech provides additional useful information.

We will investigate structural event detection across corpora, on both broadcast

news and conversational telephone speech. Broadcast news comprises read speech,

formal interviews, man-on-the-street interviews, and some spontaneous speech, al-

though not usually conversational. In contrast, telephone conversational speech is

spontaneous, and much of it is quite informal. Broadcast news usually has fewer

edit disfluencies than spontaneous conversational speech, and many of these may


25/253

6

be caused by reading errors. Our algorithms will be evaluated on both the human

transcriptions and recognition output to investigate the effect of incorrect words in

ASR output on system performance.

1.2.2 Our Approach to the Problem

The framework of most current speech recognition systems is to find the most

likely word sequence given the speech signal. Because the hidden structure of the

utterance (sentence boundaries and disfluencies) does not have an explicit acoustic

signal,3 it is hard to integrate the problem of structural event detection with word

recognition in current speech recognition systems. Therefore, we will address this

problem by using a post-processing approach that generates the structural informa-

tion after the recognition results are available. Several knowledge sources will be

employed, involving both textual information and prosodic cues to reduce ambiguity

inherent in one knowledge source. Figure 1.1 shows a diagram of our approach, the

final output of which is a rich transcription or cleaned-up transcription. As the figure

shows, prosodic information is obtained from a combination of the speech signal and

recognition output, which is used to provide word and phone alignments.

In our investigations, textual information is obtained from the word strings in

the transcriptions generated either by a human transcriber or by the ASR system.

This type of information is no doubt very important. In many cases, people have

no problem inferring appropriate structural events from word transcriptions. Some

textual cues are quite useful for automatic identification of structural events, for

example, words like I often start a new sentence, and a repeated or revised word

string often signals disfluencies. In addition, the syntactic and semantic information

derived from the words provides valuable cues for structural event detection.

3There are some implicit prosodic cues at the boundary points, which will be described in Chapter 5.


26/253

7

ASRASR transcriptionSpeech

signal

Extract

textual features

Extract

prosodic features

Structural event

detection systems

Structural event

output

Prosodic

featuresTextural

features

Process

transcriptions

Rich or cleaned-up

transcription

Fig. 1.1. A flow diagram for the automatic structural event detection task.

In some cases, the use of textual information alone may not completely disam-

biguate structural events. The following example is extracted from the broadcast

news data:

Anne what are the chances well hear uh something of substance again

from the President prior to the vote ?/

And thats a possible next step ?/


27/253

8

A purely textual model would not be able to determine whether the second sentence

is a statement or a question. However, the rising tone in the speech signal would

enable the listener to determine that a question is intended.

In the face of high word error rates, word level information may be unreliable and

possibly misleading. In such a case, the lexical, syntactic, and semantic patterns used

for detecting sentence boundaries and disfluencies will be less reliable due to the word

errors. The following example compares ASR output with a human transcription of

the speech:

ASR output:

Its been a while for the good for the tackle that stuff

Human transcription:

Its been a while since Ive uh uh since Ive tackled that stuff

It will be difficult, if not impossible, for a word-based language model to identify the

repetition or the existing disfluencies using this ASR output.

Prosody, the rhythm and melody of speech, is important for automating

rich transcription. Past research results [414] suggest that speakers use prosody to

impose structure on both spontaneous and read speech. Examples of such prosodic

indicators include pause duration, change in pitch range and amplitude, global pitch

declination, melody and boundary tone distribution, vowel duration lengthening, and

speaking rate variation. Since these features provide information complementary to

the word sequence, they provide an additional potentially valuable source of infor-

mation for structural event detection. Additionally, since they may be more robust

than textual features to word errors, they may provide a more reliable knowledge

source.

Textual and prosodic knowledge sources have been exploited in previous re-

search [12,13,1518], and their combination has proven to be beneficial to the per-

formance for structural event detection. This thesis builds upon this prior work that

combined these knowledge sources using a hidden Markov model (HMM) approach.


28/253

9

We will focus on developing a richer feature set for these knowledge sources, building

more effective models to capture such information, and integrating various knowledge

sources for structural event detection by using different modeling approaches.

The investigations in this thesis should help to answer several questions with

respect to the automatic detection of structural events: What knowledge sources

are helpful? What is the best modeling approach for combining different knowledge

sources? How is the model performance affected by various factors such as corpora,

transcriptions, and event types?


29/253

10

2. RELATED WORK

In the past decade, a substantial amount of research has been conducted in the areas

of detecting intonational and linguistic boundaries in conversational speech, as well as

in detecting and correcting speech disfluencies. In this chapter, we introduce research

related to the automatic detection of different structural events, namely, sentence

boundaries, edit disfluencies, and filler words. For each type, related research is

categorized based on what knowledge sources have been used. Additionally, for

completeness, studies from linguistics or psychology are discussed where appropriate.

2.1 Sentence Boundary Detection

For speech recognition, sentences are usually defined by acoustic segment

boundaries that correspond to long stretches of silence or a change of conversa-

tional turn.1 In contrast, linguistic segment boundaries mark a unit that represents

a complete idea but may not necessarily represent a grammatical sentence nor begin

or end with a long silence or turn change. Experiments by Meteer and Iyer in [19]

suggest that language model perplexity can be reduced by working with linguistic

segments rather than acoustic segments. Our goal is to automatically find such

linguistic sentence-like units.

Some of the previous research has focused on detecting major sentence bound-

aries;2 others have investigated detecting subtypes of sentences (e.g., questions, state-

ments). Prior research related to sentence and its subtype detection can be divided

1The definition of turn varies in the literature. In this thesis, a turn is a portionof speech uttered by a single speaker and bounded by silence from that speaker. Seehttp://secure.ldc.upenn.edu/intranet/Annotation/MDE/guidelines/2004/control floor.shtmlfor more details.2The definition of sentence varies across these past research efforts. The term used in this thesiswill be defined in Chapter 3.


30/253

11

into two categories based on the knowledge sources employed: a text-based approach

or an approach using textual and acoustic information. The text-based approach uses

only textual information; hence, it is suitable for both transcribed speech and writ-

ten text. Text-based methods may not be able to resolve some ambiguities using

information found in text, as in the example in Section 1.2.2, for which a question

type is detected based on the rising tone. A combination approach uses both the

acoustic cues and textual information. In most cases, it is difficult to compare the

results of prior research since they often differ on the corpora used for training and

testing, as well as in the information used by their systems.

2.1.1 Text-based Processing for Sentence Boundary Detection

As mentioned in Chapter 1, the sentence boundary detection problem in written

text aims to disambiguate punctuation marks with the goal of identifying sentence

boundaries. Palmer and Hirst [20] developed an efficient automatic sentence bound-

ary labeling algorithm, which uses the part-of-speech (POS) probabilities of the

tokens surrounding a punctuation mark as input to a feed-forward neural network

to obtain the role of the punctuation mark. Because sentence boundaries were not

available to their part-of-speech tagger, they used the prior probabilities of all parts

of speech for a word. They tested their system on a portion of the Wall Street Jour-

nal (WSJ) corpus. Their experiments found that a context of six surrounding tokens

and a hidden layer with two units yielded the best accuracy on the test set. When

training and testing were conducted using texts in lower-case-only format, the net-

work was able to disambiguate 96.2% of the boundaries. Other approaches have also

been used to investigate this problem, for example, Reynar and Ratnaparkhi [21]

used a maximum entropy algorithm, and Schmid [22] employed an unsupervised

learning method. Walker et al. [23] compared three different methods for sentence

boundary detection as a preprocessing step in machine translation. They showed

that the maximum entropy method [21] outperforms the other two systems, i.e.,


31/253

12

the direct model and the rule-based system. They also argued that high recall is

more important for the application of machine translation: fragmenting sentences is

better than combining two sentences. This insight might be useful if we are going

to use our structural event detection results in the downstream language processing

modules, among which machine translation is one. The sentence boundary problem

in text processing is different from that in speech processing in that punctuation

information is available in text (although it is not deterministic). However, some

knowledge obtained from such a task is useful to our automatic sentence boundary

detection in speech, such as the lexical cues that are most effective for determining

the role of punctuation.

An automatic punctuation system, called Cyberpunc, which is based only onlexical information, was developed by Beeferman et al. [24]. They counted the oc-

currence of each punctuation mark in the 42 million tokens of the WSJ corpus and

reported that about 10.5% of the tokens in that corpus were punctuation, mostly

commas (4.658%) and periods (4.174%). Cyberpunc generates only commas, as-

suming that sentence boundaries are provided or pre-determined. They extended

a language model to account for punctuation by explicitly including commas in an

N-gram LM and allowing commas to occur at interword boundaries. Commas wereadded to the testing word strings by finding the best hypothesis using a Viterbi

algorithm. They evaluated this method for generating commas on 2,317 reference

sentences of the Penn Treebank WSJ corpus that were stripped of punctuation marks.

They obtained a recall rate of 66% and precision of 76% for this comma generation

task. The goal of this research differs from sentence boundary detection in speech

because the task is to find commas assuming that the major sentence boundaries are

known. Beeferman et al. [24] claimed that a punctuation-aware language model canbe applied to rescore speech recognition lattices in general, but they did not evaluate

this.

Stevenson and Gaizauskas [25] also conducted experiments on identifying sen-

tence boundaries in transcriptions of the WSJ corpus using a memory-based learn-


32/253

13

ing (MBL) algorithm. For each word boundary, they obtained a feature vector of

13 elements from the word and its neighboring words, including the probability of

the word starting or ending a sentence, their POS tags, and so on. The precision

and recall of their approach was around 35% when case information of the word was

removed. The results were much improved when case information was provided to

their sentence boundary detection system. Clearly, case information is important for

this method, suggesting that it may not extend well to ASR outputs, which do not

capture case information and often contain incorrect words.

2.1.2 Combining Textual and Prosodic Information for Sentence Bound-

ary Detection

Some past research has been conducted on combining prosodic information and

textual information to find sentence boundaries and their subtypes in speech. It

is known that there is a strong correspondence between discourse structure and

prosodic information. A comparison between syntactic and prosodic phrasing was

presented by Fach [26]. In that study, the syntactic structure was generated by Ab-

neys chunk parser [27] and prosodic structure was given by ToBI label files [28]. Thiswork showed that at least 65% of the syntactic boundaries were prosodic boundaries

in read speech.

Chen [29] proposed a method combining speech recognition with punctuation

generation based on acoustic and lexical information using a business letter corpus.

Punctuation marks were treated as words in the dictionary, with acoustic baseforms

of silence, breath, and other non-speech sounds, and her language model was mod-

ified to include punctuation. Chen found that 75.6% of all pauses correspond to

punctuation marks, and that only 6.5% of the punctuation marks do not correspond

to pauses. This finding suggests that pauses are closely related to punctuation in

read speech. Chen conducted a speech recognition and automatic punctuation ex-

periment on a business letter with 330 words, read aloud by 3 speakers. For different


33/253

14

testing conditions, Chen reported a result of about 70%-80% accuracy on punctu-

ation placement, but lower accuracy on correct identification of punctuation types.

How this result will apply to conversational speech or a larger corpus is unknown.

A sentence boundary recognizer using textual information and pause duration wasdeveloped by Gotoh and Renals [15]. In their work, for each interword boundary, a

decision is made about whether there is a sentence boundary or not. Their algorithm

finds the sequence of sentence boundary classes using speech recognition output by

combining probabilities from a language model and a pause duration model. They

conducted sentence boundary experiments on 16 hours of Broadcast News corpus

using acoustic and duration models trained on 300 hours of acoustic data and using

a language model trained on 9 million words. The word error rate (WER) for theirtest set was 26.3%. They obtained a recall rate of about 62% and precision rate

of 80% for sentence boundary detection. Their study found that a pause duration

model when used alone performs more accurately than using an N-gram language

model for sentence boundary detection. This could be possibly because the language

model suffers a lot from the word errors in the recognition output. They found that

the result is improved by combining these two information sources.

Shriberg, Stolcke and their colleagues have built a general HMM framework forcombining lexical and prosodic cues for tagging speech with various kinds of hidden

structural information, including sentence boundaries, disfluencies, topic boundaries,

dialogue acts, emotion, and so on [12,3033]. Experimental results have shown that

the combination of the prosody model and language models generally performs better

than using each knowledge source alone.

In [12], Shriberg et al. directly compared two corpora (Switchboard and Broad-

cast News) on the task of sentence segmentation. Experiments were conducted on

both human transcriptions and speech recognition outputs to compare the degra-

dation of the prosody model and LM in the face of ASR errors. They extracted

prosodic features such as pause, phone and rhyme duration, and F0 features, as well

as other non-prosodic features such as turn change and gender. The features were


34/253

15

used as inputs to a decision tree model, which predicted the appropriate segment

boundary type at each interword boundary. They investigated the performance of

the prosody model, a statistical LM that captures lexical correlations with segment

boundaries, and a combination of the two models. On Broadcast News, the prosodic

model alone performed as well as (or even better than) the word-based statistical LM,

for both human transcriptions and recognized words. They found that the prosody

model often degraded less in the face of recognition errors. Furthermore, for all tasks

and corpora, they obtained a significant improvement over the word-only models by

combining models. Analysis of the decision trees revealed that the prosody model

captures language-independent boundary indicators, such as pre-boundary length-

ening, boundary tones, and pitch resets. In addition, feature usage was found to

be corpus dependent. While pause features were heavily used in both corpora, they

found that duration cues dominated in Switchboard conversational speech; whereas,

pitch is a more informative feature in Broadcast News.

Kim and Woodland [16] also combined prosodic and lexical information in a

system designed to identify full stops, question marks, and commas in Broadcast

News. Their approach is similar to the one used by Shriberg et al. [12]. A prosodic

decision tree was tested alone and in combination with a language model, with some

improvements reported through the use of the combined model.

Christensen et al. [17] investigated two different approaches to automatically

identify punctuation using the Broadcast News corpus. A finite state approach com-

bining a linguistic model with a prosody model significantly reduced the detection

error rate and increased the related precision and recall measures, especially when

using pause duration. They also showed how prosodic features like pause duration

increased detection accuracy for full stops but had very little impact for detecting

the other types of punctuation marks. The second approach used a multi-layer per-

ception (MLP) to model the prosodic features. This approach provides insight into

the relationship between the individual prosodic features and the various punctua-


35/253

16

tion marks. The results confirmed that pause duration features are the most useful

features for finding full stops.

Huang and Zweig [34] developed a maximum entropy based method to add punc-

tuation (period, comma, and question mark) into transcriptions for the Switchboardcorpus. Features used in their models involve the neighboring words, the tags (punc-

tuation marks) associated with the previous words, and pause features. They evalu-

ated this approach on both the reference transcription and speech recognition output.

Performance was measured using precision, recall, and F-measure. Results showed

that performance varies for the different punctuation marks, and adding the bigram

type of features (features about the previous and the current position, or the current

and the next position) improves F-measure by about 4% over unigram information.They noticed that adding pause information only yields a small gain, in contrast

to the results reported for Broadcast news speech (such as [16]). This could be

attributed to the different data sets, or to a suboptimal use of pause information

in this maximum entropy approach. They observed also that a comma is hard to

distinguish from no-punctuation, and that question mark is often confusable with a

period. This approach provides a good framework for designing additional features.

The maximum entropy approach will be investigated further in Chapter 8.

In the 2003 NIST sentence boundary detection evaluation, all the systems used

both prosodic and textual features for sentence boundary detection [35]. The ap-

proaches used are similar to the HMM approach used in [12]. For example, one

system estimated the likelihood of three classes: complete sentence, incomplete sen-

tence, and non-sentence. They used 48 acoustic-prosodic features estimated for each

word boundary, including pause, speaking rate, energy, and pitch features. These

prosodic features were used to train a 2-layer neural network. A linguistic subsystem

used a trigram LM which has sentence tokens inserted between words. The com-

bined decoder used the likelihood of the sentence classes from the acoustic-prosodic

subsystem and the likelihood from the linguistic system, along with a Viterbi al-

gorithm to find the class hypothesis at each word boundary. In another system, a


36/253

17

decision tree was used to predict 4 classes: complete sentence, incomplete sentence,

interruption point in edit disfluencies, or non-event boundary. The prosodic features

provided to the decision tree are similar to the ones described in [12]. In addition,

the posterior probability from the LMs was included as a feature in the decision

tree. These two systems were further combined using a 2-layer neural network which

uses the minimum square error back-propagation algorithm to hypothesize a binary

score at each word boundary. These systems were evaluated on both the Conversa-

tional Telephone Speech (CTS) and Broadcast News speech (BN), using both human

transcriptions and speech recognition output.

There is also some work that relies on only the prosodic information for finding

the sentence units. Wang and Narayanan [36] developed a method that used only theprosodic features (mostly pitch features) in a multi-pass approach. They did not use

any word or phone alignment and thus avoid using a speech recognizer. They fit the

pitch contour with two linear folds and search for major breaks in the pitch contour.

Then in the second pass, sentence boundaries are detected based on some pre-defined

rules and statistics. They evaluated this algorithm using a subset of the Switchboard

corpus, and obtained a false alarm rate of 17.9% and a miss rate of 7.1%. This result

is encouraging since only pitch information is used. However, in conversationalspeech, pitch may not be a very effective feature for sentence boundary detection.

Clearly we would expect that adding additional prosodic and textual information

may yield further improvement.

2.1.3 Summary of Past Research on Sentence Boundary Detection

Finding sentence-like units and their subtypes can make transcriptions more read-

able, while also aiding downstream language processing modules, which typically

expect sentence-like segments. Previous work has shown that lexical cues are a

valuable knowledge source for determining punctuation roles and detecting sentence

boundaries, and that prosody provides additional important information for spo-


37/253

18

ken language processing. Useful prosodic features include pause, word lengthening,

and pitch patterns. Past experiments also show that detecting sentence boundaries

is relatively easier than reliably determining sentence subtypes or sentence-internal

breaks (e.g., commas). The poor performance of sentence-internal structure detec-

tion also affects downstream processing, such as parsing [2]. Table 2.1 summarizes

important attributes of much of the previous research. Most make use of textual

information, either by using a statistical LM or employing other machine learning

strategies. The value of adding more syntactic information to the task of sentence

detection is an open question. The approaches listed in the first five rows are simi-

lar to the approach taken in this thesis, since textual and prosodic information are

combined for sentence boundary detection.

2.2 Edit Disfluency Processing

Disfluencies have been investigated using a variety of approaches. Linguists and

psychologists have considered disfluencies largely from a production and perception

standpoint; whereas, computational linguists have been more concerned with recog-

nizing disfluencies and thus improving machine recognition of spontaneous speech.

Although the latter is our main focus, we believe that a better understanding of the

underlying theory of disfluency production and its effect on listeners comprehension

can help to construct a better model for the automatic detection of disfluencies;

therefore, we will briefly discuss some studies in psychology and linguistics.

2.2.1 Production and Properties of Disfluencies

Disfluency Production

Disfluencies are very common in spontaneous speech. When speakers cannot

formulate an entire utterance at once or when they change their minds about what

they are saying, they may suspend their speech and introduce a pause or filler before


38/253

19

Table2.1

A

summar

yofsomeimportantpriorstudiesonsentenceboundarydetection.

Columntwoisthetask

chosen

foreachin

vestigation:boundarymeans

thesentenceboundarydetectiontask,comparedtoitssub

typeor

punctuatio

ndetection;columnthreedes

cribesthemodelortheinform

ationsourcesusedbyeachin

vestiga-

tion;colum

nfouristhecorpusonwhic

htheexperimentswerecondu

cted;columnfiverepresentswhether

theexperimentswereperformedonhumantranscriptions(Ref)orreco

gnitionresults(ASR).Noteth

atCTS

(i.e.,conversationaltelephonespeech)isusedinthecorpuscolumnforthoseexperimentsthatwerecon-

ductedon

theSwitchboardcorpus.

Eventhoughnotextualinformationisusedinthisautomaticdetection

model,R

efconditionisusedinthats

tudyforitsevaluation.

Inves

tigation

ClassificationTask

Model

Corpus

ReforASR

Shribergetal.[12]

bounda

ry

prosody,word-L

M

CTS,BN

Ref,ASR

Gotoh,

Renal[15]

bounda

ry

pause,word-LM

BN

ASR

Kim,Woodland[16]

punctuation

prosody,word-L

M

BN

Ref

Huang,

Zweig[34]

punctuation

Maxent(word,pa

use)

CTS

Ref,ASR

NISTevalsystems[35]

bounda

ry

prosody,word-L

M

CTS,BN

Ref,ASR

Beeferman[24]

commasgiven

boundary

word-LM

WSJ

Ref

Stevenson,Gaizauskas[25]

bounda

ry

MBL(word,PO

S)

WSJ

Ref

Chen[29]

punctuation

punctuationtok

en

abusinessletter

ASR

withacousticinform

ation

Wang,Narayanan[36]

bounda

ry

pitch

CTS

Ref


39/253

20

continuing, or add, delete, or replace words they have already produced. Spontaneous

speech is systematically shaped by the problems speakers encounter while planning

an utterance, accessing lexical items, and articulating a speech plan. Speech errors

and disfluencies produced by normal speakers have been studied for decades to learn

about linguistic production and the cognitive processes of speech planning [3739].

Disfluency has been used as evidence for cognitive load in speech planning. Ovi-

att [40] and Shriberg [41] have shown in different types of task-oriented conversations

that long utterances have a higher disfluency rate than short ones. This effect may

be related to the planning load of the utterance, i.e., speakers have more difficulty

planning longer utterances, while making task-oriented plans at the same time. An-

other observation is that disfluencies occur more frequently at the beginning of an

utterance when the utterance is at an early planning stage, providing evidence of

the impact of utterance planning on disfluencies.

Clark and Wasow [42] studied the phenomenon of repeated words in spontaneous

speech. In their work, repeats are divided into four stages: initial commitment,

suspension of speech, hiatus, and restart of the constituent. These four stages cor-

respond to the four components (i.e., reparandum, interruption, editing term, and

correction) that have been laid out in Chapter 1 for all edit disfluencies. They pro-

posed a commit-and-restore modelof repeated words, as well as three hypotheses to

account for the repeats, namely, the complexity hypothesis, the continuity hypoth-

esis, and the commitment hypothesis. They hypothesize that the more complex a

constituent, the more likely speakers are to suspend it after an initial commitment

to it (i.e., complexity hypothesis), and that speakers prefer to produce constituents

with a continuous delivery (i.e., continuity hypothesis), and that speakers make a

preliminary commitment to constituents, expecting to suspend them afterward (i.e.,

commitment hypothesis). They analyzed repeated articles and pronouns in two large

corpora, the Switchboard corpus and the London-Lund corpus,3 and found strong

empirical evidence to support the proposed commit-and-restore model, along with

3See [42] for a description of the corpus.


40/253

21

evidence for all three hypotheses. They noticed that speakers are more likely to make

a premature commitment, and then immediately suspend it when the constituent

becomes more complex, and that it is more likely that speakers restart a constituent

the more that their suspension disrupts the utterance. One example is the frequent

occurrence of function words in repeats. It has long been recognized for English

that function words are repeated far more often than content words. When speakers

want to make an initial commitment to a constituent, the word they mostly com-

monly use is a function word. Overall, Clark and Wasow [42] found that function

words were repeated more than ten times as often as content words, 25.2 versus 2.4

per thousand in the Switchboard corpus. This more frequent occurrence of function

words in repeats is explained by the three hypotheses they proposed.

Knowing the types of words that speakers tend to repeat (or revise) is helpful

for building a better model of spontaneous speech. For example, when speakers

repair a content word, they often return to a major constituent boundary, such as

on Friday, I mean, on Monday. Such an observation is beneficial for defining

disfluency patterns and can aid in automatically identifying them.

Effect on Listeners

It is also valuable to understand how human listeners cope with disfluent input.

Studies by Lickley [43], Lickley and Bard [5] have shown that listeners generally miss

the disfluencies or incorrectly report on the occurrence of disfluencies, suggesting that

disfluencies may have been filtered out for utterance comprehension. Psycholinguists

believe that disfluencies play specific roles in our communication, such as sending sig-

nals to the listener to do things like pay more attention, help the speaker find a word,

or be patient while the speaker gathers his or her thoughts. Disfluencies provide in-

formation that enables people in a conversation to better coordinate interaction and

manage turn-taking [41].


41/253

22

Brennan [44] investigated how comprehension is affected when listeners hear dis-

fluent speech. In her experiments, listeners followed fluent and disfluent instructions

for selection of an object in a graphical display. She found that listeners make fewer

errors when hearing less misleading information before the interruption points of

disfluencies. She also observed that mid-word interruptions are better signals than

between-word interruptions that a word was produced in error and that the speaker

intends to replace it. This supports Levelts hypothesis [38] that by interrupting

a word, a speaker signals to the addressee that that word is an error. If a word is

completed, the speaker intends the listeners to interpret it as correctly delivered.

Brennan also found in her experiments that there is information in disfluencies that

partially compensates for any disruption that listeners meet while processing disflu-

ent speech.

Fox Tree [45] studied how naturally occurring speech disfluencies affect listeners

comprehension. She observed that disfluencies do not always have a negative effect

on comprehension. For example, repetitions do not hinder the listeners, because they

can help listeners to recover information missing in the first occurrence of words that

are repeated. However, it does take longer to identify words when there is a false

start. When false starts begin utterances, listeners may abort the false starts with nocost to comprehension. But, if false starts are in the middle of utterances, listeners

have to figure out where the false start begins, what to abort, and where to attach

the restarted information. This process slows down comprehension.

Disfluency Rates

A conservative estimate (excluding silent hesitations) for the rate of disfluencies4

in spontaneous speech is approximately 6 words per 100 words [45]. There are a

variety of

0685081 2a739 liu y structural event detection for rich transcription of s

Documents