0685081 2a739 liu y structural event detection for rich transcription of s

Upload: daniela-arsenova-lazarova

Post on 03-Apr-2018

214 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/28/2019 0685081 2A739 Liu y Structural Event Detection for Rich Transcription of s

    1/253

    STRUCTURAL EVENT DETECTION FOR RICH TRANSCRIPTION OF

    SPEECH

    A Thesis

    Submitted to the Faculty

    of

    Purdue University

    by

    Yang Liu

    In Partial Fulfillment of the

    Requirements for the Degree

    of

    Doctor of Philosophy

    December 2004

  • 7/28/2019 0685081 2A739 Liu y Structural Event Detection for Rich Transcription of s

    2/253

    ii

    To my parents and my husband.

  • 7/28/2019 0685081 2A739 Liu y Structural Event Detection for Rich Transcription of s

    3/253

    iii

    ACKNOWLEDGMENTS

    I started this research of structural event detection at Purdue University and

    continued this at ICSI where I have been for the past two and a half years. ICSI

    has provided a wonderful environment for me to enrich my research on speech and

    language processing.

    I gratefully acknowledge my major advisor Mary Harper for both academic and

    moral support over the past few years. Even while I have been away from campus,

    she has been in constant touch via email and phone supporting my research. I have

    benefited from her insightful guidance and discussion, as well as her encouragement.

    She has given me the intellectual freedom to do research in spoken language pro-

    cessing and has provided lots of advice. She has taught me how to be a researcher

    through all these years, while ploughing through the many paper drafts she has

    revised.

    I would like to thank Elizabeth Shriberg and Andreas Stolcke for giving me the

    opportunity to continue my research at ICSI. I thank them for their valuable sugges-

    tions and comments when I encountered difficulties in my research. I have learned

    from them how to look at a problem both from a scientific and engineering point of

    view. Especially thanks to Elizabeth Shriberg for teaching me about linguistics as

    well as for providing academic advice over the past two years.

    I thank my other Ph.D committee members: Leah Jamieson and Jack Gandour

    at Purdue University. They have been very generous with their time and supportive

    of my research topic. I have benefited from discussions with Leah Jamieson about

    speech processing in my first two years of study at Purdue University.

    Many people at ICSI also deserve acknowledgment. On the academic front, Bar-

    bara Peskin shared her vision of the entire structural event detection project, and

    at the same time was always willing to spend her time working out details. Nelson

  • 7/28/2019 0685081 2A739 Liu y Structural Event Detection for Rich Transcription of s

    4/253

    iv

    Morgan as the director of ICSI, has created an excellent environment that nurtures

    research and learning. Chuck Wooters and James Fung deserve a special thanks for

    generating speaker diarization results. Jeremy Ang, Kofi Boakye, Barry Chen, Dave

    Gelbart, Dan Gillick, Andy Hatch, Yan Huang, Adam Janin, Nikki Mirghafori, and

    Qifeng Zhu are helpful office mates and neighbors at ICSI. They have made my time

    at ICSI more enjoyable.

    There are so many other people who have contributed to my research. Luciana

    Ferrer at SRI has helped much with prosodic feature extraction. I thank Mari

    Ostendorf and Dustin Hillard at University of Washington for their collaboration

    on the structural event detection work. Wen Wang, who finished her Ph.D from

    Purdue University and is at SRI now, has been so patient with all my questions

    regarding language models. I am glad that I had the chance to work together with

    Lei Chen at Purdue University using a multimodal corpus for sentence boundary

    detection. Nitesh Chawla at CIBC has been a wonderful source for answers to

    my machine learning questions. Thanks also to Andrew McCallum at University

    of Massachusetts and Fernando Pereira at the University of Pennsylvania for their

    support and advice on the CRF model. I also thank Julia Hirsberg and Yoav Freund

    at Columbia University for their assistance with the boosting algorithm.

    Most of all, I thank my family for their support of my education. I would not be

    able to reach the end of this journey without consistent support and encouragement

    from my husband. His belief in me has made this thesis possible. The love from my

    parents and sister has also supported me during the difficult times in my graduate

    life.

  • 7/28/2019 0685081 2A739 Liu y Structural Event Detection for Rich Transcription of s

    5/253

    v

    TABLE OF CONTENTS

    Page

    LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x

    LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvi

    ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xviii

    1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

    1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

    1.2 Scope of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

    1.2.1 Structural Event Detection Tasks . . . . . . . . . . . . . . . . 5

    1.2.2 Our Approach to the Problem . . . . . . . . . . . . . . . . . . 6

    2 RELATED WORK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

    2.1 Sentence Boundary Detection . . . . . . . . . . . . . . . . . . . . . . 10

    2.1.1 Text-based Processing for Sentence Boundary Detection . . . 11

    2.1.2 Combining Textual and Prosodic Information for Sentence Bound-

    ary Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.1.3 Summary of Past Research on Sentence Boundary Detection . 17

    2.2 Edit Disfluency Processing . . . . . . . . . . . . . . . . . . . . . . . . 18

    2.2.1 Production and Properties of Disfluencies . . . . . . . . . . . . 18

    2.2.2 Past Research on Automatic Disfluency Detection . . . . . . . 24

    2.2.3 Summary of Past Research on Disfluencies . . . . . . . . . . . 33

    2.3 Filler Word Processing . . . . . . . . . . . . . . . . . . . . . . . . . . 34

    2.3.1 Production and Perception of Fillers . . . . . . . . . . . . . . 342.3.2 Past Research on Filler Word Processing . . . . . . . . . . . . 37

    2.4 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

    3 DATA RESOURCES AND TASKS . . . . . . . . . . . . . . . . . . . . . . 40

    3.1 Structural Speech Events Types . . . . . . . . . . . . . . . . . . . . . 40

  • 7/28/2019 0685081 2A739 Liu y Structural Event Detection for Rich Transcription of s

    6/253

    vi

    Page

    3.1.1 Sentence-like Units (SUs) . . . . . . . . . . . . . . . . . . . . 41

    3.1.2 Fillers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

    3.1.3 Edit Disfluencies . . . . . . . . . . . . . . . . . . . . . . . . . 43

    3.2 Structural Event Detection Task Description . . . . . . . . . . . . . . 45

    3.2.1 Task Description . . . . . . . . . . . . . . . . . . . . . . . . . 45

    3.2.2 Performance Measures . . . . . . . . . . . . . . . . . . . . . . 47

    3.3 Corpora . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

    4 THE HMM APPROACH TO STRUCTURAL EVENT DETECTION . . . 54

    4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

    4.2 Feature Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 554.2.1 Prosodic Features . . . . . . . . . . . . . . . . . . . . . . . . . 55

    4.2.2 Textual Features . . . . . . . . . . . . . . . . . . . . . . . . . 60

    4.3 The Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

    4.3.1 The Prosody Model . . . . . . . . . . . . . . . . . . . . . . . . 60

    4.3.2 The Language Model (LM) . . . . . . . . . . . . . . . . . . . 62

    4.4 Model Combination . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

    4.5 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 675 HMM BASELINE PERFORMANCE . . . . . . . . . . . . . . . . . . . . . 68

    5.1 System Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

    5.1.1 Choice of Classes . . . . . . . . . . . . . . . . . . . . . . . . . 68

    5.1.2 Training Procedures . . . . . . . . . . . . . . . . . . . . . . . 70

    5.1.3 Testing Procedures . . . . . . . . . . . . . . . . . . . . . . . . 73

    5.2 Baseline System Performance . . . . . . . . . . . . . . . . . . . . . . 76

    5.2.1 Task 1: SU Detection . . . . . . . . . . . . . . . . . . . . . . . 765.2.2 Task 2: Filler Word Detection . . . . . . . . . . . . . . . . . . 82

    5.2.3 Task 3: Edit Word and IP Detection . . . . . . . . . . . . . . 84

    5.2.4 Summary for All the Tasks . . . . . . . . . . . . . . . . . . . . 92

    5.3 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

  • 7/28/2019 0685081 2A739 Liu y Structural Event Detection for Rich Transcription of s

    7/253

    vii

    Page

    6 INCORPORATING TEXTUAL KNOWLEDGE SOURCES INTO THEHMM SYSTEM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

    6.1 Review of Related Language Model Techniques . . . . . . . . . . . . 97

    6.2 Various Knowledge Sources . . . . . . . . . . . . . . . . . . . . . . . 100

    6.2.1 Word-LM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

    6.2.2 Automatically Induced Classes (AIC) . . . . . . . . . . . . . . 101

    6.2.3 Part-of-speech (POS) Tags . . . . . . . . . . . . . . . . . . . . 102

    6.2.4 Syntactic Chunk Tags . . . . . . . . . . . . . . . . . . . . . . 104

    6.2.5 Word LMs from Additional Corpora . . . . . . . . . . . . . . 107

    6.3 Integration Methods for the LMs in an HMM . . . . . . . . . . . . . 108

    6.4 Experiments on SU Detection Task . . . . . . . . . . . . . . . . . . . 110

    6.4.1 CTS SU Task . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

    6.4.2 BN SU Task . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

    6.5 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

    7 PROSODY MODEL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

    7.1 Addressing the Imbalanced Data Set Problem . . . . . . . . . . . . . 118

    7.1.1 The Imbalanced Class Distribution Problem . . . . . . . . . . 118

    7.1.2 Approaches to Address the Problem . . . . . . . . . . . . . . . 120

    7.2 Pilot Study for SU Detection . . . . . . . . . . . . . . . . . . . . . . . 123

    7.2.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . 123

    7.2.2 Sampling Results . . . . . . . . . . . . . . . . . . . . . . . . . 125

    7.2.3 Bagging Results . . . . . . . . . . . . . . . . . . . . . . . . . . 129

    7.3 Sampling and Bagging Across SU and IP Tasks . . . . . . . . . . . . 132

    7.3.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . 1327.3.2 Results Across SU and IP Tasks . . . . . . . . . . . . . . . . . 133

    7.4 Evaluation on the Full NIST SU Task . . . . . . . . . . . . . . . . . . 136

    7.4.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . 136

    7.4.2 Results on the NIST SU Task . . . . . . . . . . . . . . . . . . 138

  • 7/28/2019 0685081 2A739 Liu y Structural Event Detection for Rich Transcription of s

    8/253

    viii

    Page

    7.5 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139

    7.5.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139

    7.5.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1418 APPROACHES TO COMBINE KNOWLEDGE SOURCES . . . . . . . . 143

    8.1 Knowledge Sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144

    8.2 A Review of the HMM for SU Detection . . . . . . . . . . . . . . . . 145

    8.3 The Maxent Posterior Probability Model for SU Detection . . . . . . 149

    8.3.1 Description of the Maxent Model . . . . . . . . . . . . . . . . 150

    8.3.2 Features Used . . . . . . . . . . . . . . . . . . . . . . . . . . . 152

    8.3.3 Comparisons of the Maxent and HMM Approaches . . . . . . 1558.3.4 Results and Discussion for the Maxent SU Model . . . . . . . 156

    8.4 The Conditional Random Field (CRF) Model for SU Detection . . . . 164

    8.4.1 Description of the CRF Model . . . . . . . . . . . . . . . . . . 165

    8.4.2 Comparisons of CRF and Other Models . . . . . . . . . . . . 166

    8.4.3 Results and Discussion for the CRF SU Model . . . . . . . . . 167

    8.5 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171

    9 SYSTEM FOR RT-04 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1739.1 RT-04 Tasks and Data . . . . . . . . . . . . . . . . . . . . . . . . . . 173

    9.2 System Performance for SU Boundary Detection . . . . . . . . . . . . 174

    9.3 SU/SU-Subtype Detection . . . . . . . . . . . . . . . . . . . . . . . . 176

    9.4 Edit Word Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . 178

    9.4.1 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178

    9.4.2 Edit Detection Results . . . . . . . . . . . . . . . . . . . . . . 182

    9.5 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18310 RELATED EFFORTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185

    10.1 Factors Impacting Performance . . . . . . . . . . . . . . . . . . . . . 185

    10.1.1 Word Error Rates (WER) . . . . . . . . . . . . . . . . . . . . 185

    10.1.2 Speaker Label for SU Detection . . . . . . . . . . . . . . . . . 187

  • 7/28/2019 0685081 2A739 Liu y Structural Event Detection for Rich Transcription of s

    9/253

    ix

    Page

    10.2 Word Fragment Detection . . . . . . . . . . . . . . . . . . . . . . . . 190

    10.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190

    10.2.2 Acoustic and Prosodic Features . . . . . . . . . . . . . . . . . 193

    10.2.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195

    10.3 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198

    11 FINAL REMARKS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200

    11.1 Impact on Other Research Efforts . . . . . . . . . . . . . . . . . . . . 200

    11.1.1 Using Structural Event Information for Word Recognition . . 200

    11.1.2 SU Detection in a Multi-modal Corpus . . . . . . . . . . . . . 202

    11.1.3 Dialog Act Detection in Meeting Corpus . . . . . . . . . . . . 204

    11.2 Summary of Experiments . . . . . . . . . . . . . . . . . . . . . . . . . 206

    11.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208

    11.4 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209

    LIST OF REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212

    APPENDICES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224

    Appendix A: ADT Boosting For SU and IP Detection . . . . . . . . . . . . 224

    A.1 ADT Boosting Description . . . . . . . . . . . . . . . . . . . . 224

    A.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . 226

    A.3 ADT Boosting Summary . . . . . . . . . . . . . . . . . . . . . 227

    Appendix B: Prosodic Features . . . . . . . . . . . . . . . . . . . . . . . . 227

    VITA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230

  • 7/28/2019 0685081 2A739 Liu y Structural Event Detection for Rich Transcription of s

    10/253

    x

    LIST OF TABLES

    Table Page

    1.1 Symbols used for the structural events in the example of annotated tran-scriptions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

    2.1 A summary of some important prior studies on sentence boundary de-tection. Column two is the task chosen for each investigation: boundarymeans the sentence boundary detection task, compared to its subtypeor punctuation detection; column three describes the model or the in-formation sources used by each investigation; column four is the cor-

    pus on which the experiments were conducted; column five representswhether the experiments were performed on human transcriptions (Ref)or recognition results (ASR). Note that CTS (i.e., conversational tele-phone speech) is used in the corpus column for those experiments thatwere conducted on the Switchboard corpus. Even though no textualinformation is used in this automatic detection model, Ref condition isused in that study for its evaluation. . . . . . . . . . . . . . . . . . . . . 19

    2.2 A summary of some important prior studies on disfluency detection. Col-umn two is the taskfor each investigation; column three describes themodel or information sources used by each investigation; column four is

    the corpus on which the experiments were conducted; column five repre-sents whether the experiments were performed on human transcriptions(Ref) or recognition results (ASR). In Core [52], preliminary repairinformation is provided, and the parser further corrects them. . . . . . . 35

    3.1 Structural events annotated by LDC and investigated in this thesis. Notethat the subtype of an edit disfluency is not annotated LDC, nor is thecorrection in an edit disfluency. . . . . . . . . . . . . . . . . . . . . . . . 41

    3.2 Information on the CTS and BN corpora, including the data set sizes,the percentage of the different types of structural events in the training

    set, and the word error rate (WER) of the speech recognizer on the testset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

    4.1 Examples of cue words that are highly representative of some structuralevent types. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

    4.2 Examples of the prosodic features used for the SU detection problem thatappear in the decision tree shown in Figure 4.3. . . . . . . . . . . . . . . 63

  • 7/28/2019 0685081 2A739 Liu y Structural Event Detection for Rich Transcription of s

    11/253

    xi

    Table Page

    5.1 CTS SU detection results using the NIST SU error rate (%) and theboundary-based CER (% in parentheses) on human transcriptions (REF)and recognition output (STT), for the LM and the prosody model indi-

    vidually, and in combination. The baseline error rate, assuming there isno SU boundary at each word boundary is 100% for the NIST SU errorrate and 15.7% for CER. . . . . . . . . . . . . . . . . . . . . . . . . . . 77

    5.2 Deletion and insertion error rates (NIST SU error rate in %) for theCTS REF condition, using the LM and the prosody alone and in theircombination. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

    5.3 Feature usage (%) for SU detection on CTS. . . . . . . . . . . . . . . . 79

    5.4 BN SU detection results using the NIST SU error rate (%) and the CER(% in parentheses) using the prosody model, the LM, and their combina-

    tion. Results are shown for both REF and STT conditions. The baselineerror rate is 100% for the NIST SU error rate and 7.2% for CER. . . . . 79

    5.5 Deletion and insertion error rates (NIST SU error rate in %) for the BNREF condition, using the LM and the prosody model alone and in theircombination. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

    5.6 Feature usage (%) for SU detection on BN. . . . . . . . . . . . . . . . . 81

    5.7 Results for CTS filler word (including FP and DM) detection, FP detec-tion, and DM boundary detection using NIST error rate (%) and CER(% in parenthesis) for the prosody model, LM, and their combination.

    Results are for both the REF and STT conditions. The baseline CER is8.3% for filler word detection, 3.6% for FP detection, and 2.8% for DMboundary detection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

    5.8 Feature usage (%) for the FP and DM detection tasks in CTS. . . . . . 85

    5.9 CTS edit word and IP detection results using NIST error rate (%) andCER (% in parenthesis) for the prosody model, the LM, and their combi-nation. Results are for the REF and STT conditions. The baseline CERis 8.3% for edit word detection, and 4.8% for edit IP detection. . . . . . 91

    5.10 Feature usage (%) for IP detection on CTS corpus. . . . . . . . . . . . . 925.11 System performance (NIST error rate in %) for all the structural event

    detection tasks on CTS and BN test sets. Results are presented for boththe REF and STT conditions. . . . . . . . . . . . . . . . . . . . . . . . 93

    6.1 Two examples of automatically induced classes for the CTS SU detectiontask, depicting member words and each words probability given the class. 103

  • 7/28/2019 0685081 2A739 Liu y Structural Event Detection for Rich Transcription of s

    12/253

    xii

    Table Page

    6.2 The POS and chunk tags for a sentence from the BN corpus, the topselling car of nineteen ninety-seven was announced today and the winneris toyota camry. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

    6.3 SU detection results (NIST error rate in %) for human transcriptions ofCTS data using various LMs, alone and in combination with the prosodymodel. The deletion (DEL), insertion (INS), and total error rate arereported. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

    6.4 SU detection results (NIST error rate in %) for human transcriptionsof the BN data using various LMs, alone and in combination with theprosody model. The deletion (DEL), insertion (INS), and total error rateare reported. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

    7.1 Description of the data set used in the pilot study for the CTS SU detec-tion task. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

    7.2 SU detection results (CER in % and F-measure) for different samplingapproaches in the pilot study of the CTS corpus, using the prosody modelalone and in combination with the LM. The CER of the LM alone on thetest set is 5.02%. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

    7.3 Recall and precision results for the sampling methods in the pilot studyof CTS SU detection. Using LM alone yields a recall of 74.6% and aprecision of 84.9%. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129

    7.4 CTS SU detection results (CER in % and F-measure) with bagging ap-plied to a randomly downsampled data set (DS), ensemble of downsam-pled training sets, and the original training set. The results for the train-ing conditions without bagging are also shown for comparison. . . . . . 130

    7.5 Description of the data sets used for the SU and IP detection tasks. Thedata set used in the pilot study is shown in the second column, which isa subset of the data set used in this investigation (large set denoted inthe table). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133

    7.6 IP and SU detection results in CER (%). DS denotes downsampled.Chance performance is 4.36% on the original test set for IP, and 13.64%for SU. The CER using LM alone is 2.34% on the IP task, and 5.27% on

    the SU task. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134

    7.7 SU detection results (NIST error rate in %) for both the CTS and BNcorpora, on the REF and STT conditions. . . . . . . . . . . . . . . . . . 139

    8.1 SU detection results (NIST error rate in %) for different state configura-tions using the trigram LM alone on the CTS reference condition. Theinsertion (INS), deletion (DEL), and total error rate are shown. . . . . . 148

  • 7/28/2019 0685081 2A739 Liu y Structural Event Detection for Rich Transcription of s

    13/253

    xiii

    Table Page

    8.2 SU detection results (NIST error rate in %) using the Maxent and theHMM approaches individually and in combination on BN and CTS, onreference transcriptions (REF) and recognition output (STT). . . . . . . 157

    8.3 Deletion, insertion, and total error rate (NIST error rate in %) of theHMM and Maxent approaches on reference transcriptions of BN and CTS. 158

    8.4 SU detection results (NIST error rate in%) using different knowledgesources on BN and CTS, evaluated on the reference transcription. . . . 159

    8.5 Comparison of using the posterior probabilities from the prosody model asbinary features versus continuous valued features in the Maxent approachfor SU detection in CTS reference transcription condition. . . . . . . . . 160

    8.6 Some of the N-gram features with the highest IG weights for the CTS SU

    detection task. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1628.7 Notation for a 2 2 contingency table used in Chi-square statistics. . . . 163

    8.8 SU detection results (NIST error rate in %) using different feature selec-tion metrics and different pruning thresholds (number of the preservedfeatures), for the CTS REF condition. . . . . . . . . . . . . . . . . . . . 164

    8.9 SU detection results (NIST error rate in %) using the HMM, Maxent,and CRF approaches individually and in combination on BN and CTS,on reference transcriptions (REF) and recognition output (STT). Thecombination of the three approaches is obtained via a majority vote. . . 169

    8.10 CTS SU detection results (NIST error rate in %) using the HMM, Maxent,and CRF individually, using different knowledge sources. Note that theall features condition uses all the knowledge sources described in Section8.3.2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170

    8.11 BN SU detection results (NIST error rate %) using the HMM, Maxent,and CRF individually, using different knowledge sources. . . . . . . . . 171

    9.1 Data description for CTS and BN used in the RT-04 NIST evaluation.BN training data is the combined RT-03 and RT-04 data. CTS contains

    only the RT-04 training data. . . . . . . . . . . . . . . . . . . . . . . . . 1749.2 SU boundary detection results (NIST SU error rate %) on the RT-04

    evaluation data. The combination is the majority vote of the Maxent,CRF, and the improved HMM approaches. DS denotes a downsampledtraining set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175

    9.3 Percentage of SU subtypes for CTS and BN. . . . . . . . . . . . . . . . 177

  • 7/28/2019 0685081 2A739 Liu y Structural Event Detection for Rich Transcription of s

    14/253

    xiv

    Table Page

    9.4 SU/SU-subtype detection results (%) on RT-04 CTS evaluation data.Results are reported using NIST SU boundary error rate, substitutionerror rate, and the subtype classification error rate (CER). . . . . . . . . 177

    9.5 SU subtype detection results (in confusion matrix) on CTS human tran-scription condition. Each cell shows the count and percentage (%) of areference subtype (row) that is hypothesized as the subtype shown in thecolumn. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178

    9.6 States and transitions used by the CRF for edit word and edit IP de-tection. The class tags are: the beginning of an edit (B-E), inside of anedit (I-E), each of which has a possible IP associated with it (B-E+IP orI-E+IP), and outside of an edit (O). . . . . . . . . . . . . . . . . . . . . 181

    9.7 Results (NIST error rate in %) for edit word and IP detection, using the

    HMM, Maxent, and CRF approaches on the reference and recognitionoutput conditions of CTS data. . . . . . . . . . . . . . . . . . . . . . . . 182

    9.8 Results (NIST error rate in %) for edit word and IP detection, using theHMM and Maxent approaches. . . . . . . . . . . . . . . . . . . . . . . . 183

    10.1 SU and edit word detection results (NIST error rate in %) for CTS andBN, on REF and various STT conditions using the RT-04 data. For SUdetection, results are reported for the SU boundary detection error. STT-1 and STT-2 are two different STT outputs, and the WER (%) for themis shown in the table. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186

    10.2 Comparisons of different ways to derive speaker labels on the RT-04 testset for the BN SU boundary detection task. Results are shown usingthe NIST error rate (%) for the HMM on the reference transcriptioncondition. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189

    10.3 Word fragment detection results (in confusion matrix) on the downsam-pled data of Switchboard corpus. . . . . . . . . . . . . . . . . . . . . . . 196

    10.4 Feature usage (%) for word fragment detection using the Switchboarddata. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197

    11.1 WER (%) when SU information is fed back to re-segment and re-recognizespeech, compared to the baseline using the acoustic segments, evaluatingon half of the RT-03 BN data. . . . . . . . . . . . . . . . . . . . . . . . 202

    11.2 SU detection results (NIST error rate in %) on the Wombat data. Notethat the combined result is not shown when using textual informationonly, in order to make results in parallel to the results in Chapter 8(Table 8.2 and Table 8.4). . . . . . . . . . . . . . . . . . . . . . . . . . . 203

  • 7/28/2019 0685081 2A739 Liu y Structural Event Detection for Rich Transcription of s

    15/253

    xv

    Table Page

    11.3 DA boundary detection results (NIST error rate in %) on ICSI Meetingdata. Results are for the reference transcriptions (REF) and STT output,using the pause decision tree (pause DT) model, hidden event LM, andthe HMM combination of them. . . . . . . . . . . . . . . . . . . . . . . 205

    11.4 DA subtype classification accuracy (%) using the reference DA bound-aries of the ICSI Meeting corpus using the human transcriptions andrecognition output. Two conditions are used: word-based features only,and the combined word-based features and the binned posterior probabil-ities from the decision tree (DT). Chance performance is obtained whenthe majority type (statement) is hypothesized for each DA. . . . . . . . 206

    A.1 SU and IP detection results (classification error rate in %) using ADTlearning algorithm and bagging. Training and testing were conductedusing a downsampled training and testing set. Chance performance is

    50%. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226

  • 7/28/2019 0685081 2A739 Liu y Structural Event Detection for Rich Transcription of s

    16/253

    xvi

    LIST OF FIGURES

    Figure Page

    1.1 A flow diagram for the automatic structural event detection task. . . . . 7

    3.1 Examples of transcriptions for CTS and BN, respectively. SU boundariesare not shown in the examples. . . . . . . . . . . . . . . . . . . . . . . . 51

    4.1 The waveform, pitch and energy contours, word alignment, and SU bound-aries for the utterance um no I hadnt heard of that. . . . . . . . . . . 56

    4.2 The raw and stylized F0 contours for the utterance um no I hadnt heardof that. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

    4.3 An example of a decision tree for SU detection. Each line represents anode in the tree, with the associated question regarding one particularprosodic feature, the class distribution, and the most likely class amongthe examples going through this node (S stands for SU boundary, and 0 fornon-SU boundary). The indentation represents the level of the decisiontree. Some of the features used in this tree are described in Table 4.2. . . 62

    5.1 Data preparation for model training. . . . . . . . . . . . . . . . . . . . . 71

    5.2 System flow diagram of the testing procedure. . . . . . . . . . . . . . . . 74

    5.3 System diagram for edit word and IP detection. . . . . . . . . . . . . . . 86

    5.4 Valid state transitions for repetitions of up to 3 words. The X and Y axesrepresent the position in the reparandum and repetition regions respec-tively, with events denoted as ORIG- and REP-. In ORIG-n, n meansthe position of a word in the reparandum; in REP-m.n, m is the totalnumber of repeated words and n represents the position of the event inthe repeat region. Optional filler words are allowed after the IP in thetransition. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

    5.5 A rule-based method for determining the reparandum region after IPsare hypothesized. SU hypotheses are used in the rules. . . . . . . . . . . 90

    6.1 Integration methods for the various LMs and the prosody model. . . . . 111

    7.1 The bagging algorithm. T is 50 in our experiments. In each bag, theclass distribution is the same as in the original data S. . . . . . . . . . . 122

  • 7/28/2019 0685081 2A739 Liu y Structural Event Detection for Rich Transcription of s

    17/253

    xvii

    Figure Page

    7.2 ROC curves and their AUCs for the decision trees trained from differentsampling approaches and the original training set. . . . . . . . . . . . . . 128

    7.3 ROC curves and their AUCs for the decision trees when bagging is used

    on the downsampled training set (bag-ds), the ensemble of downsampledtraining sets (bag-ensemble), and the original training set (bag-original). 131

    7.4 ROC curves for IP and SU detection using the prosody model alone onthe CTS corpus. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137

    8.1 The graphical model for the SU detection problem. Only one word-eventpair is depicted in each state, but in a model based on N-grams theprevious N 1 tokens would condition the transition to the next state.O are observations consisting of words W and prosodic features F, andE are structural events. . . . . . . . . . . . . . . . . . . . . . . . . . . . 146

    8.2 The graphical model for the POS tagging problem. POS tags are thehidden states in this problem. S are POS tags, and W are words. . . . . 148

    8.3 The graphical representation of a CRF for the sentence boundary detec-tion problem. E represents the state tags (i.e., SU boundary or not),while W and F are word and prosodic features respectively. O are obser-vations consisting of W and F. . . . . . . . . . . . . . . . . . . . . . . . 165

    8.4 The graphical model representations of the HMM, CMM, and CRF ap-proaches. O are observations, and S are events (or tags). . . . . . . . . 168

    10.1 An illustration of how speaker change is obtained for the CTS data. An

    arrow represents a speaker change after that segment. . . . . . . . . . . . 18810.2 The pruned decision tree used to detect word fragments. The decision is

    made in the leaf nodes; however, in the figure the decision for an internalnode in the tree is also shown. . . . . . . . . . . . . . . . . . . . . . . . 198

    11.1 Using SU information for re-recognition in BN. . . . . . . . . . . . . . . 201

    A.1 An example of an alternating decision tree (ADT). . . . . . . . . . . . . 225

  • 7/28/2019 0685081 2A739 Liu y Structural Event Detection for Rich Transcription of s

    18/253

    xviii

    ABSTRACT

    Liu, Yang. Ph.D., Purdue University, December, 2004. Structural Event Detectionfor Rich Transcription of Speech. Major Professor: Mary P. Harper.

    Although speech recognition technology has significantly improved during the

    past few decades, current speech recognition systems output only a stream of words

    without providing other useful structural information that could aid a human reader

    and downstream language processing modules. This thesis research focuses on theautomatic detection of several helpful structural events in speech, including sentence

    boundaries, type of utterance, filled pauses, discourse markers, and edit disfluencies.

    The systems evaluated combine prosodic cues and textual information sources in a

    variety of ways to support automatic detection of these structural events. Exper-

    iments were conducted across corpora (conversational speech and broadcast news

    speech) and with different transcription quality (human transcriptions versus recog-

    nition output).The imbalanced data problem is investigated for training the decision tree prosody

    model component of our system because structural events are much less frequent than

    non-events. A variety of sampling approaches and bagging are used to address this

    imbalance. Significant performance improvements are obtained via bagging. Some

    of the sampling methods are useful depending on the performance metrics used.

    Sentence boundary detection and disfluency detection tasks are impacted differently

    by sampling, bagging, and boosting, suggesting the inherent differences between thetwo tasks.

    A variety of methods for combining knowledge sources are examined: a hidden

    Markov model (HMM), the maximum entropy (Maxent) model, and the conditional

    random field (CRF). The Maxent and CRF approaches are discriminatively trained

  • 7/28/2019 0685081 2A739 Liu y Structural Event Detection for Rich Transcription of s

    19/253

    xix

    to model the posterior probabilities and thus correlate with the performance mea-

    sures. They also support the use of more correlated features and so enable the

    combination of a variety of textual information sources. The HMM and CRF both

    model sequence information, unlike the Maxent which explicitly models local infor-

    mation. A model that combines these three approaches is superior to any method

    alone.

    Interactions with other research efforts suggest that the methods developed in

    this thesis generalize well to other corpora (e.g., a multimodal corpus, a multiparty

    meeting corpus) and to similar tasks (e.g., a gestural model, dialog act segmentation

    and classification).

  • 7/28/2019 0685081 2A739 Liu y Structural Event Detection for Rich Transcription of s

    20/253

    1

    1. INTRODUCTION

    1.1 Motivation

    Speech recognition technology has improved significantly during the past few

    decades; for tasks involving read or pre-planned speech, recognition accuracy is often

    greater than 90%. However, the word-level transcription accuracy for spontaneous

    conversational speech falls far short of this level, generally lower than 80%. The

    acoustic properties of spontaneous conversational speech are quite challenging tomodel due to phenomena such as coarticulation, word fragments, and filled pauses.

    Additionally, disfluencies and ungrammatical utterances pose serious problems for

    language models (LMs). These factors combine to affect the performance of speech

    recognizers on spontaneous speech. The following is an excerpt of a transcription of

    spontaneous conversational speech. Both the human transcription and the recogni-

    tion output are shown in the below example. The presence of a word fragment in

    the example is represented by a - after the partial word. Word recognition errors inthe recognition output have a strikethrough in them, and the corresponding correct

    words are shown in bold face inside curly parentheses (corresponding to deletion or

    substitution errors).

    Human Transcription:

    but uh im i i i think that you know i mean we always uh i mean ive ive

    had a a lot of good experiences with uh with many many people especially

    where theyve had uh extended family and i and an- i i kind of see that

    that you know perhaps you know we may need to like get close to the

    family environment and and get down to the values of you know i mean

    uh its money seems to be too big of an issue wi- with with with with with

    whats going on today

  • 7/28/2019 0685081 2A739 Liu y Structural Event Detection for Rich Transcription of s

    21/253

    2

    Recognition Output:

    but um that that{uh im i i} i think that you know we{i mean} we

    always uh i mean ive ive had it there {a} a lot of good experiences with

    the{uh} with many many people especially with have{where theyve}

    had extended family night and i and{an- i} i kind of see that that you

    know perhaps you know we may need to like youre {get} close to the

    family environment and in {and} get down to the values of you know i

    mean no and{uh its} money seems to be too big of an issue we would

    {wi- with with with} with with really was we would whats going on

    today

    As can be seen from the recognition output example, current automatic speech

    recognition (ASR) systems simply output a stream of words. Structural informa-

    tion (such as the location of punctuation, disfluencies, and speaker turns) is missing,

    making it difficult for a human to read or for downstream automatic processors to

    deal with. As shown in the example above, even the human transcription, which

    contains no word errors, is still hard to read due to the absence of punctuation and

    the presence of speech disfluencies and filler words.

    The transcriptions can be marked with different types of structural information

    to enhance readability or ease downstream processing. In this thesis, the following

    types of structural events are considered:

    Sentence boundaries: A sentence ends with ./ for a statement, .../ for an

    incomplete statement, and ?/ for a question in the marked up transcription

    examples in this thesis.

    Filler words: These include filled pauses (e.g., uh and um) and discourse

    marker words (such as you know, well). The tokens and are used to

    mark the extent of these filler words.

  • 7/28/2019 0685081 2A739 Liu y Structural Event Detection for Rich Transcription of s

    22/253

    3

    Edit disfluencies: Disfluencies are highly prevalent in conversational speech.

    In this thesis, the term edit disfluency is used for the disfluencies1 with the

    following structure (see Chapter 3 for more details):

    (reparandum) * editing term correction

    The edited portion of a disfluency (i.e., the reparandum) is marked in examples

    with parentheses ( and ). For example, in a a lot in the human transcription

    shown above, the first a is the reparandum so it should be marked with

    parentheses. The interruption point (IP) inside the edit disfluency is marked

    by *. The editing term, which follows the IP and precedes the corrections, is

    optional. The edit disfluency structure is embedded in utterances and so may

    be preceded and followed by words that are not part of the edit disfluency.

    These types of structural information will be described in more detail in Chap-

    ter 3. Below is the annotation of our human transcription example.2 All the words

    that interrupt the fluency of speech are shown in bold face in this example. Table 1.1

    summarizes the meanings of the symbols used in the annotated transcriptions.

    but uh (im * i * i think that you know i mean ive) * ive

    had (a) * a lot of good experiences (with) * uh with (many) * many

    people especially where theyve had uh extended family ./

    (and i * and) * an- (i) * i kind of see (that) * that you know perhaps

    you know we may need to like get close to the family environment

    (and) and get down to the values of you know i mean .../

    (uh its) * money seems to be too big of an issue (wi- * with * with

    * with * with) * with whats going on today ./

    The transcriptions containing this structural information are called rich tran-

    scriptions because they contain much richer information than a simple stream of

    1These disfluencies are also called speech repairs in the literature.2The human transcription is used here to illustrate the the importance of structural information inorder to factor out the effect of speech recognition errors.

  • 7/28/2019 0685081 2A739 Liu y Structural Event Detection for Rich Transcription of s

    23/253

    4

    Table 1.1Symbols used for the structural events in the example of annotatedtranscriptions.

    Symbol Meaning

    ./ or .../ sentence boundaries (complete or incomplete)

    filler words

    ( ) reparandum in an edit disfluency

    * interruption point in an edit disfluency

    words. Given this structural information (either human annotated or automatically

    generated), human transcriptions or recognition output can be cleaned up for im-

    proved readability. For example, if the disfluencies and fillers are removed from the

    previous transcription and each sentence is presented with the appropriate punctu-

    ation, the cleaned-up transcription would be as follows:

    But ive had a lot of good experiences with many people especially where

    theyve had extended family. I kind of see that perhaps we may need to

    get close to the family environment and get down to the value of... Money

    seems to be too big of an issue with whats going on today.

    Clearly this cleaned-up transcription is more readable, is easier to understand, and

    is more appropriate for subsequent language processing modules.

    There has been a growing interest recently in the study of the impact of structural

    events. Jones et al. [1] have conducted experiments, showing that cleaned-up tran-

    scriptions improve human readability compared to the original transcription. Other

    recent research has considered whether automatically generated sentence informa-

    tion can play a role in parsing. Gregory et al. [2] have found that using sentence-

    internal prosodic cues degrades parsing performance; however, the method used for

    automatically generating sentence-internal annotations was not state-of-the-art. On

  • 7/28/2019 0685081 2A739 Liu y Structural Event Detection for Rich Transcription of s

    24/253

    5

    the other hand, Kahn et al. [3] have achieved significant error reductions in pars-

    ing performance when using sentence boundary information from a state-of-the-art

    automatic detection system.

    1.2 Scope of the Thesis

    1.2.1 Structural Event Detection Tasks

    Automatic structural event detection is a crucial step for improving the readabil-

    ity of speech recognition output and for making spontaneous speech understanding

    systems possible. The goal of this thesis is to enrich the recognition output with

    multiple levels of structural information, including sentence boundaries, filled pause

    and discourse marker words, and edit disfluencies. We will construct and evaluate

    algorithms that automatically detect such structural event types.

    Note that the problem of sentence boundary detection differs from its analog in

    text processing, which is sometimes called sentence splitting or sentence boundary

    detection. The goal of the sentence splitting task is to identify sentence boundaries

    in written text where punctuation is available; hence, the problem is effectively

    reduced to deciding which symbols that potentially denote sentence boundaries (i.e.,

    . ! ?) actually do. The sentence splitting problem is not deterministic since these

    punctuation symbols do not always occur at the end of sentences. For example, in

    I watch C. N. N., only the final period denotes the end of a sentence. In the

    sentence boundary detection task using speech, no punctuation is available, yet the

    availability of speech provides additional useful information.

    We will investigate structural event detection across corpora, on both broadcast

    news and conversational telephone speech. Broadcast news comprises read speech,

    formal interviews, man-on-the-street interviews, and some spontaneous speech, al-

    though not usually conversational. In contrast, telephone conversational speech is

    spontaneous, and much of it is quite informal. Broadcast news usually has fewer

    edit disfluencies than spontaneous conversational speech, and many of these may

  • 7/28/2019 0685081 2A739 Liu y Structural Event Detection for Rich Transcription of s

    25/253

    6

    be caused by reading errors. Our algorithms will be evaluated on both the human

    transcriptions and recognition output to investigate the effect of incorrect words in

    ASR output on system performance.

    1.2.2 Our Approach to the Problem

    The framework of most current speech recognition systems is to find the most

    likely word sequence given the speech signal. Because the hidden structure of the

    utterance (sentence boundaries and disfluencies) does not have an explicit acoustic

    signal,3 it is hard to integrate the problem of structural event detection with word

    recognition in current speech recognition systems. Therefore, we will address this

    problem by using a post-processing approach that generates the structural informa-

    tion after the recognition results are available. Several knowledge sources will be

    employed, involving both textual information and prosodic cues to reduce ambiguity

    inherent in one knowledge source. Figure 1.1 shows a diagram of our approach, the

    final output of which is a rich transcription or cleaned-up transcription. As the figure

    shows, prosodic information is obtained from a combination of the speech signal and

    recognition output, which is used to provide word and phone alignments.

    In our investigations, textual information is obtained from the word strings in

    the transcriptions generated either by a human transcriber or by the ASR system.

    This type of information is no doubt very important. In many cases, people have

    no problem inferring appropriate structural events from word transcriptions. Some

    textual cues are quite useful for automatic identification of structural events, for

    example, words like I often start a new sentence, and a repeated or revised word

    string often signals disfluencies. In addition, the syntactic and semantic information

    derived from the words provides valuable cues for structural event detection.

    3There are some implicit prosodic cues at the boundary points, which will be described in Chapter 5.

  • 7/28/2019 0685081 2A739 Liu y Structural Event Detection for Rich Transcription of s

    26/253

    7

    ASRASR transcriptionSpeech

    signal

    Extract

    textual features

    Extract

    prosodic features

    Structural event

    detection systems

    Structural event

    output

    Prosodic

    featuresTextural

    features

    Process

    transcriptions

    Rich or cleaned-up

    transcription

    Fig. 1.1. A flow diagram for the automatic structural event detection task.

    In some cases, the use of textual information alone may not completely disam-

    biguate structural events. The following example is extracted from the broadcast

    news data:

    Anne what are the chances well hear uh something of substance again

    from the President prior to the vote ?/

    And thats a possible next step ?/

  • 7/28/2019 0685081 2A739 Liu y Structural Event Detection for Rich Transcription of s

    27/253

    8

    A purely textual model would not be able to determine whether the second sentence

    is a statement or a question. However, the rising tone in the speech signal would

    enable the listener to determine that a question is intended.

    In the face of high word error rates, word level information may be unreliable and

    possibly misleading. In such a case, the lexical, syntactic, and semantic patterns used

    for detecting sentence boundaries and disfluencies will be less reliable due to the word

    errors. The following example compares ASR output with a human transcription of

    the speech:

    ASR output:

    Its been a while for the good for the tackle that stuff

    Human transcription:

    Its been a while since Ive uh uh since Ive tackled that stuff

    It will be difficult, if not impossible, for a word-based language model to identify the

    repetition or the existing disfluencies using this ASR output.

    Prosody, the rhythm and melody of speech, is important for automating

    rich transcription. Past research results [414] suggest that speakers use prosody to

    impose structure on both spontaneous and read speech. Examples of such prosodic

    indicators include pause duration, change in pitch range and amplitude, global pitch

    declination, melody and boundary tone distribution, vowel duration lengthening, and

    speaking rate variation. Since these features provide information complementary to

    the word sequence, they provide an additional potentially valuable source of infor-

    mation for structural event detection. Additionally, since they may be more robust

    than textual features to word errors, they may provide a more reliable knowledge

    source.

    Textual and prosodic knowledge sources have been exploited in previous re-

    search [12,13,1518], and their combination has proven to be beneficial to the per-

    formance for structural event detection. This thesis builds upon this prior work that

    combined these knowledge sources using a hidden Markov model (HMM) approach.

  • 7/28/2019 0685081 2A739 Liu y Structural Event Detection for Rich Transcription of s

    28/253

    9

    We will focus on developing a richer feature set for these knowledge sources, building

    more effective models to capture such information, and integrating various knowledge

    sources for structural event detection by using different modeling approaches.

    The investigations in this thesis should help to answer several questions with

    respect to the automatic detection of structural events: What knowledge sources

    are helpful? What is the best modeling approach for combining different knowledge

    sources? How is the model performance affected by various factors such as corpora,

    transcriptions, and event types?

  • 7/28/2019 0685081 2A739 Liu y Structural Event Detection for Rich Transcription of s

    29/253

    10

    2. RELATED WORK

    In the past decade, a substantial amount of research has been conducted in the areas

    of detecting intonational and linguistic boundaries in conversational speech, as well as

    in detecting and correcting speech disfluencies. In this chapter, we introduce research

    related to the automatic detection of different structural events, namely, sentence

    boundaries, edit disfluencies, and filler words. For each type, related research is

    categorized based on what knowledge sources have been used. Additionally, for

    completeness, studies from linguistics or psychology are discussed where appropriate.

    2.1 Sentence Boundary Detection

    For speech recognition, sentences are usually defined by acoustic segment

    boundaries that correspond to long stretches of silence or a change of conversa-

    tional turn.1 In contrast, linguistic segment boundaries mark a unit that represents

    a complete idea but may not necessarily represent a grammatical sentence nor begin

    or end with a long silence or turn change. Experiments by Meteer and Iyer in [19]

    suggest that language model perplexity can be reduced by working with linguistic

    segments rather than acoustic segments. Our goal is to automatically find such

    linguistic sentence-like units.

    Some of the previous research has focused on detecting major sentence bound-

    aries;2 others have investigated detecting subtypes of sentences (e.g., questions, state-

    ments). Prior research related to sentence and its subtype detection can be divided

    1The definition of turn varies in the literature. In this thesis, a turn is a portionof speech uttered by a single speaker and bounded by silence from that speaker. Seehttp://secure.ldc.upenn.edu/intranet/Annotation/MDE/guidelines/2004/control floor.shtmlfor more details.2The definition of sentence varies across these past research efforts. The term used in this thesiswill be defined in Chapter 3.

  • 7/28/2019 0685081 2A739 Liu y Structural Event Detection for Rich Transcription of s

    30/253

    11

    into two categories based on the knowledge sources employed: a text-based approach

    or an approach using textual and acoustic information. The text-based approach uses

    only textual information; hence, it is suitable for both transcribed speech and writ-

    ten text. Text-based methods may not be able to resolve some ambiguities using

    information found in text, as in the example in Section 1.2.2, for which a question

    type is detected based on the rising tone. A combination approach uses both the

    acoustic cues and textual information. In most cases, it is difficult to compare the

    results of prior research since they often differ on the corpora used for training and

    testing, as well as in the information used by their systems.

    2.1.1 Text-based Processing for Sentence Boundary Detection

    As mentioned in Chapter 1, the sentence boundary detection problem in written

    text aims to disambiguate punctuation marks with the goal of identifying sentence

    boundaries. Palmer and Hirst [20] developed an efficient automatic sentence bound-

    ary labeling algorithm, which uses the part-of-speech (POS) probabilities of the

    tokens surrounding a punctuation mark as input to a feed-forward neural network

    to obtain the role of the punctuation mark. Because sentence boundaries were not

    available to their part-of-speech tagger, they used the prior probabilities of all parts

    of speech for a word. They tested their system on a portion of the Wall Street Jour-

    nal (WSJ) corpus. Their experiments found that a context of six surrounding tokens

    and a hidden layer with two units yielded the best accuracy on the test set. When

    training and testing were conducted using texts in lower-case-only format, the net-

    work was able to disambiguate 96.2% of the boundaries. Other approaches have also

    been used to investigate this problem, for example, Reynar and Ratnaparkhi [21]

    used a maximum entropy algorithm, and Schmid [22] employed an unsupervised

    learning method. Walker et al. [23] compared three different methods for sentence

    boundary detection as a preprocessing step in machine translation. They showed

    that the maximum entropy method [21] outperforms the other two systems, i.e.,

  • 7/28/2019 0685081 2A739 Liu y Structural Event Detection for Rich Transcription of s

    31/253

    12

    the direct model and the rule-based system. They also argued that high recall is

    more important for the application of machine translation: fragmenting sentences is

    better than combining two sentences. This insight might be useful if we are going

    to use our structural event detection results in the downstream language processing

    modules, among which machine translation is one. The sentence boundary problem

    in text processing is different from that in speech processing in that punctuation

    information is available in text (although it is not deterministic). However, some

    knowledge obtained from such a task is useful to our automatic sentence boundary

    detection in speech, such as the lexical cues that are most effective for determining

    the role of punctuation.

    An automatic punctuation system, called Cyberpunc, which is based only onlexical information, was developed by Beeferman et al. [24]. They counted the oc-

    currence of each punctuation mark in the 42 million tokens of the WSJ corpus and

    reported that about 10.5% of the tokens in that corpus were punctuation, mostly

    commas (4.658%) and periods (4.174%). Cyberpunc generates only commas, as-

    suming that sentence boundaries are provided or pre-determined. They extended

    a language model to account for punctuation by explicitly including commas in an

    N-gram LM and allowing commas to occur at interword boundaries. Commas wereadded to the testing word strings by finding the best hypothesis using a Viterbi

    algorithm. They evaluated this method for generating commas on 2,317 reference

    sentences of the Penn Treebank WSJ corpus that were stripped of punctuation marks.

    They obtained a recall rate of 66% and precision of 76% for this comma generation

    task. The goal of this research differs from sentence boundary detection in speech

    because the task is to find commas assuming that the major sentence boundaries are

    known. Beeferman et al. [24] claimed that a punctuation-aware language model canbe applied to rescore speech recognition lattices in general, but they did not evaluate

    this.

    Stevenson and Gaizauskas [25] also conducted experiments on identifying sen-

    tence boundaries in transcriptions of the WSJ corpus using a memory-based learn-

  • 7/28/2019 0685081 2A739 Liu y Structural Event Detection for Rich Transcription of s

    32/253

    13

    ing (MBL) algorithm. For each word boundary, they obtained a feature vector of

    13 elements from the word and its neighboring words, including the probability of

    the word starting or ending a sentence, their POS tags, and so on. The precision

    and recall of their approach was around 35% when case information of the word was

    removed. The results were much improved when case information was provided to

    their sentence boundary detection system. Clearly, case information is important for

    this method, suggesting that it may not extend well to ASR outputs, which do not

    capture case information and often contain incorrect words.

    2.1.2 Combining Textual and Prosodic Information for Sentence Bound-

    ary Detection

    Some past research has been conducted on combining prosodic information and

    textual information to find sentence boundaries and their subtypes in speech. It

    is known that there is a strong correspondence between discourse structure and

    prosodic information. A comparison between syntactic and prosodic phrasing was

    presented by Fach [26]. In that study, the syntactic structure was generated by Ab-

    neys chunk parser [27] and prosodic structure was given by ToBI label files [28]. Thiswork showed that at least 65% of the syntactic boundaries were prosodic boundaries

    in read speech.

    Chen [29] proposed a method combining speech recognition with punctuation

    generation based on acoustic and lexical information using a business letter corpus.

    Punctuation marks were treated as words in the dictionary, with acoustic baseforms

    of silence, breath, and other non-speech sounds, and her language model was mod-

    ified to include punctuation. Chen found that 75.6% of all pauses correspond to

    punctuation marks, and that only 6.5% of the punctuation marks do not correspond

    to pauses. This finding suggests that pauses are closely related to punctuation in

    read speech. Chen conducted a speech recognition and automatic punctuation ex-

    periment on a business letter with 330 words, read aloud by 3 speakers. For different

  • 7/28/2019 0685081 2A739 Liu y Structural Event Detection for Rich Transcription of s

    33/253

    14

    testing conditions, Chen reported a result of about 70%-80% accuracy on punctu-

    ation placement, but lower accuracy on correct identification of punctuation types.

    How this result will apply to conversational speech or a larger corpus is unknown.

    A sentence boundary recognizer using textual information and pause duration wasdeveloped by Gotoh and Renals [15]. In their work, for each interword boundary, a

    decision is made about whether there is a sentence boundary or not. Their algorithm

    finds the sequence of sentence boundary classes using speech recognition output by

    combining probabilities from a language model and a pause duration model. They

    conducted sentence boundary experiments on 16 hours of Broadcast News corpus

    using acoustic and duration models trained on 300 hours of acoustic data and using

    a language model trained on 9 million words. The word error rate (WER) for theirtest set was 26.3%. They obtained a recall rate of about 62% and precision rate

    of 80% for sentence boundary detection. Their study found that a pause duration

    model when used alone performs more accurately than using an N-gram language

    model for sentence boundary detection. This could be possibly because the language

    model suffers a lot from the word errors in the recognition output. They found that

    the result is improved by combining these two information sources.

    Shriberg, Stolcke and their colleagues have built a general HMM framework forcombining lexical and prosodic cues for tagging speech with various kinds of hidden

    structural information, including sentence boundaries, disfluencies, topic boundaries,

    dialogue acts, emotion, and so on [12,3033]. Experimental results have shown that

    the combination of the prosody model and language models generally performs better

    than using each knowledge source alone.

    In [12], Shriberg et al. directly compared two corpora (Switchboard and Broad-

    cast News) on the task of sentence segmentation. Experiments were conducted on

    both human transcriptions and speech recognition outputs to compare the degra-

    dation of the prosody model and LM in the face of ASR errors. They extracted

    prosodic features such as pause, phone and rhyme duration, and F0 features, as well

    as other non-prosodic features such as turn change and gender. The features were

  • 7/28/2019 0685081 2A739 Liu y Structural Event Detection for Rich Transcription of s

    34/253

    15

    used as inputs to a decision tree model, which predicted the appropriate segment

    boundary type at each interword boundary. They investigated the performance of

    the prosody model, a statistical LM that captures lexical correlations with segment

    boundaries, and a combination of the two models. On Broadcast News, the prosodic

    model alone performed as well as (or even better than) the word-based statistical LM,

    for both human transcriptions and recognized words. They found that the prosody

    model often degraded less in the face of recognition errors. Furthermore, for all tasks

    and corpora, they obtained a significant improvement over the word-only models by

    combining models. Analysis of the decision trees revealed that the prosody model

    captures language-independent boundary indicators, such as pre-boundary length-

    ening, boundary tones, and pitch resets. In addition, feature usage was found to

    be corpus dependent. While pause features were heavily used in both corpora, they

    found that duration cues dominated in Switchboard conversational speech; whereas,

    pitch is a more informative feature in Broadcast News.

    Kim and Woodland [16] also combined prosodic and lexical information in a

    system designed to identify full stops, question marks, and commas in Broadcast

    News. Their approach is similar to the one used by Shriberg et al. [12]. A prosodic

    decision tree was tested alone and in combination with a language model, with some

    improvements reported through the use of the combined model.

    Christensen et al. [17] investigated two different approaches to automatically

    identify punctuation using the Broadcast News corpus. A finite state approach com-

    bining a linguistic model with a prosody model significantly reduced the detection

    error rate and increased the related precision and recall measures, especially when

    using pause duration. They also showed how prosodic features like pause duration

    increased detection accuracy for full stops but had very little impact for detecting

    the other types of punctuation marks. The second approach used a multi-layer per-

    ception (MLP) to model the prosodic features. This approach provides insight into

    the relationship between the individual prosodic features and the various punctua-

  • 7/28/2019 0685081 2A739 Liu y Structural Event Detection for Rich Transcription of s

    35/253

    16

    tion marks. The results confirmed that pause duration features are the most useful

    features for finding full stops.

    Huang and Zweig [34] developed a maximum entropy based method to add punc-

    tuation (period, comma, and question mark) into transcriptions for the Switchboardcorpus. Features used in their models involve the neighboring words, the tags (punc-

    tuation marks) associated with the previous words, and pause features. They evalu-

    ated this approach on both the reference transcription and speech recognition output.

    Performance was measured using precision, recall, and F-measure. Results showed

    that performance varies for the different punctuation marks, and adding the bigram

    type of features (features about the previous and the current position, or the current

    and the next position) improves F-measure by about 4% over unigram information.They noticed that adding pause information only yields a small gain, in contrast

    to the results reported for Broadcast news speech (such as [16]). This could be

    attributed to the different data sets, or to a suboptimal use of pause information

    in this maximum entropy approach. They observed also that a comma is hard to

    distinguish from no-punctuation, and that question mark is often confusable with a

    period. This approach provides a good framework for designing additional features.

    The maximum entropy approach will be investigated further in Chapter 8.

    In the 2003 NIST sentence boundary detection evaluation, all the systems used

    both prosodic and textual features for sentence boundary detection [35]. The ap-

    proaches used are similar to the HMM approach used in [12]. For example, one

    system estimated the likelihood of three classes: complete sentence, incomplete sen-

    tence, and non-sentence. They used 48 acoustic-prosodic features estimated for each

    word boundary, including pause, speaking rate, energy, and pitch features. These

    prosodic features were used to train a 2-layer neural network. A linguistic subsystem

    used a trigram LM which has sentence tokens inserted between words. The com-

    bined decoder used the likelihood of the sentence classes from the acoustic-prosodic

    subsystem and the likelihood from the linguistic system, along with a Viterbi al-

    gorithm to find the class hypothesis at each word boundary. In another system, a

  • 7/28/2019 0685081 2A739 Liu y Structural Event Detection for Rich Transcription of s

    36/253

    17

    decision tree was used to predict 4 classes: complete sentence, incomplete sentence,

    interruption point in edit disfluencies, or non-event boundary. The prosodic features

    provided to the decision tree are similar to the ones described in [12]. In addition,

    the posterior probability from the LMs was included as a feature in the decision

    tree. These two systems were further combined using a 2-layer neural network which

    uses the minimum square error back-propagation algorithm to hypothesize a binary

    score at each word boundary. These systems were evaluated on both the Conversa-

    tional Telephone Speech (CTS) and Broadcast News speech (BN), using both human

    transcriptions and speech recognition output.

    There is also some work that relies on only the prosodic information for finding

    the sentence units. Wang and Narayanan [36] developed a method that used only theprosodic features (mostly pitch features) in a multi-pass approach. They did not use

    any word or phone alignment and thus avoid using a speech recognizer. They fit the

    pitch contour with two linear folds and search for major breaks in the pitch contour.

    Then in the second pass, sentence boundaries are detected based on some pre-defined

    rules and statistics. They evaluated this algorithm using a subset of the Switchboard

    corpus, and obtained a false alarm rate of 17.9% and a miss rate of 7.1%. This result

    is encouraging since only pitch information is used. However, in conversationalspeech, pitch may not be a very effective feature for sentence boundary detection.

    Clearly we would expect that adding additional prosodic and textual information

    may yield further improvement.

    2.1.3 Summary of Past Research on Sentence Boundary Detection

    Finding sentence-like units and their subtypes can make transcriptions more read-

    able, while also aiding downstream language processing modules, which typically

    expect sentence-like segments. Previous work has shown that lexical cues are a

    valuable knowledge source for determining punctuation roles and detecting sentence

    boundaries, and that prosody provides additional important information for spo-

  • 7/28/2019 0685081 2A739 Liu y Structural Event Detection for Rich Transcription of s

    37/253

    18

    ken language processing. Useful prosodic features include pause, word lengthening,

    and pitch patterns. Past experiments also show that detecting sentence boundaries

    is relatively easier than reliably determining sentence subtypes or sentence-internal

    breaks (e.g., commas). The poor performance of sentence-internal structure detec-

    tion also affects downstream processing, such as parsing [2]. Table 2.1 summarizes

    important attributes of much of the previous research. Most make use of textual

    information, either by using a statistical LM or employing other machine learning

    strategies. The value of adding more syntactic information to the task of sentence

    detection is an open question. The approaches listed in the first five rows are simi-

    lar to the approach taken in this thesis, since textual and prosodic information are

    combined for sentence boundary detection.

    2.2 Edit Disfluency Processing

    Disfluencies have been investigated using a variety of approaches. Linguists and

    psychologists have considered disfluencies largely from a production and perception

    standpoint; whereas, computational linguists have been more concerned with recog-

    nizing disfluencies and thus improving machine recognition of spontaneous speech.

    Although the latter is our main focus, we believe that a better understanding of the

    underlying theory of disfluency production and its effect on listeners comprehension

    can help to construct a better model for the automatic detection of disfluencies;

    therefore, we will briefly discuss some studies in psychology and linguistics.

    2.2.1 Production and Properties of Disfluencies

    Disfluency Production

    Disfluencies are very common in spontaneous speech. When speakers cannot

    formulate an entire utterance at once or when they change their minds about what

    they are saying, they may suspend their speech and introduce a pause or filler before

  • 7/28/2019 0685081 2A739 Liu y Structural Event Detection for Rich Transcription of s

    38/253

    19

    Table2.1

    A

    summar

    yofsomeimportantpriorstudiesonsentenceboundarydetection.

    Columntwoisthetask

    chosen

    foreachin

    vestigation:boundarymeans

    thesentenceboundarydetectiontask,comparedtoitssub

    typeor

    punctuatio

    ndetection;columnthreedes

    cribesthemodelortheinform

    ationsourcesusedbyeachin

    vestiga-

    tion;colum

    nfouristhecorpusonwhic

    htheexperimentswerecondu

    cted;columnfiverepresentswhether

    theexperimentswereperformedonhumantranscriptions(Ref)orreco

    gnitionresults(ASR).Noteth

    atCTS

    (i.e.,conversationaltelephonespeech)isusedinthecorpuscolumnforthoseexperimentsthatwerecon-

    ductedon

    theSwitchboardcorpus.

    Eventhoughnotextualinformationisusedinthisautomaticdetection

    model,R

    efconditionisusedinthats

    tudyforitsevaluation.

    Inves

    tigation

    ClassificationTask

    Model

    Corpus

    ReforASR

    Shribergetal.[12]

    bounda

    ry

    prosody,word-L

    M

    CTS,BN

    Ref,ASR

    Gotoh,

    Renal[15]

    bounda

    ry

    pause,word-LM

    BN

    ASR

    Kim,Woodland[16]

    punctuation

    prosody,word-L

    M

    BN

    Ref

    Huang,

    Zweig[34]

    punctuation

    Maxent(word,pa

    use)

    CTS

    Ref,ASR

    NISTevalsystems[35]

    bounda

    ry

    prosody,word-L

    M

    CTS,BN

    Ref,ASR

    Beeferman[24]

    commasgiven

    boundary

    word-LM

    WSJ

    Ref

    Stevenson,Gaizauskas[25]

    bounda

    ry

    MBL(word,PO

    S)

    WSJ

    Ref

    Chen[29]

    punctuation

    punctuationtok

    en

    abusinessletter

    ASR

    withacousticinform

    ation

    Wang,Narayanan[36]

    bounda

    ry

    pitch

    CTS

    Ref

  • 7/28/2019 0685081 2A739 Liu y Structural Event Detection for Rich Transcription of s

    39/253

    20

    continuing, or add, delete, or replace words they have already produced. Spontaneous

    speech is systematically shaped by the problems speakers encounter while planning

    an utterance, accessing lexical items, and articulating a speech plan. Speech errors

    and disfluencies produced by normal speakers have been studied for decades to learn

    about linguistic production and the cognitive processes of speech planning [3739].

    Disfluency has been used as evidence for cognitive load in speech planning. Ovi-

    att [40] and Shriberg [41] have shown in different types of task-oriented conversations

    that long utterances have a higher disfluency rate than short ones. This effect may

    be related to the planning load of the utterance, i.e., speakers have more difficulty

    planning longer utterances, while making task-oriented plans at the same time. An-

    other observation is that disfluencies occur more frequently at the beginning of an

    utterance when the utterance is at an early planning stage, providing evidence of

    the impact of utterance planning on disfluencies.

    Clark and Wasow [42] studied the phenomenon of repeated words in spontaneous

    speech. In their work, repeats are divided into four stages: initial commitment,

    suspension of speech, hiatus, and restart of the constituent. These four stages cor-

    respond to the four components (i.e., reparandum, interruption, editing term, and

    correction) that have been laid out in Chapter 1 for all edit disfluencies. They pro-

    posed a commit-and-restore modelof repeated words, as well as three hypotheses to

    account for the repeats, namely, the complexity hypothesis, the continuity hypoth-

    esis, and the commitment hypothesis. They hypothesize that the more complex a

    constituent, the more likely speakers are to suspend it after an initial commitment

    to it (i.e., complexity hypothesis), and that speakers prefer to produce constituents

    with a continuous delivery (i.e., continuity hypothesis), and that speakers make a

    preliminary commitment to constituents, expecting to suspend them afterward (i.e.,

    commitment hypothesis). They analyzed repeated articles and pronouns in two large

    corpora, the Switchboard corpus and the London-Lund corpus,3 and found strong

    empirical evidence to support the proposed commit-and-restore model, along with

    3See [42] for a description of the corpus.

  • 7/28/2019 0685081 2A739 Liu y Structural Event Detection for Rich Transcription of s

    40/253

    21

    evidence for all three hypotheses. They noticed that speakers are more likely to make

    a premature commitment, and then immediately suspend it when the constituent

    becomes more complex, and that it is more likely that speakers restart a constituent

    the more that their suspension disrupts the utterance. One example is the frequent

    occurrence of function words in repeats. It has long been recognized for English

    that function words are repeated far more often than content words. When speakers

    want to make an initial commitment to a constituent, the word they mostly com-

    monly use is a function word. Overall, Clark and Wasow [42] found that function

    words were repeated more than ten times as often as content words, 25.2 versus 2.4

    per thousand in the Switchboard corpus. This more frequent occurrence of function

    words in repeats is explained by the three hypotheses they proposed.

    Knowing the types of words that speakers tend to repeat (or revise) is helpful

    for building a better model of spontaneous speech. For example, when speakers

    repair a content word, they often return to a major constituent boundary, such as

    on Friday, I mean, on Monday. Such an observation is beneficial for defining

    disfluency patterns and can aid in automatically identifying them.

    Effect on Listeners

    It is also valuable to understand how human listeners cope with disfluent input.

    Studies by Lickley [43], Lickley and Bard [5] have shown that listeners generally miss

    the disfluencies or incorrectly report on the occurrence of disfluencies, suggesting that

    disfluencies may have been filtered out for utterance comprehension. Psycholinguists

    believe that disfluencies play specific roles in our communication, such as sending sig-

    nals to the listener to do things like pay more attention, help the speaker find a word,

    or be patient while the speaker gathers his or her thoughts. Disfluencies provide in-

    formation that enables people in a conversation to better coordinate interaction and

    manage turn-taking [41].

  • 7/28/2019 0685081 2A739 Liu y Structural Event Detection for Rich Transcription of s

    41/253

    22

    Brennan [44] investigated how comprehension is affected when listeners hear dis-

    fluent speech. In her experiments, listeners followed fluent and disfluent instructions

    for selection of an object in a graphical display. She found that listeners make fewer

    errors when hearing less misleading information before the interruption points of

    disfluencies. She also observed that mid-word interruptions are better signals than

    between-word interruptions that a word was produced in error and that the speaker

    intends to replace it. This supports Levelts hypothesis [38] that by interrupting

    a word, a speaker signals to the addressee that that word is an error. If a word is

    completed, the speaker intends the listeners to interpret it as correctly delivered.

    Brennan also found in her experiments that there is information in disfluencies that

    partially compensates for any disruption that listeners meet while processing disflu-

    ent speech.

    Fox Tree [45] studied how naturally occurring speech disfluencies affect listeners

    comprehension. She observed that disfluencies do not always have a negative effect

    on comprehension. For example, repetitions do not hinder the listeners, because they

    can help listeners to recover information missing in the first occurrence of words that

    are repeated. However, it does take longer to identify words when there is a false

    start. When false starts begin utterances, listeners may abort the false starts with nocost to comprehension. But, if false starts are in the middle of utterances, listeners

    have to figure out where the false start begins, what to abort, and where to attach

    the restarted information. This process slows down comprehension.

    Disfluency Rates

    A conservative estimate (excluding silent hesitations) for the rate of disfluencies4

    in spontaneous speech is approximately 6 words per 100 words [45]. There are a

    variety of