interactively skimming recorded speech · 2012. 10. 25. · kept me motivated and busy over the...

156
Interactively Skimming Recorded Speech Barry Michael Arons B.S. Civil Engineering, Massachusetts Institute of Technology, 1980 M.S. Visual Studies, Massachusetts Institute of Technology, 1984 Submitted to the Program in Media Arts and Sciences in partial fulfillment of the requirements for the degree of Doctor of Philosophy at the Massachusetts Institute of Technology February 1994 Author: Program in Media Arts and Sciences˚ January 7, 1994 Certified by: Christopher M. Schmandt˚ Principal Research Scientist˚ Thesis Supervisor˚ Accepted by: Stephen A. Benton˚ Chair, Departmental Committee on Graduate Students˚ Program in Media Arts and Sciences˚

Upload: others

Post on 28-Jan-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

  • Interactively SkimmingRecorded Speech

    Barry Michael Arons

    B.S. Civil Engineering, Massachusetts Institute of Technology, 1980M.S. Visual Studies, Massachusetts Institute of Technology, 1984

    Submitted to the Program in Media Arts and Sciencesin partial fulfillment of the requirements for the degree ofDoctor of Philosophy at theMassachusetts Institute of Technology

    February 1994

    Author: Program in Media Arts and Sciences January 7, 1994

    Certified by: Christopher M. Schmandt Principal Research Scientist Thesis Supervisor

    Accepted by: Stephen A. Benton Chair, Departmental Committee on Graduate Students Program in Media Arts and Sciences

  • 2

    © 1994 Massachusetts Institute of Technology.All rights reserved.

  • 3

    Interactively Skimming Recorded Speech

    Barry Michael Arons

    Submitted to the Program in Media Arts and Sciences on January 7, 1994,in partial fulfillment of the requirements for the degree ofDoctor of Philosophy.

    AbstractListening to a speech recording is much more difficult than visuallyscanning a document because of the transient and temporal nature ofaudio. Audio recordings capture the richness of speech, yet it is difficultto directly browse the stored information. This dissertation investigatestechniques for structuring, filtering, and presenting recorded speech,allowing a user to navigate and interactively find information in theaudio domain. This research makes it easier and more efficient to listento recorded speech by using the SpeechSkimmer system.

    First, this dissertation describes Hyperspeech, a speech-only hypermediasystem that explores issues of speech user interfaces, browsing, and theuse of speech as data in an environment without a visual display. Thesystem uses speech recognition input and synthetic speech feedback toaid in navigating through a database of digitally recorded speech. Thissystem illustrates that managing and moving in time are crucial in speechinterfaces. Hyperspeech uses manually segmented and structured speechrecordings—a technique that is practical only in limited domains.

    Second, to overcome the limitations of Hyperspeech while retainingbrowsing capabilities, a variety of speech analysis and user interfacetechniques are explored. This research exploits properties of spontaneousspeech to automatically select and present salient audio segments in atime-efficient manner. Two speech processing technologies, timecompression and adaptive speech detection (to find hesitations andpauses), are reviewed in detail with a focus on techniques applicable toextracting and displaying speech information.

    Finally, this dissertation describes SpeechSkimmer, a user interface forinteractively skimming speech recordings. SpeechSkimmer uses simplespeech processing techniques to allow a user to hear recorded soundsquickly, and at several levels of detail. User interaction, through amanual input device, provides continuous real-time control of the speedand detail level of the audio presentation. SpeechSkimmer incorporatestime-compressed speech, pause removal, automatic emphasis detection ,and non-speech audio feedback to reduce the time needed to listen. Thisdissertation presents a multi-level structural approach to auditoryskimming, and user interface techniques for interacting with recordedspeech.

    Thesis Supervisor Christopher M. SchmandtPrincipal Research Scientist, Program in Media Arts and Sciences

    This work was performed in the Media Laboratory at MIT. Support forthis research was provided, in part, by Apple Computer, Inc., IntervalResearch Corporation, and Sun Microsystems, Inc. The ideas expressedherein do not necessarily reflect those of the supporting agencies.

  • 4

  • 5

    Doctoral Dissertation Committee

    Thesis Advisor: Christopher M. SchmandtPrincipal Research ScientistProgram in Media Arts and Sciences

    Thesis Reader: Nathaniel I. DurlachSenior Research ScientistDepartment of Electrical Engineering and Computer Science

    Thesis Reader: Thomas W. MalonePatrick J. McGovern Professor of Information SystemsSloan School of Management

  • 6

  • 7

    Acknowledgments

    Chris Schmandt and I have been building interactive speech systemstogether for over a decade. Chris is one of the founding fathers of theconversational computing field, an area of emerging importance that haskept me motivated and busy over the years. Thanks for helping me gethere.

    Nat Durlach has been a friendly and supportive influence, challengingme to think about how people operate and how we can get machines tocommunicate with them efficiently and seamlessly.

    Tom Malone has provided much encouragement and insight into the wayhumans, machines, and interfaces should work in the future.

    Don Jackson, from Sun Microsystems, provided support and friendshipsince before this dissertation process even started.

    Eric Hulteen, from Apple Computer, critiqued many of my early ideasand helped shape their final form.

    David Liddle and Andrew Singer, from Interval Research, got meinvolved in what could be the best place since the ArcMac.

    Lisa Stifelman provided valuable input in user interface design, assistedin the design and the many hours of carrying out the usability test, taughtme the inner workings of the Macintosh, and helped edit this document.

    Meg Withgott lent her help and expertise on pitch, emphasis, andsegmentation.

    Michele Covell and Michael Halle provided technical support (when Ineeded it most) beyond the call of duty. I hope to repay the favorsomeday.

    Special thanks to Linda Peterson for her time, patience, and help inkeeping me (and the entire Media Arts and Sciences program) on track.

    Steve Benton provided behind the scenes backing and a helping hand (orsignature) when it was needed.

  • 8 Acknowledgments

    Gayle Sherman provided advice, signatures, and assistance in managingthe bureaucracy at the Media Laboratory and MIT.

    Doug Reynolds ran his speaker identification software on my recordings.Don Hejna provided the SOLAFS implementation. Jeff Hermanintegrated the BBC radio data into SpeechSkimmer.

    Thanks to Marc Davis, Abbe Don, Judith Donath, Nicholas Negroponte,and William Stasior for their assistance, advice, and for letting me userecordings of their speech. Tom Malone’s spring 1993 15.579A classalso allowed me to record parts of two of their discussions.

    Jonathan Cohen and Gitta Salomon provided important help along theway. Walter Bender, Janet Cahn, George Furnas, Andrew Kass, PaulResnick, Louis Weitzman, and Yin Yin Wong also contributed to thiswork in important ways.

    Portions of this dissertation originally appeared in ACM conferenceproceedings as Arons 1991a and Arons 1993a, Copyright 1991 and 1993,Association of Computing Machinery, Inc., reprinted by permission.Parts of this work originally appeared in AVIOS conference proceedingsas Arons 1992a and Arons 1991b, reprinted by permission of theAmerican Voice I/O Society.

    When I completed my Master of Science degree in 1984 the idea ofgoing on to get a doctorate was completely out of the question, so Idedicated my thesis to the memory of my parents. I didn’t need or want aPh.D., writing was a struggle, and I wanted to get out of MIT. It is nowten years later, I’m back from the west coast, and I am completing mydissertation. Had I known, I would have waited and instead dedicatedthis dissertation to them for instilling me with a sense of curiosity, apassion for learning and doing, and the insight that has brought me to thispoint in my life.

  • 9

    Contents

    Abstract 3

    Doctoral Dissertation Committee 5

    Acknowledgments 7

    Contents 9

    Figures 13

    1 Motivation and Related Work 151.1 Defining the Title 161.2 Introduction 17

    1.2.1 The Problem: Current User Scenarios 171.2.1.1 Searching for Audio in Video 181.2.1.2 Lecture Retrieval 18

    1.2.2 Speech Is Important 191.2.3 Speech Storage 201.2.4 A Non-Visual Interface 211.2.5 Dissertation Goal 22

    1.3 Skimming this Document 221.4 Related Work 23

    1.4.1 Segmentation 241.4.2 Speech Skimming and Gisting 251.4.3 Speech and Auditory Interfaces 28

    1.5 A Taxonomy of Recorded Speech 291.6 Input (Information Gathering) Techniques 32

    1.6.1 Explicit 321.6.2 Conversational 331.6.3 Implicit 33

    1.6.3.1 Audio and Stroke Synchronization 351.7 Output (Presentation) Techniques 35

    1.7.1 Interaction 361.7.2 Audio Presentation 36

    1.8 Summary 37

    2 Hyperspeech: An Experiment in Explicit Structure 392.1 Introduction 39

    2.1.1 Application Areas 402.1.2 Related Work: Speech and Hypermedia Systems 40

    2.2 System Description 412.2.1 The Database 412.2.2 The Links 442.2.3 Hardware Platform 452.2.4 Software Architecture 46

    2.3 User Interface Design 462.3.1 Version 1 472.3.2 Version 2 47

    2.4 Lessons Learned on Skimming and Navigating 502.4.1 Correlating Text with Recordings 52

  • 10 Contents

    2.4.2 Automated Approaches to Authoring 522.5 Thoughts on Future Enhancements 53

    2.5.1 Command Extensions 532.5.2 Audio Effects 54

    2.6 Summary 55

    3 Time Compression of Speech 573.1 Introduction 57

    3.1.1 Time Compression Considerations 583.1.2 A Note on Compression Figures 59

    3.2 General Time compression Techniques 593.2.1 Speaking Rapidly 593.2.2 Speed Changing 603.2.3 Speech Synthesis 603.2.4 Vocoding 603.2.5 Pause Removal 60

    3.3 Time Domain Techniques 613.3.1 Sampling 613.3.2 Sampling with Dichotic Presentation 623.3.3 Selective Sampling 633.3.4 Synchronized Overlap Add Method 64

    3.4 Frequency Domain Techniques 653.4.1 Harmonic Compression 653.4.2 Phase Vocoding 66

    3.5 Tools for Exploring the Sampling Technique 663.6 Combined Time Compression Techniques 67

    3.6.1 Pause Removal and Sampling 673.6.2 Silence Removal and SOLA 683.6.3 Dichotic SOLA Presentation 68

    3.7 Perception of Time-Compressed Speech 693.7.1 Intelligibility versus Comprehension 693.7.2 Limits of Compression 693.7.3 Training Effects 713.7.4 The Importance of Pauses 72

    3.8 Summary 74

    4 Adaptive Speech Detection 754.1 Introduction 754.2 Basic Techniques 764.3 Pause Detection for Recording 78

    4.3.1 Speech Group Empirical Approach: Schmandt 794.3.2 Improved Speech Group Algorithm: Arons 804.3.3 Fast Energy Calculations: Maxemchuk 824.3.4 Adding More Speech Metrics: Gan 82

    4.4 End-point Detection 834.4.1 Early End-pointing: Rabiner 834.4.2 A Statistical Approach: de Souza 834.4.3 Smoothed Histograms: Lamel et al. 844.4.4 Signal Difference Histograms: Hess 874.4.5 Conversational Speech Production Rules: Lynch et al. 89

    4.5 Speech Interpolation Systems 894.5.1 Short-term Energy Variations: Yatsuzuka 914.5.2 Use of Speech Envelope: Drago et al. 914.5.3 Fast Trigger and Gaussian Noise: Jankowski 91

    4.6 Adapting to the User’s Speaking Style 924.7 Summary 93

    5 SpeechSkimmer 955.1 Introduction 95

  • Contents 11

    5.2 Time compression and Skimming 975.3 Skimming Levels 98

    5.3.1 Skimming Backward 1005.4 Jumping 1015.5 Interaction Mappings 1025.6 Interaction Devices 1035.7 Touchpad Configuration 1055.8 Non-Speech Audio Feedback 1065.9 Acoustically Based Segmentation 107

    5.9.1 Recording Issues 1085.9.2 Processing Issues 1085.9.3 Speech Detection for Segmentation 1095.9.4 Pause-based Segmentation 1135.9.5 Pitch-based Emphasis Detection for Segmentation 114

    5.10 Usability Testing 1185.10.1 Method 119

    5.10.1.1 Subjects 1195.10.1.2 Procedure 119

    5.10.2 Results and Discussion 1205.10.2.1 Background Interviews 1215.10.2.2 First Intuitions 1215.10.2.3 Warm-up Task 1215.10.2.4 Skimming 1225.10.2.5 No Pause 1225.10.2.6 Jumping 1235.10.2.7 Backward 1235.10.2.8 Time Compression 1245.10.2.9 Buttons 1245.10.2.10 Non-Speech Feedback 1255.10.2.11 Search Strategies 1255.10.2.12 Follow-up Questions 1265.10.2.13 Desired Functionality 126

    5.10.3 Thoughts for Redesign 1275.10.4 Comments on Usability Testing 129

    5.11 Software Architecture 1295.12 Use of SpeechSkimmer with BBC Radio Recordings 1305.13 Summary 131

    6 Conclusion 1336.1 Evaluation of the Segmentation 1336.2 Future Research 1356.3 Evaluation of Segmentation Techniques 135

    6.3.1 Combining SpeechSkimmer with a Graphical Interface 1366.3.2 Segmentation by Speaker Identification 1376.3.3 Segmentation by Word Spotting 137

    6.4 Summary 138

    Glossary 141

    References 143

  • 12

  • 13

    Figures

    Fig. 1-1. A consumer answering machine with time compression. 27Fig. 1-2. A close-up view of the digital message shuttle. 27Fig. 1-3. A view of the categories in the speech taxonomy. 31Fig. 2-1. The “footmouse” built and used for workstation-based transcription. 43Fig. 2-2. Side view of the “footmouse.” 43Fig. 2-3. A graphical representation of the nodes in the database. 44Fig. 2-4. Graphical representation of all links in the database (version 2). 45Fig. 2-5. Hyperspeech hardware configuration. 46Fig. 2-6. Command vocabulary of the Hyperspeech system. 48Fig. 2-7. A sample Hyperspeech dialog. 49Fig. 2-8. An interactive repair. 50Fig. 3-1. Sampling terminology. 61Fig. 3-2. Sampling techniques. 62Fig. 3-3. Synchronized overlap add (SOLA) method. 65Fig. 3-4. Parameters used in the sampling tool. 67Fig. 4-1. Time-domain speech metrics for frames N samples long. 77Fig. 4-2. Threshold used in Schmandt algorithm. 79Fig. 4-3. Threshold values for a typical recording. 81Fig. 4-4. Recording with speech during initial frames. 81Fig. 4-5. Energy histograms with 10 ms frames. 85Fig. 4-6. Energy histograms with 100 ms frames. 86Fig. 4-7. Energy histograms of speech from four different talkers. 87Fig. 4-8. Signal and differenced signal magnitude histograms. 88Fig. 4-9. Hangover and fill-in. 90Fig. 5-1. Block diagram of the interaction cycle of the speech skimming system. 97Fig. 5-2. Ranges and techniques of time compression and skimming. 97Fig. 5-3. Schematic representation of time compression and skimming ranges. 98Fig. 5-4. The hierarchical “fish ear” time-scale continuum. 99Fig. 5-5. Speech and silence segments played at each skimming level. 99Fig. 5-6. Schematic representation of two-dimensional control regions. 102Fig. 5-7. Schematic representation of one-dimensional control regions. 103Fig. 5-8. Photograph of the thumb-operated trackball tested with SpeechSkimmer. 104Fig. 5-9. Photograph of the joystick tested with SpeechSkimmer. 104Fig. 5-10. The touchpad with paper guides for tactile feedback. 105Fig. 5-11. Template used in the touchpad. 106Fig. 5-12. Mapping of the touchpad control to the time compression range. 106Fig. 5-13. A 3-D plot of average magnitude and zero crossing rate histogram. 110Fig. 5-14. Average magnitude histogram showing a bimodal distribution. 111Fig. 5-15. Histogram of noisy recording. 111Fig. 5-16. Sample speech detector output. 113Fig. 5-17. Sample segmentation output. 114Fig. 5-18. F0 plot of a monologue from a male talker. 116Fig. 5-19. Close-up of F0 plot in figure 5-18. 117Fig. 5-20. Pitch histogram for 40 seconds of a monologue from a male talker. 117Fig. 5-21. Comparison of three F0 metrics. 118Fig. 5-22. Counterbalancing of experimental conditions. 120Fig. 5-23. Evolution of SpeechSkimmer templates. 127Fig. 5-24. Sketch of a revised skimming template. 128Fig. 5-25. A jog and shuttle input device. 129Fig. 5-26. Software architecture of the skimming system. 130

  • 14

  • 15

    1 Motivation and Related Work

    As life gets more complex, people are likely toread less and listen more.

    (Birkerts 1993, 111)

    Speech has evolved as an efficient means of human-to-humancommunication, with our vocal output reasonably tuned to our listeningand cognitive capabilities. While we have traditionally been constrainedto listen only as fast as someone speaks, the ancient Greek philosopherZeno said “Nature has given man one tongue, but two ears so that wemay hear twice as much as we speak.” In the last 40 years technology hasallowed us to increase our speech listening rate, but only by a factor ofabout two. This dissertation postulates that through appropriateprocessing and interaction techniques it is possible to overcome the timebottleneck traditionally associated with using speech—that we can skimand listen many times faster than we can speak.

    This research addresses issues of accessing and listening to speech innew ways. It reviews a variety of areas related to high-speed listening,presents two technical explorations of skimming and navigating inspeech recordings, and provides a framework for thinking about suchsystems. This work is not an incremental change from what existstoday—the techniques and user interfaces presented herein are a wholenew way to think about speech and audio.

    Important ideas addressed by this work include:• The importance of time when listening to and navigating in

    recorded speech.• Techniques to provide multi-level structural representations of the

    content of speech recordings.• User interfaces for accessing speech recordings.

    This chapter describes why skimming speech is an important but difficultissue, reviews background material, and provides an overview of thisresearch.

  • 16 Chapter 1

    1.1 Defining the TitleThe most difficult problem in performing a studyon speech-message information retrieval isdefining the task.

    (Rose 1991, 46)

    The title of this work, Interactively Skimming Recorded Speech, isimportant for a variety of reasons. It has explicitly been made concise todescribe and exemplify what this document is about—all unnecessarywords and redundancies were removed. Working backward through thetitle, each word is defined in the context of this document:

    Speech is “the communication or expression of thoughts in spokenwords” (Webster 1971, 840). Although portions of this work areapplicable to general audio, “speech” is used because it is our primaryform of interpersonal communication, and the research focuses onexploiting information that exists only in the speech channel.

    Recorded indicates that speech is stored for later retrieval. The form andformat of the storage are not important, although a random access digitalstore is used and assumed throughout this document. While the systemsdescribed run in “real time,” they must be used on existing speech. Whilesomeone is speaking, it is possible to review what they have already said,but it is impossible to browse forward in time beyond the current instant.

    Skimming means “to remove the best … contents from” and “to passlightly or hastily: glide or skip along, above, or near a surface” (Webster1971, 816). Skim is used to mean quickly extracting the salientinformation from a speech recording, and is similar to browsing whereone wanders around to see what turns up. Skimming and browsingtechniques are often used while searching—where one is looking for aparticular piece of information. Scanning is similar to skimming in manyways, yet it connotes a more careful examination using vision. Skimminghere is meant to be performed with the ears. Note that the researchconcentrates on skimming rather than summarization, since it is mostgeneral, computationally tractable, and applicable to a wide range ofproblems.

    Interactively indicates that the speech is not only presented, but segmentsof speech are selected and played through the mutual actions of thelistener and the skimming system. The listener is an active and criticalcomponent of the system.

  • Motivation and Related Work 17

    1.2 IntroductionReading, because we control it, is adaptive toour needs and rhythms.… Our ear, and with itour whole imaginative apparatus, marches inlockstep to the speaker’s baton.

    (Birkerts 1993, 111)

    Skimming, browsing, and searching are traditionally considered visualtasks. One easily performs them while reading a newspaper, windowshopping, or driving a car. However, there is no natural way for humansto skim speech information because of the transient nature of audio— theear cannot skim in the temporal domain the way the eyes can browse inthe spatial domain.

    Speech is a powerful communications medium—it is, among otherthings, natural, portable, and rich in information, and it can be used whiledoing other things. Speech is efficient for the talker, but is usually aburden on the listener (Grudin 1988). It is faster to speak than it is towrite or type; however, it is slower to listen than it is to read. Thisresearch integrates information from multiple sources to overcome someof the limitations of listening to speech. This is accomplished byexploiting some of the regular properties of speech to enable high-speedskimming.

    1.2.1 The Problem: Current User Scenarios

    Recorded audio is currently used by many people in a variety ofsituations including:

    • lectures and interviews on microcassette• voice mail• tutorial and motivational material on audio cassette• conference proceedings on tape• books on tape• time-shifted, or personalized, radio and television programs

    Such “personal” uses are in addition to commercial uses such as:• story segments for radio shows• using the audio track for editing a video tape story line• information gathering by law enforcement agencies

    This section presents some everyday situations that demonstrate theproblems of using stored speech that are addressed by this research.

  • 18 Chapter 1

    1.2.1.1 Searching for Audio in Video

    When browsing the meeting record sequentially,it is convenient to replay it in meaningful units.In a medium such as videotape, this can bedifficult since there is no way of identifying thestart of a meaningful unit. When fast forwardingthrough a videotape of the meeting, people[reported they] … frequently ended up in themiddle of a discussion rather than the start.

    (Wolf 1992, 4)

    There are important problems in the field of video production, logging,and editing that are better addressed in the audio domain than in thevisual domain (Davenport 1991; Pincever 1991). For example, after atelevision reporter and crew conduct an interview for the nightly newsthey edit the material under tight time and content constraints. After theinterview, the reporter’s primary task is typically to find the mostappropriate “sound bite” before the six o’clock news. This is often donein the back of a news van using the reporter’s hand-scribbled notes, andby searching around an audio recording on a microcassette recorder.

    Finding information on an audio cassette is difficult. Besides the fact thata tape only provides linear access to the recording, there are severalconfounding factors that make browsing audio difficult. Althoughspeaking is faster than writing or typing, listening to speech is slowcompared to reading. Moreover, the ear cannot browse an audio tape. Arecording can be sped up on playback; however on most conventionaltape players this is accompanied by a change in pitch, resulting in a lossof intelligibility.

    Note that the task may not be any easier if the reporter has a videotape ofthe interview. The image of a “talking head” adds few useful cues sincethe essential content information is in the audio track. When thevideotape is shuttled at many times normal speed an identifiable imagecan be displayed, yet the audio rapidly becomes unintelligible.

    1.2.1.2 Lecture Retrieval

    When attending a presentation or a lecture, one often takes notes by handor with a notebook computer. Typically, the listener has to choosebetween listening for the sake of understanding the high-level ideas ofthe speaker or taking notes so that none of the low-level details are lost.The listener may attempt to capture one level of detail or the other, butbecause of time constraints it is often difficult to capture bothperspectives. Few people make audio recordings of talks or lectures. It is

  • Motivation and Related Work 19

    much easier to browse one’s hand-written notes than it is to listen to arecording. Listening is a time-consuming real time process.

    If an audio recording is made, it is difficult to access a specific piece ofinformation. For example, in a recorded lecture one may wish to review ashort segment that describes a single mathematical detail. The listenerhas two choices in attempting to find this small chunk of speech: theentire tape can be played from the beginning until the desired segment isheard, or the listener can jump around the tape attempting to find thedesired information. The listener’s search is typically inefficient, time-consuming, and frustrating because of the linear nature of the tape andthe medium. In listening to small snippets of speech from the tape, it isdifficult to find audio landmarks that can be used to constrain the search.Users may try to minimize their search time by playing only shortsegments from the tape, yet they are as likely to play an unrelatedcomment or a pause as they are to stumble across an emphasized word oran important phrase. It is difficult to perform an efficient search evenwhen using a recorder that has a tape counter or time display.

    This audio search task is analogous to trying to find a particular scene ina video tape, where the image can only be viewed in play mode (i.e., thescreen is blank while fast forwarding or rewinding). The user is caughtbetween the slow process of looking or listening, and an inefficientmethod of searching in the visual or auditory domain.

    1.2.2 Speech Is Important

    Speech is a rich and expressive medium (Chalfonte 1991). In addition tothe lexical content of our spoken words, our emotions and importantsyntactic and semantic information are captured by the pitch, timing, andamplitude of our speech. At times, more semantic information can betransmitted by the use of silence than by the use of words. Suchinformation is difficult to convey in a text transcription, and is bestcaptured in the sounds themselves.

    Transcripts are useful in electronically searching for keywords orvisually skimming for content. Transcriptions, however, are expensive—a one hour recording of carefully dictated business correspondence takesat least an hour to transcribe and will usually cost roughly $20 per hour.A one hour interactive meeting or interview will often take over six hoursto transcribe and cost over $150. Note that automatic speech recognition-based transcriptions of spontaneous speech, meetings, or conversationsare not practical in the foreseeable future (Roe 1993).

  • 20 Chapter 1

    Speech is becoming increasingly important for I/O and for data storageas personal computers continue to shrink in size. Screens and keyboardslose their effectiveness in tiny computers, yet the transducers needed tocapture and play speech can be made negligible in size. Negroponte said:

    The … consequence of this view of the future is that theform factor of such dynadots suggests that the dominantmode of computer interaction will be speech. We can speakto small things. (Negroponte 1991, 185)

    1.2.3 Speech Storage

    Until recently, the use of recorded speech has been constrained bystorage, bandwidth, computational, and I/O limitations. These barriersare quickly being overcome by recent advances in electronics and relateddisciplines, so that it is now becoming technologically feasible to record,store, and randomly access large amounts of recorded speech. Personalcomputers and workstations are now capable of recording and playingaudio, regularly contain tens or hundreds of megabytes of RAM, and canstore tens of gigabytes of data on disks.

    As the usage scenarios in section 1.2.1 illustrate, recorded speech isimportant in many existing interfaces and applications. Stored speech isbecoming more important as electronics and storage costs continue todecrease, and portable and hand-held computers (“personal digitalassistants” and “personal communicators”) become pervasive.Manufacturers are supplying the base hardware and softwaretechnologies with these devices, but there is currently no means forinteracting with, or finding information in, large amounts of storedspeech.

    One interesting commercial device is the Jbird digital recorder (Adaptive1991). This portable pocket-sized device is a complete digital recorder,with a signal processing chip for data compression, that stores up to fourhours of digitized audio to RAM. This specialized device is designed tobe used covertly by law enforcement agencies, but similar devices mayeventually make it to the consumer market as personal recording devicesand memory aids (Lamming 1991; Stifelman 1993; Weiser 1991).

  • Motivation and Related Work 21

    1.2.4 A Non-Visual Interface

    If you were driving home in your car right now,you couldn’t be reading a newspaper.

    Heard during a public radio fund drive.

    Speech is fundamental for human communication, yet this medium isdifficult to skim, browse, and navigate because of the transient nature ofaudio. In displaying a summary of a movie, television show, or homevideo one can show a time line of key frames (Davis 1993; Mills 1992)or a video extrusion (Elliott 1993), possibly augmented with text orgraphics, that provides a visual context. It is not possible to display aconversation or radio show in an analogous manner. If the highlights ofthe radio program were to be played simultaneously, the resultingcacophony would be neither intelligible nor informative.

    A waveform, spectrogram, or other graphical representation can bedisplayed, yet this provides little content information.1 Text tags (or a fulltranscription) can be shown; however, this requires an expensivetranscoding from one medium to another, and causes the rich attributes ofspeech to be lost. Displaying speech information graphically for the sakeof finding information in the signal is somewhat like taking aspirin for abroken arm—it makes you feel better, but it does not attack thefundamental problem.

    This research therefore concentrates on non-visual, or speech-onlyinterfaces that do not use a display or a keyboard, but take advantage ofthe audio channel. A graphical user interface may make some speechsearching and skimming tasks easier, but there are several reasons forexploring non-visual interfaces. First, there are a variety of situationswhere a graphical interface cannot be used, such as while walking ordriving an automobile, or if the user is visually impaired. Second, theimportant issue addressed in this research is structuring and extractinginformation from the speech signal. Once non-visual techniques aredeveloped to extract and present speech information, they can be takenadvantage of in visual interfaces. However, tools and techniques learnedfrom graphical interfaces are less applicable to non-visual interfaces.

    1A highly trained specialist can slowly “read” spectrograms; however, this approach isimpractical and slow, and does not provide the cues that make speech a powerfulcommunication medium.

  • 22 Chapter 1

    1.2.5 Dissertation Goal

    The focus of this research is to provide simple and efficient methods forskimming, browsing, navigating and finding information in speechinterfaces. Several things are required to improve information access andadd skimming capabilities in scenarios such as those described in section1.2.1. Tools and algorithms, such as time compression and semanticallybased segmentation, are needed to enable high-speed listening torecorded speech. In addition, user interface software and technologiesmust be developed to allow a user to access the recorded information andcontrol its presentation.

    This dissertation addresses these issues and problems by helping usersskim and navigate in speech. Computer-based tools and interactiontechniques have been developed to assist in interactively skimming andfinding information purely in the audio domain. This is accomplished bymatching the system output to the listener’s cognitive and perceptualcapabilities. The focus of this research is not on developing newfundamental speech processing algorithms,2 but to combine interactiontechniques with speech processing technologies in novel and powerfulnew ways. The goal is to provide auditory “views” of speech recordingsat different time scales and abstraction levels under interactive usercontrol—from a high-level audio overview to a detailed presentation ofinformation.

    1.3 Skimming this Documentauditory information is temporally fleeting: onceuttered, special steps have to be taken to refer toit again, unlike visually presented informationthat may be referred to at leisure.

    (Tucker 1991, 148)

    This section provides a brief road map to the dissertation research andthis document, encouraging quick visual skimming in areas of interest tothe reader. The number of bullets indicate the chapters that a readershould consider looking at first.

    ••• Casual readers will find chapter 1, particularly section 1.2 mostinteresting, as it describes the fundamental problems beingaddressed and the general approach to their solution throughseveral user scenarios.

    2However, they were developed where needed.

  • Motivation and Related Work 23

    •• Chapter 2 describes Hyperspeech, a preliminary investigation ofspeech-only browsing and navigation techniques in a manuallyauthored hypermedia database. The development of this systeminspired the research described in this dissertation.

    • Chapter 3 reviews methods to time compress speech, includingperceptual limits, and the significance and importance of pausesin understanding speech.

    • Chapter 4 reviews techniques of finding speech versusbackground noise in recordings, focusing on techniques that canbe used to segment recordings, and methods that adapt todifferent background noise levels.

    ••• Chapter 5 describes SpeechSkimmer, a user interface forinteractively skimming recorded speech. Section 5.9 detailsalgorithms developed for segmenting speech recordings basedon pauses and on pitch.

    •• Chapter 6 discusses the contributions of the research, and areasfor continued work.

    1.4 Related Work

    This research draws from diverse disciplines, integrating theories, ideas,and techniques in important new ways. There is a wide body ofknowledge and literature that addresses particular aspects of thisproblem; however, none of them provides an adequate solution tonavigating in recorded speech.

    Time compression technologies allow the playback speed of a recordingto be increased, but there are perceptual limits to the maximum speedincrease. The use of time-compressed speech plays a major role in thisdissertation and is reviewed in detail in chapter 3.

    There has been some work done in the area of summarizing and gisting(see section 1.4.2), but these techniques have been constrained to limiteddomains. Research on speech interfaces has laid the groundwork for thisexploration by providing insight into the problems of using speech, buthas not directly addressed the issue of finding information in speechrecordings. There has been significant work in presenting and retrievinginformation, but this has focused on textual information and graphicalinterfaces.

  • 24 Chapter 1

    1.4.1 Segmentation

    There are several techniques for segmenting speech, but these have notbeen applied to the problem of skimming and navigating. Chapter 2describes some manual and computer-assisted techniques for segmentingrecorded speech. A speech detector that determines the presence orabsence of speech can be effective at segmenting recordings. A variety ofspeech detection techniques are described in chapter 4. Sections 5.9.3and 5.9.5 describe other segmentation techniques based on pauses andthe fundamental frequency of speech.

    Kato and Hosoya investigated several techniques to enable fast messagesearching in telephone-based information systems (Kato 1992; Kato1993). They broke up messages on hesitation boundaries, and presentedeither the initial portion of each phrase or segments based on high energyportions of speech. They found that combining these techniques withtime compression enabled fast message browsing.

    Hawley describes many techniques and algorithms for extractingstructure out of sound (Hawley 1993). Hawley’s work focuses on findingmusic and speech in audio recordings with an eye towards parsing moviesound tracks.

    Wolf and Rhyne present a method for selectively reviewing meetingsbased on characteristics captured by a computer-supported meeting tool(Wolf 1992). They found the temporal pattern of workstation-based turn-taking to be a useful index to points of interest within the meeting log.They did this by analyzing patterns of activity that are captured by thecomputerized record of the meetings. They then attempted to characterizethe structure of the meetings by correlating these data with points ofinterest to assist in reviewing the meeting. The intent of each turn takenduring a meeting was coded into one of five categories. The turncategories of most interest for assisting in browsing the meeting recordwere preceded by longer gaps than the other turn types. They found, forexample, that finding parts following gaps of ten seconds or longerprovides a more efficient way of browsing the meeting record thansimply replaying the entire recording. Wolf and Rhyne found that thetemporal pattern of turn-taking was effective in identifying interestingpoints in the meeting record. They also suggest that using a combinationof indicators such as user identification combined with a temporalthreshold might make the selective review of meetings more effective.While these gaps do not have a one-to-one correspondence with pausesin speaking, the general applicability of this technique appears valid.

  • Motivation and Related Work 25

    1.4.2 Speech Skimming and Gisting

    A promising alternative to the fully automatedrecognition and understanding of speech is thedetection of a limited number of key words,which would be automatically combined withlinguistic and non-linguistic cues and situationknowledge in order to infer the general contentor “gist” of incoming messages.

    (Maksymowicz 1990, 104)

    Maxemchuk suggests three techniques (after Maxemchuk 1980, 1395)for skimming speech messages:

    • Text descriptors can be associated with points in a speechmessage. These pointers can be listed, and the speech messageplayed back starting at a selected pointer. This is analogous tousing the index in a text document to determine where to startreading.

    • While playing back a speech message it is possible to jumpforward or backward in the message. This is analogous to flippingthrough pages in a text document to determine the area of interest.

    • Finally, the playback rate can be increased. When the highestplayback rate is selected, not every word is intelligible; however,the meaning can generally be extracted. This is analogous toskimming through a text document to determine the areas ofinterest.

    Several systems have been designed that attempt to obtain the gist of arecorded message (Houle 1988; Maksymowicz 1990; Rose 1991) fromacoustical information. These systems use a form of keyword spotting(Wilcox 1991; Wilcox 1992a; Wilpon 1990) in conjunction withsyntactic and/or timing constraints in an attempt to broadly classify amessage. Similar work has recently been reported in the areas ofretrieving speech documents (Glavitsch 1992) and editing applications(Wilcox 1992b).

    Rose demonstrated the first complete system that takes speech messagesas input, and produces as output an estimated “message class” (Rose1991). Rose’s system does not attempt to be a complete speech messageunderstanding system that fully describes the utterance at all acoustic,syntactic and semantic levels of information. Rather, the goal is only toattempt to extract a general notion of topic or category of the inputspeech utterance according to a pre-defined notion of topic. The systemuses a limited vocabulary Hidden Markov Model (Rabiner 1989) wordspotter that provides an incomplete transcription of the speech. A second

  • 26 Chapter 1

    stage of processing interprets this incomplete transcription and classifiesthe message according to a set of pre-defined topics.

    Houle et al. proposed a post-processing system for automatic gisting ofspeech (Houle 1988). Keyword spotting is used to detect, classify andsummarize speech messages that are then used to notify an operatorwhenever a high-priority message arrives. They say that such a systemusing keyword spotting is controlled by the trade-off between theprobability of detecting the spoken words and the false alarm rate. If, forexample, a speaker-independent word spotting system correctly detects80% of the individual keywords, there will be 120 false alarms per hour.With an active vocabulary of ten words in the spotting system, this wouldcorrespond to 1200 false alarms per hour, or one false alarm every threeseconds. Such a system is impractical because if each keyword that wasdetected was used to draw the attention of the operator, the entire speechrecording would effectively need to be monitored. A decreased falsealarm rate was achieved by looking for two or more keywords within ashort phrase window. The addition of these syntactic constraints greatlyimproves the effectiveness of the keyword spotting system. The othertechnique that they use is termed “credibility adjustment.” This iseffectively setting the rejection threshold of the keyword spotter tomaximize the number of correct recognitions and to minimize thenumber of false acceptances. The application of these two kinds of back-end filters significantly reduces the number of false alarms.

    It appears to be much easier to skim synthetic speech than recordedspeech since the words and layout of the text (sentences, paragraphs, andformatting information) provide knowledge about the structure andcontent of a document. Raman (Raman 1992a, Raman 1992b) andStevens (Edwards 1993; Stevens 1993) use such a technique for speakingdocuments containing mathematics based on TEX or LATEX formattinginformation (Knuth 1984; Lamport 1986).

    For example, as the skimming speed increases, along with raising thespeed of the synthesizer, simple words such as “a” and “the” could bedropped. Wallace (Wallace 1983) and Condray (Condray 1987) appliedsuch a technique to recorded “telegraphic speech”3 and found thatlistener efficiency (the amount of information acquired per unit time)increased under such conditions. When the skimming speed of syntheticspeech is increased, only selected content words or sentences could bepresented. For example, perhaps only the first two sentences from each

    3Telegraph operators often dropped common words to speed the transmission ofmessages.

  • Motivation and Related Work 27

    paragraph are presented. With higher speed skimming, only the firstsentence (assumed to be the “topic” sentence) would be synthesized.

    Consumer products have begun to appear with rudimentary speechskimming features. Figures 1-1 and 1-2 show a telephone answeringmachine that incorporates time compression. The “digital messageshuttle” allows the user to play voice messages at 0.7x, 1.0x, 1.3x, and1.6x of normal speed, permits jumping back about 5 seconds within amessage, and skipping forward to the next message (Sony 1993).

    Fig. 1-1. A consumer answering machine with time compression.

    Fig. 1-2. A close-up view of the digital message shuttle.

    This dissertation addresses the issues raised by these previousexplorations and integrates new techniques into a single interface. Thisresearch differs from previous approaches by presenting an interactivemulti-level representation based on simple speech processing andfiltering of the audio signal. While existing gisting and word spotting

  • 28 Chapter 1

    techniques have a limited domain of applicability, the techniquespresented here are invariant across all topics.

    1.4.3 Speech and Auditory Interfaces

    This dissertation also builds on ideas of conversational interfacespioneered at MIT’s Architecture Machine Group and Media Laboratory.Phone Slave (Schmandt 1984) and the Conversational Desktop(Schmandt 1985; Schmandt 1987) explored interactive messagegathering and speech interfaces to simple databases of voice messages.Phone Slave, for example, segmented voice mail messages into fivechunks4 through an interactive dialogue with the caller.

    VoiceNotes (Stifelman 1993) explores the creation and management of aself-authored database of short speech recordings. VoiceNotesinvestigates many of the user interface issues addressed in theHyperspeech and SpeechSkimmer systems (chapters 2 and 5) in thecontext of a hand-held computer.

    Resnick (Resnick 1992a; Resnick 1992b; Resnick 1992c) designedseveral voice bulletin board systems accessible through a touch toneinterface. These systems are unique because they encourage many-to-many communication by allowing users to dynamically add voicerecordings to the database over the telephone. Resnick’s systems addressissues of navigation among speech recordings, and include tone-basedcommands equivalent to “where am I” and “where can I go?” However,they require users to fill out an “audio form” to provide improved accessin telephone-based information services.

    These predecessor systems all structure the recorded speech informationthrough interaction with the user, placing a burden on the creator orauthor of the speech data. The work presented herein automaticallystructures the existing recordings from information inherent in aconversational speech signal.

    Muller and Daniel’s description of the HyperPhone system (Muller 1990)provides a good overview of many important issues in speech-I/Ohypermedia (see chapter 2). They state that navigation tends to bemodeled spatially in almost any interface, and that voice navigation isparticularly difficult to map into the spatial domain. HyperPhone “voicedocuments” are a collection of extensively interconnected fine-grainedhypermedia objects that can be accessed through a speech recognitioninterface. The nodes contain small fragments of ASCII text to be

    4Name, subject, phone number, time to call, and detailed message.

  • Motivation and Related Work 29

    synthesized, and are connected by typed links. Hyperspeech differs fromHyperPhone in that it is based on recordings of spontaneous speechrather than synthetic speech, there is no default path through the nodes,and no screen or keyboard interface of any form is provided (chapter 2).

    Non-speech sounds can be used as an “audio display” for presentinginformation in the auditory channel (Blattner 1989; Buxton 1991). Thisarea has been explored for applications ranging from the presentation ofmultidimensional data (Bly 1982) to “auditory icons” that use everydaysounds (e.g., scrapes and crashes) as feedback for actions in a graphicaluser interface (Gaver 1989a; Gaver 1989b; Gaver 1993).

    The “human memory prosthesis” is envisioned to run on a lightweightwireless notepad-style computer (Lamming 1991). The intent is to helppeople remember things such as names of visitors, reconstructing pastevents, and locating information after it has been filed. This system isintended to gather information through active badges (Want 1992),computer workstation use, computer-based note taking, andconversations. It is noted that video is often used to record significantevents such as design meetings and seminars, but that searching throughthis information is a tedious task. Users must play back a sufficientamount of data to reestablish context so that they can locate a small, butimportant, snippet of audio or video. By time-stamping the audio andvideo streams and correlating these with time stamps of the note taking,it is possible to quickly jump to a desired point in the audio or videostream simply by selecting a point in the hand-written notes (section1.6.3.1).

    1.5 A Taxonomy of Recorded Speech

    This section broadly classifies the kinds of speech that can be capturedfor later playback. This taxonomy is not exhaustive, but lists thesituations that are of most interest for subsequent browsing and review.Included are lists of attributes that help distinguish the classifications,and cues that can assist in segmenting and skimming the recordedspeech. For example, if a user explicitly made a recording, or was presentwhen a recording was made, the user’s high-level content knowledge ofthe recording can assist in interactively retrieving information.

    The classifications are listed roughly from hardest to easiest in terms ofthe ability to extract the underlying structure purely from the audiosignal. Note that these classification boundaries are easily blurred. Forexample, there is a meeting-to-lecture continuum: parts of meetings may

  • 30 Chapter 1

    be more structured, as in a formal lecture, and parts of lectures may beunstructured, as in a meeting discussion.

    These categories can be organized along many dimensions, such asstructure, interactivity, number of people, whether self-authored, etc.Items can also be classified both within or across categories. Forexample, a voice mail message is from a single person, yet a collection ofvoice mail messages are from many people. Figure 1-3 plots theseclassifications as a function of number of participants and structure. Theannotated taxonomy begins here:

    Captured speech.Such as a recording of an entire day’s activities. This includes informalvoice communication such as conversations that occur when running intosomeone in the hall or elevator.

    • least structured• user present• unknown number of talkers• variable noise• all of the remaining items in this list

    Meetings.Including design, working, and administrative gatherings.

    • more interactive than a lecture• more talkers than a lecture• may be a written agenda• user may have been present or participated• user may have taken notes

    Possible cues for retrieval: who was speaking based on speakeridentification, written or typed notes.

    Lectures.Including formal and informal presentations.

    • typically a monologue• organization may be more structured than a meeting• may be a written outline or lecture notes• user may have been present• user may have taken notes

    Possible cues for retrieval: question-and-answer period, visual aids,demonstrations, etc.

  • Motivation and Related Work 31

    lecture

    phone call

    voice messagedictation

    voice notes

    meeting

    self-authored other person other people

    moststructured

    leaststructured

    Fig. 1-3. A view of the categories in the speech taxonomy.

    Personal dictation.Such as letters or notes that would traditionally be transcribed.

    • single talker• user authored• can be explicitly categorized by user (considered as long voice

    notes—see below)Possible cues for retrieval: date, time, and place of recording.

    Recorded telephone calls .• different media than face-to-face communication• well defined beginning and ending• typically two talkers• well documented speaking and hesitation characteristics (Brady

    1965; Brady 1968; Brady 1969; Butterworth 1977)• user participation in conversation• can differentiate caller from callee (Hindus 1993)• consistent audio quality within calls

    Possible cues for retrieval: caller identification, date, time, length of call.

    Voice mail.Speech messages gathered by a computer-based answering system.

    • single talker per message• typically short• typically contain similar types of information• user not present• consistent audio quality within messages

  • 32 Chapter 1

    Possible cues for retrieval: caller identification, date, time, length ofmessage.

    Voice notes.Short personal speech recordings organized by the user (Stifelman 1993).In addition to the VoiceNotes system, several other Media Laboratoryprojects have used small snippets of recorded speech in applications suchas calendars, address books, and things-to-do lists (Schmandt 1993).

    • single talker• typically short notes• authored by user• consistent audio quality

    Possible cues for retrieval: notes are categorized by user when authored.

    Most predecessor systems rely on speech recordings that are structured insome fashion (i.e., in the lower left quadrant of figure 1-3). Thisdissertation attempts to segment recordings that are unstructured (i.e., inthe upper right quadrant of figure 1-3).

    1.6 Input (Information Gathering) TechniquesWhen the medium of communication is free-handsketching and writing, conventional keywordsearches of the meeting are not possible. Thecontent and structure of the meeting must beinferred from other information.

    (Wolf 1992, 6)

    This section describes several data collection techniques that can occurwhen a recording is created. These data could subsequently be used asmechanisms to access and index the speech recordings.

    1.6.1 Explicit

    Explicit (or active) techniques require the user to manually identifyinteresting or important audio segments such as with button or keyboardpresses (Degen 1992). Explicit techniques are uninteresting in the contextof this dissertation as they burden the user during the recording process.Such techniques place an additional cognitive load on the user at recordtime or during authoring (see chapter 2), and do not generalize across theentire speech taxonomy. Such techniques assume that the importance andarchival nature of the recording is known ahead of time. If largequantities of audio are recorded (e.g., everything that is said or heard

  • Motivation and Related Work 33

    every day), explicitly marking all recordings becomes tedious andimpractical.

    1.6.2 Conversational

    It is also possible to use structured input techniques, or an interactiveconversation with the user, to gather classification and segmentationinformation about a recording (see section 1.4.3). Such techniques areuseful for short recordings and limited domains. However, these methodsincrease the complexity of creating a recording and cannot be used for allclasses of recordings.

    1.6.3 Implicit

    One of the requirements in providing access tothe meeting record is to do so in a way which isefficient for users who are searching the historyand is not burdensome for users as they generatethe meeting record.

    (Wolf 1992, 1)

    Implicit (or passive) techniques provide audio segmentation and searchcues without requiring additional action from the user. Implicit inputtechniques include:

    Synchronizing keyboard input with the audio recording.Keystrokes are time-stamped and synchronized with an audio recordingto provide an access mechanism into the recording (Lamming 1991).This technique is discussed further in section 1.6.3.1.

    Pen or stylus synchronization with audio.This is similar to keystroke synchronization, but uses a pen rather than akeyboard for input. Audio can thus be synchronized with handwrittennotes that are recognized and turned into text, or with handwriting,figures, and diagrams that are recorded as digital “ink” or bitmaps.

    CSCW keyboard synchronization.This is a superset of keystroke or pen synchronization, but allows formulti-person input and the sharing of synchronization informationbetween many machines.

    Synchronizing an existing on-line document with audio.The structure of an existing document, agenda, or presentation can besynchronized with button or mouse presses to provide cues to accessingchunks of recorded speech. This technique typically provides coarse-

  • 34 Chapter 1

    grained chunking, but is highly correlated with topics. A technique suchas this could form the basis for a CSCW application by broadcast thespeaker’s slides and mouse clicks to personal machines in the audienceas a cue for subsequent retrieval.

    Electronic white board synchronization.The technology is similar to pen input, but is less likely to supportcharacter recognition. White board input has different social implicationsthan keyboard or pen synchronization since the drawing space is sharedrather than private.

    Direction sensing microphones.There are a variety of microphone techniques that can be used todetermine where a talker is located. Each person can use a separatemicrophone for person identification, but this is a physical imposition oneach talker. An array of microphones can be used to determine thetalker’s location, based on energy or phase differences between thearrival of signals at the microphone location (Flanagan 1985;Compernolle 1990). Such talker identification information can be used tonarrow subsequent audio searches. Note that this technique can be usedto distinguish between talkers, even if the talkers’ identities are notknown.

    Active badge or UNIX “finger” information.These sources can provide coarse granularity information about who is ina room, or logged on a machine (Want 1992; Manandhar 1991). Thesedata could be combined with other information (such as direction-sensingmicrophones) to provide more precise information than any of theindividual technologies can support.

    Miscellaneous external information.A variety of external sources such as room lights, video projectors,computer displays, or laser pointers can also be utilized for relevantsynchronization information. However, it is difficult to obtain and usethis information in a general way.

    These techniques appear quite powerful; however, there are many timeswhere it is useful to retrieve information from a recording created in asituation where notes were not taken, or where the other information-gathering techniques were not available. A recording, for example, mayhave been created under conditions where it was inconvenient orinappropriate to take notes. Additionally, something may have been saidthat did not seem important at the time, but on reflection, may beimportant to retrieve and review (Stifelman 1992a). Therefore it is

  • Motivation and Related Work 35

    crucial to develop techniques that do not require input from the userduring recording.

    1.6.3.1 Audio and Stroke Synchronization

    Keyboard and pen synchronization techniques are the most interesting ofthe techniques described in section 1.6.3. They are tractable and providefine granularity information that is readily obtained from the user.Keystrokes (or higher level constructs such as words or paragraphs) aretime-stamped and synchronized with an audio recording.5 The user’skeyboard input can then be used as an access mechanism into therecording.6

    Keyboard synchronization technology is straightforward, but maybecome complex with text that is manipulated and edited, or if the notesspan across days or meetings. It is hypothesized that such asynchronization mechanism can be significantly more effective if it iscombined with audio segmentation (section 5.9). The user’s textual notescan be used as the primary means of summarizing and retrievinginformation; the notes provide random access to the recordings, while therecording captures all the spoken details, including things missed in thenotes. These techniques are promising and practical, as laptop andpalmtop computers are becoming increasingly common in publicsituations.

    Note that while the intent of this technique is to not place any additionalburden on the user, such a technology may change the way people work.For example, the style and quantity of notes people take in a meetingmay change if they know that they can access a detailed audio recordingbased on their notes.

    1.7 Output (Presentation) Techniques

    Once recordings are created, they are processed and presented to theuser. This research also explores interaction and output techniques forpresenting this speech information.

    5L. Stifelman has prototyped such a system on a Macintosh.6The information that is provided by written notes is analogous to the use of keywordspotting.

  • 36 Chapter 1

    1.7.1 Interaction

    In a few seconds, or even fractions of a second,you can tell whether the sound is a news anchorperson, a talk show, or music. What is reallydaunting is that, in the space of those fewseconds, you effortlessly recognize enough aboutthe vocal personalities and musical styles to tellwhether or not you want to listen!

    (Hawley 1993, 53)

    Interactive user control of the audio presentation is synergistically tied tothe other techniques described in this document to provide a skimminginterface. User interaction is perhaps the most important and powerful ofall the techniques, as it allows the user to filter and listen to therecordings in the most appropriate manner for a given search task (seechapters 2 and 5).

    For example, most digital car radios have “scan” and “seek” buttons.Scan is automatic simply going from one station to the next. Seek allowsusers to go to the next station under their own control. Scan mode on aradio can be frustrating since it is simply interval-based—after roughlyseven seconds, the radio jumps to the next station regardless of whether acommercial, a favorite song, or a disliked song is playing.7 Since scanmode is automatic and hands-off, by the time one realizes that somethingof interest is playing, the radio has often passed the desired station. Theseek command brings the listener into the loop, allowing the user tocontrol the listening period for each station, thus producing a moredesirable and efficient search. This research takes advantage of thisconcept, creating a closed-loop system with the user actively controllingthe presentation of information.

    1.7.2 Audio Presentation

    There are two primary methods of presenting supplementary andnavigational information in a speech-only interface: (1) the use of non-speech audio and (2) taking advantage of the spatial and perceptualprocessing capabilities of the human auditory system.

    The use of non-speech audio cues and sound effects can be applied tothis research in a variety of ways. In a speech- or sound-only interface,non-speech audio can provide terse, but informative, feedback to theuser. In this research, non-speech audio is explored for providing

    7Current radios have no cues to the semantic content of the broadcasts.

  • Motivation and Related Work 37

    feedback to the user regarding the internal state of the system, and fornavigational cues (see also Stifelman 1993).

    The ability to focus one’s listening attention on a single talker among acacophony of conversations and background noise is sometimes calledthe “cocktail party effect.” It may be possible to exploit some of theperceptual, spatial, or other characteristics of speech and audition thatgive humans this powerful ability to select among multiple audiostreams. A spatial audio display can be used to construct a 3-D audiospace of multiple simultaneous sounds external to the head (Durlach1992; Wenzel 1988; Wenzel 1992). Such a system could be used in thecontext of this research to present multiple speech channelssimultaneously, allowing a user to “move” between parallel speechpresentations. In addition, one can take advantage of perceptually basedaudio streams (Bregman 1990)—speech signals can be mixed with a“primary” signal using signal processing techniques to enhance theprimary sound, bringing it into the foreground of attention while stillallowing the other streams to be attended to (Ludwig 1990; Cohen 1991;Cohen 1993). These areas are outside the primary research of thisdissertation, but are discussed in Arons 1992b.

    1.8 Summary

    This chapter provides an overview of what speech skimming is, itsutility, and why it is a difficult problem. A variety of related work hasbeen reviewed, and a range of techniques that can assist in skimmingspeech have been presented.

    Chapter 2 goes on to describe an experimental system that addressesmany of these issues in the context of a hypermedia system. Subsequentchapters address the deficiencies of this experimental system and presentnew methods for segmenting and skimming speech recordings.

  • 38

  • 39

    2 Hyperspeech: An Experiment inExplicit Structure

    The Hyperspeech system began with a simple question: How can onenavigate in a speech database using only speech? In attacking thisquestion a variety of important issues were raised regarding structuringspeech information, levels of detail, and browsing in speech userinterfaces. This experiment thus motivated the remainder of this researchinto the issues of skimming and navigating in speech recordings.8

    2.1 Introduction

    Hyperspeech is a speech-only hypermedia application that exploresissues of navigation and system architecture in an audio environmentwithout a visual display. The system uses speech recognition tomaneuver in a database of digitally recorded speech segments; syntheticspeech is used for control information and user feedback.

    In this prototype system, recorded audio interviews were manuallysegmented by topic; hypertext-style links were added to connect logicallyrelated comments and ideas. The software architecture is data-driven,with all knowledge embedded in the links and nodes, allowing thesoftware that traverses through the network to be straightforward andconcise. Several user interfaces were prototyped, emphasizing differentstyles of speech interaction and feedback between the user and themachine.

    Interactive “hypertext” systems have been proposed for nearly half acentury (Bush 1945; Nelson 1974), and realizable since the 1960’s(Conklin 1987; Engelbart 1984). Attempts have continually been made tocreate “hypermedia” systems by integrating audio and video intotraditional hypertext frameworks (Multimedia 1989; Backer 1982). Mostof these systems are based on a graphical user interface paradigm using amouse, or touch sensitive screen, to navigate through a two-dimensionalspace. In contrast, Hyperspeech is an application for presenting “speech

    8This chapter is based on Arons 1991a and contains portions of Arons 1991b.

  • 40 Chapter 2

    as data,” allowing a user to wander through a database of recordedspeech without any visual cues.

    Speech interfaces must present information sequentially while visualinterfaces can present information simultaneously (Gaver 1989a; Muller1990). These confounding features lead to significantly different designissues when using speech (Schmandt 1989), rather than text, video, orgraphics. Recorded speech cannot be manipulated, viewed, or organizedon a display in the same manner as text or video images. Schematicrepresentations of speech signals (e.g., waveform, energy, or magnitudedisplays) can be viewed in parallel and managed graphically, but thespeech signals themselves cannot be listened to simultaneously (Arons1992b). Browsing such a display is easy since it relies “on the extremelyhighly developed visuospatial processing of the human visual system”(Conklin 1987, 38).

    Navigation in the audio domain is more difficult than in the spatialdomain. Concepts such as highlighting, to-the-right-of, and menuselection must be accomplished differently in audio interfaces than invisual interfaces. For instance, one cannot “click here” in the audio worldto get more information—by the time a selection is made, time haspassed, and “here” no longer exists.

    2.1.1 Application Areas

    Applications for such a technology include the use of recorded speech,rather than text, as a brainstorming tool or personal memory aid. AHyperspeech-like system would allow a user to create, organize, sort, andfilter “audio notes” under circumstances where a traditional graphicalinterface would not be practical (e.g., while driving) or appropriate (e.g.,for someone who is visually impaired). Speech interfaces are particularlyattractive for hand-held computers that lack keyboards or large displays.Many of these ideas are discussed further in Stifelman 1992a andStifelman 1993.

    2.1.2 Related Work: Speech and Hypermedia Systems

    Compared with traditional hypertext or multimedia systems, little workhas been done in the area of interactive speech-only hypertext-likesystems (see also section 1.4.3). Voice mail and telephone accessibledatabases can loosely be placed in this category; however they are farfrom what is considered “hypermedia.” These systems generally presentonly a single view of the underlying data, have a limited 12-button

  • Hyperspeech 41

    interface, do not encourage free-form exploration of the information, anddo not allow personalization of how the information is presented.

    Parunak (Parunak 1989) describes five common hypertext navigationalstrategies in geographical terms. Hyperspeech uses a “beaten path”mechanism and typed links as additional navigational aids that reduce thecomplexity of the hypertext database. A beaten path mechanism (e.g.,bookmarks or a back-up stack) allows a user to easily return to placesalready visited.

    Zellweger states “Users are less likely to feel disoriented or lost whenthey are following a pre-defined path rather than browsing freely, and thecognitive overhead is reduced because the path either makes or narrowstheir choices” (Zellweger 1989, 1). Hyperspeech encourages free-formbrowsing, allowing users to focus on accessing information rather thannavigation. Zellweger presents a path mechanism that leads a userthrough a hypermedia database. These paths are appropriate for scripteddocuments and narrations; this system focuses on conversationalinteractions.

    IBIS has three types of nodes and a variety of link types includingquestions, objects-to, and refers-to. Trigg’s Textnet proposed a taxonomyof link types encapsulating ideas such as refutation and support. Thesystem described in this chapter has two node types, and several linktypes similar to the argumentation links in Textnet and IBIS (Conklin1987).

    2.2 System Description

    This section describes how the Hyperspeech database and links werecreated, and provides an overview of the hardware and software systems.

    2.2.1 The Database

    If a man can … make a better mouse-trap … theworld will make a beaten path to his door.

    R. W. Emerson

    Audio interviews were conducted with five academic, research, andindustrial experts in the user interface field.9 All but one of the interviews

    9The interviewees and their affiliations at the time were: Cecil Bloch (SomosomoAffiliates), Brenda Laurel (Telepresence Research), Marvin Minsky (MIT), LouisWeitzman (MCC Human Interface Group), and Laurie Vertelney (Apple HumanInterface Group).

  • 42 Chapter 2

    was conducted by telephone, since only the oral content was of interest.Note that videotaping similar interviews for a video hypermedia systemwould have been more expensive and difficult to schedule than telephoneinterviews.10

    A short list of questions was discussed with the interviewees to help themformulate their responses before a scheduled telephone call. Atelemarketing-style program then called, played recorded versions of thequestions, and digitally recorded the response to each question in adifferent data file. Recordings were terminated without manualintervention using speech detection (see chapter 4). There were five shortbiographical questions (name, title, background, etc.), and three longerquestions relating to the scope, present, and future of the humaninterface.11 The interviews were deliberately kept short; the total time foreach automated interview was roughly five minutes.

    The recordings were then manually transcribed on a Sun SparcStationusing a conventional text editor while simultaneously controlling audioplayback with a custom-built foot pedal (figures 2-1 and 2-2). A serialmouse was built into the foot pedal, with button clicks controlling theplayback of the digital recordings.

    The transcripts for each question were then manually categorized intomajor themes (summary nodes) with supporting comments (detailnodes). Figure 2-3 is a schematic representation of the nodes in thedatabase.12 The starting and stopping points of the speech filescorresponding to these categories were then determined with asegmentation tool. Note that most of the boundaries between segmentsoccurred at natural pauses between phrases, rather than between wordswithin a phrase. This attribute is of use in systems that segment speechrecordings automatically (see chapter 5).

    10One of the participants was in a bathtub during their telephone interview.11The questions were:

    1. What is the scope, or boundaries, of the human interface? What does the humaninterface mean to you?2. What do you perceive as the most important human interface research issues?3. What is the future of the human interface? Will we ever achieve “the ultimate”human interface, and if so, what will it be?

    12The node and link images are included here with some hesitation. Such images,intended only for the author of the database, can bias a user of the system, forcing aparticular spatial mapping onto the database. When the application is running, there is novisual display of any information.

  • Hyperspeech 43

    Fig. 2-1. The “footmouse” built and used for workstation-based transcription.

    Fig. 2-2. Side view of the “footmouse.” Note the small screws used to depressthe mouse buttons.

    After manually analyzing printed transcripts to find interesting speechsegments, a separate segmentation utility was used to determine thecorresponding begin/end points in the sound file. This utility playedsmall fragments of the recording, allowing the database author todetermine segment boundaries within the sound files. Keyboard-basedcommands analogous to fine-, medium-, and coarse-grained cursormotions of the Emacs text editor (Stallman 1979) were used to movethrough the sound file and determine the proper segmentation points.13

    13E.g., the forward character command (Control-F) moved forward slightly (50 ms), theforward word command (Meta-F) moved forward a small amount (250 ms), and theforward page command moved a medium amount (1 s). The corresponding backwardcommands also allowed movement with the recorded sound file. Other keyboard andfoot-pedal commands allowed larger movements.

  • 44 Chapter 2

    Fig. 2-3. A graphical representation of the nodes in the database. Detail nodes(circles) that are related to a summary node (diamonds) are horizontallycontiguous. The three columns (narrow, wide, narrow) correspond to each of theprimary questions. The five rows correspond to each of the interviewees.

    Of the data gathered14 (approximately 19 minutes of speech, includingtrailing silences, um’s, and pauses), over 70 percent was used in the finalspeech database. Each of the 80 nodes15 contains short speech segments,with a mean length of 10 seconds (SD = 6 seconds, maximum of 25seconds). These brief segments parallel Muller’s fine-grainedhypermedia objects (Muller 1990). However, in this system eachutterance represents a complete idea or thought, rather than a sentencefragment.

    2.2.2 The Links

    For this prototype, an X Window System-based tool designed forworking with Petri nets (Thomas 1990) was used to link the nodes in thedatabase. All the links in the system were typed according to function.Initially, a small number of supporting and opposing links betweentalkers were identified. For example, Minsky’s comments about“implanting electrodes and other devices that can pick information out of

    14In the remainder of the chapter, references to nodes and links do not include responsesto the biographical questions.15There are roughly equal numbers of summary nodes and detail nodes.

  • Hyperspeech 45

    the brain and send information into the brain” are opposed to Bloch’srelated view that ends “and that, frankly, makes my blood run cold.”

    As the system and user interface developed, a large number of links andnew link types were added (there are over 750 links in the currentsystem). Figure 2-4 shows the links within the database. The figure alsoillustrates a general problem of hypermedia systems—the possibility ofgetting lost within a web of links. The problems of representing andmanipulating a hypermedia database become much more complex in thespeech domain than with traditional media.

    Fig. 2-4. Graphical representation of all links in the database (version 2). Notethat many links are overlaid upon one another.16

    2.2.3 Hardware Platform

    The telephone interviews were gathered on a Sun 386i workstationequipped with an analog telephone interface and digitization board. TheHyperspeech system is implemented on a Sun SparcStation, using itsbuilt-in codec for playing the recorded sound segments. The telephonequality speech files are stored uncompressed (8-bit µ-law coding, 8000samples/second). A DECtalk serial-controlled text-to-speech synthesizeris used for user feedback. The recorded and synthesized speech soundsare played over a conventional loudspeaker system (figure 2-5).

    16A more appropriate authoring tool would provide a better layout of the links and visualdifferentiation of the link types.

  • 46 Chapter 2

    isolated wordrecognizer

    text-to-speechsynthesizer

    SparcStation

    computer control

    Fig. 2-5. Hyperspeech hardware configuration.

    Isolated word, speaker-dependent, speech recognition is provided by aTI-Speech board in a microcomputer; this machine is used as an RS-232controlled recognition server by the host workstation (Arons 1989;Schmandt 1988). A headset-mounted noise-canceling microphoneprovides the best possible recognition performance in this noisyenvironment with multiple sound sources (user + recordings +synthesizer).

    2.2.4 Software Architecture

    The software is written in C, and runs in a standard UNIX operatingenvironment. A simple recursive stack model tracks all nodes that havebeen visited, and permits the user to return (pop) to a previously heardnode at any time.

    Because so much semantic and navigational information is embedded inthe links, the software that traverses through the nodes in the database isstraightforward and concise. This data-driven architecture allows theprogram that handles all navigation, user interaction, and feedback to behandled by approximately 300 lines of C code.17 Note that this data-driven approach allows the researcher to scale up the size of the databasewithout having to modify the underlying software system. This data-driven philosophy was also followed in SpeechSkimmer (see chapter 5).

    2.3 User Interface Design

    The Hyperspeech user interface evolved during development of thesystem; many improvements were made throughout an iterative designprocess. Some of the issues described in the following sections illustratethe differences between visual and speech interfaces, and are importantdesign considerations for those implementing speech-based systems.

    17Excluding extensive library routines and drivers that control the speech I/O devices.

  • Hyperspeech 47

    2.3.1 Version 1

    The initial system was sparsely populated with links and had a simpleuser interface paradigm: explicit menu control. After the sound segmentassociated with a node was played, a list of valid command options (linksto follow) was spoken by the synthesizer. The user then uttered theirselection, and the cycle was repeated.

    The initial system tried to be “smart” about transitioning between thenodes. After playing the recording, if no links exited that node, thesystem popped the user back to the previous node, as no valid links couldbe followed or navigational commands issued. This automatic return-to-previous-node function was potentially several levels deep. Also, once anode had been heard, it was not mentioned in succeeding menus in orderto keep the prompts as short as possible—it was (incorrectly) assumedthat a user would not want to be reminded about the same node twice.

    Navigation in this version was very difficult. The user was inundatedwith feedback from the system—the content of the recordings becamelost in the noise of the long and repetitive menu prompts. The supposedly“smart” node transitions and elision of menu items brought users tounknown places, and left them stranded without landmarks because themenus were constantly changing.

    2.3.2 Version 2

    This section describes the current implementation of the Hyperspeechsystem. The most significant change from Version 1 was the addition ofa variety of new link types and a large number of links.

    A name link will transition to a node of a particular talker. For example,user input of Minsky causes a related comment by Marvin Minsky to beplayed.

    Links were also added for exploring the database at three levels of detail.The more link allows a user to step through the database at the lowestlevel of detail, playing all the information from a particular talker. Thebrowse link permits a user to skip ahead to the next summary nodewithout hearing the detailed statements. This lets a user skim and get anoverview of a particular talker’s important ideas. The scan18 commandautomatically jumps between the dozen or so nodes that provide a high-level overview path through the entire database, allowing a user to skimover all the recordings to find a topic of interest.

    18In retrospect, these may not have been the most appropriate names (see section 1.1).

  • 48 Chapter 2

    In order to reduce the amount of feedback to the user, the number oflinks was greatly increased so that a link of every type exists for eachspeech node. Since any link type can be followed from any node,command choices are uniform across the database, and menus are nolonger needed. This is analogous to having the same graphical menuactive at every node in a hypermedia interface; clicking anywhereproduces a reasonable response, without having to explicitly highlightactive words or screen areas. Figure 2-6 shows the vocabulary and linktypes of the system.

    Link type Command Descriptionname Bloch Transition to related comments from

    Laurel a particular talkerVertelneyWeitzmanMinsky

    dialogical supporting Transition to a node that supports this viewpointopposing Transition to a node that opposes this viewpoint

    control more Transition to next detail nodecontinue Transition to next detail node (alias for more)browse Transition to next summary nodescan Play path through selected summary nodes

    Utilities Command Descriptioncontrol return Pop to previous node

    repeat Replay current node from beginninghelp help Synthesize a description of current location

    options List current valid commandson/off pay attention Turn on speech recognizer

    stop listening Turn off speech recognizer

    Fig. 2-6. Command vocabulary of the Hyperspeech system.

    A host of minor changes made the system more interactive andconversational. Since the most valuable commodity in speech systems istime rather than screen real estate, every effort was made to speed theinteractions. The speech segments in the database are, by default, playedback 1.25 times faster than they were recorded without a change of pitch(see chapter 3). If the repeat command is invoked, the node is replayed atnormal speed for maximum intelligibility. The speaking rate of thesynthetic speech has also been significantly increased (from the defaultof 180 wpm) to reduce user feedback time. Short repetitive types offeedback (e.g., direct echoing of recognized commands) are spoken at afaster rate (350 wpm) than help or navigation-related feedback (250wpm). The output volume levels were also adjusted so that the speechrecordings—the primary output of the system—are louder than thesynthetic speech.

  • Hyperspeech 49

    A sample interactive dialog that illustrates what the Hyperspeech systemsounds like is shown in figure 2-7. See Arons 1993b for a video tapedemonstrating the system in use.

    Talker Utterance CommentsMinsky What I think will happen over the next

    fifty years is we’ll learn more and moreabout implanting electrodes, and otherdevices, that can pick information out ofthe brain and send information into thebrain.

    User opposing User objects toidea, does anyoneelse?

    Bloch The ultimate is obviously some sort ofdirect coupling between the nervoussystem and artificial devices, and that,frankly makes my blood run cold.

    User browse Go to nextsummary fromBloch.

    Bloch In terms of ultimate development, Ithink that the thing that can be said isthat it is unpredictable.

    User Weitzman What isWeitzman’s view?

    Weitzman I would hope that we never do achievethe ultimate interface.

    User continue Get moreinformation.

    Weitzman We’ll always be able to improve on it,and just the fact that during the processof getting there …

    User help Interrupt to getinformation.

    Synthesizer This is Louie Weitzman on the future ofthe human interface.

    Weitzman … we are going to learn new things andbe able to see even better ways to attackthe problem.

    Continue playingcomment.

    User Vertelney What does theindustrial designerthink?

    Vertelney I think it’s like back in theRenaissance…

    User return Not of interest.Interrupt, and goback to previousnode.

    Weitzman We’ll always be able to… Weitzman again.User Minsky What’s Minsky’s

    view of the future?Minsky And when it becomes smart enough we

    won’t need the person anymore, and theinterface problem will disappear.

    Fig. 2-7. A sample Hyperspeech dialog.

  • 50 Chapter 2

    Explicit echoing (Hayes 1983) of recognized commands is no longer thedefault. However, at start-up time the system can be configured forvarious degrees of user feedback. Observers and first-time users of thesystem are often more comfortable with the interface if commandechoing is turned on. As soon as a spoken command is recognized,speech output (synthesized or recorded) is immediately halted, providingcrucial feedback to the user that a command was heard. The systemresponse time is fast enough that a rejection error 19 is immediatelynoticeable to an experienced user. If a substitution error20 occurs, theuser can quickly engage the machine in a repair dialog (Schmandt 1986).Note that speech recognition parameters are typically set so thatsubstitution errors are less common than rejection errors. Figure 2-8illustrates what a repair dialog (with command echoing on) might soundlike.

    Talker Utterance Description of actionUser Weitzman Desired command is spokenSynthesizer “supporting” Fast echoing (substitution error)Minsky “The interfa…” Incorrect sound is startedUser return Interrupt recording, pop to previous nodeUser Weitzman Repeat of misrecognized commandSynthesizer “Weitzman” Echo of correctly recognized wordWeitzman “I hope we never do

    achieve the ultimateinterface…”

    Desired action is taken

    Fig. 2-8. An interactive repair.

    2.4 Lessons Learned on Skimming and Navigating

    Einstein is reported to have once said, “make everything as simple aspossible, but not too simple.” This idea also holds true in user interfaces,particularly those involving speech. Since time is so valuable in a speechapplication, every effort must be made to streamline the interactions.Howev