scientific bases of human-machinecommunication voice · abstract the scientific bases for...

Proc. Natl. Acad. Sci. USAVol. 92, pp. 9914-9920, October 1995Colloquium Paper

This paper was presented at a coUoquium entitled "Human-Machine Communication by Voice, " organized byLawrence R. Rabiner, held by the National Academy of Sciences at The Arnold and Mabel Beckman Centerin Irvine, CA, February 8-9, 1993.

Scientific bases of human-machine communication by voiceRONALD W. SCHAFERDepartment of Electrical and Computer Engineering, Georgia Institute of Technology, Atlanta, GA 30332-0250

ABSTRACT The scientific bases for human-machinecommunication by voice are in the fields of psychology,linguistics, acoustics, signal processing, computer science,and integrated circuit technology. The purpose of this paperis to highlight the basic scientific and technological issues inhuman-machine communication by voice and to point outareas of future research opportunity. The discussion is orga-nized around the following major issues in implementinghuman-machine voice communication systems: (i) hard-ware/software implementation of the system, (ii) speechsynthesis for voice output, (iii) speech recognition and un-derstanding for voice input, and (iv) usability factors relatedto how humans interact with machines.

Humans communicate with other humans in many ways,including body gestures, printed text, pictures, drawings, andvoice. But surely voice communication is the most widely usedin our daily affairs. Flanagan (1) succinctly summarized thereasons for this with a pithy quote from Paget (2):What drove man to the invention of speech was, as Iimagine, not so much the need of expressing his thoughts(for that might have been done quite satisfactorily bybodily gesture) as the difficulty of "talking with hishands full."

Indeed, speech is a singularly efficient way for humans toexpress ideas and desires. Therefore, it is not surprising that wehave always wanted to communicate with and command ourmachines by voice. What may be surprising is that a paradigmfor this has been around for centuries. When machines beganto be powered by draft animals, humans discovered that thesame animals that provided the power for the machine alsocould provide enough intelligence to understand and actappropriately on voice commands. For example, the simplevocabulary of gee, haw, back, giddap, and whoa served nicelyto allow a single human to control the movement of a largefarm machine. Of course, voice commands were not the onlymeans of controlling these horse- or mule-powered machines.Another system of more direct commands was also availablethrough the reins attached to the bit in the animal's mouth.However, in many cases, voice commands offered clear ad-vantages over the alternative. For example, the human was leftcompletely free to do other things, such as walking alongsidea wagon while picking corn and throwing it into the wagon.This eliminated the need for an extra person to drive themachine, and the convenience of not having to return to themachine to issue commands greatly improved the efficiency ofthe operation. (Of course, the reins were always tied in aconveniently accessible place just in case the voice controlsystem failed to function properly!)

Clearly, this reliance on the modest intelligence of theanimal source of power was severely limiting, and even thatlimited voice control capability disappeared as animal powerwas replaced by fossil fuel power. However, the allure of voiceinteraction with machines remained and became stronger astechnology became more advanced and complex. The obviousadvantages include the following:* Speech is the natural mode of communication for hu-

mans.* Voice control is particularly appealing when the human's

hands or eyes are otherwise occupied.* Voice communication with machines is potentially very

helpful to handicapped persons.* The ubiquitous telephone can be an effective remote

terminal for two-way voice communication with machines thatcan also speak, listen, and understand.

Fig. 1 depicts the elements of a system for human-machinecommunication by voice. With a microphone to pick up thehuman voice and a speaker or headphones to deliver asynthetic voice from the system to the human ear, the humancan communicate with the system, which in turn can commandother machines or cause desired actions to occur. In order todo this, the voice communication system must take in thehuman voice input, determine what action is called for, andpass information to other systems or machines. In some cases"recognition" of the equivalent text or other symbolic repre-sentation of the speech is all that is necessary. In other cases,such as in natural language dialogue with a machine, it may benecessary to "understand" or extract the meaning of theutterance. Such a system can be used in many ways. In one classof applications, the human controls a machine by voice; forexample, the system could do something simple like causingswitches to be set or it might gather information to completea telephone call or it might be used by a human to control awheelchair or even a jet plane. Another class of applicationsinvolves access and control of information; for example, thesystem might respond to a request for information by searchinga data base or doing a calculation and then providing theanswers to the human by synthetic voice, or it might evenattempt to understand the voice input and speak a semanticallyequivalent utterance in another language.What are the important aspects of the system depicted in

Fig. 1? In considering almost all the above examples and manyothers that we can imagine, there is a tendency to focus on thespeech recognition aspects of the problem. While this may bethe most challenging and glamorous part, the ability to rec-ognize or understand speech is still only part of the picture.The system also must have the capability to produce syntheticvoice output. Voice output can be used to provide feedback toassure the human that the machine has correctly understoodthe input speech, and it also may be essential for returning anyinformation that may have been requested by the humanthrough the voice transaction. Another important aspect of theproblem concerns usability by the human. The system must bedesigned to be easy to use, and it must be flexible enough to

9914

The publication costs of this article were defrayed in part by page chargepayment. This article must therefore be hereby marked "advertisement" inaccordance with 18 U.S.C. §1734 solely to indicate this fact.

Dow

nloa

ded

by g

uest

on

May

23,

202

0

Proc. Natl. Acad. Sci. USA 92 (1995) 9915

ci-

Human-Machine

Voice

Communication

System

FIG. 1. Human-machine communication by

cope with the wide variability that is comm4speech. Finally, the technology available for isuch a system must be an overarching concern.

Like many new ideas in technology, voice ccwith machine in its modem form may appear to b4is not essential to human progress and well-beinquestioned both the ultimate feasibility and the ,recognition by machine, arguing that anything sbcapabilities of a native speaker would not be usesting and that such capabilities are not feasibleThe questions raised by Pierce (3) concerningvalue, and the potential for success of researrecognition stimulated much valuable discussiorin the late 1960s and may even have dampened tIof engineers and scientists for a while, but u

research community answered with optimistic vijit is certainly true that the ambitious goal oimachine with the speaking and understanding cnative speaker is still far away, the past 25 ye;significant progress in both speech synthesis ancso that effective systems for human-machine ccby voice are now being deployed in many impotions, and there is little doubt that applications wthe technology matures.The progress so far has been due to the efforts

across a broad spectrum of science and technololprogress will require an even closer linkage 1diverse fields as psychology, linguistics, acousti(cessing, computer science, and integrated circuThe purpose of this paper is to highlight the basictechnological issues in human-machine comnrvoice and to set the context for the next two lvolume, which describe in detail some of the imof progress and some of the areas where moineeded. The discussion is organized aroundmajor issues in implementing human-machine,%nication systems: (i) hardware/software implethe system, (ii) speech synthesis for voice outpu,recognition and understanding for voice inputability factors related to how humans interact w

DIGITAL COMPUTATIONAND MICROELECTRONICS

Scientists and engineers have systematically studisignal and the speech communication process f4century. Engineers began to use this knowledge iiof the twentieth century to experiment with waysbandwidth on telephone channels. However, therapid development of the digital computer werapid advances in both speech research and tecdcomputers were used as tools for simulating anbut it soon became clear that the digital conultimately be the only way to realize complexprocessing systems. Computer-based laboratquickly became indispensable tools for speech reis not an exaggeration to say that one of the stvating forces in the modem field of digital sigI

was the need to develop digital filtering, spectrum analysis, andsignal modeling techniques for simulating and implementingspeech analysis and synthesis systems (4-7).

In addition to its capability to do the numerical computa-tions called for in analysis and synthesis of speech, the digital

action or computer can provide the intelligence necessary for human-information machine communication by voice. Indeed, any machine with

voice input/output capability will incorporate or be interfacedto a highly sophisticated and powerful digital computer. Thus,the disciplines of computer science and engineering have

voice. already made a huge impact on the field of human-machinecommunication by voice, and they will continue to occupy a

nim human central position in the field.implementing Another area of technology that is critically intertwined with

immunication digital computation and speech signal processing is microelec-emamluxunyhaton tronics technology. Without the mind-boggling advances thatea luxurytha have occurred in this field, digital speech processing andg. Some have human-machine communication by voice would still be lan-

ort of the full guishing in the research laboratory as an academic curiosity.sefrt of intherf As an illustration, Fig. 2 shows the number of transistors pereful orminter- chip for several members of a popular family of digital signal

the goals, the processing (DSP) microcomputers plotted as a function of therch in speech year the chip was introduced. The upper graph in this figureand thought shows the familiar result that integrated circuit device densities

he enthusiasm tend to increase exponentially with time, thereby leadingietenthIathm inexorably to more powerful systems at lower and lower cost.

gor. Although The lower graph shows the corresponding time required to dof providing a a single multiply-accumulate operation of the form (previous--apability of a _sum + cx[n]), which is a ubiquitous operation in DSP. Fromairs have seen this graph we see that currently available chips can do thisrecognition combination of operations in 40 nanoseconds or less or the

dmmunication equivalent of 50 MFLOPS (million floating-point operationsrtant applica- per second). Because of multiple busses and parallelism in theill increase as architecture, such chips can also do hundreds of millions of

other operations per second and transfer hundreds of millionsof researchers of bytes per second. This high performance is not limited togy, and future special-purpose microcomputers. Currently available worksta-between such tions and personal computers also are becoming fast enoughcs, signal pro- to do the real-time operations required for human-machineit technology. voice communication without any coprocessor support.scientific and Thus, there is a tight synergism between speech processing,iunication by computer architecture, and microelectronics. It is clear thatpapers in this these areas will continue to complement and stimulate eachiportant areas other; indeed, in order to achieve a high level of success inre research is human-machine voice communication, new results must con-the following tinue to be achieved in areas of computer science and engi-voice commu- neering, such as the following:mentation oft, (iii) speechand (iv) us-

iith machines.

ied the speechor well over an the first partof conservinginvention andre key to theinology. First,ialog systems,nputer wouldspeech signaltory facilities.search, and ittrongest moti-ial processing

0CD ew 107cloUZ

"D 100E

:31 L9

103

CDC*cCD 2

E 2Fl

DEVICE DENSITY

s0 1982 1984 1986 1988 1990

COMPUTATION SPEED

1992 19

x 3

~ ~ ~~1982 1984 1986 1988 1990 1992

Year Introduced1994

FIG. 2. Device density and computation speed for a family of DSPmicrocomputers (data courtesy of Texas Instruments, Inc.).

EK

Colloquium Paper: Schafer

1 99L

Dow

nloa

ded

by g

uest

on

May

23,

202

0

9916 Colloquium Paper: Schafer

Microelectronics. Continued progress in developing morepowerful and sophisticated general-purpose and special-purpose computers is necessary to provide adequate inexpen-sive computer power for human-machine voice communica-tion applications. At this time, many people in the microelec-tronics field are confidently predicting chips with a billiontransistors by the end of the decade. This presents significantchallenges and opportunities for speech researchers to learnhow to use such massive information processing power effec-tively.

Algorithms. New algorithms can improve performance andincrease speed just as effectively as increased computer power.Current research on topics such as wavelets, artificial neuralnetworks, chaos, and fractals is already finding application inspeech processing applications. Researchers in signal process-ing and computer science should continue to find motivationfor their work in the problems of human-machine voicecommunication.

Multiprocessing. The problems of human-machine commu-nication by voice will continue to challenge the fastest com-puters and the most efficient algorithms. As more sophisti-cated systems evolve, it is likely that a single processor withsufficient computational power may not exist or may be tooexpensive to achieve an economical solution. In such cases,multiple parallel processors will be needed. The problems ofhuman-machine voice communication are bound to stimulatemany new developments in parallel computing.

Tools. As systems become more complex, the need forcomputer-aided tools for system development continues toincrease. What is needed is an integrated and coordinated setof tools that make it easy to test new ideas and develop newsystems concepts, while also making it easy to move a researchsystem through the prototype stage to final implementation.Such tools currently exist in rudimentary form, and it is alreadypossible in some applications to do the development of asystem directly on the same DSP microprocessor or worksta-tion that will host the final implementation. However, muchmore can be done to facilitate the development and imple-mentation of voice processing systems.

SPEECH ANALYSIS AND SYNTHESISIn human-machine communication by voice, the basic infor-mation-carrying medium is speech. Therefore, fundamentalknowledge of the speech signal-how it is produced, howinformation is encoded in it, and how it is perceived-iscritically important.Human speech is an acoustic wave that is generated by a

well-defined physical system. Hence, it is possible using the laws of

physics to model and simulate the production of speech. Theresearch in this area, which is extensive and spanning many years,is described in the classic monographs of Fant (8) and Flanagan(1), in more recent texts by Rabiner and Schafer (7) and Deller etal (9), and in a wealth of scientific literature. Much of this researchhas been based on the classic source/system model depicted in Fig.3.

In this model the different sounds of speech are producedby changing the mode of excitation between quasi-periodicpulses for voiced sounds and random noise for fricatives, withperhaps a mixture of the two sources for voiced fricatives andtransitional sounds. The vocal tract system response alsochanges with time to shape the spectrum of the signal toproduce appropriate resonances or formants. With such amodel as a basis, the problem of speech analysis is concernedwith finding the parameters of the model given a speech signal.The problem of speech synthesis then can be defined asobtaining the output of the model, given the time-varyingcontrol parameters of the model.A basic speech processing problem is the representation of

the analog acoustic waveform of speech in digital form. Fig. 4depicts a general representation of a system for digital speechcoding and processing.

Speech, like any other band-limited analog waveform, can besampled and quantized with an analog-to-digital (A-to-D)converter to produce a sequence of binary numbers. Thesebinary numbers represent the speech signal in the sense thatthey can be converted back to an analog signal by a digital-to-analog (D-to-A) converter, and, if enough bits are used inthe quantization and the sampling rate is high enough, thereconstructed signal can be arbitrarily close to the originalspeech waveform. The information rate (bit rate) of such adigital waveform representation is simply the number ofsamples per second times the number of bits per sample. Sincethe bit rate determines the channel capacity required fordigital transmission or the memory capacity required forstorage of the speech signal, the major concern in digitalspeech coding is to minimize the bit rate while maintaining anacceptable perceived fidelity to the original speech signal.One way to provide voice output from a machine is simply

to prerecord all possible voice responses and store them indigital form so that they can be played back when required bythe system. The information rate of the digital representationwill determine the amount of digital storage required for thisapproach. With a bandwidth of 4000 Hz (implying an 8000-Hzsampling rate) and eight bits per sample (with ,-law or A-lawcompression), speech can be represented by direct samplingand quantization with a bit rate of 64,000 bits/sec and with aquality comparable to a good long-distance telephone con-

pitch period

vocal tractparameters

Voiced/UnvoicedMixture

syntheticspeech

FIG. 3. Source/system model for speech production.

Proc. Natl. Acad. Sci. USA 92 (1995)

Dow

nloa

ded

by g

uest

on

May

23,

202

0


A-to-D Tra nsmission, Decod ing0. and ,- Storage, or .0 and

speech Coding digital IProcessing ,digital D-to-A Ispeechinput representation representation output

FIG. 4. Digital speech coding.

nection (often called "toll quality"). To further reduce the bitrate while maintaining acceptable quality and fidelity, it isnecessary to incorporate knowledge of the speech signal intothe quantization process. This is commonly done by an anal-ysis/synthesis coding system in which the parameters of themodel are estimated from the sampled speech signal and thenquantized for digital storage or transmission. A sampledspeech waveform is then synthesized by controlling the modelwith the quantized parameters, and the output of the model isconverted to analog form by a D-to-A converter.

In this case the first block in Fig. 4 contains the analysis andcoding computations as well as the A-to-D, and the third blockwould contain the decoding and synthesis computations andthe D-to-A converter. The output of the discrete-time modelof Fig. 3 satisfies a linear difference equation; that is, a givensample of the output depends linearly on a finite number ofprevious samples and the excitation. For this reason, linearpredictive coding (LPC) techniques have enjoyed huge successin speech analysis and coding. Linear predictive analysis is usedto estimate parameters of the vocal tract system model in Fig.3, and, either directly or indirectly, this model serves as thebasis for a digital representation of the speech signal. Varia-tions on the LPC theme include adaptive differential PCM(ADPCM), multipulse-excited LPC (MPLPC), code-excitedLPC (CELP), self-excited LPC (SEV), mixed-excitation LPC(MELP), and pitch-excited LPC (1, 7, 9). With the exceptionof ADPCM, which is a waveform coding technique, all theother methods are analysis/synthesis techniques. Codingschemes like CELP and MPLPC also incorporate frequency-weighted distortion measures in order to build in knowledge ofspeech perception along with the knowledge of speech pro-duction represented by the synthesis model. Another valuableapproach uses frequency-domain representations and knowl-edge of auditory models to distribute quantization error so asto be less perceptible to the listener. Examples of this approachinclude sinusoidal models, transform coders, and subbandcoders (1, 7, 9).

In efforts to reduce the bit rate, an additional trade-offcomes into play-that is, the complexity of the analysis/synthesis modeling processes. In general, any attempt to lowerthe bit rate while maintaining high quality will increase thecomplexity (and computational load) of the analysis andsynthesis operations. At present, toll quality analysis/synthesisrepresentations can be obtained at about 8000 bits/sec or anaverage of about one bit per sample (10). Attempting to lowerthe bit rate further leads to degradation in the quality of thereconstructed signal; however, intelligible speech can be re-produced with bit rates as low as 2000 bits/sec (10).The waveform of human speech contains a significant

amount of information that is often irrelevant to the messageconveyed by the utterance. An estimate under simple assump-tions shows that the fundamental information transmissionrate for a human reading text is on the order of 100 bits/sec.This implies that speech can in principle be stored or trans-

mitted an order of magnitude more efficiently if we can findways of representing the phonetic/linguistic content of thespeech utterance in terms of the parameters of a speechsynthesizer. Fig. 5 shows this approach denoted as text-to-speech synthesis.The text of a desired speech utterance is analyzed to

determine its phonetic and prosodic variations as a function oftime. These in turn are used to determine the control param-eters for the speech model, which then computes the samplesof a synthetic speech waveform. This involves literally apronouncing dictionary (along with rules for exceptions, ac-ronyms, and irregularities) for determining phonetic contentas well as extensive linguistic rules for producing durations,intensity, voicing, and pitch. Thus, the complexity of thesynthesis system is greatly increased while the bit rate of thebasic representation is greatly reduced.

In using digital speech coding and synthesis for voiceresponse from machines, the following four considerationslead to a wide range of trade-off configurations: (i) complexityof analysis/synthesis operations, (ii) bit rate, (iii) perceivedquality, and (iv) flexibility to modify or make new utterances.Clearly, straightforward playback of sampled and quantizedspeech is the simplest approach, requiring the highest bit ratefor good quality and offering almost no flexibility other thanthat of simply splicing waveforms of words and phrases to-gether to make new utterances. Therefore, this approach isusually only attractive where a fixed and manageable numberof utterances is required. At the other extreme is text-to-speechsynthesis, which, for a single investment in program, dictio-nary, and rule base storage, offers virtually unlimited flexibil-ity to synthesize speech utterances. Here the text-to-speechalgorithm may require significant computational resources.The usability and perceived quality of text-to-synthetic speechhas progressed from barely intelligible and "machine-like" inthe early days of synthesis research to highly intelligible andonly slightly unnatural today. This has been achieved with avariety of approaches ranging from concatenation of diphoneelements of natural speech represented in analysis/synthesisform to pure computation of synthesis parameters for physicalmodels of speech production.Speech analysis and synthesis have received much attention

from researchers for over 60 years, with great strides occurringin the 25 years since digital computers became available forspeech research. Synthesis research has drawn support frommany fields, including acoustics, digital signal processing,linguistics, and psychology. Future research will continue tosynthesize knowledge from these and other related fields inorder to provide the capability to represent speech with highquality at lower and lower information rates, leading ultimatelyto the capability of producing synthetic speech from text thatcompares favorably with that of an articulate human speaker.Some specific areas where new results would be welcome arethe following:

Converter Syteietext I} control digital syntheticinput parameters representation speech

output

FIG. 5. Text-to-speech synthesis.


Dow

nloa

ded

by g

uest

on

May

23,

202

0


FIG. 6. Speech analysis by synthesis.

Language Modeling. A continuing goal must be to under-stand how linguistic structure manifests itself in the acousticwaveform of speech. Learning how to represent phoneticelements, syllables, stress, emphasis, etc., in a form that can beeffectively coupled to speech modeling, analysis, and synthesistechniques should continue to have high priority in speechresearch. Increased knowledge in this area is obviously essen-tial for text-to-speech synthesis, where the goal is to ensure thatlinguistic structure is correctly introduced into the syntheticwaveform, but more effective application of this knowledge inspeech analysis techniques could lead to much improvedanalysis/synthesis coders as well.

Acoustic Modeling. The linear source/system model of Fig.3 has served well as the basis for speech analysis and coding,but it cannot effectively capture many subtle nonlinear phe-nomena in speech. New research in modeling wave propaga-tion in the vocal tract (10) and new models based on modu-lation theory, fractals, and chaos (11, 12) may lead to improvedanalysis and synthesis techniques that can be applied tohuman-machine communication problems.Auditory Modeling. Models of hearing and auditory per-

ception are now being applied with dramatic results in high-quality audio coding (10). New ways of combining both speechproduction and speech perception models into speech codingalgorithms should continue to be a high priority in research.

Analysis by Synthesis. The analysis-by-synthesis approach tospeech analysis is depicted in Fig. 6, which shows that theparametric representation of the speech signal is obtained byadjusting the parameters of the model until the syntheticoutput of the model matches the original input signal accu-rately enough according to some error criterion. This principleis the basis for MPLPC, CELP, and SEV coding systems. Inthese applications the speech synthesis model is a standardLPC source/system model, and the "perceptual comparison"is a frequency-weighted mean-squared error. Although greatsuccess has already been achieved with this approach, it shouldbe possible to apply the basic idea with more sophisticatedcomparison mechanisms based on auditory models and withother signal models. Success will depend on the development

speechinput

of appropriate computationally tractable optimization ap-proaches (10).

SPEECH RECOGNITION AND UNDERSTANDINGThe capability of recognizing or extracting the text-levelinformation from a speech signal (speech recognition) is amajor part of the general problem of human-machine com-munication by voice. As in the case of speech synthesis, it iscritical to build on fundamental knowledge of speech produc-tion and perception and to understand how linguistic structureof language is expressed and manifested in the speech signal.Clearly there is much that is common between speech analysis,coding, synthesis, and speech recognition.

Fig. 7 depicts the fundamental structure of a typical speechrecognition system. The "front-end" processing extracts aparametric representation or input pattern from the digitizedinput speech signal using the same types of techniques (e.g.,linear predictive analysis or filter, banks) that are used inspeech analysis/synthesis systems. These acoustic features aredesigned to capture the linguistic features in a form thatfacilitates accurate linguistic decoding of the utterance. Cep-strum coefficients derived from either LPC parameters orspectral amplitudes derived from FFT or filter bank outputsare widely used as features (13). Such analysis techniques areoften combined with vector quantization to provide a compactand effective feature representation. At the heart of a speechrecognition system is the set of algorithms that compare thefeature pattern representation of the input to members of a setof stored reference patterns that have been obtained by atraining process. Equally important are algorithms for makinga decision about the pattern to which the input is closest.Cepstrum distance ineasures are widely used for comparisonof feature vectors, and dynamic time warping (DTW) andhidden Markov models (HMMs) have been shown to be veryeffective in dealing with the variability of speech (13). Asshown in Fig. 7, the most sophisticated systems also employgrammar and language models to aid in the decision process.

Speech recognition systems are often classified according tothe scope of their capabilities. Speaker-dependent systemsmust be "trained" on the speech of an individual user, whilespeaker-independent systems attempt to cope with the vari-ability of speech among speakers. Some systems recognize alarge number of words or phrases, while simpler systems mayrecognize only a few words, such as the digits 0 through 9.Finally, it is simpler to recognize isolated words than torecognize fluent (connected) speech. Thus, a limited-vocabulary, isolated-word, speaker-dependent system wouldgenerally be the simplest to implement, while to approach thecapabilities of a native speaker would require a large-vocabulary, connected-speech, speaker-independent system.The accuracy of current speech recognition systems dependson the complexity of the operating conditions. Recognitionerror rates below 1 percent have been obtained for highly

textoutput

FIG. 7. Speech recognition.


Dow

nloa

ded

by g

uest

on

May

23,

202

0


speech I text or Processin text speechinput output

FIG. 8. Speech recognition/synthesis.

constrained vocabulary and controlled speaking conditions;but for large-vocabulary, connected-speech systems, the worderror rate may exceed 25 percent.

Clearly, different applications will require different capa-bilities. Closing switches, entering data, or controlling a wheel-chair might very well be achieved with the simplest system. Asan example where high-level capabilities are required, considerthe system depicted in Fig. 8, which consists of a speechrecognizer producing a text or symbolic representation, fol-lowed by storage, transmission, or further processing, and thentext-to-speech synthesis for conversion back to an acousticrepresentation. In this case it is assumed that the output of thetext-to-speech synthesis system is sent to a listener at a remotelocation, such that the machine is simply an intermediarybetween two humans. This system is the ultimate speechcompression system since the bit rate at the text level is onlyabout 100 bits/second. Also shown in Fig. 8 is the possibilitythat the text might be processed before being sent to thetext-to-speech synthesizer. An example of this type of appli-cation is when processing is applied to the text to translate onenatural language such as English into another such as Japa-nese. Then the voice output is produced by a Japanesetext-to-speech synthesizer, thereby resulting in automatic in-terpretation in the second language.

If the goal is to create a machine that can speak andunderstand speech as well as a human being, then speechsynthesis is probably further along than recognition. In thepast, synthesis and recognition have been treated as separateareas of research, generally carried out by different people andby different groups within research organizations. Obviously,there is considerable overlap between the two areas, and bothwould benefit from closer coupling. The following are topicswhere both recognition and synthesis would clearly benefitfrom new results:Language Modeling. As in the case of speech synthesis, a

continuing goal must be to understand how linguistic structureis encoded in the acoustic speech waveform, and, in the caseof speech recognition, to learn how to incorporate such modelsinto both the pattern analysis and pattern matching phases ofthe problem.

Robustness. A major limitation of present speech recogni-tion systems is that their performance degrades significantlywith changes in the speaking environment, transmission chan-nel, or the condition of the speaker's voice. Solutions to theseproblems may involve the development of more robust featurerepresentations having a basis in auditory models, new dis-tance measures that are less sensitive to nonlinguistic varia-tions, and new techniques for normalization of speakers andspeaking conditions.Computational Requirements. Computation is often a dom-

inant concern in speech recognition systems. Search proce-dures, hidden Markov model training and analysis, and newfeature representations based on detailed auditory models allrequire much computation. All these aspects and more willbenefit from increased processor speed and parallel compu-tation.

Speaker Identity and Normalization. It is clear that speakeridentity is represented in the acoustic waveform of speech, butmuch remains to be done to quantify the acoustic correlates ofspeaker identity. Greater knowledge in this area would beuseful for normalization of speakers in speech recognitionsystems, for incorporation of speaker characteristics in text-

to-speech synthesis, and for its own sake as a basis for speakeridentification and verification systems.

Analysis by Synthesis. The analysis-by-synthesis paradigm ofFig. 6 may also be useful for speech recognition applications.Indeed, if the block labeled "Model Parameter Generator"were a speech recognizer producing text or some symbolicrepresentation as output, the block labeled "Speech SynthesisModel" could be a text-to-speech synthesizer. In this case thesymbolic representation would be obtained as a by-product ofthe matching of the synthetic speech signal to the input signal.Such a scheme, although appealing in concept, clearly presentssignificant challenges. Obviously, the matching metric couldnot simply compare waveforms but would have to operate ona higher level. Defining a suitable metric and developing anappropriate optimization algorithm would require much cre-ative research, and the implementation of such a system wouldchallenge present computational resources.

USABILITY ISSUESGiven the technical feasibility of speech synthesis and speechrecognition, and given adequate low-cost computational re-sources, the question remains as to whether human-machinevoice communication is useful and worthwhile. Intuition sug-gests that there are many situations where significant improve-ments in efficiency and performance could result from the useof voice communication/control even of a limited and con-strained nature. However, we must be careful not to makeassumptions about the utility of human-machine voice com-munication based on conjecture or our personal experiencewith human-human communication. What is needed is hardexperimental data from which general conclusions can bedrawn. In some very special cases, voice communication witha machine may allow something to be done that cannot be doneany other way, but such situations are not the norm. Even forwhat seem to be obvious areas of application, it generally canbe demonstrated that some other means of accomplishing thetask either already exists or could be devised. Therefore, thechoice usually will be determined by such factors as conve-nience, accuracy, and efficiency. If voice communication withmachines is more convenient or accurate, it may be consideredto be worth the extra cost even if alternatives exist. If it is moreefficient, its use will be justified by the money it saves.The scientific basis for making decisions about such ques-

tions is at best incomplete. The issues are difficult to quantifyand are not easily encapsulated in a neat theory. In many caseseven careful experiments designed to test the efficacy ofhuman-machine communication by voice have used humans tosimulate the behavior of the machine. Some of the earliestwork showed that voice communication capability significantlyreduced the time required to perform tasks involving simu-lated human-computer interaction (14), and subsequent re-search has added to our understanding. However, widelyapplicable procedures for the design of human-machine voicecommunication systems are not yet available. The paper byCohen and Oviatt (15) is a valuable contribution because itsummarizes the important issues and current state of knowl-edge on human-machine interaction and points the way toresearch that is needed as a basis for designing systems.The paradigm of the voice-controlled team and wagon has

features that are very similar to those found in some computer-based systems in use today-that is, a limited vocabulary of


Dow

nloa

ded

by g

uest

on

May

23,

202

0


acoustically distinct words, spoken in isolation, with an alter-nate communication/control mechanism conveniently acces-sible to the human in case it is necessary to override the voicecontrol system. Given a computer system with such con-strained capabilities, we could certainly go looking for appli-cations for it. In the long term, however, a much more desirableapproach would be to determine the needs of an applicationand then specify the voice communication interface that wouldmeet the needs effectively. To do this we must be in a betterposition to understand the effect on the human's performanceand acceptance of the system of such factors as:* vocabulary size and content,* fluent speech vs. isolated words,* constraints on grammar and speaking style,* the need for training of the recognition system,* the quality and naturalness of synthetic voice response,* the way the system handles its errors in speech under-

standing, and* the availability and convenience of alternate communi-

cation modalities.These and many other factors come into play in determiningwhether humans can effectively use a system for voice com-munication with a machine and, just as important, whetherthey will prefer using voice communication over other modesof communication that might be provided.Mixed-Mode Communication. Humans soon tire of repeti-

tion and welcome anything that saves steps. Graphical inter-faces involving pointing devices and menus are often tediousfor repetitive tasks, and for this reason most systems makeavailable an alternate short-cut for entering commands. This isno less true for voice input; humans are also likely to tire oftalking to their machines. Indeed, sometimes we would evenlike for the machine to anticipate our next command-forexample, something like the well-trained team of mules thatautomatically stopped the corn-picking wagon as the farmerfell behind and moved it ahead as he caught up (William H.Schafer, personal communication). While mind reading maybe out of the question, clever integration of voice communi-cation with alternative sensing mechanisms, alternate input/output modalities, and maybe even machine learning willultimately lead to human-machine interfaces of greatly im-proved usability.

Experimental Capabilities. With powerful workstations andfast coprocessors readily available, it is now possible to doreal-time experiments with real human-machine voice com-munication systems. These experiments will help answer ques-tions about the conditions under which these systems are most

effective, about how humans learn to use human-machinevoice communication systems, and about how the interactionbetween human and machine should be structured; then thenew "theory of modalities" called for by Cohen and Oviatt (15)may begin to emerge.

CONCLUSIONAlong the way to giving machines human-like capability tospeak and understand speech there remains much to belearned about how structure and meaning in language areencoded in the speech signal and about how this knowledgecan be incorporated into usable systems. Continuing improve-ment in the effectiveness and naturalness of human-machinevoice communication systems will depend on creative synthesisof concepts and results from many fields, including microelec-tronics, computer architecture, digital signal processing,acoustics, auditory science, linguistics, phonetics, cognitivescience, statistical modeling, and psychology.

This work was supported by the John and Mary FranklinFoundation.

1. Flanagan, J. L. (1972) Speech Analysis, Synthesis, and Perception(Springer, New York).

2. Paget, R. (1930) Human Speech (Harcourt, New York).3. Pierce, J. R. (1969) J. Acoust. Soc. Am. 47, 1049-1050.4. Gold, B. & Rader, C. M. (1969) Digital Processing of Signals

(McGraw-Hill, New York).5. Oppenheim, A. V. & Schafer, R. W. (1975) Digital Signal Pro-

cessing (Prentice-Hall, Englewood Cliffs, NJ).6. Rabiner, L. R. & Gold, B. (1975) Theory and Application of

Digital Signal Processing (Prentice-Hall, Englewood Cliffs, NJ).7. Rabiner, L. R. & Schafer, R. W. (1978) Digital Processing of

Speech Signals (Prentice-Hall, Englewood Cliffs, NJ).8. Fant, G. (1960) Acoustic Theory of Speech Production (Mouton,

The Hague, The Netherlands).9. Deller, J. R., Jr., Proakis, J. G. & Hansen, J. H. L. (1993) Dis-

crete-Time Processing of Speech Signals (Macmillan, New York).10. Flanagan, J. L. (1995) Proc. Natl. Acad. Sci. USA 92, 9938-9945.11. Maragos, P. (1991) in Proceedings of the IEEE International

Conference onAcoustics, Speech, and Signal Processing (Toronto),pp. 417-420.

12. Maragos, P. A., Kaiser, J. F. & Quatieri, T. F. (1994) IEEE Trans.Signal Process., in press.

13. Rabiner, L. R. & Juang, B.-H. (1993) Fundamentals of SpeechRecognition (Prentice-Hall, Englewood Cliffs, NJ).

14. Chapanis, A. (1975) Sci. Am. 232, 36-49.15. Cohen, P. R. & Oviatt, S. L. (1995) Proc. Natl. Acad. Sci. USA 92,

9921-9927.


Dow

nloa

ded

by g

uest

on

May

23,

202

0

scientific bases of human-machinecommunication voice · abstract the scientific bases for...

Documents