computer recognition of connected speech

19
Received 14 April 1967 9.10 Computer Recognition of Connected Speech* D. R. REDDY Computer Science Department, StanfordUniversity, Stanford, California 94305 A systemfor obtaininga phonemic transcription from a connected speech sampleentered into the com- puter by a microphone and an analog-to-digital converter is described. A feature-extraction program divides the speech utterance into segments approximately corresponding to phonemes, determines pitch periods of thosesegments where pitch analysis is appropriate, and computes a list of parameters for eachsegment. A classification program assigns a phoneme-group label (vowellikesegment, fricativelike segment, etc.) to each segment, determines whether a segment shouldbe classified as a phoneme or whether it represents a phoneme boundary between two phonemes, and then assigns a phoneme label to eachsegment that is not rejected as being a phonemeboundary. About 30 utterancesof 1-2 sec duration were analyzed using the aboveprograms on an interconnected IBM 7090-PDP1 system. Correct identification of many voweland consonantal phonemes was achieved for a single speaker using the same speech material that was used for developing the recognition procedure. The time for analysis of eachutterance was about 40 timesreal time. INTRODUCTION PEECH recognition by a computer without the use of any special-purpose-hardware poses several in- terestingproblemsand requiresa different approach than those used before the availability of suitable computers. Firstly, the problem of speech recognition by machine is yet to be satisfactorily solved. Secondly, any attempt at simulating the solutions that now usefilters would requireexcessive computer time. This paper describes the details of a system developed at the Stanford Computer ScienceDepartment for ob- taining a phonemic transcription from a connected speech sample. Several requirements and objectives were formulated during the development of this system that concern the level of recognition attempted, the useof linguistic and probabilisticconstraints, the number of speakers to be recognized, and the use of special purpose hard- ware. In this Section, we discuss these objectives and outline how they affect the design of the system. There are several speech-recognition problems of varying degrees of complexity. Input to the machine may consistof isolated syllables,words, or connected speech. Output from the machineusually consists of * This paper is based on the material in "An Approach to Computer SpeechRecognition by Direct Analysis of the Speech Wave," PhD thesis, Computer Sci. Dept. Stanford Univ. (1966). This thesisis available in publishedform as Techn. Rept. No. CS49, Computer Sci. Dept., Stanford Univ., Stanford, Calif. (Sept. 1966). a sequence of discrete symbols, e.g.,phonemes, syllables, or words. If recognition can be achieved at the level of continuousspeech,then one can easily recognize isolated syllables or words;the opposite is not the case. Thus, we decided to attempt recognition of continuous speechalthough such a task is considered extremely hard at the present time. Output in the form of pho- nemic symbols was chosen in preference to syllable or word-string output. Although the influence of neigh- boring phonemes cannot be properly accounted for at the phonemelevel, it is attractive because one has to distinguish betweenonly 40 or so different entities. A computer was usedin all the phases of the investi- gation without the use of any preprocessing hardware. Since we were mainly interestedin computer speech recognition, it seemed appropriate to see how much can be achieved by the use of the computer alone. Thus, our speech input equipment consisted of a micro- phone and an analog-to-digital converter. An advantage of this approach is that one is not committedby the hardwareto specific recognition algorithms. This is not meant to imply that special-purpose hardware is not useful in computer speech recognition but only that one should develop optimal algorithmsusing a com- puter before implementing them in hardware form. Restricting ourselves to the use of computer alone forced us to seek new and different solutions to the problems of speech processing since any attempt at simulating the approaches that require the use of filters would have required excessive computer time. The Journal of the AcousticalSocietyof America Copyright ¸ 1967 by the Acoustical Society of America. 329 Redistribution subject to ASA license or copyright; see http://acousticalsociety.org/content/terms. Download to IP: 128.42.202.150 On: Sun, 23 Nov 2014 11:19:03

Upload: d-r

Post on 29-Mar-2017

216 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Computer Recognition of Connected Speech

Received 14 April 1967 9.10

Computer Recognition of Connected Speech*

D. R. REDDY

Computer Science Department, Stanford University, Stanford, California 94305

A system for obtaining a phonemic transcription from a connected speech sample entered into the com- puter by a microphone and an analog-to-digital converter is described. A feature-extraction program divides the speech utterance into segments approximately corresponding to phonemes, determines pitch periods of those segments where pitch analysis is appropriate, and computes a list of parameters for each segment. A classification program assigns a phoneme-group label (vowellike segment, fricativelike segment, etc.) to each segment, determines whether a segment should be classified as a phoneme or whether it represents a phoneme boundary between two phonemes, and then assigns a phoneme label to each segment that is not rejected as being a phoneme boundary. About 30 utterances of 1-2 sec duration were analyzed using the above programs on an interconnected IBM 7090-PDP1 system. Correct identification of many vowel and consonantal phonemes was achieved for a single speaker using the same speech material that was used for developing the recognition procedure. The time for analysis of each utterance was about 40 times real time.

INTRODUCTION

PEECH recognition by a computer without the use of any special-purpose-hardware poses several in- teresting problems and requires a different approach than those used before the availability of suitable computers. Firstly, the problem of speech recognition by machine is yet to be satisfactorily solved. Secondly, any attempt at simulating the solutions that now use filters would require excessive computer time. This paper describes the details of a system developed at the Stanford Computer Science Department for ob- taining a phonemic transcription from a connected speech sample.

Several requirements and objectives were formulated during the development of this system that concern the level of recognition attempted, the use of linguistic and probabilistic constraints, the number of speakers to be recognized, and the use of special purpose hard- ware. In this Section, we discuss these objectives and outline how they affect the design of the system.

There are several speech-recognition problems of varying degrees of complexity. Input to the machine may consist of isolated syllables, words, or connected speech. Output from the machine usually consists of

* This paper is based on the material in "An Approach to Computer Speech Recognition by Direct Analysis of the Speech Wave," PhD thesis, Computer Sci. Dept. Stanford Univ. (1966). This thesis is available in published form as Techn. Rept. No. CS49, Computer Sci. Dept., Stanford Univ., Stanford, Calif. (Sept. 1966).

a sequence of discrete symbols, e.g., phonemes, syllables, or words. If recognition can be achieved at the level of continuous speech, then one can easily recognize isolated syllables or words; the opposite is not the case. Thus, we decided to attempt recognition of continuous speech although such a task is considered extremely hard at the present time. Output in the form of pho- nemic symbols was chosen in preference to syllable or word-string output. Although the influence of neigh- boring phonemes cannot be properly accounted for at the phoneme level, it is attractive because one has to distinguish between only 40 or so different entities.

A computer was used in all the phases of the investi- gation without the use of any preprocessing hardware. Since we were mainly interested in computer speech recognition, it seemed appropriate to see how much can be achieved by the use of the computer alone. Thus, our speech input equipment consisted of a micro- phone and an analog-to-digital converter. An advantage of this approach is that one is not committed by the hardware to specific recognition algorithms. This is not meant to imply that special-purpose hardware is not useful in computer speech recognition but only that one should develop optimal algorithms using a com- puter before implementing them in hardware form. Restricting ourselves to the use of computer alone forced us to seek new and different solutions to the

problems of speech processing since any attempt at simulating the approaches that require the use of filters would have required excessive computer time.

The Journal of the Acoustical Society of America

Copyright ̧ 1967 by the Acoustical Society of America.

329

Redistribution subject to ASA license or copyright; see http://acousticalsociety.org/content/terms. Download to IP: 128.42.202.150 On: Sun, 23 Nov 2014

11:19:03

Page 2: Computer Recognition of Connected Speech

D. R. REDDY

Some of the differences in approach that have led to substantial reductions in the processing time are the following:

ß Running spectrum of the whole utterance is not computed. Even the limited spectral analysis that is required forms the last rather than the first part of the feature extraction procedure.

ß Many of the features required for determining the phoneme string are obtained by operating directly on the speech wave.

ß Segmentation and pitch-detection procedures use information from the speech wave that would be diffi- cult to recover from the output of a filter bank.

No attempt was made to use linguistic or proba- bilistic constraints. The reasons for this are manifold.

Firstly, it was felt that a first approximation to the phoneme string must be obtained before one can apply linguistic and probabilistic constraints to improve rec- ognition. Secondly, it was felt that not all the informa- tion obtainable from the acoustic waveform has been

utilized in the past. While extensive use has been made of the spectral information, the information con- tained in the prosodic parameters is yet to be tapped. It was considered desirable to see how much recognition can be achieved by the use of acoustic information alone.

The data consisted of about 30 utterances of con-

tinuous speech of 1-2 sec duration by a single speaker. This sample size requiring the recognition of about 300 phonemes was considered satisfactory since we were mainly interested in the feasibility of obtaining a phoneme string from continuous speech rather than obtaining statistical results based on large amounts of data. The advantage of a single-speaker approach is that it is easier to write programs, adjust their pa- rameters and to test them. In order to permit the recognition program to adapt to a wider variety of speakers, it will then be necessary to develop a tune-in program that will modify the recognition program.

The classification programs are based on what we know about acoustic parameters of speech in general. Speaker-dependent information is only used in resolv- ing the ambiguities when there is insufficient informa- tion to effect classification. Use of speaker-dependent cues is necessary because the so-called significant acoustic cues of various phonemes are not always present and humans recognize the sounds because of the presence of compensatory acoustic cues. Machines can use a similar strategy (which is in fact used in our procedures) in which the absence of primary cues leads the program to look for the secondary cues. For ex- ample, in the classification of stops, if the second formant transitions are not well defined, the program attempts to use the frequency of the burst following the silence, although the burst frequency is not as reliable a cue as the slope of the formant transitions.

Our objectives are similar in many respects to other attempts at speech recognition by machine. Compre- hensive surveys of these attempts may be found in Flanagan • and Lindgren? Fry and Denes a and Sakai and Doshita 4 have built special machines for recogni- tion of phonemes from continuous speech. Forgie and Forgie 5 have used a computer as a major tool in their recognition of vowels. Hughes and Hemdaft used a computer to obtain phoneme strings from words using the single speaker approach. Talbert et al. 7 and Martin et al. a have used neural nets to recognize words and specific phonemes, respectively. The main difference between the above attempts and the present one is that almost all of them use the output from a bank of filters as input to the system and this in turn leads to the use of a completely different strategy in speech analysis. In the remaining Sections, we describe the details of our system and the results obtained using it.

I. METHOD AND MATERIALS

The procedures described in this paper were formu- lated by experimenting with various possibilities on an interconnected IBM 7090-PDP1 computer. An Electro- voice model 664 microphone and an AR-2a loudspeaker attached to the PDP1 provide the speech input-output capabilities to the system. The signal from the micro- phone is amplified using a Burr-Brown model 9647 amplifier and digitized using an A-D converter con- nected to the PDP1. The output from the D-A con- verter of PDP1 is connected to the loudspeaker using an operational amplifier.

The speech signal is sampled at a rate of 20000 samples/sec and digitized to 9-bit samples (+256 levels of amplitude). The 20-kHz rate was dictated by the 10-KHz frequency range of interest. The number of levels of digitization was dictated by the intensity range of interest and the noise level of the room. For the English language, the sample size must provide for about a 40 dB (•8 bits including sign) range. In our case, it would have been difficult to obtain any more than nine significant bits of information since

1 j. L. Flanagan, Speech Analysis Synthesis and Perception (Academic Press Inc., New York, 1965).

• N. Lindgren, "Automatic Speech Recognition," IEEE Spec- trum 2, 3, 114-136 (1965).

a D. B. Fry and ?. Denes, "Experiments in Mechanical Speech Recognition," Information Theory [Butterworth and Co. (Pub- lishers) Ltd., London, 1955•, pp. 206--212.

4 T. Sakai and S. Doshita, "The Automatic Speech Recognition System for Conversational Sound," IEEE Trans. Electron Com- puters 12, 835-846 (1963).

5 j. W. Forgie and C. D. Forgie, "Results Obtained from a Vowel Recognition Computer Program," J. Acoust. Soc. Am. 31, 1480-1489 (1959).

6 F. W. Hughes and J. F. Hemdal, "Speech Analysis," Purdue Res. Foundation, Lafayette, Ind. TR-EE65-9 (1965).

7 L. R. Talbert et al., "A Real-Time Adaptive Speech Recogni- tion System," Stanford Electronic Labs., Stanford, Calif. Rept. ASD-TDR-63-660 (1963).

8 T. B. Martin, A. L. Nelson, and H. J. Zadell, "Speech Recog- nition by Feature Abstraction Techniques," Wright-Patterson AFB AF Avionics Labs., Rept. AL-TDR 64-176 (1964).

330 Volume 42 Number 2 1967

Redistribution subject to ASA license or copyright; see http://acousticalsociety.org/content/terms. Download to IP: 128.42.202.150 On: Sun, 23 Nov 2014

11:19:03

Page 3: Computer Recognition of Connected Speech

COMPUTER RECOGNITION OF CONNECTED SPEECH

the noise level of the computer room was already affecting the low-order bit. Use of an acoustic chamber to reduce artificially the noise level was not considered since that would not be the normal-mode of man- machine communication.

The digitized speech utterance is displayed on a CRT and the portion of interest is selected using a light pen. The selected portion is played back over the loudspeaker to ensure that the desired utterance is contained within the selected segment. The digitized utterance is then stored on the IBM 1301 disk file

and is analyzed by the IBM 7090. The analysis pro- grams were written in BALGOL language. The results of analyses are displayed along with the waveform on a CRT display attached to the PDP1. The waveform and results are then graphed on a CALCOMP plotter.

Thirty-two continuous utterances of 1 to 2 sec dura- tion were used in the analyses. Of these, 22 sequences were nonsense sounds in which each consonant was

combined with the five long vowels/a/,/u/,/i/,/e/, and/o/. Occasionally, when a given sequence was too long to fit within the computer storage space available, only that portion of the sequence that would fit was analyzed. The remaining utterances consisted of a few simple sentences, and the digits spoken three at a time. The sounds were all uttered by a single speaker but not in a single recording session. This set of sounds provided the test material in all phases of this investigation.

Various steps involved in obtaining a phoneme-string starting from the raw speech wave can be summarized as follows. First the utterance is normalized so that the

largest amplitude has a value of 1. The normalized sound is segmented into sustained and transitional parts. A procedure for pitch extraction determines a typical pitch period for each of the segments accepted as being voiced. A pitch-synchronous analysis of the typical pitch period determines the energy present up to the 100th harmonic. Results obtained from the above

procedures are used to determine a list of properties for each segment such as duration, intensity and fre- quencies of energy concentration. Each segment is initially classified as being a vowellike segment, frica- tivelike segment, etc. Tests of acoustic closeness are used to determine whether a given segment has a phoneme associated with it or whether it is a transi- tional segment between two phonemes. Those segments that are not rejected as being transitions are then associated with phonemes of English. A flowchart of various operations involved in transforming the raw speech wave into a phoneme string is given in Fig. 1.

II. NORMALIZATION, SEGMENTATION, AND PITCH DETERMINATION

Every utterance was amplitude normalized to reduce the variability resulting from the differing average in- tensity level of the utterances. Amplitude normaliza-

RECORD PEECH UTTERANCE,/ NORMALIZE

SEGMENT

DETERMINE PITCH PERIOOS

PITCH SYNCHRONOUS

SPECTRUM ANALYSIS

DETERMINE PARAMETERS

CLASSIFY SEGMENTS INTO

PHONEME GROUPS

DETECT NULL SEGMENTS

CLASSIFY NON-NULL SEGMENTS

INTO PHONEMES OUTPUT "X,..P HONE M E STRIN

Fro. 1. Flowchart of operations in phoneme transcription of a speech utterance.

tion may be defined as follows. Let the speech wave of an utterance be represented by an n vector x whose elements are the ordinates of the speech wave at a number of equidistant points. Then the normalized sound is given by the vector x', where

l_< j_< n.

The above definition of intensity normalization is crude but simple and was found to be satisfactory. It is possible, of course, to develop better and more sophisticated algorithms for normalization. However, it is not yet clear how to choose an optimal normaliza- tion scheme.

Since segmentation is usually the first step in speech analysis, it is desirable to be able to segment sounds using parameters that are simple to obtain. Using two easily obtained parameters, intensity and zero crossings, we have developed a procedure to segment the speech wave into sustained parts where the characteristics of the sound are regular and transitional parts where the characteristics vary with time. First the speech wave is divided into a succession of minimal segments of 10-msec duration. The intensity level (the maximum amplitude of the speech wave within the minimal segment) and the number of zero crossings are calcu- lated for each minimal segment. These parameters are used to group together acoustically similar minimal segments to form larger segments resulting in the segmentation of the sound. Definition of acoustic simi-

The Journal of the Acoustical Society of America 331

Redistribution subject to ASA license or copyright; see http://acousticalsociety.org/content/terms. Download to IP: 128.42.202.150 On: Sun, 23 Nov 2014

11:19:03

Page 4: Computer Recognition of Connected Speech

D. R. REDDY

ENTERING SPCTR 3 HARM.NO FRF(•UENCY

0 0 I 98

2 196 3 294 4 392

5 490 6 =88

7 686

8 784

9 8B2 10 980

11 • 078 12 i176 13 1274 14 1372 15 1470 16 lr68 17 1666 lB 1764 19 1862 20 1960 22 2156 24 2352 26 2•48 28 2744

30 2940 32 3136 34 3332 36 3•28 38 3724

40 3920 43 4214 46 4508

49 4802 52 •096

55 5390 58 5684 61 9978 64 6272 67 6566

7G 6860

74 7252 78 7644 82 8036

86 8428

90 8820 94 9212 98 9604

FMNT1 719 FIAMPL 2849

V01CIN 739

N{JSLVL 162

D TAAIEETOOIEHTO SINCUEFF COSCOEFF

AA 1133 1235 98 AMPLITUDE SUM.ADJ.AMPL

0 13• 68 0 2 -200 201 807

30 • 538 1304 443

183 535 565 1679 68 571 575 1929

-348 707 789 2265 -434 789 901 3562

-1469 1160 1873 5622 -557 2793 2849 5•91

400 773 870 4870 32 1151 1152 3390

-44 • 1294 1369 4062 263 1519 1541 4009

1078 214 1099 2678 37 -2 37 1223

-45 -72 86 263 -113 -80 139 427

-99 -174 202 556 -9R -190 215 651 -65 -224 235 704 -36 -250 254 796

7 -306 307 982 -47 -417 421 844

107 -42 115 732 183 -67 195 528 204 -72 217 616 1;4 -64 204 646

-181 132 22: 658 86 213 229 617 44 157 163 516 78 124 375 96

66 59 89 271 52 26 58 215 61 30 68 219 93 -3 93 243 83 o 83 284

108 -10 109 296 101 -2• 104 302

88 -14 89 222 10 -26 29 156 27 -26 38 121 42 -33 55 168 50 -55 75 207 52 -57 78 226 44 -57 73 231 40 -68 80 225 31 -64 72 230 29 -72 78 150

ZEROI 965 FmNT2 1087 ZERO2 1427 F•NT3 2189 Z1AMPL 870 F2AMPL 1541 Z2AMPL 37 F3A•PL 421

NOSPTS 16 NSFRE• 1218 TOTAPL 8028 NOSAPt 1099

Fro. 2. Spectral properties of/a/.

larity and the description of the segmentation pro- cedure are given by Reddy2

A pitch-period-extraction procedure operates on the segmented sound to divide each segment into a sequence of pitch periods. Only those segments where voicing is] likely to be present are subjected to pitch-period analysis. Very-low-intensity segments, which can be considered as silence, and medium intensity segments with a large number of zero crossings, which may be considered as high-frequency noise (as i. unvoiced fricatives), were not analyzed. The pitch-period-extrac- tion procedure is based on waveform analysis akin to that of Gold. m However, rather than developing several simple pitch period extractors and then attempting to combine them, we develop a single but a more complex pitch-period extractor. The procedure uses the con- straints that the range for the fundamental frequency is about 75-400 Hz and that for a given speaker it does not differ by more than 20% within any segment of a normal utterance. Errors are corrected by permit- ting additional scans of the speech wave. Further details of the pitch period extraction procedure are given by Reddy. n

• D. R. Reddy, "Segmentation of Speech Sounds," J. Acoust. Soc. Am. 40, 307-312 (1966).

•øB. Gold, "Computer Program for Pitch Extraction," J. Acoust. Soc. Am. 34, 916-921 (1962).

n D. R. Reddy, "Pitch Period Determination of Speech Sounds," Comm. ACM 10, 343-348 (1967).

A typical pitch period within a segment is defined to be the pitch period that is located about the middle of the segment and contains at least one pitch period adjacent to it whose duration does not differ from its own by more than 10%. If the latter condition is not satisfied, then a scan is made within the segment to find a pitch period close to the center that satisfies the condition.

ili. SPECTRAL ANALYSIS

A typical pitch period within each segment is used to estimate the spectral characteristics of the segment by means of pitch-synchronous analysis. Since the segmentation procedure groups acoustically similar minimal segments together to form sustained segments, it was considered sufficient to analyze only one typical pitch period per segment to determine the spectral properties of the whole segment. However, when transi- tions of formants were considered to be important, two pitch periods located towards the ends of the segment were analyzed to determine the slopes of the formant transitions.

Determination of the spectral envelope by pitch- synchronous analysis consists of subjecting a pitch period of the speech wave to Fourier series expansion and was first proposed by Mathews, Miller, and David/ Let a• and b• be the nth sine and cosine

coefficients of the Fourier series. Then the amplitude of the spectrum at the nth harmonic, P•= (a,?-pb,,•) «, represents the nth ordinate of the spectral envelope. Since we are only interested in P• in this investigation, the choice of the beginning and the end points of a pitch period was not considered important. This differs from the work of Mathews, Miller, and David, who define the beginning and the end of a pitch to be the zero crossing before the principal peak in the period.

The low-intensity segments (silence) and the high- frequency noise segments that do not exhibit any periodicity do not satisfy the hypothesis for pitch synchronous analysis. Thus, the spectral envelope for these segments should ideally be calculated using one of the Fourier transform methods. However, we found that if we were willing to accept errors of about 5-10%, Fourier series expansion of the speech wave of sufficient duration (about 10 msec) within each segment was found to be adequate for our purposes. The approach seemed justifiable since it is known that the ear only makes a crude form of frequency analysis and since we are mainly interested in the noise frequency of maximum power. Since the noise frequency of interest was always contained in the spectral envelope in the range 1000-10 000 Hz, the choice of the time window of 10 msec ensured the presence of more than 10 waves of the frequency of excitation of the vocal-tract con-

•' M. V. Mathews, J. E. Miller, and E. E. David, Jr., "Pitch Synchronous Analysis of Voiced Sounds," J. Acoust. Soc. Am. 33, 179-186 (1961).

332 Volume 42 Number 2 1967

Redistribution subject to ASA license or copyright; see http://acousticalsociety.org/content/terms. Download to IP: 128.42.202.150 On: Sun, 23 Nov 2014

11:19:03

Page 5: Computer Recognition of Connected Speech

COMPUTER RECOGNITION OF CONNECTED SPEECH

TABLe. I. Parameters of speech sounds.

1. Duration in milliseconds 2. Intensity (range 0-32) 3. Standard deviation of intensity 4. Mean zero crossings/10 msec 5. Standard deviation of zero crossings 6. Pitch in hertz

7. Sum of amplitude at Harmonics 1 and 2 8. Frequency of Formant 1 9. Amplitude at Formant 1

10. Frequency of Formant 2 11. Amplitude at Formant 2 12. Frequency of Formant 3 13. Amplitude at Formant 3 14. Frequency of Zero 1 15. Amplitude at Zero 1 16. Frequency of Zero 2 17. Amplitude at Zero 2 18. Frequency of noise-energy concentration 19. Amplitude of noise-energy concentration 20. Number of points above noise threshold 21. Sum of amplitude at the above points

striction. Further, by choosing the time window to be about the same as the average pitch period of the speaker, it was possible to correct for the errors that occasionally resulted from assuming that some seg- ments representing voiced stops and fricatives were unvoiced. Error analyses were conducted on the effect of using an incorrect pitch period on the spectral envelope. Because of the averaging technique used in our definition of formant and noise frequencies (to be defined later) the error was seldom more than 10% and it became less and less as the number of waves

contained in the time window increased, as expected. The success of this technique was dependent on reasona- ble separation of the poles of the vocal tract.

Most vocal cavities function as damped resonators resulting in little difference in amplitude over a small bandwidth around a pole. At higher frequencies this bandwidth can be expected to overlap several har- monics. Consequently, calculation of amplitude can be omitted at some harmonics at higher frequencies with no resulting loss of information. In this investigation, -Pn was calculated for each of the first 20 harmonics, every second between the 22nd and 40th harmonics, every third between the 43rd and 60th, and every fourth between the 64th and 100th harmonics. In the case of women or children, it is likely that only the first 40 harmonics need be calculated. Since calculation of the value of sine and cosine functions every time would have involved computer times of the order of 10 msec per evaluation, a sine table of 512 values over the interval •0,2•r• was used in this investigation. The accuracy afforded was considered sufficient for our purposes. No attempt was made to convert to a decibel scale since the range of numbers posed no problems.

IV. TYPICAL SPECTRAL ENVELOPES

Figures 2 and 3 give detailed results of the spectral analysis of the sounds /a/ and /s/. The parameters

ENTERING SPCTR 11 0 SAASFESO[JSEHS I S 8716 8816 100 HARN.NO FREQUENCY SI•(.OEFF CUSCOEFF ANPLITUDE SUM.•ADJ.AHPL

0 0 0 -190 96 0 1 100 lg -28 35 226 2 200 53 79 95 154 3 300 21 -9 24 128 4 400 0 9 9 72 5 •00 -32 -20 39 75 6 600 -19 17 26 96 7 700 20 -22 31 78 8 800 11 -16 20 78

9 900 -2 -2• 27 50 10 1000 2 3 46

11 1100 -1• 4 17 28 12 1200 -6 -3 8 38

13 1300 3 -11 12 34 14 1400 12 -5 13 54 lq 1500 -12 29 66 16 1600 -11 -21 2• 64 17 1700 -4 -8 11 76 18 1800 -3• lg 41 67 19 1900 -5 -14 16 82 20 2000 -IN 16 25 64

22 2200 -21 -5 22 55 24 2400 -5 -3 7 62

2600 -27 32 26 16 48 28 2800 • -7 9 74

30 3000 14 30 33 67 32 3200 -7 24 25 94 34 3400 -17 31 36 97 36 3&00 34 -12 36 190 38 3800 22 118

208 116 362

40 4000 -3g -203 404

43 •00 22 -74 78 429 46 4600 140 -28 143 259 4• 4900 -32 17 38 246 52 520C -42 49 65 143 55 5qOC 36 -15 39 156 58 5800 46 21 51 152 61 6100 39 47 62 189 64 6400 76 -14 77 192 67 6700 28 46 54 194 l0 7000 60 -20 63 170 74 7400 37 -36 53 164 76 7900 3 48 48 130 82 8200 -28 -6 29 185 86 U600 •0 89 108 348 90 9000 -206 43 212 383 94 9400 -62 5 63 310 98 9800 -21 27 35 98

F•NTI gg ZE•01 105• FMNT2 I881 ZERO2 2578 FMNI• 3686 FIAMPL 96 ZIA•PL F2AMPE 4I Z2AMPL ? F3AMPL 118

V•ICIN 130 N•SPTS 27 NSFREQ 8953 NOSLVL 24 TOTAPL 1800 NOSAPt 212

Fro. 3. Spectral properties of/s/.

determined using the spectral envelope are given at the bottom of the figures. The output of the spectral analysis consists of the harmonic number n, frequency F• in hertz of that harmonic, sine coefficient b•, cosine coefficient a•, amplitude _Pn, and sum of adjacent amplitudes Q•. The terms a• and b• are scaled and the scaling factor is irrelevant for our purposes. Harmonic 0 represents the amplitude defined by the coefficient a0 of the Fourier expansion. -P0 might occasionally be a crude measure of the volume of airflow, which is useful in the classification of whispered vowels and /h/. The spectral envelopes of /a/ and /s/ can be seen to be of similar shape to those in the literature. Note that the amplitude is given in the pressure (linear) scale rather than decibel (logarithmic) scale. The pressure scale was adopted for efficiency reasons and because the range of numbers was well within the limits of the computer. Note that the envelope for/s/ was calculated by subjecting a 10-msec time window of the/s/segment to Fourier series expansion (rather than Fourier transforms).

V. PARAMETER DETERMINATION

Table I gives the list of all the parameters calculated for each segment for use with the classification pro- cedures. This list is an extension of a list of significant parameters proposed by Peterson? The first five pa-

pa G. E. Peterson, "Automatic Speech Recognition Procedures," Language and Speech 4, 200-219 (1961).

The Journal of the Acoustical Society of America 333

Redistribution subject to ASA license or copyright; see http://acousticalsociety.org/content/terms. Download to IP: 128.42.202.150 On: Sun, 23 Nov 2014

11:19:03

Page 6: Computer Recognition of Connected Speech

D. R. REDDY

rameters of Table I are calculated as soon as the

utterance is segmented. The rest of the parameters are calculated after the determination of the spectral en- velope. Not all these parameters were used in the present classification procedures, e.g., pitch and the frequencies of Zero 1 and Zero 2. But we believe that they will be necessary in classification procedures in- volving several speakers.

The parameters given in Table I were not defined in a manner identical to that used in the past literature. However, an attempt was made to define the parame- ters so that comparison with the past results would be easy and make them more meaningful. Several other parameters, e.g., measures of noisiness, are believed to be new. Parameters such as duration, intensity, and zero crossings were found to be at least as useful as the formant information. The procedures for parameter determination made use of the following algorithmic definitions.

Duration of the segment is the elapsed time in milli- seconds between the start of the segment and the start of the following segment.

Intensity of a segment is defined to be the mean of the intensity levels of the minimal segments contained within the segment.

Mean zero crossings of a segment represent the average number of zero crossings per 10 msec within the seg- ment and is defined to be the mean of the zero crossings of the minimal segments contained in the segment.

Standard deviation of the intensity of a segment and zero crossings of a segment are defined in the obvious statistical sense.

Let pitch period of a segment be the mean period of all the pitch periods contained within the segment. Then the Pitch of a segment is defined to be the fre- quency in hertz specified by the pitch period of the segment.

The spectral envelope defined by the amplitudes, Pn, of the nth harmonic is used to determine the spectrum-dependent properties of the segment.

The definition sum of the amplitude of Harmonics 1 and 2 is obvious and was used as an indication of

voicing.

Let F and P be n vectors such that Pi represents the amplitude at frequency F•. Let Qi be a vector such that

Qi=Pi-•+Pi+Pg+• for i=l,..., n-1.

The summing of the three adjacent elements of the spectral envelope has the effect of smoothing local variations and can be thought of as a form of local integration.

Let j be an index such that Fj_•_<850 Hz and Fj>850 Hz and k be an index such that Qk>_Qi for

i- 1... j. Then

Frequency of Formant i is defined to be

Fi= (Pk_•.F•_•q-P•.Fkq-P•+•.F•+•)/Qk.

Amplitude at Formant i is defined to be

AF1 = max(P•_•,P•,P•+•).

Thus, F1 is the weighted average of the frequencies around the maximum peak of the smoothed spectrum Q in the approximate frequency range 100-950 Hz. One can also visualize the definition of F! in another

way. Associate with a given frequency F• the area Q• under the spectral envelope defined by Pi in the fre- quency interval (Fi_•,Fi+•). Then F! can be thought of as the frequency of the center of gravity of the maximum area Q• where Qk>_ Qi for i= 1--. j.

F2 and F3 and the amplitudes at F2 and F3 are defined analogously. Let I be an index such that /=max(j+l, k+4) where j and k were determined during the calculation of Fl. Let m be an index such that F•_•<2000 and Fro>_2000. Then, the frequency interval [F•,F,•-] specifies the range in which F2 is expected to be found. The definition of I ensures that there is no overlap of areas considered for F1 and F2. The expected range of F2 was 900-2000 Hz.

Let k' be an index such that Q•,>_Q• for i=l...m. Then, F2 and the amplitude at F2 can be determined by replacing k by k' in the equation for Fl. Let n =k'+4, and let p be an index such that F•_•<3200 and F•>_3200. Then, the frequency range for F3 is given by the interval [F,•,F•-]. F3 and the amplitude at F3 are calculated analogously.

No significant formant structure exists for stoplike and fricativelike segments. However, F1, F2, and F3, etc., were also calculated for these segments because of the uniformity of handling and because the ampli- tudes at F1, F2, and F3 still provide an indication of the shape of the spectral envelope.

Let k and k' be the indices determined in the calcu-

lation F1 and F2. Let j be an index such that Q•_<Q• for i= k-+- 1,-. -, k'-- 1.

Then, frequency of .zero i is defined to be

Fi-•/Pi-•q-Fi/Pjq-Fj+ •/Pi+ i

1/Py_•q- 1/P•q- l/P/+1

Amplitude at Z1 is defined to be

AZ! = min (P•_•,Pj,Py+•).

Z2 and AZ2 are defined analogously by assuming that Z2 is contained in the frequency interval (F2,F3).

An index of the noisiness of a segment is often useful for distinguishing between ambiguous cases such as overlapping formant regions, as in the case of /a/ and /o/. Further, in distinguishing among fricatives, it is necessary to have a knowledge of the frequency of maximum power in the frequency interval

334 Volume 42 Number 2 1967

Redistribution subject to ASA license or copyright; see http://acousticalsociety.org/content/terms. Download to IP: 128.42.202.150 On: Sun, 23 Nov 2014

11:19:03

Page 7: Computer Recognition of Connected Speech

COMPUTER OF CONNECTED SPEECH

SOUNDS

OURATILIN IN NSECS

INTENSITYIRANGE 0-32)

S.D. OF INTœNSITY

•EAN ZERU X[NGS/10 •S

S,D. OF ZERO XIN(;S

PITCH IN CPS

SU• OF PWR AT HMNCS 1 2

FI?EWUENCY OF FORMAhT1

AFPLI[UUE AT FORMANT!

FREUUEWCY OF FORMANT2

AMPLI[tJOE OF FORMANT2

FREWUENCY OF FORMANT3

AMPLI[U6œ AT FORMANT3

FREOUENCY OF ZEROZ

AMPL[TUCE OF /EROZ

F•EOENCY OF ZSRC2

•mPLITUOE AT /ER02

RECOGNITION

TYPICAL P•OPERTIES OF SELECTE• PHONEHES

A AA o U oo K P B F

40 120 90 50 140 50 70 60 70

30 31 23 22 9 2 1 2 2

2 2 I 1 1 3 0 0 0

19 19 13 13 10 19 4 4 19

2 3 3 1 2 5 5 0 3

114 98 122 107 100 100 100 100 100

876 1382 105 46 726 132

479 210 216 61 171 66

916 82 58 445 123

I EE EH

20 150 20O

30 15 24

3 2 3

36 34 28

3 3 5

96 107 106

952 1221 661 819 739 633

309 222 448 879 719 517

860 807 870 2247 2849 1619 2951

s SH

130 140

3 4

1 1

46 51

12 11

100 100

130 85

99 307

96 82

Z M N

60 90 80

6 7 7

3 1 1

33 10 8

6 3 2

100 108 101

731 1085 1360

188 216 205

387 657 906

W Y R L

30 70 70 •0

6 11 21 12

1 I 2 I

11 30 26 13

3 3 5 3

95 96 95 122

827 608 767 647

189 284 414 40•

477 412 1159 1082

1851 2040 1938 1297 1087 907 988 1078 1292 1008 980 1799 1881 2042 1594 945 1071 975 2097 1695 915

1137 530 1509 1821 1•41 1426 1274 77 48 15 16 32 41 308 79 124 67 191 301 789 385

3296 3020 2548 1822 2189 3577 3004 2597 2036 1421 2009 2351 3686 2688 3370 2562 2385 1311 3479 2246 3553

532 466 483 329 421 231 168 50 13 9 10 31 118 166 184 76 88 37 299 631 219

566 818 1268 1012 965 807 743 990 709 811 908 816 1051 989 695 651 827 735 654 1052 907

90 30 103 762 870 225 586 33 9 5 11 9 3 7 6 63 17 •3 40 •3 90

2560 2525 2334 1587 1427 1959 1934 1714 1747 1278 1702 1900 2578 2548 1906 1890 1434 1271 3062 211• 2460

•7 155 453 261 37 14 66 18 1 3 2 22 7 28 23 5 7 22 23 387 2

FR[Q OF NOISE ENERGY CONC 1922 2040 1997 1822 1218 3463 3004 3646 1292 1487 1507 9464 8953 2042 •247 2356 2385 1194 1948 1628 3•,71

ANPLT AT NOISE ENP, GY CDNC 1137 530 1509 329 1099 231 168 50 48 9 14 84 212 308 184 76 88 37 301 789 219

'JO OF PTS &BOVE THRES"OLO 24 19 21 15 16 7 2 I 2 0 0 11 27 26 17 3 2 3 25 16 6

SU•, OF AMPLT •T A•OVE PTS 7031 4015 794g 6426 8028 973 1441 77 74 0 0 458 leoo 1911 1325 183 155 375 3671 6073 6•4

Fro. 4. Typical parameters of selected phonemes.

(1200, 10 000). Let i be an index such that Pg_>Ps for all j for which Fs is in the interval (1200, 10 000).

Then, the frequency of noise energy concentration is de- fined to be

NF=

Amplitude of NF is given by P•.

Let t define a threshold noise level so that all the

harmonics that have amplitude greater than t may be considered to be significant. Let H be a Boolean vector defined by,

Hi = 1, if Pi> t and Fi> 1200; - 0, otherwise.

The number of points above threshold is given by

23Hg, for i=l..-n.

Total amplitude at the points above threshold is given by

EHi'Pi for i=l...n.

In this investigation, t was taken to be a function of the amplitude at F1 of the form a-t-b. AF1 (actually 20+AF1/20). The choice of the threshold level and the frequency range was devised in an ad hoc manner and merits further research to determine an optimal definition.

VI. TYPICAL PARAMETERS

Figure 4 gives a list of the parameters for some selected phonemes. The list of phonemes was chosen to be representative and depended on the number of phonemes that can be conveniently presented on a single sheet of computer paper. Entries such as "pitch" and formants that are irrelevant in the case of un-

voiced sounds were still calculated for the sake of

uniformity. For unvoiced sounds, pitch was defined to be the inverse of the time window used in the

analysis. Further, formant frequencies of unvoiced sounds (calculated as in the case of voiced sounds) give an indication of the shape of the spectrum. The results can be seen to agree in general with the results reported in literature in the case of those parameters that have been investigated in the past.

Note that the slopes of formant transitions are not given. These may be considered as functions of spectral parameters and are calculated when necessary. The parameters of stops /k/, /p/, and /b/ represent the analysis of the burst segment or the silence segment preceding the burst.

VII. CLASSIFICATION TECHNIQUE

In order to distinguish an element from a set of n elements one can use different classification strategies. Pairwise comparison among all the elements of the set requires n(n--1)/2 tests or comparisons. One-out- of-n strategy requires n tests to determine the best

The Journal of the Acoustical Society of America 335

Redistribution subject to ASA license or copyright; see http://acousticalsociety.org/content/terms. Download to IP: 128.42.202.150 On: Sun, 23 Nov 2014

11:19:03

Page 8: Computer Recognition of Connected Speech

D. R. REDDY

match. A binary-tree-type classification requires only log2 n tests to identify any element uniquely. The first two strategies are often used in machines built using adaptive networks. Given enough hardware, all the comparisons can be made in parallel, thus requiring no more time than would be required to make one com- parison. The redundancy in testing also helps to re- duce errors. But when using a serial computer, such brute-force techniques can be time consuming. Thus, if one can discover suitable tests, the binary-tree-type classification net would be ideal for computer speech recognition.

Jakobson, Fant, and Halle • proposed a set of dis- tinctive features that permit a binary-tree type classi- fication of phonemes of English. Unfortunately, it has not been possible to find a one-to-one mapping between the distinctive features and acoustic features of pho- nemes. Hughes and Hemdal, 6 while attempting to stay within the distinctive feature framework, state that the compact versus diffuse feature is manifest as a location of Formant ! (F!_<500 vs FI> 500) in the case of vowels and as greater versus lesser spread among formants as well as greater versus lesser in- tensity in the case of fricatives. Grouping together such varied acoustic features under one name does not

result in any simplified concept and can only result in a loss of intuitive understanding of the processes in- volved. In view of the above consideration, the follow- ing guidelines were used in the development of our classification procedures.

ß Binary-tree-type classification is used as far as pos- sible. However, a node may have more than two branches when necessary. Note that if the node had "n" branches, then it would be the same as one-out-of-n classification strategy.

ß Tests at each node are formulated in terms of dis-

tinctive acoustic features. After a satisfactory set of distinctive acoustic features is discovered, it may be possible to confirm or suggest modifications to the "distinctive feature" approach. By distinctive acoustic features, we mean a set of parameters directly ob- tainable from the speech wave. They need not be minimal. Nor do they have to be independent, i.e., the type of tests to be made at a particular node may be dependent on the outcome of the previous tests.

ß Various allophones of a given phoneme are char- acterized by the presence (or absence) of several acoustic phonemena not all of which are necessarily present in any one given acoustic manifestation. Thus, the recognition scheme must be based on a large number of specific rules, at least for the present, rather than on a single all-encompassing rule.

ß The above assumption does not deny the possibility of existence of a single, dominant, characteristic for

14 R. Jakobson, C. G. M. Fant, and M. Halle, Preliminaries to Speech Analysis (The MIT Press, Cambridge, Mass., 1952).

each phoneme. The rules for classification start with tests for dominant primary cues, the secondary cues being used if the tests for primary cues prove incon- clusive. Thus, we admit the possibility of the existence of a probabilistic scale among various properties of a phoneme.

ß The algorithms described here were developed for a specific speaker and the absolute parameters used and type of tests made may have to be modified before they can be used for any other speaker. The classifica- tion procedures were developed in an ad hoc way and did not use any machine learning procedures. How- ever, there are few tests in the program that would not be considered reasonable in the light of the previous published research on acoustic features of phonemes.

VIII. CLASSIFICATION INTO PHONEME GROUPS

Most classification procedures (Hughes and Hemdal, 6 Hughes and Halle, •5 Lehiste, •6 and Martin et al., 8) developed for machine recognition of phonemes assume that it is possible to classify a speech sound as belong- ing to a phoneme group such as a vowel, stop, fricative, etc., and proceed to find properties that will permit classification of sounds within each group. Some of our preliminary investigations have shown that the con- ventional grouping of phonemes has several inherent disadvantages for machine recognition of speech? In view of these difficulties, we found that a division of phonemes into subsets without requiring that they be mutually exclusive would permit us to classify a seg- ment of speech using easily obtainable acoustic param- eters and would reduce the number of errors.

After observing acoustic parameters of about 275 phoneroes in connected speech, we decided to group the sounds of English into four nonmutually exclusive subsets, viz., stoplike sounds, fricativelike sounds, nasal-liquidlike sounds, and vowellike sounds. It is possible to divide them into even smaller groups using characteristics such as voicing. However, the tendency towards erroneous grouping seems to increase in proportion to the number of groups. Elements of each of the groups are given in Table II. The rationale behind the choice of such grouping is discussed by Reddy) 7

Classification of a segment into a phoneme group is based on the hypothesis that segments that are either local maxima or local minima on a time-versus-segment- intensity graph can always be associated with a pho- neme (vowel and consonant, respectively). The re- maining segments may either represent a phoneme or may be null segments, i.e., segments representing a phoneme boundary which cannot be associated with

15 G. W. Hughes and M. Halle, "Spectral Properties of Fricative Consonants," J. Acoust. Soc. Am. 28, 303-310 (1956).

16 I. Lehiste, "Acoustical Characteristics of Selected English Consonants," Intern. J. Am. Linguistics 30, 3, 1-197 (1964).

17 D. R. Reddy, "Phoneme Grouping for Speech Recognition," J. Acoust. Soc. Am. 41, 1295-1300 (1967).

336 Volume 42 Number 2 1967

Redistribution subject to ASA license or copyright; see http://acousticalsociety.org/content/terms. Download to IP: 128.42.202.150 On: Sun, 23 Nov 2014

11:19:03

Page 9: Computer Recognition of Connected Speech

COMPUTER RECOGNITION OF CONNECTED SPEECH

I NOISE POINTS

•')__•__ YES j ABOVE THRESHOLD • YES > o ?

USE THE VOWEL FORMANT TRANSITIONS FOR,FURTHER •"•'""-'//• NO CLASSIFICATION

• .... V,• •-• NO

a

lS THIS SEGMENT FOLLOWED BY A

P-TYPE BURST

I t vo.o ?

b • YES ALONE NOISE POINTS OVE THRESHOLD > •B

YES ZERO CROSS INGS-ZC

DEVIATION < 2 _

?

YES

YES YES YES

IS THIS SEGMENT FOLLOWED BY A FRICATIVE-LIKE

SEGMENT

YES

DURATION NO MOST LIKELY > 40 ms • A TRANSITION PRECEDING A

i YES FR ICATIVE IS THE DURATION OF THE FOLLOW ING FR ICATIVE SEGMENT < 120 ms

_

lYES USE PARAMETERS OF BOTH 1 SEGMENTS FOR FURTHER CLASSIFICATION

IS THE BURST OF INTENSITY j > 3 AND DURATION ]__•YES______• VOICED ? -• 30 ms AND FREQ. '

OF-NOISE CONCENTRAT ION -----•NO

• 2700 C• NO

_ _ YE• VOICED ?

e

BASE ALL--F•JR•HER TESTS ON NO FREQUENCY OF NOISE ENERGY

CONCENTRATION, NF OF THE BURST SEGMENT

a. A p-type burst has no s•gnihcant noise component but the amplitude of the waveform increases due to sudded increase in the volume of air-flow. I! is characterized by a segment with high amplitude and few zero crossings and occasionally •n the form of large variation in the zero crossings.

b. This test of noisiness of a sound is useful to dmtlnguish between aspirated and unaspirated stops. c. The affrlcahves are characterized by a burst of higher intensdy and larger duration than that of unvmced stops. d. In most cases, there is no noticeable burst following voiced stops. e. L•berman et al 11954) and Halle et al (1957) have also attempted to use burst frequency as a parameter, but it appears to be unreliable.

FiG. 5. Classification tree for stoplike sounds.

any phoneme. Thus the phoneme-group classification procedure not only indicates what group a segment may belong to but also whether it is likely to be a phoneme boundary.

If a segment is noiselike, then it is labeled "fricative- like." Otherwise, if the segment-intensity is a local maximum, then the segment is labeled "vowellike." Otherwise, the segment is labeled as "stoplike," "frica- tivelike," or "nasal-liquidlike," depending on the seg- ment parameters intensity and zero crossings. A seg- ment that is neither a local maximum nor a local minimum is given an additional label, "may-be," to indicate the possibility of it being a null segment. Further details of the classification procedure are given by Reddy?

IX. NULL-SEGMENT DETERMINATION

Ideally, the segment boundaries resulting from the segmentation procedure should also represent the phoneme boundaries (if such a term is definable!). However, our segmentation procedure divides the speech utterance only into a series of sustained and transitional segments. In such a scheme a segment

might represent a transition between two phoneme (i.e., a null segment that cannot be associated with any phoneme) or represent a single phoneme, or oc- casionally more than one phoneme. Note that not all transitional segments are null segments and not all sustained segments have phonemes associated with them (see Fig. 12).

Determining whether a given segment is a null seg- ment or not has proved to be at least as complicated as discovering adequate classification procedures. Un- fortunately, there exists no deterministic procedure at present that would permit us to decide whether a given segment is a null segment. After a number of trials, we have devised the procedure outlined below, which was satisfactory within the limited scope of this investigation. A more comprehensive study of a large number of sentences will be required to develop a more reliable procedure.

Null segments are determined and rejected in three stages. First, immediately after segmentation all the transitional segments of 10 msec (one minimal segment) duration are rejected as null segments. Since duration of the segments increases in multiples of 10 msec, this

The Journal of the Acoustical Society of America 337

Redistribution subject to ASA license or copyright; see http://acousticalsociety.org/content/terms. Download to IP: 128.42.202.150 On: Sun, 23 Nov 2014

11:19:03

Page 10: Computer Recognition of Connected Speech

D. R. REDDY

ls FORMANT 1 J YES DOMINANT ? j b •NO

(•)• TEST FOR H: FEW ZERO CROSSINGS __ NO AND FORMANTS SIMILAR TO THE FOLLOWING VOWEL

• NO

•t•/•• 's 'T UNVO'CED AND J DURATION < l0 ms ?

c •-NO J ,s I I,s ^,o

DURATION _< ].ZO ms ? L •o•sY ? • •o

NOISE CONCENTRATION J TEST FOR UNVOICED > 550O ? J STOP - w• •ow I •o _ 7•o•

YES E5

D,(•-•'• POSS I BLY ERROR. TEST FOR VOICED STOP

POSSIBLY AN ERROR IN PHONEME _ GROUP CLASSIFICATION. PERFORM

--J TESTS SIMILAR TO NASAL-LIQUID- LIKE SEGMENTS

J SUCCESSFUL ? J YES

ils THE FREQUENCY OF

•] NOISE CONCENTRATION >300O ?

CLASS IFY THE I SEGMENT ACCORDINGLY

'•YES IS IT A HIGH

AMPLITUDE NOISY SOUND

?

ZERO CROSSINGS I NO • " < 15 -•

•YES NO ils THE DURATION J = _• 120 ms

J NO

a. Formant 1 is considered dominant if the amplitude of formant I is above a certain threshold and the amplitude of F1 is at least twice as much as the amplitude of F2.

b. Two factors appear to be useful in distinguishing Ihl. Zero crossings are fewer I< 4) than expected for a fricative segment and formant-like energy concentrations occur at •'bout the same frequency as that of the following vowel formants.

c. Since most unvoiced fricatives are longer than 70 msec duration, the possible candidates are stop-like sounds /t/, /k/, /f/, and /e / for which the expected silence segment has been mixed with the fricative segment.

d. Hughes and Halle (1956) and Hughes and Hemdal (1965) also suggest similar parameters.

Fro. 6. Classification tree for fricativelike sounds.

is equivalent to a conjecture that no phoneme can be of less than 20-msec duration.

Most null segments are detected just before final classification. Closeness index and transition index are computed for all the segments that were given a label "may-be" during the subclassification. "Closeness index" and "transition index" are defined below.

Let D, X, F1, F2, F3, and NF represent the pa- rameters duration, zero crossings, Formant 1, Formant 2, Formant 3, and frequency of noise-energy concen- tration of a given segment, respectively. Let indices p and n denote the parameters for the previous and next non-may-be segments; viz., Flp represents the Formant 1 of the previous segment that is not labeled "may-be." Let f,.. 'f0 be Boolean-valued functions defined as follows:

f•=l, if D_<30msec; f2 = 1, if

fa=l, if

f4=l, if

fs=l, if

f6=l, if

f7=l, if

fs=l, if

f0=l, if

f•= 0, otherwise.

D> 60 msec; f,•= 0, otherwise. X--X•, I _<X/4; f•=0, otherwise. F1--Flp[_<75; f4=0, otherwise. F1-- F1 p[ >_ 250; rs-- 0, otherwise. F2--F2p[_<150; f6=0, otherwise. F2--F2p[_>400; fT=0, otherwise. F3--F3•[ <_275; f•=0, otherwise. NF-- NFp [ <_ 200; f0 = 0, otherwise.

338 Volume 42 Number 2 1967

TABLE II. Elements of phoneme groups.

Phoneme group

Stoplike Fricativelike

Nasal-liquidlike Vowellike:

Elements of the group

p, t, k, f, 0, ti, h, b, d, g, v, •r, ds s, •, f, 0, h, z, $, v

(j,I,i), r, 1, (w,U,u), m, n, r/, (b,d,g), h I, i, e, e, •e, 2t, a, o, o, U, u, r, 1

Let Cp=f•--f2-+-faq-f4--f,q-f6--fTq-fsq-fo. The Cp represents the closeness index of the segment with respect to the previous segment. Let C• denote the closeness index of the segment with respect to the next segment calculated by replacing p by n in the equations fl to f0. Further, let gl'"g6 be defined as follows:

g•= 1, if transitional segment; g•= 0, otherwise.

g2= 1, if [(Xp--X). (X--X•)>_O] V[min(IX-X[,

g2= O, otherwise.

g0= 1, if (Flp--F1). (F1--F1,)_> 0; ga-0, otherwise.

g4 =1, if [(F2p--F2). (F2--F2,)>_0] ¾ Ernin([F2-F2•l, [F2--F2,[)< 50•;

g4= 0, otherwise.

Redistribution subject to ASA license or copyright; see http://acousticalsociety.org/content/terms. Download to IP: 128.42.202.150 On: Sun, 23 Nov 2014

11:19:03

Page 11: Computer Recognition of Connected Speech

COMPUTER RECOGNITION OF CONNECTED SPEECH

YE• IS (INTENSITY < 10) OR

(INTENSITY < 12 AND ZERO CROS•'INGS < 12)

_

?

b d

ARE THE FORMANTS WITHIN NO IS (ZERO CROSSINGS THE EXPECTED RANGE FOR • ZC DEVIATION) M, N AND W ? < 8

c 1 YES d 1 YEs

,s I 0.6 'AMPLITUDE 0F •2} l ZC•VlAT

•N0 •N0

e NO J IS THE PREVIOUS SEGMENT A VOWEL-LIKE SEGMENT AND

(Flp- 200 < F1 < Flp ) A (F3 > F3p + 600)

•YES 1 NO L750 71 [150 < F1 450 ?

NO

IS AMPLITUDE OF F3 > NO IS THE PREVIOUS 1.1 (AMPLITUDE OF F• NT SEGMENT A ? VOWEL ?

.s i 0.2 (AMPL?ITUDE OF F•[ _80 ms .

YES •

a. Nasal sounds are usually of lower intensity and have fewer zero crossings than liquids. b. /w/ sounds appear to have a formant structure similar to that of nasals. For / m/, In/,

and Iwl usually F1 is in the interval (140, 340); F2 in (925, 1425) or (1700, 2300) and F3 > 1800. c. An effective parameter which was useful in distinguishing between Iwl, Iml, and Inl (at least

for this speaker) was the ratio of amplitude of formant 3 to the amplitude of formant 2. d. If the mean zero crossings are very low then it is unlikely that the sound is one of the liquids.

The possibilities are voiced stops for which the expected closure did not eventuate or I hl where the large volume airflow shifts the waveform above the zero line.

e. IUI as in diphthong laUI is characterized by an F1 slightly lower than the preceding vowel and F3 increases by more than 600.

f. Parameters for lyl, Irl, II I, and Iwl were similar to those obtained by Lehiste (1064).

t

1400 < F3 1800 ?

S

_ IS NO - F3 > 3O00 ?

YES

(•NO NO •(••J o

.o I l =J d 50ms?

Fro. 7. Classification tree for nasal-liquidlike sounds.

Y E.,•/

E• YES DURATION > 60 ms -?

dJ NO ,©

I is 350 < FI < 550 AND NOISE i j POINTS ABOVE THRESHOLD > 8 ___Y_E•..(• L AND F3 > 3000 ? - b •NO 80 ms AND NOISE POINTS ABOVE -

THRESHOLD _< 6

e j NO

_ __ •/•-•- j NO F2 _> l?00 ? F1 > 600 AND

IS y••R• c F1 > •00 AND • IS < maxi1425• 5(F1 - + - ' - • •70 i 60 ms-? • 60 ms . F2 < 1775 .

IS IT A NOISY SOUND. _ NO IS IT A NOISY SOUND. NO NOISE POINTS > 3 ? • NOISE POINTS > 3 ? F1 >IS 600 < 1740

• YES 1500 < FREQ. OF NOISE ENERGY CONCENTRATION < 1700 ?

a. Ill overlaps with the formant regions of lul and Iol but is noisier and has a high F]L

b. Iol overlaps with the formant regions of lul and la/. It is distmguished from l ul m that its F1 •s slightly higher than the F1 of lul. It is dmtingulShed from lal in that fl IS not as noisy.

c. F2 region is bounded by a line which is constant up to F1: 725 after which It mcreases linearly wflh Fl.

d. Note I I I and IUI appear at more than one node.

e. Hughes and Hemdal (1965) and Forgle and Forgie (1958) use some s•milar parameters.

FIO. 8. Classification tree for vowellike sounds.

The Journal of the Acoustical Society of America 339

Redistribution subject to ASA license or copyright; see http://acousticalsociety.org/content/terms. Download to IP: 128.42.202.150 On: Sun, 23 Nov 2014

11:19:03

Page 12: Computer Recognition of Connected Speech

D. R. REDDY

54 2764FOUR FIVE SIX RAJ 9C 9-2 660526180000 5289 6!F 0 R F A I V S I K ENTERING PHONM 54

SEGMENT CLASSIFICATION

TIME OUR IN S ZC S PTCH VOIC FM1 AF1 FM2 AF2 FM3 AF3 ll AZi Z2 AZ2 NFR AMPL TPT AMPL SUBGRP PHONM 0 20 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 STOPLK 0 40 3 0 11 5 iO0 lg6 ig4 120 1459 39 3080 32 1257 5 2640 8 57?2 45 B 268 MAY BE FRICLK F

40 30 9 2 4 3 100 2183 66 2011 1193 115 2070 97 721 16 1625 25 2155 97 0 0 MAY BE NLQLIK H 70 140 27 3 9 2 135 i877 453 3239 1044 i05 3279 ll3 1022 67 1479 40 3478 iI3 0 0 VOWELK 0

210 50 9 8 8 3 120 1124 392 1472 1211 157 2891 148 1099 ll6 2176 46 2891 148 5 615 MAY 8E NLQLIK NULL 260 70 2 0 19 3 iO0 132 66 i23 i799 32 235I 31 816 9 1900 22 9464 84 11 458 FRICLK F 330 20 4 4 4 4 136 929 69 127C 1033 483 2583 35 I006 234 2196 9 3759 36 0 0 MAY BE FRICLK NULL 350 150 30 2 20 6 120 1143 827 2850 1341 943 2320 •20 1092 165 1653 107 2395 820 14 6?92 VOWELK 500 50 i1 8 14 5 Ii7 I250 370 756 2016 133 2664 104 1450 i2 2269 33 2016 i33 9 776 MAY 8E NLQLIK 5•0 i40 6 2 52 lq 100 130 283 98 1407 38 2837 70 998 15 1795 6 4642 283 29 2512 FRICLK S 690 20 19 I 3i 4 92 1046 471 1•09 1864 i66• 2754 608 I284 90 2306 i17 i854 i665 21 7505 MAY BE NLOLIK NULL 710 30 29 2 32 i iIg 970 404 1820 i94i 2154 2487 911 loll 3i 2257 iT1 200i 2154 i8 9323 740 IO 21 0 3? 0 0 0 0 C 0 0 • 0 0 0 0 0 0 C 0 0 750 60 ! 1 23 4 100 84 gi 60 2095 l? 3403 5 1723 2 2794 3 2226 i? 0 0 810 20 I 1 I4 4 100 98 78 75 2061 18 2722 14 1708 2 2615 7 2!80 18 0 0 830 220 3 2 50 lO 100 15I 91 112 1789 31 2397 30 1397 9 2009 •5 9746 i15 22 lOlO

1050 0 0 0 0 0 0 ! 0 0 0 0 0 0 0 0 0 0 0 C 0 0

TIME OUR IN S ZC S PTCH VOIC FM1 AF1 FM2 AF2 FM3 AF3 Z! All Z2 AI2 '4FR A•PL TPT AMPL SUSGRP PHONM SOUND FOUR FIVE SIX INTENDED PHONEMES F 0 R F A I V S I K S

T

COMPUTED PHONEMES F H U F AA I S EH P S K

F•. 9. Parameters of "four five six."

VOWEI. K FH

NULL

STOPLK NUtL

STOPLK P f K

FRICLK S

FOU• FIVE SIX

qUST SUST SUST MRY BE M•Y BE FRIC LIKE N•SLQ LIKE VONEL LIKE F H 0..• •'

MHY BE NRSLQ LIKE

•--- NULL

Phoneme associated with segments

SUST TRfiN S MRY BE

FRIC LIKE F•IG LIKE VO•EL LIKE F NULL tiff

HRY BE N•SLQ LIKE I

SUST TRRN SU9T -- [I • SUST FRIC LIKE MRY BE S NRSLQ LIKE VO•IEL LIKE STOP LIKE

NULL EH NULL

TRRN SUST

STOP LIKE FRIC LIKE PTK S

FIG. 10. Phoneme classification of "four five six."

340 Volume 42 Number 2 1967

Redistribution subject to ASA license or copyright; see http://acousticalsociety.org/content/terms. Download to IP: 128.42.202.150 On: Sun, 23 Nov 2014

11:19:03

Page 13: Computer Recognition of Connected Speech

COMPUTER RECOGNITION OF CONNECTED SPEECH

55 2829SEVEN EIGHt NINE RAJ 90 g-2 660526180000 4704 54S EH V A N EH ! [ N A ENTERING PHONN 55 1

SEGMENT CLASSIFICATION

TIME OUR IN S ZC S PTCH VOIC FML AFI FM2 AF2 FM3 AF3 Zl AZ1 Z2 AZ2 NFR ANPL TPT AMPL SUBGRP

0 20 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 STOPLK

0 90 3 I 50 9 100 74 191 46 988 41 2799 46 328 13 1603 13 5226 240 26 1899 FRICLK S

90 20 13 4 9 4 125 1389 417 2126 1631 497 2378 237 1121 238 2042 28 1631 497 15 3811 MAY BE NLQLIK NULL

110 70 20 2 13 3 123 1047 539 1531 1643 948 3660 338 819 7i 2301 57 1696 948 20 5577 VOWELK FH

180 70 13 2 12 3 84 847 540 1388 1409 523 2296 166 1004 55 1699 39 i355 523 10 2067 NLULIK W

250 30 19 3 16 2 120 1220 610 1413 1454 570 2305 2•5 969 266 2169 196 1534 402 15 3852 VOWELK EH

280 50 8 1 15 2 112 1357 225 935 1787 69 2493 173 12•4 14 2050 i9 2493 173 5 517 NLQLIK N

330 20 18 3 25 12 109 1292 221 881 1010 64 2409 145 778 11 i288 19 2409 145 4 356 MAY BE NLQLIK NULL

350 50 3i i 36 3 i07 I20T 218 6?8 2038 i302 27•g 1036 i4ig 70 233• 476 2038 i302 25 8574 VOWELK E[

400 50 25 3 32 2 107 1187 219 662 2072 952 3664 492 721 74 3112 d4 2158 952 31 8937 MAY BE VOWELK NULL

450 30 18 2 39 1 105 1048 309 615 2046 663 2749 468 2479 39 2478 185 2112 663 23 4639 MAY BE NLQL[K NULL

480 10 8 0 33 0 0 0 0 C 0 0 0 0 0 0 O O 0 C 0 0

490 30 1 1 5 2 100 243 189 169 2052 15 3416 13 i468 2 2799 5 2158 15 0 0

520 60 0 0 18 6 100 86 89 5• 965 24 2033 12 717 10 1706 2 2033 12 0 0

580 20 1 1 9 1 104 371 186 266 1008 23 1992 20 943 13 1346 4 1860 20 0 0

600 40 4 2 11 3 100 1054 191 676 1816 37 2590 148 1304 14 2032 18 2590 148 2 211

640 10 1• 0 25 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

650 150 29 3 26 5 105 1020 755 747 1508 1375 2004 468 1021 167 1894 290 1586 1375 31 9692

800 40 16 4 34 4 go 94• 169 510 2029 568 2703 642 1461 78 2362 236 2703 642 23 5070 MAY BE NLQLIK NULL

840 70 4 3 21 14 100 668 185 407 1990 37 2774 43 1689 3 2360 9 2774 43 1 42 FRICtK N

910 40 1 1 4 3 100 435 183 227 1167 27 2662 18 923 7 1468 4 2757 18 0 0 STOPtK MUtt

940 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0

TIME DUR IN S IC S PTCM VOIC FM1 •F1 FM2 •F2 FM3 •F3 Z1 •Z1 12 •Z2 NFR AMPL TPT AMPL SUBGRP

SOUND SEVEN EIGHT NINE

INTENDED PHONEMES S EH V A N EW I T N A I N K

CONPUTEO PHONEMES S iN g [H N E[ T N AA N

F•O. 11. Parameters of •seYen eiõ•t nine."

PHQNM

NULL

STOPLK *

FRICLK T K

STOPLK *

FRICLK N

NULL

VOWELK AA

PHONM

g•= 1, if (F3•--F3). (F3--F3,,)>O; g•-- 0, otherwise.

go- 1, if (NF v--NF). (ArF--NF,,) >_ 0; go=O, otherwise. output

i Eve EE

Let T= g•-kgad-g4q-gs-kg0. Then T represents the I It 1 "transition index" of the segment with respect to the e Met EH previous and next segments. A "may-be" segment is e Mate EH a null segment if m At AE

• Bird ER

(C•>_3) V (C,>_3) V {[(C•>0) V (C,>0)] A r>_3} . Up A a Father AA

is true. This definition has to be slightly modified to handle boundary conditions. a All AU o Obey O

Those segments that are not labeled may-be may u Foot U also have null segments. This happens when there is a u Boot OO cluster of stoplike segments or fricativelike segments r Read R (see Fig. 12--the stop in eight). One way to handle 1 Let L w We W

this case is to reject all but the segment with longest j You Y duration within the cluster as being null segments. m Me M

Final rejection of segments as being null segments n No N comes after the final phoneme classification. Those • Sing NY segments that are labeled may-be and did not get

TAULE III. Phonemes of English and their equivalent computer output.

Phoneme As in Computer Phoneme As in Computer output

h He H

z Zoo Z

5 Azure ZH v Vote V

•r Then DH

dz Jar J s See S

• She SH f For F

0 Thin TH

t• Chew CH b Be B

d Day D g Go G p Pay P t To T

k Key K

The Journal of the Acoustical Society of America 341

Redistribution subject to ASA license or copyright; see http://acousticalsociety.org/content/terms. Download to IP: 128.42.202.150 On: Sun, 23 Nov 2014

11:19:03

Page 14: Computer Recognition of Connected Speech

D. R. REDDY

SEVEN EIGHT NINE

.... "' ...... .-' SUST TRRN SUST SUST SUST SUST

MRY BE

•RIC LIKE NRSLQ LIKE VOWEL LIKE •RSLQ LIKE VOWEL LIKE •RSLO LIKE NULL EH EH

N ••T • _ SUST SUST NRSLQ LIKE VOWEL LIKE VOWEL LIKE NR•LQ LIKE STOP LIKE FRIC LIKE

EE ß NULL I K

Phoneroes associated with segments

TflRN SUST

STOP LIKE FRIC LIKE N

VOWEL LIKE RR

TRRN MRY BE NRSLO LIKE NULL

SUST SUST

FBIC LIKE STOP LIKE N NULL

FIG. 12. Phoneme classification of "seven eight nine."

rejected earlier are checked to see if they have been classified as an identical or similar phoneme to the previous or next non-may-be segments. If so, such a segment is rejected as being a null segment. A phoneme such as /w/ or /U/ is considered similar to /U/ or /u/, and /y/ and /I/ are considered similar to /I/, /i/, or/e/. Sometimes such rejection results in the loss of a valid phoneme, but it is necessary in many cases.

X. PHONEME CLASSIFICATION

Figures 5-8 illustrate the algorithms used in the classification of nonnull segments of different phoneme groups, viz., stoplike sounds, fricativelike sounds, nasal- liquidlike sounds, and vowellike sounds. We have at- tempted to make flowcharts specific and, at the same time, intuitively meaningful, the latter taking preced- ence over the former. In doing so, certain tests of secondary nature have been left out so as to avoid cluttering up the flowcharts with details. Flowchart symbols used for computer transcription of phonemes of English and their equivalent IPA symbols are given in Table III. Actual algorithms •8 used, written in Algol 60 notation, can be found in Reddy?

18 The author feels that publication of algorithms in speech research in a form similar to that used in the Communications of the ACM is highly desirable and would reduce much of the current duplication of research.

x0 D. R. Reddy, "An Approach to Computer Speech Recogni- tion by Direct Analysis of the Speech Wave," Stanford Univ. Computer Sci. Dept. Tech. Rept. No. CS 49 (1966).

342 Volume 42 Number 2 1967

The present classification procedures are determin- istic and do not provide for the output of any proba- bilistic information. Some of the classification pro- cedures are incomplete (e.g., the classification of stops using vowel-formant information) and are shown on the flowcharts by means of dotted lines. Descriptions of tests that are new and/or require further explana- tion are given in the footnotes for each flowchart. Various papers by Lieberman et al., 2ø Halle et al.? • Hughes and Halle, •5 Hughes and Hemdal, 6 Lehiste, •6 and Forgie and Forgie 5 were useful in formulating some of the tests. Since the flowcharts are self-explana- tory, we do not describe individual tests. Note that the parameters given in various tests are for a single speaker and may require modification for other speakers.

XI. TYPICAL CLASSIFICATION

Figures 9 and 10 contain the parameters and the results of classification of the sequence "four, five six." Figure 9 contains parameters for each segment of the sound. The first and the last lines indicate boundary segments of null duration that facilitate the analysis of beginning and ending segments. Figure 10 contains

•.0 A.M. Liberman, P. C. Delattre, and F. S. Cooper, "The Role of Consonant-Vowell Transitions in the Perception of the Stop and Nasal Consonants," Psychol. Monographs 68, No. 8, i-t3 (1954).

•-• M. Halle, G. W. Hughes, and J. T. A. Radley, "Acoustic Properties of Stop Consonants," J. Acoust. Soc. Am. 29, 107-116 (1957).

Redistribution subject to ASA license or copyright; see http://acousticalsociety.org/content/terms. Download to IP: 128.42.202.150 On: Sun, 23 Nov 2014

11:19:03

Page 15: Computer Recognition of Connected Speech

COMPUTER RECOGNITION OF CONNECTED SPEECH

56 2887ZERO RAJ 9C 9-2 660526180000 2346 27Z EE R 0 ENTERING PHONM 56 1

SEGMENT CLASSIFICATION

TIME BUR IN S ZC S •TCH VOIC FM1 AFI FM2 AF2 FM3 AF3 Zl AZ1 Z2 AZ2 NFR AMPL TPT AMPL SUBGRP PHONM

0 20 0 0 O O 0 0 0 0 0 0 0 O 0 0 0 0 0 0 O 0 STOPLK

0 60 7 2 33 3 100 726 173 534 985 45 3626 58 781 23 2570 8 4647 176 14 1349 FRICLK J

60 •0 14 5 21 3 104 1468 209 756 1890 435 2916 227 1398 21 2493 130 1890 435 19 3531 MAY BE NLQLIK NULL

100 80 26 3 30 6 107 1817 225 1062 2042 789 3266 450 879 52 2865 35 1957 789 20 6483 VOWELK EE

180 90 15 2 13 3 101 1540 324 1466 1925 367 2590 166 1015 66 2382 71 1431 406 12 2855 NLQLIK R

270 10 21 0 8 0 0 0 0 0 0 0 0 0 0 0 0 0 0 C 0 0 NULL

280 90 27 3 12 3 104 1413 519 3431 1012 947 2446 431 837 324 1990 46 2339 431 2 1378 VOWELK 0

370 60 17 2 9 1 91 880 491 1868 944 582 2434 101 726 130 1755 24 2531 101 1 582 MAY BE NLQLIK NULL

430 30 11 2 11 1 74 584 461 1191 957 331 1268 165 815 248 1201 139 957 301 13 2846 MAY BE NLQLIK NULL

460 0 0 0 0 0 0 1 0 O 0 0 0 0 0 0 0 0 0 0 0 0

TI•E BUR IN S ZC S PTCH VOIC FMI AF1 FM2 AF2 FM3 AF3 Z1 AZ1 Z2 AZ2 NFR AMPL TPT AMPL SUBGRP PH•NM

SOUND ZERO INTENDED PHONEMES Z EE R 0

COMPUTED PHONEMES J EE R 0

Z Z • • o o o • z

z z • • • • • • U ? 0 U • • • • • • -- • • • 0 • • Z • 0 0 0 0 Z • • U

• : o o o • • • = • • o _ z • < z • • o o < o • • • J • • • z > z N • • • • • • • • • • • z • • • o • --• z • z • • • z • • • z • • < • < • • • • 0 • 0 • • • • • • j L 0

o • • • • • • • • • • o o

Fro. 13. Parameters of "zero."

ZERO

SUST TRRN SUST MRY BE ,

FRIC LIKE NRSLQ LIKE VOWEL LIKE J NULL EE

14. Phoneme classification of "zero."

SUST SUST

NRSLO LIKE VOWEL LIKE

Phoneroes associated with segments

SUST SUST MRY BE MRY BE NRSLO LIKE NRSLO LIKE NULL NULL

The Journal of the Acoustical Society of America 343

Redistribution subject to ASA license or copyright; see http://acousticalsociety.org/content/terms. Download to IP: 128.42.202.150 On: Sun, 23 Nov 2014

11:19:03

Page 16: Computer Recognition of Connected Speech

D. R. REDDY

1. SOUND KAAKEEKCCKEHKC INTENDED PHGNEMES K AA K EE K CC K EH K O

COMPUTEœ PHCNEMES K AA K I K OC K EH H 0

9. SOUND CHAACHEECHOGCkEMCH INTENDED PHONEMES CH AA CH EE CH Cg CH EH CH

COMPUTED PHDNEMES CH AA CH EE T C0 CH EH CH

2. SOUND TAATEETCCTEHTC INTENDED PPGNEMES T AA T EE T CO T EH T 0

K

COMPUTED PHONEMES T AA K EE T CC T EH G 0

3. SOUND PAAPEEPCCPENPC R INIENDEC PNONEMES P AA P EE P 60 P EH P 0

T T

COMPUTED PNONEMES P AA P EE P CC F EH P 0 K K

4. SOUND FAAFEEFOOFEH INTENDED PHONEMES F AA F EE F CO F EH

COMPUTED PNCNEMES F AA Th EE F GO F EH

5. SOUNU THAATHEETHC6THEH INTENDED PHCNEMES TH 6A TH EE TN CC TH EH

COPPUTED PNONEMES TH AA TN I TF CC TH EH

6. SOUND hAAHEEHCCHENHC INIENDEC PNONEMES H AA H EE • GO H EH H 0

COMPUTED PHONEMES K AA H I H CC H EH H O

7. SOUND SHAASHEESHOOSFEhSH INTENDEC'PHONEMES SH A• SH EE SH CO SH EH SH

COMPUTED PHCNEMES •H aA Sh EE SH O0 SH EH SH

10. SOUND GAAGEEGCCGEHGG INTENDED PHONEMES G AA G EE G CO G EH G 0

œ

COMPUTED PNONEMES AA G EE V CC G EH B 0 G

SOUND CAADEECCCDEHDC INTENDED PHCNEMES C 6A C EE C CO O EN D 0

O D O D D COMPUTED PHONEMES B AA 8 EE B CC B EH B C

G G G G G

12. SOUND BAABEEBOOBEHBO INTENDED PNONEMES B AA B EE B CO B EH B 0

D O D D D COMPUTED PHCNEMES B AA B EE B O0 B I B 0

G G G G G

13. SOUND VAAVEEVCCVEH INTENDED PHONEMES V 6A V EE V CO V EH

T D COMPUTED PNONEMES ¾ AA V I P CC B Y

K G

SOUND CHAAOHEECHCCCHEN INTENDED PHDNEMES CH AA ON EE DN CO DH EH

• o D D COMPUTED PHONEMES B AA B EE B CO B EH

G G G G

15o SOUND ZAAZEEZCCZEHZO INTENDED PHONEMES Z AA Z EE Z CC Z EH Z O

COMPUTED PHONEMES J AA Z EE Z Ca Z EH Z 0

8. SOUND SAASEESCCSEHS INTENDED PHONEMES S AA S EE S CO S EH S

COMPUTEO PNONEMES S AA S EE S CC S EH S

[6. SOUND JAAJEEJOOJEHJ INTENDED PHONEMES J AA J EE J CC J EH J

COMPUTED PFCNEMES J AA J EE J CC J EH O

Fro. 15. Classification of sounds.

17. SOUND WAAWEEWCCWENWC INTENOEO PHONEME5 W AA W EE W CC W EH W 0

COMPUTEC PNONEMES k AA M EE OO CO W EE W O

25ø SOUND NHAT IS IT INTENDED PHONEMES W AA T EE Z I T

K K

COMPUTED PNGhErES k A T I S I T

18. SOUND rAAMEEMCCMEHMC INTENDED PHONEMES M 6A F EE M CO M EH M 0

COMPUTED PHONEMES M AA M EE N CC M EH M O

26. SOUND HE CAN COrE INIENDED PHGNEMES H EE K EH h K A M

COMPUTEC PHCNEMES H EE K I K a M

SOUND NAANEENCCNEFNC INIENDEC PHCNEMES N AA N EE N CO N EH N 0

COMPUTED PHONEMES N AA N EE N CO N EE N AA

27. SOUND SHE IS G•Y INTENDED PHONEMES S• EE EE S G EH

K

COMPUTED PHONErES SH EE S T EH

20. SOUND YAAYEEYOCYEHY INTENDED PHONEMES Y AA Y EE Y CC Y EH Y

D

CCrPUTED PHCNErES 8 Y AA EE CC EN Y G

21. SOUND LAALEELCCLEHLC INTENDED PHCNENES L AA L EE L CG L EH L 0

CONMUTED PHONEMES L AA L [E L CC EH L 0

22. SOUND RAAREERCCREHRC INTENDED PHCNEMES R AA R EE R O0 R EH R 0

COPMUTED PHONEMES AA R EE R CC R EH R O

23. SOUND JOHN HAS A DOCK INTENDED PNONEMES J AA N H AE Z EH B U K

O

COMPUTED PHCNEMES J AA M AE Z EH B U K G

24. SOUND HOW ARE YOU INTENDED PHCNENES H A U AA R [ O0

corPUTEC PHCNEMES H AA U AA R I O0

28, SOUND HELLO TCNY INTENDED PNONEMES H A L O T C N EE

COrPUTED PHCNEMES H A L C IF AA N EE

29. SOUND ONE TWC THREE INTENDED PHONEIVES W A N T CC TH R EE

T

COtMUTED PHONE•ES W AA N P OC IN EE K

SOUND FOUR FIVE SIX

INTENDED PHCNErES F 0 R F A I V S I K S T

COMPUTED PEONEMES F H O F AA I S EH P S K

3I. SOUND SEVEN EIGHT NINE INTENOED PHONErES S EH V A N EH [ T N A I N

K

COMPUTED PHONEMES S EH W EH N EE T N AA N

32. SOUND ZERO INTENDED PHONEMES Z EE R C

COMPUTEO PHCNEMES J EE R O

FIG. 16. Classification of sounds.

344 Volume 42 Number 2 1967

Redistribution subject to ASA license or copyright; see http://acousticalsociety.org/content/terms. Download to IP: 128.42.202.150 On: Sun, 23 Nov 2014

11:19:03

Page 17: Computer Recognition of Connected Speech

COMPUTER RECOFoNITION OF CONNECTED SPEECH

COMPUTED PHONEMES

PTK BD• F V TH DH CH J H SH S Z M N Y R L W J I EE EH A AA 0 U O0 A TOT. %CORRECT PTK 21 1 1 I 24 88

BDG 1 14 ! 1 17 82

F i 5 I 7 72

v 1 1 2 1 1 6 33 TH 4 4 100

DH 4 4 0

CH i 4 [ 5 80 J 1 6 7 86

H 1 7 1 9 78

SH 6 6 100

S • 9 9 100

Z 2 I 5 8 63

M 5 1 6 83

N 1 10 11 91

Y • 2 3 5 40

R 6 3 9 66

L 5 1 6 83

W 1 5 1 7 72

EE I 5 22 i 28 78

EH ] 2 3 22 27 82 _ A 1 3 3 I 7 43 0

2 16 118 89 U • 2 [ 2 100

00 24 I 24 100 A 1 I 2

TOT. 26 21 5 3 & 0 4 8 9 6 10 5 7 11 2 6 5 6 10 25 25 4 29 16 2 25 13 ,,

Fro. 17. Confusion matrix for phoneme classification.

the waveform, each segment of which is labeled: as sustained or transitional, as belonging to a particular phoneme group, and as representing a particular phoneme (including the null phoneme). Figures 11 and 12 contain similar results for the sequence "seven, eight, nine," and Figs. 13 and 14 for the sound "zero."

XlI. RESULTS

Figures 15 and 16 represent the results of classifica- tion of the 32 continuous speech utterances (see the section on Method and Materials). The results consist of the sound, the intended phonemes, and the computed phoneroes. The intended and the computed phonemes are transcribed in the notation given in Table III. Note that the intended phoneroes are not the same as those given in a dictionary but represent the phoneme string that would be obtained after allowing for speaker characteristics and possible modification of phoneroes at word junctions.

For most sounds, the computed phonemes bear satisfactory similarity to the intended phoneroes. Even when errors occur the computed phoneme usually belongs to the same category as the intended phoneme. Time for analysis of the utterances on the IBM 7090 varied from 45-70 sec for sequences of 1-2 sec duration. Figure 17 contains the confusion matrix for the final

phoneme classification between the intended phonemes and the computed phoneroes. The dominant diagonal of the matrix indicates that many of the phonemes were classified correctly. In fact, about 81ø•o of the total of 287 phonemes were classified correctly. How- ever, for some phoneroes the performance was not as high as desired. Details of the type of errors and the cause of errors are discussed in the following Section.

XIII. ERROR ANALYSES

In this Section, we outline various causes of errors and examples of their occurrence in the results of Figs. 15 and 16. The possible errors of the sounds may be classified as being due to boundary conditions, incorrect segmentation, incorrect null-segment detection, and inadequate classification procedures.

A. Errors Due to Boundary Conditions at the Beginning and the End of a Sound

Since each utterance begins and ends in silence, some of the phonemes do not contain proper acoustic cues for the recognition of the initial and final phonemes. In Sound 20, an extra "B D G" phoneme was introduced because of an incorrect inclusion of a small portion of the silence segment preceeding the sound. Sounds 10

The Journal of the Acoustical Society of America 345

Redistribution subject to ASA license or copyright; see http://acousticalsociety.org/content/terms. Download to IP: 128.42.202.150 On: Sun, 23 Nov 2014

11:19:03

Page 18: Computer Recognition of Connected Speech

D. R. REDDY

and 22 contain examples of the loss of the beginning phoneme because of loss of part of the sound. Sounds 13 and 16 contain examples of incorrect classification of the ending phoneme because of loss of part of the sound. Sounds 6, 15, and 32 contain examples of in- correct classification of the beginning phoneme because of the assumption that silence precedes the sound; i.e., some sounds were classified as J, an affricative characterized by a silence followed by a noise segment as opposed to just a noise segment as in Z. These errors could have been easily corrected by rerecording the sounds, but were left in to illustrate the type of errors that can occur.

B. Errors Due to Incorrect Segmentation

The classification procedures assume that there exists at least one segment for each phoneme and ignores the possibility that more than one phoneme may be as- sociated with a given segment. This results in the loss of a phoneme (rather than incorrect classification) in the string of phonemes. The most common errors of this type were found among diphthongs where both vowels are classified as one (see Sound 31) and in cases where R preceded a vowel (Sound 29). This accounts for the low (500-/0 correct) recognition for the vowel "I" in Fig. 17. The error can be corrected either by modifying the segmentation procedure to be more sensitive to changes or by including the possibility of multiple phonemes for each segment in the classifica- tion procedures.

C. Errors Due to Incorrect Null-Segment Detection

Sounds 20, 23, 26, 27, and 30 contain examples in which a valid segment has been rejected as a null segment because of incorrect null-segment detection. In Sound 20, "YAAYEEYOOYEHY," the Y EE Y parts are all acoustically very similar, and therefore, were rejected except for the vowel part EE. Similar errors occurred in the case of Y EH in Sound 20 and

EE EE in Sound 27. In Sound 30, an H was inserted as an extra phoneme because of incorrect null-segment detection. This error accounts for the low (400-/0) correct for the phoneme ¾ in Fig. 17.

D. Errors Due to Inadequate Classification Procedures

More analysis is required to determine the significant acoustic cues to correct the remaining errors. The phonemes involved are unaspirated stops P, T, K, B, D and G, the aspirated voiced stops V and DH, vowel group I, EE, and EH, and the vowel group A and AA. Note that these errors, while annoying, do not affect the readability of the phonetic transcription of the sounds very much.

Many of the above errors can be corrected by more sophisticated algorithms. However, a much larger sample of sounds should be analyzed to determine the necessary modifications.

XIV. CONCLUSION

The research reported here leads to the following conclusions:

ß The results indicate that machine recognition of connected speech is no longer an utopian dream, as one researcher chose to call it a few years ago, but indeed a distinct possibility. The Section on Error Analyses shows that many of the errors of the present system are transcient in nature and can be corrected to improve recognition.

ß It is no longer absolutely necessary to preprocess the acoustic signal because (1) a present-day computer can easily handle the bit rates, (2) parameters can be extracted just as easily from the acoustic signal as from the output from a bank of filters, and (3) pro- cessing of the acoustic signal does not appear to require any excessive computer time (•40 times real time) if one chooses a suitable strategy.

ß Although 40 times real time is still far away from our original goal of close-to-real-time recognition, it is still a marked improvement over earlier attempts to simulate the filters where pitch determination alone was taking about 1000 times real time. Use of faster computers and machine language programming should further reduce the present computation time.

ß It is desirable to use a computer without any special purpose hardware in all the phases of speech recogni- tion research. This permits the researcher to experi- ment with different algorithms and parameters without having to modify the hardware. Given a suitable algorithm one can always implement it in hardware form.

ß A good on-line speech research system is essential for experimenting with a computer. Essentially, this implies a flexible system for handling input-output to displays, disk, plotter, analog-to-digital, and digital- to-analog converter coupled with a time sharing system with a fast modify-compile-debug cycle time.

ß Parameters such as duration and intensity (or equivalently average spectral power) seem to be at least as important as any spectrum-dependent pa- rameters when one has to classify among all the 40 phonemes, and therefore, deserve greater time and energy than has heretofore spent on them.

ß There may be properties, in addition to formants, based on the spectrum that are useful in phoneme classification, e.g., measures of noisiness (18 to 2l of Table I). More effort should be devoted to discovering such measures.

ß The problems of how to group phonemes for speech recognition and how to detect null segments (segments that represent phoneme boundaries) appear to be new and have heretofore attracted little attention. We have

346 Volume 42 Number 2 1967

Redistribution subject to ASA license or copyright; see http://acousticalsociety.org/content/terms. Download to IP: 128.42.202.150 On: Sun, 23 Nov 2014

11:19:03

Page 19: Computer Recognition of Connected Speech

COMPUTER RECOGNITION OF CONNECTED SPEECH

developed two possible solutions. Further investiga- tions are necessary to formulate better solutions based on a large number of speakers.

ß The procedure and the results described in this paper have the following limitations' all the speech sounds used in this investigation were uttered by a single speaker; classification procedures are incomplete for stops; the program does not handle the case where two phonemes are within one segment and we have as yet no satisfactory solution for boundary condition errors (see the Section on Error Analyses).

At present, we are continuing research in several areas of speech recognition along similar lines using a PDP-6 with 65K-2 usec memory. These are outlined below:

1. Improvement of the above procedures based on a large number of sounds of a single speaker.

2. Extension of the above to several speakers by developing a tune-in program that uses some form of machine learning procedure.

3. Use of probabilistic, linguistic, and/or vocabulary constraints to improve recognition.

4. Procedures for grouping phonemes into words, phrases and sentences, and finally

5. Man-machine speech communication.

Most of the above problems have to be solved for the realization of a generalized automatic speech- recognition system. However, limited applications of man-machine speech communication only need partial solutions to some of the above problems and can be undertaken directly on the basis of the present investigation.

ACKNOWLEDGMENTS

The author would like to thank Professor John McCarthy, his thesis advisor who not only initiated the author into the speech research area but also suggested the possible usefulness of the acoustic wave- form in speech analysis. The author would like to thank Professor Earl D. Schubert and Lester Earnest

for their valuable comments on the manuscript. The research reported herein was supported in part

by the Advanced Research Projects Agency of the Office of the Secretary of Defense.

The Journal of the Acoustical Society of America 347

Redistribution subject to ASA license or copyright; see http://acousticalsociety.org/content/terms. Download to IP: 128.42.202.150 On: Sun, 23 Nov 2014

11:19:03