e or ic4- z:kensory 40n'-

18
'"on-, T: Speet'h-C' 6 ion 1444- z z:kensory q e or IC4- 40n'- FT sectioo u ory, tic .4.1inguis a

Upload: others

Post on 01-May-2022

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: e or IC4- z:kensory 40n'-

'"on-, T: Speet'h-C' 6 ion1444-

z z:kensoryqe or IC4- 40n'-

FT sectioo u ory,

tic.4.1inguis a

Page 2: e or IC4- z:kensory 40n'-

286 RLE Progress Report Number 133

Page 3: e or IC4- z:kensory 40n'-

I~I-~:Ei~ILE 1~-~~3~6484.

4r4

88" B~4

4

4

1~4

i ".

4 $ $ 4

94 ~,

jsi4

Page 4: e or IC4- z:kensory 40n'-

288 RLE Progress Report Number 133

Page 5: e or IC4- z:kensory 40n'-

Chapter 1. Speech Communication

Chapter 1. Speech Communication

Academic and Research Staff

Professor Kenneth N. Stevens, Professor Jonathan Allen, Professor Morris Halle, Professor Samuel J.Keyser, Dr. Corine A. Bickley, Dr. Suzanne E. Boyce, Dr. Carol Y. Espy-Wilson, Dr. Marie K. Huffman, Dr.Michel T. Jackson, Dr. Melanie L. Matthies, Dr. Joseph S. Perkell, Dr. Mark A. Randolph, Dr. Stefanie R.Shattuck-Hufnagel, Dr. Mario A. Svirsky, Dr. Victor W. Zue

Visiting Scientists and Research Affiliates

Giulia Arman-Nassi,' Dr. Richard S. Goldhor,2 Dr. Robert E. Hillman, 3 Dr. Jeannette D. Hoit,4 Eva B.Holmberg, 5 Dr. Tetsuo Ichikawa, 6 Dr. Harlan L. Lane,7 Dr. John L. Locke,8 Dr. John I. Makhoul, 9 Dr. CarolC. Ringo,10 Dr. Noriko Suzuki,"1 Jane W. Webster 12

Graduate Students

Abeer A. Alwan, Marilyn Y. Chen, Helen M. Hanson, Andrew W. Howitt, Caroline B. Huang, Lorin F.Wilde

Undergraduate Students

Anita Rajan, Lorraine Sandford, Veena Trehan, Monnica Williams

Technical and Support Staff

Ann F. Forestell, Seth M. Hall, D. Keith North, Arlene Wint

1.1 Introduction

Sponsors

C.J. Lebel FellowshipDennis Klatt Memorial FundDigital Equipment Corporation

1 Milan Research Consortium, Milan, Italy.

2 Audiofile, Inc., Lexington, Massachusetts.

3 Boston University, Boston, Massachusetts.

4 Department of Speech and Hearing Sciences, University of Arizona, Tucson, Arizona.

5 MIT and Department of Speech Disorders, Boston University, Boston, Massachusetts.

6 Tokushima University, Tokushima, Japan.

7 Department of Psychology, Northeastern University, Boston, Massachusetts.

8 Massachusetts General Hospital, Boston, Massachusetts.

9 Bolt Beranek and Newman, Inc., Cambridge, Massachusetts.

10 University of New Hampshire, Durham, New Hampshire.

11 First Department of Oral Surgery, School of Dentistry, Showa University, Tokyo, Japan.

12 Massachusetts Eye and Ear Infirmary, Boston, Massachusetts.

289

Page 6: e or IC4- z:kensory 40n'-

Chapter 1. Speech Communication

National Institutes of HealthGrants T32 DC00005, 5-R01 DC00075, F32DC00015, S15 NS28048, R01 NS21183,1 3 P01NS23734,14 and 1 -R01 DC00776

National Science FoundationGrants IRI 88-0568013 and IRI 89-10561

The overall objective of our research in speechcommunication is to gain an understanding of theprocesses whereby (1) a speaker transforms a dis-crete linguistic representation of an utterance intoan acoustic signal, and (2) a listener decodes theacoustic signal to retrieve the linguistic represen-tation. The research includes development ofmodels for speech production, speech perception,and lexical access, as well as studies of impairedspeech communication.

1.2 Models, Theory, and Datain Speech Physiology

1.2.1 Comments on SpeechProduction Models

A paper on speech production models has beenprepared for the 12th International Congress ofPhonetic Sciences (Aix-en-Provence, France,August 1991). The paper discusses modelingwhich covers the entire transformation from alinguistic-like input to a sound output. Such mod-eling can make two kinds of contribution to ourunderstanding of speech production: (1) forcingtheories of speech production to be stated explic-itly and (2) serving as an organizing framework fora focused program of experimentation. As anexample, a "task dynamic" production model iscited. The model incorporates underlying phono-logical primitives that consist of "abstract articula-tory gestures." Initial efforts have been made touse the model in interpreting experimental data.Several issues that arise in such modeling are dis-cussed: the validity of the underlying theory, incor-porating realistic simulations of peripheralconstraints, accounting for timing, difficulties inusing models to evaluate experimental data, andthe use of neural networks. Suggestions for analternative modeling approach center around theneed to include the influence of speech acousticsand perceptual mechanisms on strategies ofspeech production.

1.2.2 Models for ArticulatoryInstantiation of Feature Values

Data from previous studies of speech movementshave been used as a basis for several theoreticalpapers focusing on models for the articulatoryinstantiation of particular distinctive feature values.Two of the papers (prepared in collaboration withcolleagues from Haskins Laboratories) addressmodels of coarticulation for the features +/-ROUND and +/- NASAL. In both papers, wereview conflicting reports in the literature on theextent of anticipatory coarticulation and presentevidence to suggest that these conflicts can beresolved by assuming that segments not normallyassociated with a rounding or nasalization featureshow stable articulatory patterns associated withthose features. In one of the papers, we point outthat these patterns may have been previously inter-preted as anticipatory coarticulation of rounding ornasality. We further argue that timing andsuprasegmental variables can be used to show thatthis is the case and that the coproduction model ofcoarticulation can best explain the data. In theother paper, we use similar data to argue againstthe "targets and (interpolated) connections" modelproposed by Keating. In a third paper on thisgeneral topic, coarticulatory patterns of roundingin Turkish and English are compared and evidenceis given to show that their modes of articulatoryorganization may be different.

1.2.3 Modeling Midsagittal TongueShapes

Two projects on articulatory modeling are in pro-gress. The first is concerned with the constructionof a quantitative, cross-linguistic articulatorymodel of vowel production, based incineradiographic data from Akan, Arabic, Chinese,and French. Programs for displaying, processing,and partially automating the process of takingmeasurements from cineradiographic data havebeen developed for the purpose of gathering datafor this work. The second project is concernedwith quantitatively testing the generalizability ofthis model to similar data gathered from speakersof English, Icelandic, Spanish, and Swedish.

Future work will focus on the degree to which thismodel generalizes from vowel production to con-sonant production. Velar and uvular consonants,which, like vowels, are articulated primarily with

13 Under subcontract to Boston University.

14 Under subcontract to Massachusetts General Hospital.

290 RLE Progress Report Number 133

Page 7: e or IC4- z:kensory 40n'-

Chapter 1. Speech Communication

the tongue body, should be able to be generatedby the model with no additional parameters; it isan open question whether or not the same will betrue of other consonants such as pharyngeals andpalatals, it almost certainly will not be true ofcoronals.

1.3 Speech SynthesisResearch on speech synthesis has concentrated forthe most part on developing improved proceduresfor the synthesis of consonant-vowel and vowel-consonant syllables. With the expanded capabili-ties of the KLSYN88 synthesizer developed byDennis Klatt, it is possible to synthesize utterancesthat simulate in a more natural way the processesinvolved in human speech production. The detailsof these procedures are based on theoretical andacoustical studies of the production of variousclasses of speech sounds, so that the manipulationof the synthesis parameters is in accord with themechanical, aerodynamic, and acoustic constraintsinherent in natural speech production.

For example, using the improved glottal source,including appropriately shaped aspiration noise,procedures for synthesizing the transitions insource characteristics between vowels andvoiceless consonants and for synthesizing voicingin obstruent consonants have been specified.Methods for manipulating the frequencies of apole-zero pair in the cascade branch of thesynthesizer have led to the synthesis of nasal,lateral, and retroflex consonants with spectral andtemporal characteristics that match closely thecharacteristics of naturally-produced syllables. Thesynthesis work is leading to an inventory of syn-thetic consonant-vowel and vowel-consonant syl-lables with precisely specified acousticcharacteristics. These syllables will be used fortesting the phonetic discrimination and identifica-tion capabilities of an aphasic population that isbeing studied by colleagues at MassachusettsGeneral Hospital.

An additional facet of the synthesis research is thedevelopment of ways of characterizing differentmale and female voices by manipulating thesynthesizer parameters, particularly the waveformof the glottal source, including the noise compo-nent of this waveform. We have found informallythat it is possible to reproduce the spectral proper-ties of a variety of different male and female voicesby adjustment of these parameters.

1.4 Speech Production ofCochlear Implant Patients

1.4.1 Longitudinal Changes in VowelAcoustics in Cochlear ImplantPatients

Acoustic parameters were measured for vowelsspoken in /hVd/ context by three postlingually-deafened cochlear implant recipients. Two femalesubjects became deaf in adulthood; the malesubject, in childhood. Longitudinal recordingswere made before and at intervals followingprocessor activation. The measured parametersincluded formant frequencies, FO, SPL, durationand amplitude difference between the first twoharmonic peaks in the log magnitude spectrum(H1 - H2). A number of changes were observedfrom pre- to post-implant with inter-subject differ-ences. The male subject showed changes in F1,vowel durations and FO. These changes wereconsistent with one another; however, they werenot necessarily in the direction of normalcy. Onthe other hand, the female subjects showedchanges in F2, vowel durations, FO and SPLwhich were in the direction of normal values and,for some parameters, tended to enhance phonemiccontrasts. For all three subjects, H1 - H2 changedin a direction which was consistent withpreviously-made flow measurements. Similar datafrom additional subjects are currently being ana-lyzed.

1.4.2 Effects on Vowel Productionand Speech Breathing ofInterrupting Stimulation from theSpeech Processor of a CochlearProsthesis

We have completed a study on the short-termchanges in speech breathing and speech acousticscaused by 24 hours of processor deactivation andby brief periods of processor activation and deacti-vation during a single experimental session incochlear implant users. We measured the sameacoustic parameters employed in the above-described longitudinal study of vowel productionand the same physiological parameters used in alongitudinal study of changes in speech breathing.

We found significant and asymmetric effects ofturning the processor on and off for a number ofspeech parameters. Parameter changes induced byshort-term changes in the state of the processorwere consistent with longitudinal changes for twoof the three subjects. Most changes were in thedirection of enhancing phonetic contrasts when

291

Page 8: e or IC4- z:kensory 40n'-

Chapter 1. Speech Communication

the processor was on. Although each subjectbehaved differently, within a subject the pattern ofchanges across vowels was generally consistent.Certain parameters, such as breathiness andairflow proved to be closely coupled in theirchanges, as in the longitudinal studies. (This cou-pling occurred even in the subject whoseresponses to processor activation where incon-sistent with the long-term changes obtained in thelongitudinal study.) The greater responsiveness ofspeech parameters to the onset of processor acti-vation than to its offset supports the hypothesisthat auditory stimulation has a "calibrational" rolein speech production, retuning pre-existing artic-ulatory routines. Many of the findings thus farimplicate control of the larynx as a major focus ofrecalibration, but other speech production param-eters need to be analyzed. As suggested bycertain findings about parameter relations in thelongitudinal study, we expect that as we progressin our understanding of those relations, we will beable to give a principled account of many of theapparent differences we observed among theseinitial subjects.

1.4.3 Methods for LongitudinalMeasures of Speech Nasality inCochlear Implant Patients

We have been examining the validity of longi-tudinal measures of nasality in cochlear implantpatients, based on acoustic spectra, sound levelsand outputs of nasal and throat accelerometers.Speech materials consist of isolated utterances andreading passages. Preliminary observations indi-cate the following: The ratio of RMS values ofnasal and throat accelerometer outputs (a tech-nique used in published experiments) may beinfluenced by: (1) variation in the relative levels ofthe two signals during the recommended cali-bration maneuver, i.e., production of a sustained/m/; and (2) substantial changes in SPL thataccompany onset of "auditory" stimulation from acochlear prosthesis. These observations raiseuncertainty about using the throat accelerometeroutput as a reference and the sensitivity of thiskind of measure to longitudinal changes in nasalityacross experimental sessions. In addition, meas-ures of harmonic and formant amplitudes fromacoustic spectra may be confounded by changesin coupling to tracheal resonances that alsoaccompany the activation of the prosthesis. Theseobservations and additional measures and cali-bration strategies are being explored further.

1.5 Phonatory FunctionAssociated with Misuse of theVocal Mechanism

1.5.1 Vocal Nodules in Females

In this study, we used our methods for obtainingmeasurements from the inverse filtered glottalwaveform, electroglottographic signal (EGG), andaverage transglottal pressure and glottal airflow tocompare phonatory function for a group of twelvefemale bilateral vocal-fold nodules patients with acontrol group of age-matched normal females.Statistical analysis of the data involved analysis ofvariance (ANOVA) and analysis of covariance(ANCOVA with SPL the covariate) procedures totest for significant differences between the groups.Standard and regressed Z scores were also used tocompare measurements for individual nodules sub-jects with the normal group.

The largest number of significant differencesbetween the nodules and normal group werefound for the loud voice condition, with normalvoice having the second largest number of signif-icant differences and soft voice the smallestnumber of differences. A majority of the signif-icant differences between nodules and normalsubjects were associated with higher SPL used bythe nodules group. Increased SPL in the nodulesgroup was accompanied, most notably, byincreases in parameters. These are believed toreflect greater potential for vocal-fold trauma (i.e.,increased transglottal pressure, AC (modulated)glottal flow, peak glottal flow and maximum rateof glottal flow declination) due to high vocal-foldclosure velocities and collision forces. In addition,an examination of individual Z-score profilesrevealed that three of the twelve nodules patientsdisplayed instances in which SPL was withinnormal limits, but AC flow and/or maximum flowdeclination rate was abnormally high. In addition,there were four nodules patients who displayedinstances in which SPL, AC flow and maximumflow declination rate were all abnormally high, butAC flow and/or maximum flow declination ratewas proportionally higher than the SPL.

In summary, the results suggest that for mostfemale nodules patients, abnormalities in vocalfunction are primarily associated with use ofexcessive loudness. Thus, for patients who fit thisprofile, simply reducing loudness through directfacilitation and/or vocal hygiene training may rep-resent appropriate treatment. However, thereappears to be a smaller proportion of femalenodules patients whose abnormalities in vocalfunction are not completely accounted for byexcessive loudness. These patients may be dis-

292 RLE Progress Report Number 133

Page 9: e or IC4- z:kensory 40n'-

Chapter 1. Speech Communication

playing an underlying "pattern"hyperfunction which could requiretherapeutic procedures to ameliorate.

of vocaladditional

1.5.2 Speech Respiration Associatedwith Vocal Hyperfunction: a PilotStudy

This pilot study compared respiratory function inan adult female nodules patient with a sex, ageand body-type matched normal control. Induc-tance plethysmography (a Respitrace) was used toobtain a variety of speech and non-speech respir-atory measures at normal and "loud" vocal inten-sity.

The non-speech respiratory performance of bothsubjects was within normal limits. The matchedcontrol also displayed normal values for measuresof speech respiration. In contrast, the subject withvocal-fold nodules showed clear evidence ofdeviant speech respiration during oral reading suchas abnormally low inspiratory volumes, with themajority of expiratory limbs for speech being initi-ated at levels within the quiet tidal volume range.During loud reading, the nodules patient initiatedspeech at lung volumes that were equal to or lessthan those used at normal loudness. This result isquite unlike normal speakers who tend to inhale tohigher lung volumes in preparation for louderutterances. The nodules patient also displayedconsistent, marked encroachment into theexpiratory reserve volume during reading at normalloudness, with even further encroachment duringloud reading. This pattern is deviant from normalspeakers who tend to terminate utterances close tothe functional residual capacity. Taken together,the results for the nodules subject reflect reducedefficiency of speech respiratory function. Thispatient produced alveolar pressures required forspeech by expending greater expiratory musculareffort, instead of inhaling like normal speakers tolarger lung volumes, thereby taking advantage ofgreater natural recoil forces to assist in generatingrequired pressures. The nodules patient alsotended to stop at inappropriate (non-juncture)points within sentences to inhale and evidencedloss of substantial volumes of air prior to the initi-ation of phonation.

Thus, the results indicate that abnormalities inspeech respiration can occur in a patient with a

hyperfunctionally-related voice disorder (nodules).This finding will be pursued in future studies ofadditional patients.

1.5.3 Comparisons Between Inter-and Intra-speaker Variation inAerodynamic and AcousticParameters of Voice Production

In this study, intraspeaker variation of non-invasiveaerodynamic and acoustic measurements of voiceproduction was examined across three recordingsfor three normal female and three normal malespeakers. Data from one recording for each offifteen females and fifteen males served asnormative references. The following measure-ments were made for productions of sequences ofthe syllable /pae/ in soft, normal and loud voiceand low and high pitch: the inverse filtered airflowwaveform average transglottal air pressure andglottal airflow, and the amplitude differencebetween the two first harmonics of the acousticspectrum. Linear regression analyses betweenindividual values of SPL and each of the otherparameters were performed for data pooled acrossthe three recordings. In addition, comparisonsbetween the individual subjects and the referencegroups were performed using Z-score analysis.

Preliminary results showed that a small amount ofSPL variation could be accompanied by large vari-ation in other parameters. For some parameters(e.g., maximum flow declination rate and ACflow), both inter- and intra-speaker variation weresignificantly related to SPL. For such parametersthe effects of SPL can be removed statistically, andcomparisons between individual subjects andgroup data can readily be made. For other param-eters, which showed large inter-speaker variationnot significantly related SPL, intra-speaker vari-ation for individual subjects could be orderlyrelated to SPL (e.g., for DC flow offset). For suchparameters, data from several baseline recordingscan be useful to establish to what extentparameters-for which both inter- and intra-speaker variation was large and not significantlyrelated to SPL (e.g., average flow)-were consid-ered less useful as indices of abnormal vocal func-tion. These data suggest that in order to quantifynormal variation, both inter- and intra-speaker var-iation should be considered.

293

Page 10: e or IC4- z:kensory 40n'-

Chapter 1. Speech Communication

1.6 Studies of Acoustics andPerception of Speech Sounds

1.6.1 VowelsPrevious reports have described an acoustic studyof vowels in which the effects of consonantalcontext, lexical stress, and speech style were com-pared using the same database of utterances. Thatstudy, which had been extended, was completedduring the past year and included perceptualexperiments in which listeners identified vowelsexcised from the utterances in the database. Thecompleted data compilation verifies that consonantcontext affects the vowel midpoints more thanlexical stress and speech style. The direction andmagnitude of the formant frequency shifts wereconsistent with findings of previous studies. Theliquid and glide contexts, /w/, /r/, and /1/,lowered the F2 frequency of front vowels, espe-cially lax front vowels, on the order of one Barkrelative to the F2 frequencies when the samevowels are adjacent to stop consonants. Shifts forF1 tended to be smaller than shifts in F2, even ona Bark scale, and were less consistent acrossspeakers.

The formant frequency midpoints and durations ofvowels carrying primary stress were shown todiffer only slightly on average from those vowelscarrying secondary stress, if the other factors wereheld constant. Vowels in speech read contin-uously also differed only slightly on the averagefrom vowels in spontaneous speech.

In general, the data show that variations in vowelmidpoint formant frequencies, durations, and tra-jectory shapes are correlated with the perceptionof the vowel by human listeners. For example, /e/tokens which have F1 - F2 midpoint values typicalof /A/ tend to be identified as /A/, and /e/ tokenswhich are short and lack a /y/ offglide, typicalcharacteristics of lax vowels, tend to be misidenti-fied as lax vowels.

Aspects of the trajectories which are important forcharacterizing the vowel were sought. The trajec-tory was used to derive a representation of thevowel by one point per formant, a modified "mid-point." Performance by a Gaussian classifier wasthe criterion used to evaluate different represent-ations of the vowels. If the effect of perceptualovershoot for F2 and perceptual averaging for F1was simulated, and the resulting modified mid-point was used as an input for the classifier, per-formance was somewhat better than if thedurational midpoints were used as input.However, the best performance was achieved if theraw data-the quarter-point, midpoint, and three-quarter point of the trajectory and the

duration-were used as input to the classifier. Theimproved performance with the raw data over themodified midpoints shows that not all of the sig-nificant aspects of the trajectory have been cap-tured in a one-point representation. It may be thata new one-point representation could be foundwhich would result in as high performance as theraw data. Alternatively, it may be necessary to usemore than a modified midpoint to fully characterizea vowel.

Of all the representations used as input to the sta-tistical classifier, the raw data also result in thebest agreement of the classifier with the humanperformance. If the classifier is also allowed totrain and test on vowels in stop and liquid-glidecontexts separately, agreement with the listeners'responses (and performance in the conventionalsense, i.e., agreement with the transcriber'sphonemic labels) improves further. The improve-ment due to separating the contexts suggests thathumans perform vowel identification in a context-dependent manner.

1.6.2 Analysis and Modeling of Stopand Affricate Consonants

We have been developing models for the pro-duction of various types of stop consonantsincluding voiceless aspirated stops, voiced stops,affricates, and prenasal stops. In all of thesemodels, estimates are made of the changes incross-sectional area of the airways as the variousarticulators (glottis, velopharyngeal opening,supraglottal constriction, pharyngeal expansion orcontraction) are manipulated to produce the dif-ferent classes of consonants. From these esti-mates, aerodynamic parameters such as flows andpressures are calculated, and the time course ofdifferent acoustic sources at the glottis and nearthe supraglottal constriction are then determined.The modification of the sources by the vocal-tracttransfer function is included to obtain predictionsof the spectral changes in the radiated sound fordifferent phases of the consonant production. Thepredictions are then compared with acoustic meas-urements of utterances produced by severalspeakers. In some cases where there are discrep-ancies between the acoustic data and the pred-ictions, revised estimates of the model parametersare made. This process leads to reasonable agree-ment between the acoustic measurements and thepredictions and provides a way of estimating thetime course of the articulatory changes from theacoustic data.

In the case of voiceless aspirated stop consonants,the model includes a glottal spreading maneuverand manipulation of the pharyngeal walls to inhibitexpansion of the vocal-tract volume during the

294 RLE Progress Report Number 133

Page 11: e or IC4- z:kensory 40n'-

Chapter 1. Speech Communication

consonantal closure interval. The model generatesa quantitative description of the sequence ofacoustic events following the consonantal release:an initial transient, an interval of frication noise (atthe supraglottal constriction), an interval of aspi-ration noise (at the glottis), a time in which thereis breathy vocal-fold vibration, and, finally, modalvocal-fold vibration. For a voiced stop consonant,active pharyngeal volume expansion is included inthe model, and the increase in intraoral pressure isthereby delayed.

The model for the production of the affricate /6/ inEnglish specifies four components: (1) an initialrelease phase for the tongue tip, (2) a relativelyslow downward movement of the tip (30-50 ms),(3) a steady palato-alveolar fricative portion, and(4) a release of the fricative portion into the vowel.From the model it is possible to calculate theamplitude, spectrum, and time course of the com-ponents of the sound resulting from these move-ments, as well as the airflow. The acoustic andaerodynamic predictions are on the range observedin actual productions of affricates. In the initial30-50 ms there is a brief transient (1-2 ms), fol-lowed by a gradually rising noise amplitude, with aspectral prominence corresponding to the acousticcavity in front of the tongue tip. During the laterfricative phase, there is acoustic coupling to thepalatal channel, leading to a second major promi-nence in the spectrum. The single prominence inthe initial phases and the two prominences in thelater frication spectrum achieve amplitudes that arecomparable to or greater than the correspondingspectral prominences in the adjacent vowel. Thesequence of acoustic events cannot be describedsimply a /t/ followed by /g/ or as a /g/ with anabrupt onset, but is unique to the affricate.

1.6.3 Fricative Consonants:Modeling, Acoustics, and Perception

We have continued our modeling studies of theproduction of fricative consonants with an analysisof the aerodynamic and acoustic processesinvolved in the production of these consonantsand have completed an investigation of the voicingdistinction for fricatives. The fricative studiesreported previously have been extended to includequantitative estimates of the amplitudes andlocations of turbulence noise sources at the glottisand at the supraglottal constriction for variousplaces of articulation. Reasonable agreement hasbeen obtained between acoustic characteristics offricatives in intervocalic position. The acousticand perceptual studies of voicing in fricatives (incollaboration with Sheila Blumstein of Brown Uni-versity) have led to a delineation of the roles of theextent of formant transitions and the presence ofglottal vibration at the edges of the obstruent

interval in determining the voicing status ofintervocalic fricatives. Data have also beenobtained for fricative sequences with mixedvoicing (e.g., the /zf/ sequence in his form), andhave shown the predominant role of regressivevoicing assimilation relative to progressive assim-ilation.

1.6.4 Acoustic Properties ofDevoiced Semivowels

When the semivowels /w,y,r,l/ occur in clusterswith unvoiced consonants, they are sometimes atleast partially devoiced so that much of the infor-mation about their features is in the devoicedregion. While the acoustic characteristics of thesemivowels which are manifest as sonorants arefairly well understood, little research has been con-ducted on those semivowels which are fricated. Inthis acoustic study, we recorded groups of wordssuch as "keen," "clean," "queen," and "cream" at aslow and fast rate by two speakers, one male andone female. We are investigating the spectralcharacteristics of the word-initial consonants andconsonant clusters to determine the attributes inthis region which signal the presence of a semi-vowel and the attributes which distinguish amongthe devoiced semivowels. So far, our findingssuggest that the spectral characteristics of the con-sonant clusters can change considerably over timedue to the presence of the semivowel. Data arebeing collected to quantify changes in the spectralcharacteristics within the consonant cluster froman initial frication burst to noise generated at theconstriction for the semivowel to voicing as thesemivowel constriction is released. For example,one notable time-varying property of the [sw]cluster in "sweet" is the change from a rising spec-tral shape where the main concentration of energyis between 4-5 kHz (at the beginning of thefrication noise) to a falling spectral shape wherethe main concentration of energy is between 1-2kHz (towards the end of the frication noise), asthe place of noise generation shifts from a post-dental to a velar location.

1.6.5 Spreading of Retroflexion inAmerican English

In this acoustic study we investigated the factorswhich influence the spreading of retroflexion froma postvocalic /r/ into the preceding vowel, /a/, inAmerican English. We consider the acoustic corre-late for the feature "retroflex" to be a low thirdformant (F3) which is close in frequency to thesecond formant (F2). Minimal pair words withinitial and final obstruent consonants wererecorded by six speakers at a slow and a fast

295

Page 12: e or IC4- z:kensory 40n'-

Chapter 1. Speech Communication

speaking rate. Several basic F3 trajectories couldbe observed during the /ar/ region, dependingupon the timing of the minimum in the F3 trajec-tory in relation to the beginning and ending of thesonorant region. Given the same word said by dif-ferent speakers at different speaking rates, theminimum in the F3 trajectory can occur at thebeginning, in the middle and at the end of thesonorant region. In addition, F3 can remain fairlyflat at a low frequency throughout the sonorantregion. We believe that this variability in the F3trajectory can be explained by several factors.First, there is a certain basic F3 trajectory neededto articulate a postvocalic /r/: a downward move-ment from the F3 frequency in the previous soundand an upward movement to either a neutral posi-tion or the F3 position for the following sound.The F3 slopes on either side of the minimum willdepend upon the context. Second, only a portionof this full F3 trajectory may be observable duringthe sonorant region. How much and what portionof the F3 trajectory can be observed will dependupon the duration of the sonorant region and onhow early the speaker starts to produce the /r/.The anticipation of /r/ as well as the duration ofthe sonorant region appear to depend on speakingrate, consonantal context and speaker differences.Further analysis is planned to determine moreaccurately the minimum time needed to executethe articulation of /r/.

1.6.6 Studies of UnstressedSyllables

We have begun an investigation of the acousticproperties of the vowels and consonants inunstressed syllables in utterances produced withdifferent speech styles. The aim of this study is todetermine the kinds of reductions that can occur inunstressed syllables and to attempt to establishwhat aspects of the syllables are retained in spiteof the reductions. As a first step in this project, weare examining vowels and consonants inunstressed syllables in which the consonants pre-ceding and following the vowel are both voiceless(as in support, classical, potato). These are situ-ations in which the vowel can become devoiced,but in which some remnant of the vowel remainsin voiceless form, or in which temporal propertiesof the surrounding segments signal the presenceof the syllable.

When there is severe reduction of the vowel, thepresence of the unstressed syllable appears to besignaled in various ways: (1) the vowel isdevoiced, but there is an interval in which thepresence of a more open vocal-tract configurationbetween the consonants is signaled by aspirationnoise; (2) in a word like support, when the vowelis devoiced, the frication (and possibly aspiration)

noise within the initial /s/ shows an increasedamplitude toward the end of the consonant, indi-cating that the consonant is being released into amore open vocal-tract configuration or that theconsonant itself is more open toward the end of itsduration; (3) properties of the adjacent consonant(such as aspiration in /p/ in support or in /k/ inclassical) indicate that the consonant sequencecould not have arisen unless an unstressed vowelwere inserted between the consonants. Theseproperties have been observed in a preliminarystudy, and a larger database of utterances is beingprepared to conduct a more detailed investigation.

1.6.7 Nasalization in English

Nasalization is one of the sound properties, or fea-tures, languages use to distinguish words. Under-standing the acoustic consequences ofnasalization is therefore necessary for under-standing how nasal sounds pattern in natural lan-guages, and also for developing models of speechacoustics for use in speech synthesis and recogni-tion. We have been investigating the importantspectral consequences of nasalization using anal-ysis of natural speech, and we have carried outperception experiments using both natural andsynthetic speech. One set of perceptual exper-iments investigated how readily listeners can usenasalization on a (natural) vowel to guess thenasality of a following consonant (bead vs. bean).Our results indicate that English speakers noticenasality on a vowel more readily when the exper-imental task focuses their attention on it. Thisobservation may be because they are notexpecting nasalization on vowels, since in Englishvowels are nasalized only incidentally, before anasal consonant, rather than being nasalizedcontrastively, as in French. One of the most con-sistent spectral effects of nasalization is dampingof the first formant. Our analyses of naturalspeech items indicate that F1 prominencedecreases with increasing opening of thevelopharyngeal port. Perceptual experiments withsynthetic speech indicate that even in the absenceof other spectral cues to nasalization, a decrease inF1 prominence increases the likelihood that avowel will be heard as nasal. This effect isstronger when F1 prominence decreases over thecourse of the vowel, rather than remaining static,suggesting that a decrease in F1 prominence overtime is an essential part of the spectral patternwhich is heard as nasal. However, it could also besimply due to the fact that time-varying syntheticitems sound more natural to the listener than dostatic items. To choose between these alternativeexplanations, we plan to run similar experimentswith listeners whose native language hascontrastively nasalized vowels, in which the spec-tral cues to nasalization change little over the

296 RLE Progress Report Number 133

Page 13: e or IC4- z:kensory 40n'-

Chapter 1. Speech Communication

course of the vowel. A somewhat less consistentspectral effect of nasalization is the presence of avisible zero in the spectral valley between F1 andF2. Our perceptual studies with natural speechitems suggest that the zero can be an importantfactor in determining when a vowel is heard asnasal. One may infer from this result that ameasure of relative spectral balance in the regionbetween F1 and F2 may correlate well with inde-pendent indications of perceived nasality. Thismeasure is currently under development. Theresults of this research, along with work reportedpreviously, should lead to an improved under-standing of the use of nasalization in naturalspeech.

1.6.8 Modeling Speech Perceptionin Noise: A Case Study ofPrevocalic Stop Consonants

The present study is part of an attempt to modelspeech perception in noise by integrating know-ledge of the acoustic cues for phonetic distinctionswith theories of auditory masking. The method-ology in the study is twofold: (1) modelingspeech perception in noise based on results ofauditory-masking theory and (2) evaluating thetheoretical predictions by conducting a series ofperceptual experiments in which particularacoustic attributes that contribute to several spe-cific phonetic distinctions are selectively masked.The model predicts the level and spectrum of themasker needed to mask out important acousticattributes in the speech signal and predictslisteners' responses to noisy speech signals.

We have been focusing on the importance of theF2 trajectory in signaling the place of articulationfor the consonants /b,d/ in /CV/ syllables with thevowels /a/ and /E/ In the /Ca/ context, the F2 tra-jectory from /d/ to the vowel falls, whereas the F2trajectory in /ba/ syllables rises into the vowel andis less extensive than the /da/ trajectory. Thereverse is true for /CE/ syllables: the F2 trajectoryis less extensive for /de/ than it is for /bE/, i.e.,there is almost no transition in F2 from the conso-nant to the vowel in /d/.

Two perceptual experiments using synthetic /CV/syllables were conducted. /C/ was either /b/ or/d/; the first set of experiments used the vowel/a/, and the second set used the vowel /be/. Thesynthetic pairs (/ba, da/) and (/be/, /de/). wereidentical except for differences in the F2 trajecto-ries (as described above). The utterances weredegraded by adding various levels of white noise,and were presented to subjects in identificationtests. Results show that for /Ca/ stimuli withmost of the F2 transition masked-yielding to the

perception of an almost "flat" trajectory- the lis-teners label the /da/ stimuli as /ba/. The reverseis true for /Cs/ syllables: when most of the F2transition is masked leaving a flat trajectory appro-priate for /de/, the listeners label the /be/ stimulias /de/. These results indicate that the shape ofthe F2 trajectory is crucial in signaling the place ofarticulation for these consonants in noise. Resultsalso show that the thresholds where confusion inplace of articulation occur can be estimated suc-cessfully from masking theory.

In future experiments we will examine listeners'responses to speech signals degraded by shapednoise, rather than white noise, in addition to usinga larger set of consonants.

1.7 Speech ProductionPlanningOur work in speech planning continued the anal-ysis of patterns in segmental speech errors, whichprovide evidence for models of the phonologicalplanning process. For example, when two targetsegments in an utterance exchange, as in "pat fig"for "fat pig," they usually share position in boththeir words and stressed syllables. Since this con-straint is widespread, it suggests that one or bothof these larger units plays a role in the segmentalplanning process. In a series of elicitation exper-iments, we contrasted stimuli like "parade fad"(where /p/ and /f/ share word position but notstress) and "repeat fad" (where /p/ and /f/ sharestress but not word position) with "ripple fad"(where /p/ and /f/ share neither word position norstress). Results show that word position inducesthe largest number of errors, stress the next-largest, and pairs of segments that share neitherpositional factor participate in almost no errors.We conclude that both word structure and syllablestress must be incorporated into models of the rep-resentation that speakers use for phonologicalplanning.

We are further pursuing the issue of position con-straints on segmental errors and their implicationsfor processing representations by examining anearlier finding that word-final segments participatein more errors when speakers produce lists ofwords (leap note nap lute), and fewer errors whenthose words occur in grammatically well-formedphrases (From the leap of the note to the nap ofthe lute) or in spontaneous speech. One possibleaccount of this observation is that the planning ofa more complex syntactic structure somehow pro-tects word-final segments; another is that the pres-ence of intervening words provides the protection,and a third possibility is that the more complexprosodic planning of longer grammatical phrases is

297

Page 14: e or IC4- z:kensory 40n'-

Chapter 1. Speech Communication

responsible. We are comparing the error patternsin short word lists (e.g., lame run ram Len) withsentences of similar length (Len's ram runs lame),and with longer sentences (But if Len has a ram itcan run if it's lame) similar to the phrase conditionin the earlier experiments. If syntactic structurealone is the factor that changes the proportion offinal-segment errors, then short sentences shouldbe as effective as longer sentences in eliciting thedifference.

In a second line of investigation, we are askinghow complex prosody affects the process of pho-nological planning. Our initial experiments deter-mined that speakers using reiterant speech toimitate target sentences (i.e., using repeatedinstances of the syllable /ma/ to reproduce theprosody of the original) showed that it is more dif-ficult to imitate metrically irregular sentences like"He told us it was too hot to work" than metricallyregular sentences like "It was hot so we swam inthe pool." Difficulty is measured by whether thenumber of syllables in the reiterant imitationmatches that of the target sentence and by thespeaker's estimate of whether the imitation washard, medium or easy). If prosodic planning andsegmental planning interact, then prosodicallyirregular utterances with normal segment structureshould elicit more segmental errors in normally-produced utterances as well. We are currentlyinvestigating this possibility.

Finally, in collaboration with Mari Ostendorf ofBoston University and Patti Price of the StanfordResearch Institute, we have investigated a thirdissue in production planning: the role of prosodicinformation (prominence, duration, etc.) in sig-naling aspects of both the structure and themeaning of utterances. In a study of seven typesof syntactic ambiguity, using trained FM-radiospeakers, we confirmed earlier (but not unani-mous) findings showing that listeners can selectthe meaning intended by the speaker reliably forsome but not all types of bracketing ambiguity.Phonological and acoustic analyses showed thatspeakers signal these distinctions in part by place-ment of the boundaries of prosodic constituentsand that these boundaries are indicated by suchprosodic cues as boundary tones and durationlengthening in the pre-boundary syllable.

1.8 Models Relating Phonetics,Phonology, and Lexical AccessIn recent years phonologists have proposed thatthe features that characterize the segments in therepresentation of an utterance in memory shouldbe organized in a hierarchical fashion. A geomet-rical tree-like arrangement for depicting phonolog-ical segments has emerged from this proposal. Wehave been examining whether phonetic consider-ations might play a role in the development ofthese new approaches to lexical representation ofutterances. Our efforts in this direction are fol-lowing two different paths: (1) a theoretical paththat is attempting to modify or restructure thefeature hierarchy derived from phonological con-siderations, and (2) a more practical path in whichwe are examining how a lexicon based on theseprinciples might be implemented and how it mightbe accessed from acoustic data.

Theoretical considerations have led to a separationof features into two classes: (1) articulator-freefeatures that specify the kind of constriction that isformed in the vocal tract, and (2) articulator-bound features indicating what articulators are tobe manipulated. The articulator-bound features fora given segment can, in turn, be placed into twocategories: (1) features specifying whicharticulator is implementing the articulator-freefeature, and (2) features specifying additional sec-ondary articulators that create a distinctive modu-lation of the sound pattern. The articulator-freefeatures are manifested in the sound stream aslandmarks with distinctive acoustic properties. Thearticulator-bound features for a given segment arerepresented in the sound in the vicinity of theselandmarks. Details of the feature hierarchy thatemerges from this view will appear in papers thatare in press or in preparation.

Implementation of a lexicon based on this hierar-chical arrangement is beginning, and we are plan-ning the development of a system for automaticextraction of acoustic properties that identify someof the articulator-free and articulator-bound fea-tures in running speech.

298 RLE Progress Report Number 133

Page 15: e or IC4- z:kensory 40n'-

Chapter 1. Speech Communication

1.9 Other Research Relating toSpecial Populations

1.9.1 Breathiness in the Speech ofHearing-impaired Children

One of the more prevalent speech abnormalities inthe speech of the hearing-impaired that contrib-utes to reduced intelligibility is excessivebreathiness. According to a past study of normalspeakers done by D. H. Klatt and L. C. Klatt, H1 -H2 and aspiration noise are highly correlated tobreathiness judgments. (H1 and H2 are the ampli-tudes of the first two harmonics, in dB, of thespectrum of a vowel.) A study was done toanalyze quantitatively the middle of the vowel ofhearing-impaired children and a few normal-hearing children. Acoustic parameters FO (funda-mental frequency), H1 - H2, aspiration noise, andspectral tilt were examined in relation to the vowelperceptual judgments on breathiness done by twoexperienced phoneticians exposed to the vowel ina word or phrase context. The correlation coeffi-cients were unexpectedly low for all of theacoustic parameters. Also, according to thejudges, nasality often accompanied breathiness inthe vowels of the hearing-impaired and it was dif-ficult to distinguish the two. To lessen the effectof nasality on breathiness judgments, only vowelswith low nasality judgments were analyzed.Although the correlation improved, the changewas not large.

The main difference between the study and that ofKlatt and Klatt was that the judgments in the Klattstudy were done on isolated vowels. Anotherstudy was conducted to examine the effect ofadjacent segments on breathiness perceptual judg-ments by presenting to the listeners vowels thatwere both isolated and imbedded in a word.According to the preliminary study, a large differ-ence was observed in the judgments of imbeddedvowels in certain words: more breathiness wasdetected in the vowel when it was in a wordcontext than when it was isolated. However, moreexperiments and examination of acoustic eventsnear consonant-vowel boundaries must be done.

Another characteristic of breathiness is the intro-duction of extra pole-zero pairs due to trachealcoupling. Analysis-by-synthesis was done usingKLSYN88 developed by Dennis Klatt. An extrapole-zero pair was often observed around 2 kHzfor hearing-impaired speakers but not for normal

speakers. However, there was low correlationbetween prominence of the extra peak andbreathiness judgments, suggesting that it may notbe possible to use prominence of extra peaks toquantify breathiness. Synthesis was also done inwhich the extra pole-zero pair was removed. Insome cases, the breathiness perceptual judgmentswere lower than the vowels synthesized with thepole-zero pair, although in most cases there wasnot much difference.

1.9.2 Speech Planning Capabilitiesin Children

A project at Emerson College is studying the proc-esses of speech encoding in children both withand without language impairment. One compo-nent of that project involves collaboration with ourSpeech Communications group at RLE. Thepurpose of this research is to examine the twopopulations for differences in ability to encodephonological information and translate it into amotor program. We have designed a test ofchildren's word repetition skills that will giveinsight into their speech planning capabilities, andthis test is being run on the two populations ofchildren.

1.9.3 Training the /r/-/I/ Distinctionfor Japanese Speakers

For adult second-language learners, certainphoneme contrasts are particularly difficult tomaster. A classic example is the difficulty nativeJapanese speakers have with the English /r/-/I/distinction. Even when these speakers learn toproduce /r/ and /I/ correctly, their ability to per-ceive the distinction is sometimes at chance.

In conjunction with the Athena second languagelearning project, we developed a computer-aidedtraining program to assist in teaching the /r/-/I/perceptual distinction. The basis of the training isthe use of multiple natural tokens and the ability torandomly select tokens and to obtain feedbackafter presentation. While training was slow, it waseffective for a majority of subjects. Many reached100 percent correct identification when tested after30-50 sessions.

As another component of this project, a fastspectrograph display was developed as an aid toteaching production of /r/ and /I/.

299

Page 16: e or IC4- z:kensory 40n'-

Chapter 1. Speech Communication

1.10 Facilities

1.10.1 Articulatory MovementTransduction

Development and testing of our system for Electro-Magnetic Midsagittal Articulometry has been com-pleted. The final stages of this effort included:revising the electronics to eliminate two sources oftransduction error, development of a more efficienttransducer calibration method and construction ofan apparatus to implement the method, improve-ment of transducer connecting wire to overcomeinsulation failures, updating and elaboration ofsignal processing software, development of a newcomputer program to display and extract synchro-nous data from articulatory and acoustic signals,and running a variety of tests to validate systemperformance. These tests indicate that perform-ance is now acceptably close to its theoreticallimit. Two trial experiments have been run withsubjects. Those experiments have resulted inseveral important improvements to experimentalmethods, including techniques for attaching trans-ducers to subjects and presenting utterance mate-rials. Preparations have been made to runexperiments on "motor equivalence" andanticipatory coarticulation.

1.10.2 Computer Facilities

Work is in progress on a program for displayingspectrograms on our workstation screens, toincrease the efficiency of certain kinds of acousticanalyses. A RISC-based compute server has beenintegrated into our network of workstations.Among other functions, it will perform compute-intensive calculations (such as for spectrogramgeneration) at ten times the rate currently possibleon our VAXStations. The WAVES acoustic anal-ysis package has been installed on the RISC work-station for future use in work on lexical access.

1.11 Publications

Journal Articles and PublishedPapers

Butzburger, J., M. Ostendorf, P.J. Price, and S.Shattuck-Hufnagel. "Isolated Word IntonationRecognition Using Hidden Markov Models."Proc. ICASSP 90, pp. 773-776 (1990).

Boyce, S.E. "Coarticulatory Organization for LipRounding in Turkish and English." J. Acoust.Soc. Am. 88: 2584-2595 (1990).

Boyce, S.E., R.A. Krakow, F. Bell-Berti, and C.E.Gelfer. "Converging Sources of Evidence forDissecting Articulatory Movements into CoreGestures." J. Phonetics 18: 173-188 (1990).

Hillman, R.E.,Walsh, andAssociatedVocal Fold(1990).

E.B. Holmberg, J.S. Perkell, M.C. Vaughn. "Phonatory Functionwith Hyperfunctionally RelatedLesions." J. Voice 4: 52-63

Klatt, D.H, and L.C. Klatt. "Analysis, Synthesis,and Perception of Voice Quality VariationsAmong Female and Male Talkers." J. Acoustic.Soc. Am. 87: 820-857 (1990).

Manuel, S.Y. "The Role of Contrast in LimitingVowel-to-Vowel Coarticulation in DifferentLanguages." J. Acoustic. Soc. Am. 83:1286-1298 (1990).

Perkell, J.S. "Testing Theories of Speech Pro-duction, Implications of Some Detailed Ana-lyses of Variable Articulatory Data." In SpeechProduction and Speech Modeling. Eds. W.J.Hardcastle and A. Marchal. Boston: KluwerAcademic Publishers, 1990, pp. 263-288.

Stevens, K.N., and C.A. Bickley. "Higher-levelControl Parameters for a Formant Synthesizer."Proceedings of the First International Confer-ence on Speech Synthesis, Autrans, France,1990, pp. 63-66.

Veilleux N., M. Ostendorf, S. Shattuck-Hufnagel,and P.J. Price. "Markov Modeling of ProsodicPhrases." Proc. ICASSP 90, pp. 777-780(1990).

Wodicka, G.R., K.N. Stevens, H.L. Golub, and D.C.Shannon. "Spectral Characteristics of SoundTransmission in the Human RespiratorySystem." IEEE Trans. Biomed. Eng. 37(12):1130-1134 (1990).

Papers Submitted for Publication

Bickley, C.A. "Vocal-fold Vibration in a ComputerModel of a Larynx." Submitted to Proceedingsof the Vocal Fold Physiology Conference,Stockholm, Sweden, 1989. New York: RavenPress.

Halle, M., and K.N. Stevens. "The PostalveolarFricatives of Polish." Submitted to OsamuFujimura Festschrift.

300 RLE Progress Report Number 133

Page 17: e or IC4- z:kensory 40n'-

Chapter 1. Speech Communication

Halle, M., and K.N. Stevens. "Knowledge of Lan-guage and the Sounds of Speech." Submittedto Proceedings of the Symposium on Music,Language, Speech and Brain, Stockholm,Sweden.

Huang, C.B. "Effects of Context, Stress, andSpeech Style on American Vowels." Submittedto ICSLP'90, Kobe, Japan.

Lane, H., J.S. Perkell, M. Svirsky, and J. Webster."Changes in Speech Breathing FollowingCochlear Implant in Postlingually DeafenedAdults." Submitted to J. Speech Hear. Res..

Perkell, J.S., and M.H. Cohen. "Token-to-tokenVariation of Tongue-body Vowel Targets: theEffect of Context." Submitted to OsamuFujimura Festschrift.

Perkell, J.S., E.B. Holmberg, and R.E. Hillman. "ASystem for Signal Processing and Data Extrac-tion from Aerodynamic, Acoustic and Electro-glottographic Signals in the Study of VoiceProduction." Submitted to J. Acoust. Soc. Am.

Perkell, J.S., and M.L. Matthies. "Temporal Meas-ures of Labial Coarticulation for the Vowel/u/." Submitted to J. Acoust. Soc. Am.

Shattuck-Hufnagel, S. "The Role of Word andSyllable Structure in Phonological Encoding inEnglish." Submitted to Cognition (specialissue).

Stevens, K.N. "Some Factors Influencing the Preci-sion Required for Articulatory Targets: Com-ments on Keating's Paper." Papers inLaboratory Phonology I. Eds. J.C. Kingstonand M.E. Beckman. Cambridge: CambridgeUniversity Press.

Stevens, K.N. "Vocal-fold Vibration for ObstruentConsonants." Proc. of the Vocal Fold Physi-ology Conference, Stockholm, Sweden. NewYork: Raven Press.

Stevens, K.N., and C.A. Bickley. "Constraintsamong Parameters Simplify Control of KlattFormant Synthesizer." Submitted to J. Pho-netics.

301

Page 18: e or IC4- z:kensory 40n'-

302 RLE Progress Report Number 133