voice transformations: from speech synthesis to...

4
Voice Transformations: From Speech Synthesis to Mammalian Vocalizations Min Tang, Chao Wang, Stephanie Seneff Spoken Language Systems Group, Laboratory for Computer Science Massachusetts Institute of Technology, Cambridge, Massachusetts 02139 USA mtang, wangc, seneff @sls.lcs.mit.edu Abstract This paper describes a phase vocoder based technique for voice transformation. This method provides a flexible way to ma- nipulate various aspects of the input signal, e.g., fundamental frequency of voicing, duration, energy, and formant positions, without explicit extraction. The modifications to the signal can be specific to any feature dimensions, and can vary dynam- ically over time. There are many potential applications for this technique. In concatenative speech synthesis, the method can be applied to transform the speech corpus to different voice characteristics, or to smooth any pitch or formant discontinuities between con- catenation boundaries. The method can also be used as a tool for language learning. We can modify the prosody of the stu- dent’s own speech to match that from a native speaker, and use the result as guidance for improvements. The technique can also be used to convert other biological signals, such as killer whale vocalizations, to a signal that is more appropriate for human auditory perception. Our initial experiments show encouraging results for all of these applications. 1. Introduction Voice transformation, as defined in this paper, is the process of transforming one or more features of an input signal to new target values. By features we mean fundamental frequency of voicing, duration, energy, and formant positions. Aspects that are not targeted to change should be maintained during the transformation process. The reconstructed signal should also be of high quality, without artifacts due to the signal process- ing. This notion of voice transformation is related to but dis- tinct from research on voice conversion, where a source speech waveform is modified so that the resulting signal sounds as if it were spoken by a target speaker [1]. A conversion of this type is often done by mapping between detailed features of the two speakers. Our research focuses instead on simultaneous manip- ulation along the dimensions listed above, in order to achieve interesting transformations of speech and other vocal signals. We will use conversion and transformation interchangeably in the rest of this paper. In the past, research in speech synthesis has led to several voice transformation methods, which are mainly based on TD- PSOLA [2] or sinusoidal models [3]. The method we propose here is closely related to the above methods, and can perform general transformation tasks, while preserving excellent speech quality. This research was supported by a contract from BellSouth and by DARPA under contract N660001-99-1-8904, monitored through Naval Command, Control, and Ocean Surveillance Center. Our system is based on a phase vocoder, where the sig- nal is broken down into a set of narrow band filter channels, and each channel is characterized in terms of its magnitude and phase [4, 5]. The magnitude spectrum is first flattened to re- move the effects of the vocal tract resonances, and then a trans- formed version of the magnitude spectrum can be re-applied, to produce apparent formant shifts. The phase spectrum, once it is unwrapped, represents the dominant frequency component of each channel. It can be multiplied by a factor to produce an apparent change in the fundamental frequency of voicing. Additional harmonics may need to be generated; these can be created by duplicating and shifting the original phase spectrum. In this fashion, we can perform a variety of interesting transformations on the input signal, from simply speeding it up to generating an output signal that sounds like a completely dif- ferent person. We believe that it will be useful in many possible application areas, including speech synthesis, language learn- ing, and even in the analysis of animal vocalizations. Voice transformation has the potential to solve several prob- lems associated with concatenative speech synthesis, where synthetic speech is constructed by concatenating speech units selected from a large corpus. The unit selection criteria are typ- ically complex, attempting to account for the detailed phono- logical and prosodic contexts of the synthesis target. Such a corpus-based method often succeeds in producing high quality speech, but there are several problems associated with it. First, the cost of collecting and transcribing a corpus is high, making it impractical to collect data from mutiple speakers. Second, there usually exist discontinuities in both formants and val- ues at concatenation boundaries. Third, high level prosodic con- straints are often difficult to capture, and an incorrect prosodic contour leads to the perception of a substantially inferior quality in the synthetic speech. To deal with the problem of generating multiple voices, a voice transformation system can be used to generate speech that sounds distinctly different from the voice of the input corpus. The entire corpus can be preprocessed through a transformation phase that can change a female voice into a male-like or child- like voice, while preserving the temporal aspects. In this way, a corpus from a single speaker can thus be leveraged to yield an apparent multiple-speaker synthesis system. The transformation system can also be utilized to alter the shape of the contour of a concatenated sequence, in order to gain explicit control over the desired intonation phrases. By applying voice transformation techniques to post-hoc edit con- catenation units, we can relax the selection criteria and conse- quently reduce the size of the corpus. The constraints for unit selection can then focus on manner/place/voicing conditions, without addressing the issue of the prosodic contour. Another area where we envision utility of the voice trans- formation system is as an aid to teaching pronunciation, partic- Presented at the 7th European Conference on Speech Communication and Technology, Aalborg, Denmark, September, 2001.

Upload: others

Post on 16-Oct-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Voice Transformations: From Speech Synthesis to …people.csail.mit.edu/wangc/papers/eurospeech01-vocoder.pdfAnother area where we envision utility of the voice trans-formation system

VoiceTransformations: From SpeechSynthesisto Mammalian Vocalizations�

Min Tang, ChaoWang, StephanieSeneff

SpokenLanguageSystemsGroup,Laboratoryfor ComputerScience

MassachusettsInstituteof Technology, Cambridge,Massachusetts02139USA�mtang, wangc, seneff � @sls.lcs.mit.edu

Abstract

Thispaperdescribesaphasevocoderbasedtechniquefor voicetransformation. This methodprovides a flexible way to ma-nipulatevariousaspectsof the input signal,e.g., fundamentalfrequency of voicing, duration,energy, andformantpositions,without explicit ��� extraction. Themodificationsto thesignalcanbespecificto any featuredimensions,andcanvarydynam-ically over time.

Therearemany potentialapplicationsfor this technique.Inconcatenative speechsynthesis,the methodcan be appliedtotransformthe speechcorpusto differentvoice characteristics,or to smoothany pitch or formantdiscontinuitiesbetweencon-catenationboundaries.The methodcanalsobe usedasa toolfor languagelearning. We canmodify the prosodyof the stu-dent’s own speechto matchthatfrom a native speaker, andusetheresultasguidancefor improvements.Thetechniquecanalsobeusedto convert otherbiologicalsignals,suchaskiller whalevocalizations,to a signal that is more appropriatefor humanauditoryperception.Our initial experimentsshow encouragingresultsfor all of theseapplications.

1. Intr oductionVoicetransformation,asdefinedin this paper, is theprocessoftransformingone or more featuresof an input signal to newtarget values. By featureswe meanfundamentalfrequencyof voicing, duration,energy, and formant positions. Aspectsthatarenot targetedto changeshouldbemaintainedduringthetransformationprocess. The reconstructedsignal shouldalsobe of high quality, without artifactsdueto the signalprocess-ing. This notion of voice transformationis relatedto but dis-tinct from researchon voiceconversion,wherea sourcespeechwaveformis modifiedsothattheresultingsignalsoundsasif itwerespokenby a targetspeaker [1]. A conversionof this typeis oftendoneby mappingbetweendetailedfeaturesof the twospeakers.Our researchfocusesinsteadonsimultaneousmanip-ulation alongthe dimensionslisted above, in order to achieveinterestingtransformationsof speechandother vocal signals.We will useconversionandtransformationinterchangeablyintherestof this paper.

In thepast,researchin speechsynthesishasled to severalvoicetransformationmethods,which aremainly basedon TD-PSOLA[2] or sinusoidalmodels[3]. Themethodwe proposehereis closely relatedto the above methods,andcanperformgeneraltransformationtasks,while preservingexcellentspeechquality.

�Thisresearchwassupportedby acontractfrom BellSouthandby

DARPA undercontractN660001-99-1-8904,monitoredthroughNavalCommand,Control,andOceanSurveillanceCenter.

Our systemis basedon a phasevocoder, wherethe sig-nal is broken down into a set of narrow bandfilter channels,andeachchannelis characterizedin termsof its magnitudeandphase[4, 5]. The magnitudespectrumis first flattenedto re-move theeffectsof thevocaltractresonances,andthena trans-formed versionof the magnitudespectrumcan be re-applied,to produceapparentformantshifts. Thephasespectrum,onceit is unwrapped,representsthedominantfrequency componentof eachchannel.It canbemultiplied by a factor � to producean apparentchangein the fundamentalfrequency of voicing.Additional harmonicsmay needto be generated;thesecanbecreatedby duplicatingandshifting theoriginalphasespectrum.

In this fashion,we can perform a variety of interestingtransformationson theinput signal,from simplyspeedingit upto generatinganoutputsignalthatsoundslikeacompletelydif-ferentperson.Webelieve thatit will beusefulin many possibleapplicationareas,including speechsynthesis,languagelearn-ing, andevenin theanalysisof animalvocalizations.

Voicetransformationhasthepotentialto solveseveralprob-lems associatedwith concatenative speechsynthesis,wheresyntheticspeechis constructedby concatenatingspeechunitsselectedfrom a largecorpus.Theunit selectioncriteriaaretyp-ically complex, attemptingto accountfor the detailedphono-logical and prosodiccontexts of the synthesistarget. Suchacorpus-basedmethodoftensucceedsin producinghigh qualityspeech,but thereareseveralproblemsassociatedwith it. First,thecostof collectingandtranscribinga corpusis high,makingit impracticalto collect datafrom mutiple speakers. Second,thereusuallyexist discontinuitiesin bothformantsand ��� val-uesatconcatenationboundaries.Third,highlevel prosodiccon-straintsareoftendifficult to capture,andan incorrectprosodiccontourleadsto theperceptionof asubstantiallyinferior qualityin thesyntheticspeech.

To dealwith the problemof generatingmultiple voices,avoicetransformationsystemcanbeusedto generatespeechthatsoundsdistinctly different from the voice of the input corpus.Theentirecorpuscanbepreprocessedthroughatransformationphasethatcanchangea femalevoiceinto a male-like or child-likevoice,while preservingthetemporalaspects.In thisway, acorpusfrom a singlespeaker canthusbe leveragedto yield anapparentmultiple-speaker synthesissystem.

Thetransformationsystemcanalsobeutilized to alter theshapeof the ��� contourof a concatenatedsequence,in orderto gainexplicit controlover thedesiredintonationphrases.Byapplyingvoicetransformationtechniquesto post-hocedit con-catenationunits,we canrelax theselectioncriteria andconse-quentlyreducethe sizeof the corpus.Theconstraintsfor unitselectioncan then focus on manner/place/voicing conditions,withoutaddressingtheissueof theprosodiccontour.

Anotherareawherewe envision utility of the voice trans-formationsystemis asanaid to teachingpronunciation,partic-

Presentedat the7thEuropeanConferenceonSpeechCommunicationandTechnology, Aalborg, Denmark,September, 2001.

Page 2: Voice Transformations: From Speech Synthesis to …people.csail.mit.edu/wangc/papers/eurospeech01-vocoder.pdfAnother area where we envision utility of the voice trans-formation system

20 40 60 80 100 1200

0.5

1

1.5

2

2.5

3x 10

5 Magnitude SpectrumM

agni

tude

DFT Sample Index

20 40 60 80 100 1200

2

4

6

8

10

12

x 105

Mag

nitu

de

DFT Sample Index

Spectral Envelop

20 40 60 80 100 1200

0.1

0.2

0.3

0.4Excitation Magnitude Spectrum

Mag

nitu

de

DFT Sample Index

Figure1: Illustration of deconvolving thevocal-tract filter andexcitation.Theoriginal spectrumis on theleft andthespectralenvelopeandexcitationspectrumare on theright.

ularly for learnersof a foreign language.The ideais to trans-form thestudent’s own speechto repairinappropriatelyutteredprosodicaspects.Thiscanbeparticularlyeffectivefor tonelan-guages,wherespeakerswho areunfamiliar with theconceptoftonemay not be ableto perceive what is incorrectabouttheirutterances.By transformingthe student’s speechto repair in-correctlyutteredtones,the systemallows the userto comparetheir own speechwith the samespeechtonally repaired. Thestudentwill hopefullybebetterableto distinguishwhichaspectis erroneous,becausethespeaker quality, temporalaspects,andphoneticcontentareheldconstant.

Finally, we have alsousedthevoicetransformationsystemto processotherbiologicalsignals,in particular, killer whalevo-calizations.Thesesignalstypically containsignificantinforma-tion at frequencieswell beyond thehumanhearingrange.Thephasevocodercanbe usedto compressthe frequency contentwhile preservingthetemporalaspectsof thesignal,bringingitinto a rangewherehumanperceptionis acute.

The rest of the paperis organizedas follows: in Section2, wesummarizebriefly themethodologyof ourphase-vocoderbasedvoice transformationsystem.In Section3, implementa-tion detailsarediscussed.Finally, in Section4, we presentanddiscusstheexperimentalresults.

2. MethodologyThe source-filtermodel considersspeechto be the convolu-tion of an excitation sourceanda vocal-tractfilter. In a phasevocoder, the input signal is first passedthrougha filter bankcontaining� contiguousband-passfilters, to obtaina magni-tudespectrumanda phasespectrum.Thespectralenvelopeofthemagnitudespectrum,which characterizesthefrequency re-sponseof thevocal-tractfilter, canbeobtainedby low-passfil-tering the magnitudespectrum.A time-domaindeconvolutioncanbedoneby point-by-pointdividing themagnitudespectrumby the spectralenvelopesamples.The result is the excitationmagnitudespectrum.Figure1 showstheeffectof deconvolvingthevocal-tractfilter andtheexcitationmagnitudespectrum.

Thephasespectrumof eachfilter is unwrappedin time, inorderto estimatethe frequency of theenergy containedin thatfilter. The first derivative of the unwrappedphasesequalstothefrequency of thesinusoidcomponentpassingthroughthecorrespondingband-passfilter. Multiplying thefirst derivativeby a factor � andthenreconstructingthe phasespectrumwillchangethe frequency of the particularsinusoidcomponentto

0 50 100 150 200 2500

10

20

30

40

50

60

70

80

Pha

se in

Rad

ians

Time Index

Figure 2: Schematicof phase(in radians),unwrappedphaseanddoubledunwrappedphase, for a singlechannel[4].

a frequency of � . If the frequenciesof all sinusoidcompo-nentsarechangedby thesamefactor � , theeffect is to changethepitch by a factor � . This processis illustratedin Figure2.Featuretransformationcan thenbe realizedby modifying thespectralenvelopeand/orthe phasespectrum. After modifica-tion, signalsarereconstructedfor eachband-passfilter. Finally,the time-domainsignal is reconstructedby combiningthe sig-nalsfrom eachfilter.

3. Implementation IssuesIn this section,we discusstechniquesfor basictransformationtasks.A compositeof thesebasictaskscanachieve morecom-plex transformations,suchasmale� femaleconversion.

3.1. Iterati ve DFT

In our system,an � -point DFT of theinput signalis computedat theoriginal samplingrate,which is equivalentto passingthesignalthroughabankof � contiguousband-passfilters. For ef-ficiency, thesystemusesaniterativemethod,asdescribedin [4],to computetheDFT, whereeachnew outputis obtainedthroughincrementaladjustmentsof theprecedingoutput.

3.2. Deconvolving the DFT Spectrum

The � DFT coefficients are then convertedfrom rectangularto polar coordinates.The magnitudespectralenvelopeis esti-matedby low-passfiltering the magnitudespectrum.The ex-citation magnitudespectrumis obtainedby dividing the mag-nitudespectrumby thesmoothspectrum.Thephasespectrumis unwrappedandfirst-differencedto obtain the instantaneousfrequency.

The � -point excitation spectrum(excitation magnitudespectrumandphasespectrum)rangesfrom � to � . As we willsoonsee,sometimeswe needto generateadditionalsamplesofthe excitation spectrumbeyond � . Generatingthesephantomexcitationsamplescanbedoneby a cloningmethoddescribedin [4].

3.3. Formant Modifications

The apparentpositionsof the formant frequenciescan be al-teredby resamplingthesmoothedmagnitudespectrum,andre-convolving it with thepreviously flattenedexcitationspectrum.For example,if we interpolatethespectralenvelopeby a factorof 1.2 and discardthe extra points at the upperend, the for-mantswill be moved up by roughly 20 percent.To move for-mantsdown, we needto decimatethespectralenvelope.Whenreconstructingthe speechsignal, we discardsomepoints atthe upperendof the excitation spectrumandphasespectrum.

2

Page 3: Voice Transformations: From Speech Synthesis to …people.csail.mit.edu/wangc/papers/eurospeech01-vocoder.pdfAnother area where we envision utility of the voice trans-formation system

Correspondingly, thereconstructedspeechwill losesomehigh-frequency energy.

3.4. Pitch Modification

Pitch modificationcanbe doneby modifying the first deriva-tivesof the unwrappedphases.To preserve the formantposi-tions,wehaveto interpolatethespectralsamplesto compensatefor the shiftedphasespectrum.Thus, if we want to keepthespectralenvelopeintactwhenwechangethepitchby afactorof� , thespectralenvelopeneedsto be interpolatedby a factorof�� to hold theformantsin position.

Anotherissuewhichneedsto beconsideredis thechangeinthe frequency rangethatwill be causedby pitch modification.If theNyquistfrequency of theinput speechis 4000Hz (whichmeansa samplingrateof 8000Hz) andthepitch is changedbya factorof � , the frequency rangeafter the pitch modificationwould be ��������� Hz. When ����� , theoriginal8000Hz sam-pling rateis insufficient,andtherewill befrequency aliasing.Asimplesolutionis to discardthe signalabove 4000Hz. How-ever, if ����� , thereconstructedsignalwill loseenergy in thehigh frequencies.To make up for the loss,phantomexcitationsamplescanbegeneratedprior to thepitchmodification.

3.5. Temporal Modification/Fr equencyCompression

Temporalcharacteristicsof theinput signalcanalsobemanip-ulated.To slow down thespeech,extrasamplesneedto begen-erated.Themagnitudespectrumof thegeneratedsamplescanbeobtainedby interpolatingthemagnitudespectrumof thein-putsignalin thetimedimension.The � -scaledphase-derivativespectrumis interpolatedsimilarly, andthenthephasespectrumis restoredusingthe interpolatedphasederivatives. Whenthesignalis reconstructed,therewill beextra samples.If the sig-nal is thenplayedat theoriginalsamplingrate,theeffect is thatthespeechis sloweddown. Similarly, by decimatingthemag-nitudeandphase-derivative spectrumbeforereconstruction,wecanreducethenumberof samples,andconsequentlyspeedupthesignal. By manipulatingthesamplingrate,we canconverttemporalchangeto frequency change.

4. Experimental Results4.1. Male � FemaleConversion

To testtheideaof voiceconversion,wehaveconductedtwo ex-periments:(1) to convert a synthesiscorpusof femalerecord-ings into a male-like voice, and(2) to convert a maleDectalkvoiceinto a femalevoice. In thefirst experiment,all of thecon-versioncanbeperformedin advance,creatinganentirecorpuswith exactly the sametemporalcharacteristicsas the originalone,but with asubstantialchangein thespeakercharacteristics.

Theoriginal synthesiscorpus[6] wasconvertedby lower-ing the fundamentalfrequency by 30% andthe spectrum(for-mants)by 25%.Thus,thespectralenvelopewasinterpolatedbya factorof

���� ��� � � ! . An additionalthirteenphantomexcitationsweregenerated.Theconversionpreservestemporalcharacter-istics,andhencewe canreuseall thealignedtranscriptiondatafrom theoriginal corpus.Figure3 shows that,afterconversion,thepitchandformantpositionsof thespeecharesimilarto thoseof a correspondingnaturalmaleutterance.

Qualitative analysisrevealsthatmaleto femaleconversionachievesbetterquality thanfemaleto maleconversion.Wesus-pectthis maybedueto theextra complexity of generatingtheextra phantomexcitationsamples.

4.2. LanguageLearning

Oneof thedifficultiesin learninga foreignlanguageis in mas-teringtheprosodicaspectsof thenew language.For example,anative Chinesespeaker generallyhasdifficulty with stressandtiming when speakingEnglish. On the other hand,it is veryhard for a native Englishspeaker to speakMandarinChinesewith correcttones,especiallyin completesentences.

Voicetransformationcanpotentiallybeusedto enhancetheexperienceof learninga foreign language.We canconvert theprosodyof a student’s utteranceto bettermatchthat expectedfrom anativespeaker, andusethemodifiedutteranceasatargetfor improvements.Webelieve thatsucha feedbackmechanismwould be more valuableto the studentthan an examplespo-ken by a native speaker, becausehe or shecanbetterperceivethe differencesand imitate the target speech,without the dis-tractionsfrom lessrelevantfactorssuchasvoicecharacteristics.Theconversioncanalsobecontrolledto changeonly a certainaspectof thespeech,to makethedistinctionsmoreapparentandeasierfor thestudentto follow.

We testedthis ideaby modelingthe ��� contoursof Man-darinChinesephrasesspokenbyanativeEnglishspeaker. Man-darinChineseis a tonallanguage,in which thesyllable��� con-tour is essentialto themeaningof sounds.Therearefour lexicaltonesin the language,eachdefinedby a canonical� � pattern:high-level, high-rising,low-dipping,andhigh-falling. Althoughit is not too difficult for a non-native speaker to learnthetonesin isolatedsyllables,the taskbecomesmuchmoredifficult forcontinuousspeechdueto tonecoarticulationeffects. The tonecontoursareperturbedin particularwaysto form a smoothandcoherentsentence� � contour. Theprocessis very naturalto anative speaker, but very hardfor non-native speakers to learn,exceptby mimickingandpracticing.

To predicta high-qualitytonecontourfor continuousMan-darin speech,we obtaineda completeprofile of tonesin allcontexts from a spontaneousMandarinspeechcorpusrecordedfrom native Chinesespeakersasexamples[7]. Thecontextualeffectsarecapturedby meansof context-dependenttonemodelparameters,which arediscreteLegendrecoefficientsof sylla-ble ��� contours[8]. The conversionis carriedout asfollows.For eachnon-native Mandarinutterance,wefirst obtainits pho-neticalignmentandcontext-dependenttonelabels.Thenabasictarget �"� contouris constructedby piecingtogetherthecorre-sponding“standard”tonetemplates,with any remainingvoicedgapsfilled by linearinterpolation.Thenew contouris smoothedandscaledto matchtheaverage��� of thespeaker. A conversionratiois thusobtainedfor eachvoicedframeof thespeechsignal,which is thenusedto modify the � � of theoriginal waveform.Unvoicedframesarekept unchanged.Figure4 illustratesthisprocesswith anexamplesentence.We testedthis methodon afew utterancesandobtainedgoodresults.The tonesin the re-constructedspeecharesignificantlyimproved,judgedby nativeChinesespeakers; while the other qualitiesof the waveformsarelargelypreserved.

4.3. Mammalian Vocalization

The voice transformationsystemcan also be usedto processnon-speechsignals. Killer whalesproducea wide variety ofdistinctvocalizations[9], andasignificantportionof thecontentis at frequenciesthatarewell beyondthehumanhearingrange.Wehaveaccessto alargenumberof high-qualityrecordingsob-tainedby Patrick Miller in PugetSoundusinga beamformingarray[10]. In apilot study, we triedcompressingfrequenciesbya factorof three,while preservingtemporalaspects.Theresult

3

Page 4: Voice Transformations: From Speech Synthesis to …people.csail.mit.edu/wangc/papers/eurospeech01-vocoder.pdfAnother area where we envision utility of the voice trans-formation system

Original

Converted

Original

Converted

Figure3: Waveformand spectrogram of a femaleutterancebefore and after conversion. Noticethat the formantsin the convertedspectrogramare shifteddownward, andthehigherfrequencyregion is excitedbyphantomexcitations.

Before Modification

After Modification

Figure4: Spectrogramandpitch contour(Hz) of theutterance“shang4 hai3 ne5?” (How aboutShanghai?)before and af-ter tone modification. The target pitch contour is generatedusing context-dependenttone model parameters. Somehighfrequencyenergy componentsare lost whenthe � � is lowered,which canbecompensatedbygenerating phantomexcitations.

is a signalthat is mucheasierto studyusingstandardtools forexaminingspeech,andthat also,perhapsmoreimportantly, ismoreaccessibleto thehumanauditorysystem.A final advan-tageis that thestoragerequirementsarereducedby a factorofthree.

5. SummaryIn this paper, we have proposeda voicetransformationmethodbasedonthephasevocoder. Thismethoddemonstratestheabil-ity to transformspeechsignalsto variousdesiredtargets. Thereconstructedspeechhasbeenfoundto beof high quality.

We have shown thata synthesiscorpuscanbetransformedto generatespeechof a differentvoicequality. In addition,wecanadjusttheintonationcontourafterconcatenationto improvetheprosodicquality. We have alsodemonstratedthat thevoicetransformationsystemcouldhelpin languagelearning,particu-

larly with regardto teachingnon-native speakersof a tonelan-guageto utter the tonescorrectly. Finally, we have found thesystemto beusefulfor transformingkiller whalevocalizationsdown into thehumanauditoryfrequency range.

6. AcknowledgementsPatrick Miller of theWoodsHole OceanographicInstitutepro-vided us with examplerecordingsfrom killer whalesliving inthePugetSoundarea.

7. References[1] Y. Stylianou, O. Cappe,E. Moulines, “Continuousprobabilistic

transformfor voice conversion”, IEEE Trans.SpeechandAudioProc.,6(2):131-142,1998.

[2] E. Moulines, F. Charpentier, “Pitch synchronouswaveform pro-cessingtechniquesfor text-to-speechsynthesisusing diphones”,SpeechCommunication,V.9,pp.453-467,1990.

[3] M. P. Pollard,B. M. G. Cheeetham,C. C. Goodyear, M. D. Edg-ington,A. Lowry, “Enhancedshape-invariantpitch andtime-scalemodificationfor concatenative speechsynthesis”,ICCASP97,V.2,pp.919-922,Munich,Germany, 1997.

[4] S. Seneff, “SpeechTransformationSystem(Spectrumand# or Ex-citation)without PitchExtraction”,TechnicalReport541,LincolnLaboratory, MIT.

[5] J.L. Flanagan,R. M. Golden“PhaseVocoder”,Bell SystemTech.J.,V.45pp.1493-1500,1966.

[6] J. R. W. Yi, J. R. Glass,“Natural-soundingspeechsynthesisus-ing variable-lengthunits”, ICSLP98,pp.1167-1170,Sydney, Aus-tralia,1998.

[7] C. Wang, J. Glass,H. Meng, J. Polifroni, S. Seneff, V. Zue, “Yinhe: a MandarinChineseVersionof the Galaxysystem”,EU-ROSPEECH97,pp.351-354,Rhodes,Greece,1997.

[8] C. Wang, S. Seneff, “Improved tone recognitionby normalizingfor coarticulationandintonationeffects”, ICSLP00,V.2,pp.83-86,Beijing, China,2000.

[9] J.K. B. Ford,“Acousticbehavior of ResidentKiller Whales(Orci-nusorca)off Vancouver Island,British Columbia,” CanadianJour-nal of Zoology67,pp.727-745,1989.

[10] P. J.Miller, andP. L. Tyack,“A SmallTowedBeamformingArrayto Identify VocalizingResidentKiller Whales(Orcinusorca)con-currentwith focalbehavioral observations,” Depp-SeaResearchII,V. 45,1998.

4