voice transformations: from speech synthesis to...

VoiceTransformations: From SpeechSynthesisto Mammalian Vocalizations�

Min Tang, ChaoWang, StephanieSeneff

SpokenLanguageSystemsGroup,Laboratoryfor ComputerScience

MassachusettsInstituteof Technology, Cambridge,Massachusetts02139USA�mtang, wangc, seneff � @sls.lcs.mit.edu

Abstract

Thispaperdescribesaphasevocoderbasedtechniquefor voicetransformation. This methodprovides a flexible way to ma-nipulatevariousaspectsof the input signal,e.g., fundamentalfrequency of voicing, duration,energy, andformantpositions,without explicit �� extraction. Themodificationsto thesignalcanbespecificto any featuredimensions,andcanvarydynam-ically over time.

Therearemany potentialapplicationsfor this technique.Inconcatenative speechsynthesis,the methodcan be appliedtotransformthe speechcorpusto differentvoice characteristics,or to smoothany pitch or formantdiscontinuitiesbetweencon-catenationboundaries.The methodcanalsobe usedasa toolfor languagelearning. We canmodify the prosodyof the stu-dent’s own speechto matchthatfrom a native speaker, andusetheresultasguidancefor improvements.Thetechniquecanalsobeusedto convert otherbiologicalsignals,suchaskiller whalevocalizations,to a signal that is more appropriatefor humanauditoryperception.Our initial experimentsshow encouragingresultsfor all of theseapplications.

1. Intr oductionVoicetransformation,asdefinedin this paper, is theprocessoftransformingone or more featuresof an input signal to newtarget values. By featureswe meanfundamentalfrequencyof voicing, duration,energy, and formant positions. Aspectsthatarenot targetedto changeshouldbemaintainedduringthetransformationprocess. The reconstructedsignal shouldalsobe of high quality, without artifactsdueto the signalprocess-ing. This notion of voice transformationis relatedto but dis-tinct from researchon voiceconversion,wherea sourcespeechwaveformis modifiedsothattheresultingsignalsoundsasif itwerespokenby a targetspeaker [1]. A conversionof this typeis oftendoneby mappingbetweendetailedfeaturesof the twospeakers.Our researchfocusesinsteadonsimultaneousmanip-ulation alongthe dimensionslisted above, in order to achieveinterestingtransformationsof speechandother vocal signals.We will useconversionandtransformationinterchangeablyintherestof this paper.

In thepast,researchin speechsynthesishasled to severalvoicetransformationmethods,which aremainly basedon TD-PSOLA[2] or sinusoidalmodels[3]. Themethodwe proposehereis closely relatedto the above methods,andcanperformgeneraltransformationtasks,while preservingexcellentspeechquality.

�Thisresearchwassupportedby acontractfrom BellSouthandby

DARPA undercontractN660001-99-1-8904,monitoredthroughNavalCommand,Control,andOceanSurveillanceCenter.

Our systemis basedon a phasevocoder, wherethe sig-nal is broken down into a set of narrow bandfilter channels,andeachchannelis characterizedin termsof its magnitudeandphase[4, 5]. The magnitudespectrumis first flattenedto re-move theeffectsof thevocaltractresonances,andthena trans-formed versionof the magnitudespectrumcan be re-applied,to produceapparentformantshifts. Thephasespectrum,onceit is unwrapped,representsthedominantfrequency componentof eachchannel.It canbemultiplied by a factor � to producean apparentchangein the fundamentalfrequency of voicing.Additional harmonicsmay needto be generated;thesecanbecreatedby duplicatingandshifting theoriginalphasespectrum.

In this fashion,we can perform a variety of interestingtransformationson theinput signal,from simplyspeedingit upto generatinganoutputsignalthatsoundslikeacompletelydif-ferentperson.Webelieve thatit will beusefulin many possibleapplicationareas,including speechsynthesis,languagelearn-ing, andevenin theanalysisof animalvocalizations.

Voicetransformationhasthepotentialto solveseveralprob-lems associatedwith concatenative speechsynthesis,wheresyntheticspeechis constructedby concatenatingspeechunitsselectedfrom a largecorpus.Theunit selectioncriteriaaretyp-ically complex, attemptingto accountfor the detailedphono-logical and prosodiccontexts of the synthesistarget. Suchacorpus-basedmethodoftensucceedsin producinghigh qualityspeech,but thereareseveralproblemsassociatedwith it. First,thecostof collectingandtranscribinga corpusis high,makingit impracticalto collect datafrom mutiple speakers. Second,thereusuallyexist discontinuitiesin bothformantsand �� val-uesatconcatenationboundaries.Third,highlevel prosodiccon-straintsareoftendifficult to capture,andan incorrectprosodiccontourleadsto theperceptionof asubstantiallyinferior qualityin thesyntheticspeech.

To dealwith the problemof generatingmultiple voices,avoicetransformationsystemcanbeusedto generatespeechthatsoundsdistinctly different from the voice of the input corpus.Theentirecorpuscanbepreprocessedthroughatransformationphasethatcanchangea femalevoiceinto a male-like or child-likevoice,while preservingthetemporalaspects.In thisway, acorpusfrom a singlespeaker canthusbe leveragedto yield anapparentmultiple-speaker synthesissystem.

Thetransformationsystemcanalsobeutilized to alter theshapeof the �� contourof a concatenatedsequence,in orderto gainexplicit controlover thedesiredintonationphrases.Byapplyingvoicetransformationtechniquesto post-hocedit con-catenationunits,we canrelax theselectioncriteria andconse-quentlyreducethe sizeof the corpus.Theconstraintsfor unitselectioncan then focus on manner/place/voicing conditions,withoutaddressingtheissueof theprosodiccontour.

Anotherareawherewe envision utility of the voice trans-formationsystemis asanaid to teachingpronunciation,partic-

Presentedat the7thEuropeanConferenceonSpeechCommunicationandTechnology, Aalborg, Denmark,September, 2001.

20 40 60 80 100 1200

0.5

1

1.5

2

2.5

3x 10

5 Magnitude SpectrumM

agni

tude

DFT Sample Index

20 40 60 80 100 1200

2

4

6

8

10

12

x 105

Mag

nitu

de

DFT Sample Index

Spectral Envelop

20 40 60 80 100 1200

0.1

0.2

0.3

0.4Excitation Magnitude Spectrum

Mag

nitu

de

DFT Sample Index

Figure1: Illustration of deconvolving thevocal-tract filter andexcitation.Theoriginal spectrumis on theleft andthespectralenvelopeandexcitationspectrumare on theright.

ularly for learnersof a foreign language.The ideais to trans-form thestudent’s own speechto repairinappropriatelyutteredprosodicaspects.Thiscanbeparticularlyeffectivefor tonelan-guages,wherespeakerswho areunfamiliar with theconceptoftonemay not be ableto perceive what is incorrectabouttheirutterances.By transformingthe student’s speechto repair in-correctlyutteredtones,the systemallows the userto comparetheir own speechwith the samespeechtonally repaired. Thestudentwill hopefullybebetterableto distinguishwhichaspectis erroneous,becausethespeaker quality, temporalaspects,andphoneticcontentareheldconstant.

Finally, we have alsousedthevoicetransformationsystemto processotherbiologicalsignals,in particular, killer whalevo-calizations.Thesesignalstypically containsignificantinforma-tion at frequencieswell beyond thehumanhearingrange.Thephasevocodercanbe usedto compressthe frequency contentwhile preservingthetemporalaspectsof thesignal,bringingitinto a rangewherehumanperceptionis acute.

The rest of the paperis organizedas follows: in Section2, wesummarizebriefly themethodologyof ourphase-vocoderbasedvoice transformationsystem.In Section3, implementa-tion detailsarediscussed.Finally, in Section4, we presentanddiscusstheexperimentalresults.

2. MethodologyThe source-filtermodel considersspeechto be the convolu-tion of an excitation sourceanda vocal-tractfilter. In a phasevocoder, the input signal is first passedthrougha filter bankcontaining� contiguousband-passfilters, to obtaina magni-tudespectrumanda phasespectrum.Thespectralenvelopeofthemagnitudespectrum,which characterizesthefrequency re-sponseof thevocal-tractfilter, canbeobtainedby low-passfil-tering the magnitudespectrum.A time-domaindeconvolutioncanbedoneby point-by-pointdividing themagnitudespectrumby the spectralenvelopesamples.The result is the excitationmagnitudespectrum.Figure1 showstheeffectof deconvolvingthevocal-tractfilter andtheexcitationmagnitudespectrum.

Thephasespectrumof eachfilter is unwrappedin time, inorderto estimatethe frequency of theenergy containedin thatfilter. The first derivative of the unwrappedphasesequalstothefrequency of thesinusoidcomponentpassingthroughthecorrespondingband-passfilter. Multiplying thefirst derivativeby a factor � andthenreconstructingthe phasespectrumwillchangethe frequency of the particularsinusoidcomponentto

0 50 100 150 200 2500

10

20

30

40

50

60

70

80

Pha

se in

Rad

ians

Time Index

Figure 2: Schematicof phase(in radians),unwrappedphaseanddoubledunwrappedphase, for a singlechannel[4].

a frequency of � . If the frequenciesof all sinusoidcompo-nentsarechangedby thesamefactor � , theeffect is to changethepitch by a factor � . This processis illustratedin Figure2.Featuretransformationcan thenbe realizedby modifying thespectralenvelopeand/orthe phasespectrum. After modifica-tion, signalsarereconstructedfor eachband-passfilter. Finally,the time-domainsignal is reconstructedby combiningthe sig-nalsfrom eachfilter.

3. Implementation IssuesIn this section,we discusstechniquesfor basictransformationtasks.A compositeof thesebasictaskscanachieve morecom-plex transformations,suchasmale� femaleconversion.

3.1. Iterati ve DFT

In our system,an � -point DFT of theinput signalis computedat theoriginal samplingrate,which is equivalentto passingthesignalthroughabankof � contiguousband-passfilters. For ef-ficiency, thesystemusesaniterativemethod,asdescribedin [4],to computetheDFT, whereeachnew outputis obtainedthroughincrementaladjustmentsof theprecedingoutput.

3.2. Deconvolving the DFT Spectrum

The � DFT coefficients are then convertedfrom rectangularto polar coordinates.The magnitudespectralenvelopeis esti-matedby low-passfiltering the magnitudespectrum.The ex-citation magnitudespectrumis obtainedby dividing the mag-nitudespectrumby thesmoothspectrum.Thephasespectrumis unwrappedandfirst-differencedto obtain the instantaneousfrequency.

The � -point excitation spectrum(excitation magnitudespectrumandphasespectrum)rangesfrom � to � . As we willsoonsee,sometimeswe needto generateadditionalsamplesofthe excitation spectrumbeyond � . Generatingthesephantomexcitationsamplescanbedoneby a cloningmethoddescribedin [4].

3.3. Formant Modifications

The apparentpositionsof the formant frequenciescan be al-teredby resamplingthesmoothedmagnitudespectrum,andre-convolving it with thepreviously flattenedexcitationspectrum.For example,if we interpolatethespectralenvelopeby a factorof 1.2 and discardthe extra points at the upperend, the for-mantswill be moved up by roughly 20 percent.To move for-mantsdown, we needto decimatethespectralenvelope.Whenreconstructingthe speechsignal, we discardsomepoints atthe upperendof the excitation spectrumandphasespectrum.

2

Correspondingly, thereconstructedspeechwill losesomehigh-frequency energy.

3.4. Pitch Modification

Pitch modificationcanbe doneby modifying the first deriva-tivesof the unwrappedphases.To preserve the formantposi-tions,wehaveto interpolatethespectralsamplesto compensatefor the shiftedphasespectrum.Thus, if we want to keepthespectralenvelopeintactwhenwechangethepitchby afactorof� , thespectralenvelopeneedsto be interpolatedby a factorof�� to hold theformantsin position.

Anotherissuewhichneedsto beconsideredis thechangeinthe frequency rangethatwill be causedby pitch modification.If theNyquistfrequency of theinput speechis 4000Hz (whichmeansa samplingrateof 8000Hz) andthepitch is changedbya factorof � , the frequency rangeafter the pitch modificationwould be �� Hz. When �� , theoriginal8000Hz sam-pling rateis insufficient,andtherewill befrequency aliasing.Asimplesolutionis to discardthe signalabove 4000Hz. How-ever, if �� , thereconstructedsignalwill loseenergy in thehigh frequencies.To make up for the loss,phantomexcitationsamplescanbegeneratedprior to thepitchmodification.

3.5. Temporal Modification/Fr equencyCompression

Temporalcharacteristicsof theinput signalcanalsobemanip-ulated.To slow down thespeech,extrasamplesneedto begen-erated.Themagnitudespectrumof thegeneratedsamplescanbeobtainedby interpolatingthemagnitudespectrumof thein-putsignalin thetimedimension.The � -scaledphase-derivativespectrumis interpolatedsimilarly, andthenthephasespectrumis restoredusingthe interpolatedphasederivatives. Whenthesignalis reconstructed,therewill beextra samples.If the sig-nal is thenplayedat theoriginalsamplingrate,theeffect is thatthespeechis sloweddown. Similarly, by decimatingthemag-nitudeandphase-derivative spectrumbeforereconstruction,wecanreducethenumberof samples,andconsequentlyspeedupthesignal. By manipulatingthesamplingrate,we canconverttemporalchangeto frequency change.

4. Experimental Results4.1. Male � FemaleConversion

To testtheideaof voiceconversion,wehaveconductedtwo ex-periments:(1) to convert a synthesiscorpusof femalerecord-ings into a male-like voice, and(2) to convert a maleDectalkvoiceinto a femalevoice. In thefirst experiment,all of thecon-versioncanbeperformedin advance,creatinganentirecorpuswith exactly the sametemporalcharacteristicsas the originalone,but with asubstantialchangein thespeakercharacteristics.

Theoriginal synthesiscorpus[6] wasconvertedby lower-ing the fundamentalfrequency by 30% andthe spectrum(for-mants)by 25%.Thus,thespectralenvelopewasinterpolatedbya factorof

�� ! . An additionalthirteenphantomexcitationsweregenerated.Theconversionpreservestemporalcharacter-istics,andhencewe canreuseall thealignedtranscriptiondatafrom theoriginal corpus.Figure3 shows that,afterconversion,thepitchandformantpositionsof thespeecharesimilarto thoseof a correspondingnaturalmaleutterance.

Qualitative analysisrevealsthatmaleto femaleconversionachievesbetterquality thanfemaleto maleconversion.Wesus-pectthis maybedueto theextra complexity of generatingtheextra phantomexcitationsamples.

4.2. LanguageLearning

Oneof thedifficultiesin learninga foreignlanguageis in mas-teringtheprosodicaspectsof thenew language.For example,anative Chinesespeaker generallyhasdifficulty with stressandtiming when speakingEnglish. On the other hand,it is veryhard for a native Englishspeaker to speakMandarinChinesewith correcttones,especiallyin completesentences.

Voicetransformationcanpotentiallybeusedto enhancetheexperienceof learninga foreign language.We canconvert theprosodyof a student’s utteranceto bettermatchthat expectedfrom anativespeaker, andusethemodifiedutteranceasatargetfor improvements.Webelieve thatsucha feedbackmechanismwould be more valuableto the studentthan an examplespo-ken by a native speaker, becausehe or shecanbetterperceivethe differencesand imitate the target speech,without the dis-tractionsfrom lessrelevantfactorssuchasvoicecharacteristics.Theconversioncanalsobecontrolledto changeonly a certainaspectof thespeech,to makethedistinctionsmoreapparentandeasierfor thestudentto follow.

We testedthis ideaby modelingthe �� contoursof Man-darinChinesephrasesspokenbyanativeEnglishspeaker. Man-darinChineseis a tonallanguage,in which thesyllable�� con-tour is essentialto themeaningof sounds.Therearefour lexicaltonesin the language,eachdefinedby a canonical� � pattern:high-level, high-rising,low-dipping,andhigh-falling. Althoughit is not too difficult for a non-native speaker to learnthetonesin isolatedsyllables,the taskbecomesmuchmoredifficult forcontinuousspeechdueto tonecoarticulationeffects. The tonecontoursareperturbedin particularwaysto form a smoothandcoherentsentence� � contour. Theprocessis very naturalto anative speaker, but very hardfor non-native speakers to learn,exceptby mimickingandpracticing.

To predicta high-qualitytonecontourfor continuousMan-darin speech,we obtaineda completeprofile of tonesin allcontexts from a spontaneousMandarinspeechcorpusrecordedfrom native Chinesespeakersasexamples[7]. Thecontextualeffectsarecapturedby meansof context-dependenttonemodelparameters,which arediscreteLegendrecoefficientsof sylla-ble �� contours[8]. The conversionis carriedout asfollows.For eachnon-native Mandarinutterance,wefirst obtainits pho-neticalignmentandcontext-dependenttonelabels.Thenabasictarget �"� contouris constructedby piecingtogetherthecorre-sponding“standard”tonetemplates,with any remainingvoicedgapsfilled by linearinterpolation.Thenew contouris smoothedandscaledto matchtheaverage�� of thespeaker. A conversionratiois thusobtainedfor eachvoicedframeof thespeechsignal,which is thenusedto modify the � � of theoriginal waveform.Unvoicedframesarekept unchanged.Figure4 illustratesthisprocesswith anexamplesentence.We testedthis methodon afew utterancesandobtainedgoodresults.The tonesin the re-constructedspeecharesignificantlyimproved,judgedby nativeChinesespeakers; while the other qualitiesof the waveformsarelargelypreserved.

4.3. Mammalian Vocalization

The voice transformationsystemcan also be usedto processnon-speechsignals. Killer whalesproducea wide variety ofdistinctvocalizations[9], andasignificantportionof thecontentis at frequenciesthatarewell beyondthehumanhearingrange.Wehaveaccessto alargenumberof high-qualityrecordingsob-tainedby Patrick Miller in PugetSoundusinga beamformingarray[10]. In apilot study, we triedcompressingfrequenciesbya factorof three,while preservingtemporalaspects.Theresult

3

Original

Converted

Original

Converted

Figure3: Waveformand spectrogram of a femaleutterancebefore and after conversion. Noticethat the formantsin the convertedspectrogramare shifteddownward, andthehigherfrequencyregion is excitedbyphantomexcitations.

Before Modification

After Modification

Figure4: Spectrogramandpitch contour(Hz) of theutterance“shang4 hai3 ne5?” (How aboutShanghai?)before and af-ter tone modification. The target pitch contour is generatedusing context-dependenttone model parameters. Somehighfrequencyenergy componentsare lost whenthe � � is lowered,which canbecompensatedbygenerating phantomexcitations.

is a signalthat is mucheasierto studyusingstandardtools forexaminingspeech,andthat also,perhapsmoreimportantly, ismoreaccessibleto thehumanauditorysystem.A final advan-tageis that thestoragerequirementsarereducedby a factorofthree.

5. SummaryIn this paper, we have proposeda voicetransformationmethodbasedonthephasevocoder. Thismethoddemonstratestheabil-ity to transformspeechsignalsto variousdesiredtargets. Thereconstructedspeechhasbeenfoundto beof high quality.

We have shown thata synthesiscorpuscanbetransformedto generatespeechof a differentvoicequality. In addition,wecanadjusttheintonationcontourafterconcatenationto improvetheprosodicquality. We have alsodemonstratedthat thevoicetransformationsystemcouldhelpin languagelearning,particu-

larly with regardto teachingnon-native speakersof a tonelan-guageto utter the tonescorrectly. Finally, we have found thesystemto beusefulfor transformingkiller whalevocalizationsdown into thehumanauditoryfrequency range.

6. AcknowledgementsPatrick Miller of theWoodsHole OceanographicInstitutepro-vided us with examplerecordingsfrom killer whalesliving inthePugetSoundarea.

7. References[1] Y. Stylianou, O. Cappe,E. Moulines, “Continuousprobabilistic

transformfor voice conversion”, IEEE Trans.SpeechandAudioProc.,6(2):131-142,1998.

[2] E. Moulines, F. Charpentier, “Pitch synchronouswaveform pro-cessingtechniquesfor text-to-speechsynthesisusing diphones”,SpeechCommunication,V.9,pp.453-467,1990.

[3] M. P. Pollard,B. M. G. Cheeetham,C. C. Goodyear, M. D. Edg-ington,A. Lowry, “Enhancedshape-invariantpitch andtime-scalemodificationfor concatenative speechsynthesis”,ICCASP97,V.2,pp.919-922,Munich,Germany, 1997.

[4] S. Seneff, “SpeechTransformationSystem(Spectrumand# or Ex-citation)without PitchExtraction”,TechnicalReport541,LincolnLaboratory, MIT.

[5] J.L. Flanagan,R. M. Golden“PhaseVocoder”,Bell SystemTech.J.,V.45pp.1493-1500,1966.

[6] J. R. W. Yi, J. R. Glass,“Natural-soundingspeechsynthesisus-ing variable-lengthunits”, ICSLP98,pp.1167-1170,Sydney, Aus-tralia,1998.

[7] C. Wang, J. Glass,H. Meng, J. Polifroni, S. Seneff, V. Zue, “Yinhe: a MandarinChineseVersionof the Galaxysystem”,EU-ROSPEECH97,pp.351-354,Rhodes,Greece,1997.

[8] C. Wang, S. Seneff, “Improved tone recognitionby normalizingfor coarticulationandintonationeffects”, ICSLP00,V.2,pp.83-86,Beijing, China,2000.

[9] J.K. B. Ford,“Acousticbehavior of ResidentKiller Whales(Orci-nusorca)off Vancouver Island,British Columbia,” CanadianJour-nal of Zoology67,pp.727-745,1989.

[10] P. J.Miller, andP. L. Tyack,“A SmallTowedBeamformingArrayto Identify VocalizingResidentKiller Whales(Orcinusorca)con-currentwith focalbehavioral observations,” Depp-SeaResearchII,V. 45,1998.

4

voice transformations: from speech synthesis to...

Documents