implementation of a speech analysis-synthesis toolbox using harmonic plus noise model didier cadic...

Post on 14-Jan-2016

218 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Implementation of a speech Implementation of a speech Analysis-Synthesis Toolbox using Analysis-Synthesis Toolbox using

Harmonic plus Noise ModelHarmonic plus Noise Model

Didier CadicDidier Cadic11, engineering student, engineering student

supervised bysupervised by

Olivier CappéOlivier Cappé11, Maurice Charbit, Maurice Charbit11, , Gérard CholletGérard Chollet11, Eric Moulines, Eric Moulines11

(presented here by Guido Aversano(presented here by Guido Aversano1,21,2))22IIASS, IIASS, Vietri sul Mare (SA), ItalyVietri sul Mare (SA), Italy

11Département TSI, ENST, Paris, FranceDépartement TSI, ENST, Paris, France

Plan of the presentationPlan of the presentation

Text-to-speech: classic methodsText-to-speech: classic methods

HNM modelHNM model

AnalysisAnalysis

SynthesisSynthesis

Analysis-Synthesis examplesAnalysis-Synthesis examples

ConclusionsConclusions

Text-To-Speech by concatenationText-To-Speech by concatenation

EnglishEnglish, male, male

EnglishEnglish, female (vocal server example), female (vocal server example)

EnglishEnglish, female (another vocal server example), female (another vocal server example)

GermanGerman, male, male

FrenchFrench, female, female

Examples realized on the AT&T web site:Examples realized on the AT&T web site:

Text-To-Speech by concatenationText-To-Speech by concatenation

2 major challenges :2 major challenges :

smooth connection between acoustic unitssmooth connection between acoustic units

flexible prosodyflexible prosody

TD-PSOLA methodTD-PSOLA method

Analysis :Analysis :

Pitch estimationPitch estimation

Pitch-synchronous Pitch-synchronous windowing windowing

Synthesis :Synthesis :

Rearrangement of Rearrangement of framesframes

TD-PSOLA methodTD-PSOLA method

Some very good-quality results:Some very good-quality results:

Singing, originalSinging, original

Singing, modifiedSinging, modified

Time-scalingTime-scaling

Cello, originalCello, original

Cello, modifiedCello, modified

Pitch-shiftingPitch-shifting

TD-PSOLA methodTD-PSOLA method

"rain", original"rain", original

"rain", 0.5 rate"rain", 0.5 rate

"ss", original"ss", original

"ss", slowed down (classic method)"ss", slowed down (classic method)

"ss", slowed down (improved)"ss", slowed down (improved)

Artifacts appearing in non-voiced sounds:Artifacts appearing in non-voiced sounds:

Phase Vocoder methodPhase Vocoder method

Intuitive description:Intuitive description:

Compression/stretchingCompression/stretchingof (narrow-band) spectrogram’s of (narrow-band) spectrogram’s time-frequency scales…time-frequency scales…

time-scalingtime-scaling

pitch-shiftingpitch-shifting

Phase Vocoder methodPhase Vocoder method

Examples :Examples :

"rain", male voice"rain", male voice

Slow-motion by Vocoder (PSOLA : )Slow-motion by Vocoder (PSOLA : )

"The quick fox …", female voice"The quick fox …", female voice

Slow-motion by VocoderSlow-motion by Vocoder

Main problem :Main problem : phase coherence is lost in the synthesized signalphase coherence is lost in the synthesized signal

TD-PSOLA and Vocoder allow TD-PSOLA and Vocoder allow basic prosodic modifications. basic prosodic modifications.

The problem of unit concatenation for TTS isThe problem of unit concatenation for TTS is not solved. not solved.

Other kinds of modifications (timbre,Other kinds of modifications (timbre, denoising, …) should be considered. denoising, …) should be considered.

We need a parametric modelWe need a parametric model

Harmonic plus Noise Model (HNM)Harmonic plus Noise Model (HNM)

Main assumption :Main assumption :

stationary segments of a stationary segments of a speech signal can be speech signal can be always seen as the always seen as the superposition of a periodic superposition of a periodic and a noisy partand a noisy part

HNM ModelHNM Model

Modelling :Modelling :

S(t)S(t) H(t)H(t) B(t)B(t)== ++

where :where : H(t) = H(t) = A Ak k cos ( 2cos ( 2 k f k f0 0 t + t + k k ))

andand B(t) = white noise passed through an AR filterB(t) = white noise passed through an AR filter

HNM analysis of a frameHNM analysis of a frame

1.1. Pitch estimationPitch estimation

Spectral comb methodSpectral comb method

HNM analysis of a frameHNM analysis of a frame

1.1. Pitch estimationPitch estimation

Good results are obtainedGood results are obtained

In some cases the method In some cases the method erroneously returns f0/2erroneously returns f0/2

Possibility of tracking…Possibility of tracking…

"aka…aga""aka…aga"

HNM analysis of a frameHNM analysis of a frame

2.2. Harmonic part: extraction of amplitudesHarmonic part: extraction of amplitudes

Least squares methodLeast squares method

H(t) = H(t) = aakk cos ( 2cos ( 2k fk f0 0 t ) + t ) + bbkk sin ( 2sin ( 2k fk f0 0 t )t )

minmin s(t) – H(t) s(t) – H(t) 22

aak, k, bbkk

HNM analysis of a frameHNM analysis of a frame

2.2. Extraction of amplitudesExtraction of amplitudes

Problem: the noisy part gives aProblem: the noisy part gives anon-null contribution to the non-null contribution to the spectral powerspectral power

Gain correction for the harmonicsGain correction for the harmonics(using an euristic formula (using an euristic formula gg((DVDV), where ), where DVDV is the estimated voicing degree) is the estimated voicing degree)

HNM analysis of a frameHNM analysis of a frame

2.2. Extraction of amplitudesExtraction of amplitudes

Residual:Residual: R(t) = s(t) - H(t)R(t) = s(t) - H(t)

HNM analysis of a frameHNM analysis of a frame

2.2. Extraction of amplitudesExtraction of amplitudes

Possibility of improving harmonic estimationPossibility of improving harmonic estimation

where Bg = gaussian white noisewhere Bg = gaussian white noise

and F(t) = AR filter, F(z) =and F(t) = AR filter, F(z) =

HNM analysis of a frameHNM analysis of a frame

3.3. AR filter estimation for the residual:AR filter estimation for the residual:

Linear prediction methodLinear prediction method

R(t) = Bg R(t) = Bg F(t) F(t)

aa0 0 + a+ a1 1 zz-1 -1 + … + a+ … + aN N zz-N-N

11

HNM SynthesisHNM Synthesis

Interpolation for each harmonic between Interpolation for each harmonic between two succesive framestwo succesive frames

H(t) = H(t) = aakk(t)(t) cos ( 2cos ( 2k fk f00(t)(t) t ) + t ) + bbkk(t)(t) sin ( 2sin ( 2k fk f00(t)(t) t ) =t ) =

= = AAkk(t)(t) cos cos kk(t)(t)

kk(t(taa) = 2) = 2k fk f00(t(taa) ) is known by pitch analysisis known by pitch analysis..

AAkk(t(taa) and ) and kk(t(taa) ) are known at analysis instants tare known at analysis instants taa

HNM SynthesisHNM Synthesis

Erroneous pitch (usually f0/2)Erroneous pitch (usually f0/2)

harmonic correspondence problemharmonic correspondence problem

is solved introducing fictitious harmonicsis solved introducing fictitious harmonics

HNM SynthesisHNM Synthesis

AAk k cos cos kk(t)(t)Linear interpolation Linear interpolation

UnwrappingUnwrapping + + cubic interpolationcubic interpolation

HNM SynthesisHNM Synthesis

Noisy partNoisy part

Generation of normally distributed random Generation of normally distributed random numbersnumbers

AR filtering (abrupt changes of coefficients AR filtering (abrupt changes of coefficients between 2 windows have no incidence…)between 2 windows have no incidence…)

HNM SynthesisHNM Synthesis

ResultsResults

"Carottes" :"Carottes" :synthesizedsynthesized

originaloriginal

"Lawyer" :"Lawyer" :synthesizedsynthesized

originaloriginal

Tuba :Tuba :synthesizedsynthesized

originaloriginal

"wazi" :"wazi" :synthesizedsynthesized

originaloriginal

a-e-i-o-u :a-e-i-o-u :synthesizedsynthesized

originaloriginal

singing :singing :synthesizedsynthesized

originaloriginal

HNM SynthesisHNM Synthesis

ResultsResults

Discours :Discours :synthesizedsynthesized

originaloriginal

"aka aga" :"aka aga" :synthesizedsynthesized

originaloriginalDussolier :Dussolier : synthesizedsynthesized

originaloriginal

Andie :Andie :synthesizedsynthesized

originaloriginal

noisy partnoisy part

"coiffe" :"coiffe" :synthesizedsynthesized

originaloriginal

Synthesis with time-stretchingSynthesis with time-stretching

Synthesis instants (tSynthesis instants (tss) ) Analysis instants (t Analysis instants (taa))

The following parameters remain unchanged:The following parameters remain unchanged:

Noisy part parametersNoisy part parameters

The pitchThe pitch

The amplitudes AThe amplitudes Akk of the harmonics of the harmonics

Synthesis with time-stretchingSynthesis with time-stretching

Simple phase trajectories resamplingSimple phase trajectories resampling

oror

"harmonic" rephasing"harmonic" rephasing

Phase adaptationPhase adaptation

a-e-i-o-u :a-e-i-o-u : slow-motion with phase "stretching"slow-motion with phase "stretching"originaloriginal

slow-motion with "harmonic" rephasingslow-motion with "harmonic" rephasing

Final resultsFinal results

OriginalOriginal 11Synthesized with rate : Synthesized with rate :

0.40.4 0.50.5 0.60.6 0.70.7 0.80.8 1.21.2 1.51.5 22

"carottes" :"carottes" :"lawyer" :"lawyer" :

tuba :tuba :"wazi" :"wazi" :singing :singing :

"a-e-i-o-u" :"a-e-i-o-u" :Dussolier :Dussolier :Discours :Discours :

Andie :Andie :"aka aga":"aka aga":"coiffe" :"coiffe" :

ConclusionsConclusions

Good results, showing method’s potential for Good results, showing method’s potential for different applications including TTSdifferent applications including TTS

Future work will include other kinds of Future work will include other kinds of modifications (pitch shifting, timbre etc.)modifications (pitch shifting, timbre etc.)

top related