ph.d defence (shinnosuke takamichi)

2015©Shinnosuke TAKAMICHI

12/22/2015

Acoustic modeling and speech parameter generation

for high-quality statistical parametric speech synthesis （高音質な統計的パラメトリック音声合成のための

音響モデリング法と音声パラメータ生成法）

Nara Institute of Science and Technology

Shinnosuke Takamichi

Ph.D defense

/46

Research target

2

Speech

/46

Speech synthesis and its benefits

Speech synthesis: a method to synthesize speech by a computer

– Text-To-Speech (TTS) [Sagisaka et al., 1988.]

– Voice Conversion (VC) [Stylianou et al., 1988.]

What is required?

– Flexible control of voice beyond ability of one human

– High-quality speech generation like human

3

Text TTS

VC

/46

Statistical parametric speech synthesis

Statistical parametric speech synthesis [Zen et al., 2009.]

– Statistical modeling of relationship between input/output

– Better flexibility than unit selection synthesis [Iwahashi et al., 1993.]

HMM-based TTS & GMM-based VC* [Tokuda et al., 2013.] [Toda et

al., 2007.]

– Mathematical support of the flexibility

– Application from other research areas

– But…

4 *HMM: Hidden Markov Model, GMM: Gaussian Mixture Model

/46

Natural speech vs. synthetic speech

in speech quality

5

Natural speech

spoken by human

Synthetic speech

of HMM-based TTS & GMM-based VC

Why?

/46

Problem definition and rest of this talk

6

Text

Parameteri-

zation error

Insufficient

modeling

Over-

smoothing

Parameteri-

zation error

Text

analysis

Speech

analysis

Speech

parameter

generation

Waveform

synthesis

Acoustic

Modeling

Approaches

in this thesis

Modeling of individual

speech segment

Chapter 3

Modulation spectrum

for over-smoothing

Chapter 4 Chapter 5

Chapter 2

/46

Speech synthesis

Analysis Generation Synthesis Modeling


speech segment

Modulation spectrum

for over-smoothing

Chapter 4 Chapter 5

Text

Chapter 2

Chapter 3

/46

2 approaches to speech synthesis

Unit selection synthesis [Iwahashi et al., 1993.]

– High quality but low flexibility

8

Pre-recorded speech database Synthetic speech

Segment selection

Text Text

analysis

Speech

analysis

Param.

Gen.

Wave.

synthesis

Acoustic

modeling

Statistical parametric speech synthesis [Zen et al., 2009.]

– High flexibility but low quality

/46

Text/speech analysis

and waveform synthesis

Text analysis (e.g., [Sagisaka et al., 1990.])

Speech analysis (e.g., [Kawahara et al., 1999.])

9 ana gen syn model

j i

あらゆる現実を・・・ Sentence

Accent phrase

a ts r a y u r u g e n u o Phoneme

Low High

Pow

er

Frequency

Fourier transform

& pow.

Envelope = spectral parameters

Periodicity in detail = Pitch (F0)

/46

Acoustic modeling in HMM-based TTS

10

𝝀 = argmax 𝑃 𝒀|𝑿, 𝝀

ML training of HMM parameter sets 𝝀

ana gen syn model

“Hello”

Speech

analysis

Text

analysis

Context labels

𝒀 Speech features

Time

sil-h+e h-e+l e-l+o

HMM 𝝀

Context-tied

Gaussian distribution

𝑁 ⋅; 𝝁, 𝚺

e-l+o a-l+o

o-l+o

𝑿

[Zen et al., 2007.]

/46

Acoustic modeling in GMM-based VC

ML training of GMM parameter sets 𝝀

11

𝝀 = argmax 𝑃 𝒀𝑡 , 𝑿𝑡|𝝀

𝑿 Speech features

𝒀 Speech features

Speech

analysis

Speech

analysis

ana gen syn model

GMM 𝝀

𝑿𝑡

𝒀𝑡

𝑁 ⋅; 𝝁, 𝚺 𝑿𝑡

𝒀𝑡

Joint vector

at time 𝑡

[Stylianou et al., 1988.]

/46

Probability to generate features

in HMM-based TTS

12

Text

analysis

HMM parameter sets 𝝀

“Hello”

𝑿

ana gen syn model

Probability to generate the synthetic speech features 𝒀

𝑃 𝒀|𝑿, 𝒒 , 𝝀 = 𝑁 𝒀;𝑬𝒒 , 𝑫𝒒

“h”

“o”

𝝁1

𝝁2

𝝁𝑇

𝝁𝑡

𝑬𝒒

𝜮1−1

𝜮2−1

𝜮𝑇−1

𝜮𝑡−1

𝑫𝒒 −1

Mean vector Covariance matrix

𝒒

[Tokuda et al., 2000.]

/46

Probability to generate features

in GMM-based VC

13

Probability to generate the synthetic speech features 𝒀

𝑃 𝒀|𝑿, 𝒒 , 𝝀 = 𝑁 𝒀;𝑬𝒒 , 𝑫𝒒

Speech

analysis

GMM parameter sets 𝝀

𝑿

𝝁1

𝝁2

𝝁𝑇

𝝁𝑡

𝑬𝒒

𝜮1−1

𝜮2−1

𝜮𝑇−1

𝜮𝑡−1

𝑫𝒒 −1

Mean vector Covariance matrix

𝒒

[Toda et al., 2007.]

ana gen syn model

/46

Speech parameter generation

ML generation of synthetic speech parameters 𝒚 𝒒

– Computationally-efficient generation (solved in a closed form)

14

𝒚 𝒒 = argmax 𝑃 𝒀|𝑿, 𝒒 , 𝝀 = argmax 𝑃 𝒚, Δ𝒚|𝑿, 𝒒 , 𝝀

Time

Sta

tic 𝒚

Tem

pora

l

de

lta

Δ𝒚

𝒚 𝒒

Δ𝒚 𝒒

Mean and variance


ana gen syn model

/46

Statistical sample-based speech synthesis



speech segment

Modulation spectrum

for over-smoothing

Chapter 3 Chapter 4 Chapter 5

Text

Chapter 2

/46

Quality degradation

by acoustic modeling

16

Averaging across input features

Context-tied Gaussian in HMM-based TTS

→ Robust to the unseen context

→ Loses info. of individual speech parameters.

𝑁 ⋅; 𝝁, 𝚺

e-l+o a-l+o

o-l+o

Proposed approach

– Models individual speech parameters while keeping robustness.

– Select one model in parameter generation.

– → Able to alleviate the quality degradation caused by averaging

/46

Acoustic modeling

of the proposed method

17

From the tied model to Rich context-GMM (R-GMM)

𝑁 ⋅; 𝝁, 𝚺

e-l+o a-l+o

o-l+o

Rich context models [Yan et al., 2009.] Less-averaged models having robustness

R-GMM Model that is formed as the same as the

conventional tied model

Update the mean while tying the covariance.

Gathers them with the same mixture weights.

/46


from R-GMMs

18

ML generation of synthetic speech parameters 𝒚 𝒒

– Iterative generation with the explicit model selection*

𝒚 𝒒 = argmax 𝑃 𝒚, Δ𝒚|𝒎 ,𝑿 𝑃 𝒎 |𝒚, Δ𝒚, 𝑿

Mean

± variance

Time Time

Sta

tic fe

atu

re 𝒚

𝒎

Tied model R-GMM

∗ 𝝀 (HMM/GMM parameter sets) is omitted.

/46

Discussion

Initialization of the parameter generation (Sec. 3.5)

– Uses speech parameters from the over-trained statistics.

– → Avoids averaging by initialization, and alleviating over-training by

parameter generation.

Comparison to unit selection synthesis (Sec. 2.2)

– The model selection corresponds to the waveform segment selection.

– → Integrates unit selection in the statistical modeling.

Comparison to conventional hybrid methods

– Able to apply voice controlling methods, e.g., [Yamagishi et al., 2007.].

– → Better flexibility than [Yan et al., 2009.][Ling et al., 2007.] (Sec. 2.8)

19

/46

Subjective evaluation

(preference test on speech quality)

20

Pre

fere

nce s

core

0.0

0.2

0.4

0.6

0.8

1.0 HMM-based TTS

Spectrum H H R R T

F0 H R H R T

H/G: HMM/GMM (= tied model), R: R-GMM, T: Target (``R’’ using reference)

95% conf. interval

GMM-based VC

Pre

fere

nce

sco

re

0.0

0.2

0.4

0.6

0.8

1.0

G R T

/46

Modulation spectrum-based post-filter



speech segment

Modulation spectrum

for over-smoothing

Chapter 4 Chapter 5

Text

Chapter 2

Chapter 3

/46

Over-smoothing

in parameter generation

22

Time Natural speech parameters

Time

Synthetic speech parameters

Speech

parameter

generation

Acoustic

modeling

/46

Revisits speech parameter generation (Sec. 2.6)

ML generation of synthetic speech parameters 𝒚 𝒒 *

23

Time

Sp

ectr

al p

ara

me

ter

Natural

𝒚 𝒒 = argmax 𝑃 𝒚, Δ𝒚|𝑿

𝑿: input features

𝝀 (HMM/GMM parameter sets) is omitted.


HMM

/46

Global Variance (GV) and

parameter generation w/ GV

ML generation with GV constraint

24

Time

Natural

HMM HMM+GV

Sp

ectr

al p

ara

me

ter

𝒗(𝒚)

𝒚 𝒒 = argmax 𝑃 𝒚, Δ𝒚|𝑿 𝑃 𝒗 𝒚𝜔

𝒗 𝒚 : GV (= 2nd moment), 𝜔: weight of the GV term

[Toda et al., 2007.]

Something is still different between them...

→ What is it?

/46

Modulation Spectrum (MS) definition

MS: power spectrum of the sequence

– Represents temporal fluctuation. [Atlas et al., 2003.]

– Segment features in speech recognition [Thomas et al., 2009.]

– Captures speech intelligibility. [Drullman et al., 1994.]

26

2nd

moment

DFT

& pow.

GV (scalar)

MS (vector)

Time

Sp

ee

ch

pa

ram

ete

r

DFT: Discrete Fourier Transform

/46

HMM

natural

Modulation frequency

Mo

du

latio

n s

pe

ctr

um

Speech quality will be improved by filling this gap!

Example of the MS

HMM+GV

27

/46

Post-filtering process

28

Training data

Speech param. MS Statistics

(Gaussian)

HMMs

Filtering

Training

Synthesis

Post-filtering in the MS domain

– Linear conversion (interpolation) using 2 Gaussian distributions

/46

Filtered speech parameter sequence

29

Time

HMM

HMM+GV

natural

HMM → post-filter

Sp

ectr

al p

ara

me

ter

Generate fluctuating speech parameters by the post-filtering!

/46

Discussion 1: What is the MS?

30

Speech

parameter GV (temporal power)

Freq. 1

Freq. 2

Freq. 𝐷s

+

+ …

=

MS (frequency power)

Sum of MSs = GV

Fourier transform

Time

/46

Discussion 2

Why post-filter?

– Independent on the original speech synthesis process

– → High portability and high quality

Further application

– For spectrum, F0 (non-continuous), duration (unactual param.)

– Segment-level filter (faster process)

Advantages compared to the conventional post-filters

– Automatic design/tuning [Eyben et al., 2014.][Yoshimura et al., 1999.]

31

/46



32

Pre

fere

nce

sco

re

0.0

0.2

0.4

0.6

0.8

1.0

Pre

fere

nce

sco

re

0.0

0.2

0.4

0.6

0.8

1.0

Spectrum

in HMM-based TTS

Spectrum

in GMM-based VC

HMM GMM

+GV

HMM

+GV

post-filtering

/46

Speech synthesis integrating modulation spectrum



speech segment

Modulation spectrum

for over-smoothing

Chapter 5

Text

Chapter 2

Chapter 3 Chapter 4

/46

Problems of the MS-based post-filter

MS-based post-filter

– External process for MS emphasis

– → Causes over-emphasis ignoring speech synthesis criteria.

– → Difficult to utilize flexibility that HMM/GMMs have

34

Approaches: Joint optimization using HMM/GMMs and MS

– Integrate MS statistics as the one of the acoustic models.

– Speech parameter generation with MS … high-quality

– Acoustic model training with MS … high-quality and fast

/46


considering the MS

35

ML generation with MS constraint

𝒚 𝒒 = argmax 𝑃 𝒚, Δ𝒚|𝑿 𝑃 𝒔 𝒚𝜔

𝒔 𝒚 : MS (= power spectrum), 𝜔: weight of the MS term

𝑬𝒒 𝑫𝒒

𝑃 𝒚, Δ𝒚|𝑿 = 𝑁 𝒚, Δ𝒚 ; 𝑬𝒒 , 𝑫𝒒 𝑃 𝒔 𝒚 = 𝑁 𝒔 𝒚 ; 𝝁s, 𝚺𝐬

Modulation freq. M

S

Natural

Quadratic function of 𝒚

/46

Discussion

(comparison to MS-based post-filter)

Initialization

– Basic ML generation (``HMM’’) → MS-based post-filter

– → Part optimization by initialization, and joint optimization by iteration

36

HMM → post-filter

HMM

Time

Spectr

al para

mete

r

HMM+MS

/46

Effect in the MS

37

HMM

HMM+GV

natural HMM+MS

Modulation frequency

Mo

du

latio

n s

pe

ctr

um

Fills the gap by the proposed generation algorithm!

/46

Effect in the GV

38

Index of speech parameters

Log G

V

HMM HMM+GV

natural

Recovers the GV w/o considering the GV!

HMM+MS

/46



39

Pre

fere

nce

sco

re

0.0

0.2

0.4

0.6

0.8

1.0

Pre

fere

nce

sco

re

0.0

0.2

0.4

0.6

0.8

1.0 GMM-based VC

HMM

+MS

HMM

+GV

GMM

+MS

GMM

+GV

HMM-based TTS

*+GV Parameter generation w/ GV (Sec. 2.9)

*+MS Parameter generation w/ MS

/46

Problems of parameter generation

and MS-constrained training

40

Speech parameter generation considering the MS

– Iterative process in synthesis

– → Computationally-inefficient speech synthesis

𝝀 = argmax 𝑃 𝒚|𝑿 𝑃 𝒔 𝒚𝜔

𝑃 𝒚|𝑿 = 𝑁 𝒚; 𝒚 𝒒 , 𝜮 : Trajectory likelihood (Sec. 2.8)

𝑃 𝒔 𝒚 = 𝑁 𝒔 𝒚 ; 𝒔 𝒚 𝒒 , 𝜮𝐬 : MS likelihood

Minimizes difference between 𝒚 and 𝒚 𝒒 .

Minimizes difference between 𝒔 𝒚 and 𝒔 𝒚 𝒒 .

Acoustic model training constrained with MS

– Train HMMs/GMMs 𝝀 to generate param. 𝒚 𝒒 having natural MS

/46

Trained HMM parameters

41

Basic training (Sec. 2.4-5)

Trajectory training (Sec. 2.8)

Time

De

lta

fe

atu

re

Updates HMM/GMM param. to generate fluctuating param.!

MS-constrained training

/46

Discussion

Computational efficiency in parameter generation

– Basic generation algorithm (Sec. 2.6) can be used without MS.

– → Not only high-quality but also computationally-efficient

Which is better in quality, proposed param. gen. or training?

– Structures of HMMs/GMMs have limitation for recovering MS.

– → The parameter generation considering the MS is better.

42

Portability Quality Computation time

Post-filter Best! (no depend-ency on models)

Better Better (120 ms)

Param. gen. Better Best!(optimization in synthesis)

Worse (1 min~)

Training Worse Better Best! (5 ms)

/46



43

Pre

fere

nce

sco

re

0.0

0.2

0.4

0.6

0.8

1.0

Pre

fere

nce

sco

re 0.0

0.2

0.4

0.6

0.8

1.0 HMM-based TTS GMM-based VC

HMM TRJ GV MS-

TRJ

GMM TRJ GV MS-

TRJ

HMM/GMM Basic HMM/GMM training (Sec. 2.4-5)

TRJ Trajectory HMM training (Sec. 2.8)

GV GV-constrained training (Sec. 2.9)

MS-TRJ MS-constrained trajectory training

/46

Conclusion

/46

Conclusion

Problem in this thesis

– Quality degradation in synthetic speech, which is caused by

parameterization error, insufficient modeling, and over-smoothing

Chapter 3: statistical parametric speech synthesis

– Addresses the insufficiency in the acoustic modeling.

– Models the individual speech parameter with rich context models.

Chapter 4 & 5: approaches using Modulation Spectrum (MS)

– Addresses the over-smoothing in the parameter generation.

– 1. MS-based post-filter: high portability

– 2. Parameter generation w/ MS: highest quality

– 3. MS-constrained training: computationally-efficient generation

45

/46

Future work

Improvements of rich context modeling

– Quality degradation even if the best models are selected. (Sec. A.5)

Theoretical analysis of MS

– Why is the speech quality improved by the MS?

MS for DNN-based speech synthesis

– More flexible structures to integrate the MS

GPU implementation of the proposed methods

– Rich-context-model selection & param. generation with the MS

46