ph.d defence (shinnosuke takamichi)
TRANSCRIPT
2015©Shinnosuke TAKAMICHI
12/22/2015
Acoustic modeling and speech parameter generation
for high-quality statistical parametric speech synthesis (高音質な統計的パラメトリック音声合成のための
音響モデリング法と音声パラメータ生成法)
Nara Institute of Science and Technology
Shinnosuke Takamichi
Ph.D defense
/46
Research target
2
Speech
/46
Speech synthesis and its benefits
Speech synthesis: a method to synthesize speech by a computer
– Text-To-Speech (TTS) [Sagisaka et al., 1988.]
– Voice Conversion (VC) [Stylianou et al., 1988.]
What is required?
– Flexible control of voice beyond ability of one human
– High-quality speech generation like human
3
Text TTS
VC
/46
Statistical parametric speech synthesis
Statistical parametric speech synthesis [Zen et al., 2009.]
– Statistical modeling of relationship between input/output
– Better flexibility than unit selection synthesis [Iwahashi et al., 1993.]
HMM-based TTS & GMM-based VC* [Tokuda et al., 2013.] [Toda et
al., 2007.]
– Mathematical support of the flexibility
– Application from other research areas
– But…
4 *HMM: Hidden Markov Model, GMM: Gaussian Mixture Model
/46
Natural speech vs. synthetic speech
in speech quality
5
Natural speech
spoken by human
Synthetic speech
of HMM-based TTS & GMM-based VC
Why?
/46
Problem definition and rest of this talk
6
Text
Parameteri-
zation error
Insufficient
modeling
Over-
smoothing
Parameteri-
zation error
Text
analysis
Speech
analysis
Speech
parameter
generation
Waveform
synthesis
Acoustic
Modeling
Approaches
in this thesis
Modeling of individual
speech segment
Chapter 3
Modulation spectrum
for over-smoothing
Chapter 4 Chapter 5
Chapter 2
/46
Speech synthesis
Analysis Generation Synthesis Modeling
Modeling of individual
speech segment
Modulation spectrum
for over-smoothing
Chapter 4 Chapter 5
Text
Chapter 2
Chapter 3
/46
2 approaches to speech synthesis
Unit selection synthesis [Iwahashi et al., 1993.]
– High quality but low flexibility
8
Pre-recorded speech database Synthetic speech
Segment selection
Text Text
analysis
Speech
analysis
Param.
Gen.
Wave.
synthesis
Acoustic
modeling
Statistical parametric speech synthesis [Zen et al., 2009.]
– High flexibility but low quality
/46
Text/speech analysis
and waveform synthesis
Text analysis (e.g., [Sagisaka et al., 1990.])
Speech analysis (e.g., [Kawahara et al., 1999.])
9 ana gen syn model
j i
あ ら ゆ る 現 実 を・・・ Sentence
Accent phrase
a ts r a y u r u g e n u o Phoneme
Low High
Pow
er
Frequency
Fourier transform
& pow.
Envelope = spectral parameters
Periodicity in detail = Pitch (F0)
/46
Acoustic modeling in HMM-based TTS
10
𝝀 = argmax 𝑃 𝒀|𝑿, 𝝀
ML training of HMM parameter sets 𝝀
ana gen syn model
“Hello”
Speech
analysis
Text
analysis
Context labels
𝒀 Speech features
Time
sil-h+e h-e+l e-l+o
HMM 𝝀
Context-tied
Gaussian distribution
𝑁 ⋅; 𝝁, 𝚺
e-l+o a-l+o
o-l+o
𝑿
[Zen et al., 2007.]
/46
Acoustic modeling in GMM-based VC
ML training of GMM parameter sets 𝝀
11
𝝀 = argmax 𝑃 𝒀𝑡 , 𝑿𝑡|𝝀
𝑿 Speech features
𝒀 Speech features
Speech
analysis
Speech
analysis
ana gen syn model
GMM 𝝀
𝑿𝑡
𝒀𝑡
𝑁 ⋅; 𝝁, 𝚺 𝑿𝑡
𝒀𝑡
Joint vector
at time 𝑡
[Stylianou et al., 1988.]
/46
Probability to generate features
in HMM-based TTS
12
Text
analysis
HMM parameter sets 𝝀
“Hello”
𝑿
ana gen syn model
Probability to generate the synthetic speech features 𝒀
𝑃 𝒀|𝑿, 𝒒 , 𝝀 = 𝑁 𝒀;𝑬𝒒 , 𝑫𝒒
“h”
“o”
𝝁1
𝝁2
𝝁𝑇
𝝁𝑡
𝑬𝒒
𝜮1−1
𝜮2−1
𝜮𝑇−1
𝜮𝑡−1
𝑫𝒒 −1
Mean vector Covariance matrix
𝒒
[Tokuda et al., 2000.]
/46
Probability to generate features
in GMM-based VC
13
Probability to generate the synthetic speech features 𝒀
𝑃 𝒀|𝑿, 𝒒 , 𝝀 = 𝑁 𝒀;𝑬𝒒 , 𝑫𝒒
Speech
analysis
GMM parameter sets 𝝀
𝑿
𝝁1
𝝁2
𝝁𝑇
𝝁𝑡
𝑬𝒒
𝜮1−1
𝜮2−1
𝜮𝑇−1
𝜮𝑡−1
𝑫𝒒 −1
Mean vector Covariance matrix
𝒒
[Toda et al., 2007.]
ana gen syn model
/46
Speech parameter generation
ML generation of synthetic speech parameters 𝒚 𝒒
– Computationally-efficient generation (solved in a closed form)
14
𝒚 𝒒 = argmax 𝑃 𝒀|𝑿, 𝒒 , 𝝀 = argmax 𝑃 𝒚, Δ𝒚|𝑿, 𝒒 , 𝝀
Time
Sta
tic 𝒚
Tem
pora
l
de
lta
Δ𝒚
𝒚 𝒒
Δ𝒚 𝒒
Mean and variance
[Tokuda et al., 2000.]
ana gen syn model
/46
Statistical sample-based speech synthesis
Analysis Generation Synthesis Modeling
Modeling of individual
speech segment
Modulation spectrum
for over-smoothing
Chapter 3 Chapter 4 Chapter 5
Text
Chapter 2
/46
Quality degradation
by acoustic modeling
16
Averaging across input features
Context-tied Gaussian in HMM-based TTS
→ Robust to the unseen context
→ Loses info. of individual speech parameters.
𝑁 ⋅; 𝝁, 𝚺
e-l+o a-l+o
o-l+o
Proposed approach
– Models individual speech parameters while keeping robustness.
– Select one model in parameter generation.
– → Able to alleviate the quality degradation caused by averaging
/46
Acoustic modeling
of the proposed method
17
From the tied model to Rich context-GMM (R-GMM)
𝑁 ⋅; 𝝁, 𝚺
e-l+o a-l+o
o-l+o
Rich context models [Yan et al., 2009.] Less-averaged models having robustness
R-GMM Model that is formed as the same as the
conventional tied model
Update the mean while tying the covariance.
Gathers them with the same mixture weights.
/46
Speech parameter generation
from R-GMMs
18
ML generation of synthetic speech parameters 𝒚 𝒒
– Iterative generation with the explicit model selection*
𝒚 𝒒 = argmax 𝑃 𝒚, Δ𝒚|𝒎 ,𝑿 𝑃 𝒎 |𝒚, Δ𝒚, 𝑿
Mean
± variance
Time Time
Sta
tic fe
atu
re 𝒚
𝒎
Tied model R-GMM
∗ 𝝀 (HMM/GMM parameter sets) is omitted.
/46
Discussion
Initialization of the parameter generation (Sec. 3.5)
– Uses speech parameters from the over-trained statistics.
– → Avoids averaging by initialization, and alleviating over-training by
parameter generation.
Comparison to unit selection synthesis (Sec. 2.2)
– The model selection corresponds to the waveform segment selection.
– → Integrates unit selection in the statistical modeling.
Comparison to conventional hybrid methods
– Able to apply voice controlling methods, e.g., [Yamagishi et al., 2007.].
– → Better flexibility than [Yan et al., 2009.][Ling et al., 2007.] (Sec. 2.8)
19
/46
Subjective evaluation
(preference test on speech quality)
20
Pre
fere
nce s
core
0.0
0.2
0.4
0.6
0.8
1.0 HMM-based TTS
Spectrum H H R R T
F0 H R H R T
H/G: HMM/GMM (= tied model), R: R-GMM, T: Target (``R’’ using reference)
95% conf. interval
GMM-based VC
Pre
fere
nce
sco
re
0.0
0.2
0.4
0.6
0.8
1.0
G R T
/46
Modulation spectrum-based post-filter
Analysis Generation Synthesis Modeling
Modeling of individual
speech segment
Modulation spectrum
for over-smoothing
Chapter 4 Chapter 5
Text
Chapter 2
Chapter 3
/46
Over-smoothing
in parameter generation
22
Time Natural speech parameters
Time
Synthetic speech parameters
Speech
parameter
generation
Acoustic
modeling
/46
Revisits speech parameter generation (Sec. 2.6)
ML generation of synthetic speech parameters 𝒚 𝒒 *
23
Time
Sp
ectr
al p
ara
me
ter
Natural
𝒚 𝒒 = argmax 𝑃 𝒚, Δ𝒚|𝑿
𝑿: input features
𝝀 (HMM/GMM parameter sets) is omitted.
[Tokuda et al., 2000.]
HMM
/46
Global Variance (GV) and
parameter generation w/ GV
ML generation with GV constraint
24
Time
Natural
HMM HMM+GV
Sp
ectr
al p
ara
me
ter
𝒗(𝒚)
𝒚 𝒒 = argmax 𝑃 𝒚, Δ𝒚|𝑿 𝑃 𝒗 𝒚𝜔
𝒗 𝒚 : GV (= 2nd moment), 𝜔: weight of the GV term
[Toda et al., 2007.]
Something is still different between them...
→ What is it?
/46
Modulation Spectrum (MS) definition
MS: power spectrum of the sequence
– Represents temporal fluctuation. [Atlas et al., 2003.]
– Segment features in speech recognition [Thomas et al., 2009.]
– Captures speech intelligibility. [Drullman et al., 1994.]
26
2nd
moment
DFT
& pow.
GV (scalar)
MS (vector)
Time
Sp
ee
ch
pa
ram
ete
r
DFT: Discrete Fourier Transform
/46
HMM
natural
Modulation frequency
Mo
du
latio
n s
pe
ctr
um
Speech quality will be improved by filling this gap!
Example of the MS
HMM+GV
27
/46
Post-filtering process
28
Training data
Speech param. MS Statistics
(Gaussian)
HMMs
Filtering
Training
Synthesis
Post-filtering in the MS domain
– Linear conversion (interpolation) using 2 Gaussian distributions
/46
Filtered speech parameter sequence
29
Time
HMM
HMM+GV
natural
HMM → post-filter
Sp
ectr
al p
ara
me
ter
Generate fluctuating speech parameters by the post-filtering!
/46
Discussion 1: What is the MS?
30
Speech
parameter GV (temporal power)
Freq. 1
Freq. 2
Freq. 𝐷s
+
+ …
=
MS (frequency power)
Sum of MSs = GV
Fourier transform
Time
/46
Discussion 2
Why post-filter?
– Independent on the original speech synthesis process
– → High portability and high quality
Further application
– For spectrum, F0 (non-continuous), duration (unactual param.)
– Segment-level filter (faster process)
Advantages compared to the conventional post-filters
– Automatic design/tuning [Eyben et al., 2014.][Yoshimura et al., 1999.]
31
/46
Subjective evaluation
(preference test on speech quality)
32
Pre
fere
nce
sco
re
0.0
0.2
0.4
0.6
0.8
1.0
Pre
fere
nce
sco
re
0.0
0.2
0.4
0.6
0.8
1.0
Spectrum
in HMM-based TTS
Spectrum
in GMM-based VC
HMM GMM
+GV
HMM
+GV
post-filtering
/46
Speech synthesis integrating modulation spectrum
Analysis Generation Synthesis Modeling
Modeling of individual
speech segment
Modulation spectrum
for over-smoothing
Chapter 5
Text
Chapter 2
Chapter 3 Chapter 4
/46
Problems of the MS-based post-filter
MS-based post-filter
– External process for MS emphasis
– → Causes over-emphasis ignoring speech synthesis criteria.
– → Difficult to utilize flexibility that HMM/GMMs have
34
Approaches: Joint optimization using HMM/GMMs and MS
– Integrate MS statistics as the one of the acoustic models.
– Speech parameter generation with MS … high-quality
– Acoustic model training with MS … high-quality and fast
/46
Speech parameter generation
considering the MS
35
ML generation with MS constraint
𝒚 𝒒 = argmax 𝑃 𝒚, Δ𝒚|𝑿 𝑃 𝒔 𝒚𝜔
𝒔 𝒚 : MS (= power spectrum), 𝜔: weight of the MS term
𝑬𝒒 𝑫𝒒
𝑃 𝒚, Δ𝒚|𝑿 = 𝑁 𝒚, Δ𝒚 ; 𝑬𝒒 , 𝑫𝒒 𝑃 𝒔 𝒚 = 𝑁 𝒔 𝒚 ; 𝝁s, 𝚺𝐬
Modulation freq. M
S
Natural
Quadratic function of 𝒚
/46
Discussion
(comparison to MS-based post-filter)
Initialization
– Basic ML generation (``HMM’’) → MS-based post-filter
– → Part optimization by initialization, and joint optimization by iteration
36
HMM → post-filter
HMM
Time
Spectr
al para
mete
r
HMM+MS
/46
Effect in the MS
37
HMM
HMM+GV
natural HMM+MS
Modulation frequency
Mo
du
latio
n s
pe
ctr
um
Fills the gap by the proposed generation algorithm!
/46
Effect in the GV
38
Index of speech parameters
Log G
V
HMM HMM+GV
natural
Recovers the GV w/o considering the GV!
HMM+MS
/46
Subjective evaluation
(preference test on speech quality)
39
Pre
fere
nce
sco
re
0.0
0.2
0.4
0.6
0.8
1.0
Pre
fere
nce
sco
re
0.0
0.2
0.4
0.6
0.8
1.0 GMM-based VC
HMM
+MS
HMM
+GV
GMM
+MS
GMM
+GV
HMM-based TTS
*+GV Parameter generation w/ GV (Sec. 2.9)
*+MS Parameter generation w/ MS
/46
Problems of parameter generation
and MS-constrained training
40
Speech parameter generation considering the MS
– Iterative process in synthesis
– → Computationally-inefficient speech synthesis
𝝀 = argmax 𝑃 𝒚|𝑿 𝑃 𝒔 𝒚𝜔
𝑃 𝒚|𝑿 = 𝑁 𝒚; 𝒚 𝒒 , 𝜮 : Trajectory likelihood (Sec. 2.8)
𝑃 𝒔 𝒚 = 𝑁 𝒔 𝒚 ; 𝒔 𝒚 𝒒 , 𝜮𝐬 : MS likelihood
Minimizes difference between 𝒚 and 𝒚 𝒒 .
Minimizes difference between 𝒔 𝒚 and 𝒔 𝒚 𝒒 .
Acoustic model training constrained with MS
– Train HMMs/GMMs 𝝀 to generate param. 𝒚 𝒒 having natural MS
/46
Trained HMM parameters
41
Basic training (Sec. 2.4-5)
Trajectory training (Sec. 2.8)
Time
De
lta
fe
atu
re
Updates HMM/GMM param. to generate fluctuating param.!
MS-constrained training
/46
Discussion
Computational efficiency in parameter generation
– Basic generation algorithm (Sec. 2.6) can be used without MS.
– → Not only high-quality but also computationally-efficient
Which is better in quality, proposed param. gen. or training?
– Structures of HMMs/GMMs have limitation for recovering MS.
– → The parameter generation considering the MS is better.
42
Portability Quality Computation time
Post-filter Best! (no depend-ency on models)
Better Better (120 ms)
Param. gen. Better Best!(optimization in synthesis)
Worse (1 min~)
Training Worse Better Best! (5 ms)
/46
Subjective evaluation
(preference test on speech quality)
43
Pre
fere
nce
sco
re
0.0
0.2
0.4
0.6
0.8
1.0
Pre
fere
nce
sco
re 0.0
0.2
0.4
0.6
0.8
1.0 HMM-based TTS GMM-based VC
HMM TRJ GV MS-
TRJ
GMM TRJ GV MS-
TRJ
HMM/GMM Basic HMM/GMM training (Sec. 2.4-5)
TRJ Trajectory HMM training (Sec. 2.8)
GV GV-constrained training (Sec. 2.9)
MS-TRJ MS-constrained trajectory training
/46
Conclusion
/46
Conclusion
Problem in this thesis
– Quality degradation in synthetic speech, which is caused by
parameterization error, insufficient modeling, and over-smoothing
Chapter 3: statistical parametric speech synthesis
– Addresses the insufficiency in the acoustic modeling.
– Models the individual speech parameter with rich context models.
Chapter 4 & 5: approaches using Modulation Spectrum (MS)
– Addresses the over-smoothing in the parameter generation.
– 1. MS-based post-filter: high portability
– 2. Parameter generation w/ MS: highest quality
– 3. MS-constrained training: computationally-efficient generation
45
/46
Future work
Improvements of rich context modeling
– Quality degradation even if the best models are selected. (Sec. A.5)
Theoretical analysis of MS
– Why is the speech quality improved by the MS?
MS for DNN-based speech synthesis
– More flexible structures to integrate the MS
GPU implementation of the proposed methods
– Rich-context-model selection & param. generation with the MS
46