hidden markov model (hmm) based speech synthesis using hts ... nawaz.pdf · if we consider all the...
TRANSCRIPT
Hidden Markov Model (HMM) based Speech Synthesis using HTS Toolkit.
Presenter:
Omer Nawaz
Research Officer (III)
Hidden Markov Model (HMM) based Speech Synthesis using HTS Toolkit.
Speech Synthesis Overview:
Text to be
Synthesized
Natural Language
Processing
(NLP)(NLP)
Speech Synthesis Overview:
2
Speech Synthesis
Engine
Synthesized
Speech
Introduction:Rule-based, formant synthesis
� Hand-crafting each phonetic units by rules
CORPUS-BASED:
�Concatenative synthesis
� High quality speech can be synthesized using waveform concatenation algorithms.concatenation algorithms.
� To obtain various voices, a large amount of speech data is necessary.
�Statistical parametric synthesis
� Generate speech parameters from statistical models
� Voice quality can easily be changed by transforming HMM parameters.
crafting each phonetic units by rules
High quality speech can be synthesized using waveform
To obtain various voices, a large amount of speech data
Generate speech parameters from statistical models
Voice quality can easily be changed by transforming
3
Approaches at CLE:CORPUS-BASED:
�Unit Selection
�HMM based.
Comparison of two Approaches:
UnitUnitUnitUnit SelectionSelectionSelectionSelection
Advantages:Advantages:Advantages:Advantages:Advantages:Advantages:Advantages:Advantages:
High Quality at Waveform level
(Specific Domain)
• Small Foot Print
• Smooth
• Stable Quality
Disadvantages:Disadvantages:Disadvantages:Disadvantages:
• Large footprints
• Discontinuous
• Unstable quality
Vocoder
(Domain
HMM basedHMM basedHMM basedHMM based
4
Small Foot Print
Smooth
Stable Quality
Vocoder sound
(Domain-independent)
Synthesis Model:
Linear
time-invariant
system)(ne
ExcitationPulse train
SourceSourceSourceSource excitation part Vocal tract Vocal tract Vocal tract Vocal tract
Source Filter Model:
system
)(nh
)(ne
White noise
� The h(n) is defined by the state output vector of the HMM mel-cepstrum
invariantSpeech
Vocal tract Vocal tract Vocal tract Vocal tract resonance part
55
)(*)()( nenhnx =Speech
is defined by the state output vector of the HMM e.g
General Overview(HTS):
Extract Spectrum,
F0, labels
Train Acoustic
Models
Speech Input
Labels
Parameter
Generation
Synthesis
Filter
Text Input
Extract Spectrum,
Train Acoustic
Stored
Training PartTraining PartTraining PartTraining Part
66
Synthesized
Speech
Stored
ModelsSynthesis PartSynthesis PartSynthesis PartSynthesis Part
Challenges:
Generation of the full-context style labels.
Addition of Stress/Syllable Layer.
Defining the Question Set.
Optimizing the Synthesized Quality.Optimizing the Synthesized Quality.
context style labels.
77
Full-Context Label Style:
P A K I S T_D A N
P-A-K T_D
Phoneme
sequence
TriTriTriTri----phone context dependent modelphone context dependent modelphone context dependent modelphone context dependent model
P A K I S T_D A NPhoneme
sequence
x^Px^Px^Px^P----AAAA+K+K+K+K====I@x_xI@x_xI@x_xI@x_x/A …/A …/A …/A … S^T_DS^T_DS^T_DS^T_D----
FullFullFullFull----context style context style context style context style context dependent modelcontext dependent modelcontext dependent modelcontext dependent model
P A K I S T_D A N
T_D-A-N
88
phone context dependent modelphone context dependent modelphone context dependent modelphone context dependent model
P A K I S T_D A N
----AAAA+N=+N=+N=+N=x@x_xx@x_xx@x_xx@x_x/A …/A …/A …/A …
context dependent modelcontext dependent modelcontext dependent modelcontext dependent model
Full-Context Format:
x^x-SIL+A=L@1_0/A:0_0_0/B:0-0-0@1-0&x^SIL-A+L=I_I@1_1/A:0_0_0/B:0-0-1@1-2&
SIL^A-L+I_I=A@1_2/A:0_0_1/B:0-0-2@2-1
A^L-I_I+A=P@2_1/A:0_0_1/B:0-0-2@2-1&
&1-1#1-1$1-1!0-0;0- …&1-9#1-3$1-1!0-2;0- …
1&2-8#1-3$1-1!0-1;0-0 …
&2-8#1-3$1-1!0-1;0- …
99
��� ۔۔۔ �� ا�
Full-Context Format:SIL^A-L+I_I=A@ 1_2/A:0_0_1/B:0-0-2@2-1&2
0-0|I_I/C:1+0+2/D:0_0/E:content+2@1+5&1+4#0+1/F:content_2/G:0_0/H:9=5^1=2|NONE/I:8=6/J:17+11-2
SupraSegmental Context
SegmentalSegmentalSegmentalSegmental
• Current Phoneme
• Previous two Phonemes
• Next two Phonemes
• Syllable
• Stress
• Word
• Phrase
• POS
1&2-8#1-3$1-1!0-1;0|I_I/C:1+0+2/D:0_0/E:content+2@1+5&1+
4#0+1/F:content_2/G:0_0/H:9=5^1=2|NONE
Supra-Segmental
Context
1010
Context
SupraSupraSupraSupra----SegmentalSegmentalSegmentalSegmental
Syllable
Stress
Word
Phrase
POS
Steps to Generate Full-Context Labels:Extract
Segmental &
Word Layer
Apply Stress &
Syllabification
Rules
TextGrid File
Rules
Align Syllable
Boundaries with
Segmental Layer
Generate new
TextGrid File with
Additional Layers
Context Labels:
1111
Convert to Full-
Context format
Generate new
File with
Additional Layers
Steps to Generate Full-Context Labels:Extract
Segmental &
Word Layer
Apply Stress &
Syllabification
Rules
TextGrid File
Rules
Align Syllable
Boundaries with
Segmental Layer
Generate new
TextGrid File with
Additional Layers
Context Labels:
1313
Convert to Full-
Context format
Generate new
File with
Additional Layers
Context Clustering (Question Set) 1/2:
Number of possible combinations are quite enormous with these 53535353 different contexts.
With only Segmental Context Possible models are:
665 ≈ 1252 million
If we consider all the context, it will be practically infinite.If we consider all the context, it will be practically infinite.
Solution:Solution:Solution:Solution:
Record data having maximum phoneme coverage at trior di-phone level.
Apply context clustering technique to classify and share acoustically similar models
Context Clustering (Question Set) 1/2:
Number of possible combinations are quite enormous with
With only Segmental Context Possible models are:
1252 million
If we consider all the context, it will be practically infinite.
1515
If we consider all the context, it will be practically infinite.
Record data having maximum phoneme coverage at tri-phone
Apply context clustering technique to classify and share
Context Clustering (Question Set) 2/2:
Phoneme� {preceding, current, succeeding} phonemes
Stress/Syllable/Word/� # of phonemes at {preceding, current, succeeding} syllable# of phonemes at {preceding, current, succeeding} syllable� stress of {preceding, current, succeeding} syllable� Position of current syllable in current word� # of syllables {from previous, to next} stressed � Vowel within current syllable� # of syllables in {preceding, current, succeeding}
Context Clustering (Question Set) 2/2:
{preceding, current, succeeding} phonemes
# of phonemes at {preceding, current, succeeding} syllable
1616
# of phonemes at {preceding, current, succeeding} syllable
of {preceding, current, succeeding} syllable
Position of current syllable in current word
stressed syllable
of syllables in {preceding, current, succeeding} word
Some Synthesized Examples:
Seen ContextSeen ContextSeen ContextSeen Context::::
UnUnUnUn----seen seen seen seen ContextContextContextContext::::
Different Carrier Word:Different Carrier Word:Different Carrier Word:Different Carrier Word:
Some Synthesized Examples:
Training Set:Training Set:Training Set:Training Set:
1717