hidden markov model (hmm) based speech synthesis using hts ... nawaz.pdf · if we consider all the...

18
Hidden Markov (HMM) based S Synthesis using Presenter: Omer Nawaz Research Officer (III) v Model Speech g HTS Toolkit.

Upload: others

Post on 28-May-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Hidden Markov Model (HMM) based Speech Synthesis using HTS Toolkit.

Presenter:

Omer Nawaz

Research Officer (III)

Hidden Markov Model (HMM) based Speech Synthesis using HTS Toolkit.

Speech Synthesis Overview:

Text to be

Synthesized

Natural Language

Processing

(NLP)(NLP)

Speech Synthesis Overview:

2

Speech Synthesis

Engine

Synthesized

Speech

Introduction:Rule-based, formant synthesis

� Hand-crafting each phonetic units by rules

CORPUS-BASED:

�Concatenative synthesis

� High quality speech can be synthesized using waveform concatenation algorithms.concatenation algorithms.

� To obtain various voices, a large amount of speech data is necessary.

�Statistical parametric synthesis

� Generate speech parameters from statistical models

� Voice quality can easily be changed by transforming HMM parameters.

crafting each phonetic units by rules

High quality speech can be synthesized using waveform

To obtain various voices, a large amount of speech data

Generate speech parameters from statistical models

Voice quality can easily be changed by transforming

3

Approaches at CLE:CORPUS-BASED:

�Unit Selection

�HMM based.

Comparison of two Approaches:

UnitUnitUnitUnit SelectionSelectionSelectionSelection

Advantages:Advantages:Advantages:Advantages:Advantages:Advantages:Advantages:Advantages:

High Quality at Waveform level

(Specific Domain)

• Small Foot Print

• Smooth

• Stable Quality

Disadvantages:Disadvantages:Disadvantages:Disadvantages:

• Large footprints

• Discontinuous

• Unstable quality

Vocoder

(Domain

HMM basedHMM basedHMM basedHMM based

4

Small Foot Print

Smooth

Stable Quality

Vocoder sound

(Domain-independent)

Synthesis Model:

Linear

time-invariant

system)(ne

ExcitationPulse train

SourceSourceSourceSource excitation part Vocal tract Vocal tract Vocal tract Vocal tract

Source Filter Model:

system

)(nh

)(ne

White noise

� The h(n) is defined by the state output vector of the HMM mel-cepstrum

invariantSpeech

Vocal tract Vocal tract Vocal tract Vocal tract resonance part

55

)(*)()( nenhnx =Speech

is defined by the state output vector of the HMM e.g

General Overview(HTS):

Extract Spectrum,

F0, labels

Train Acoustic

Models

Speech Input

Labels

Parameter

Generation

Synthesis

Filter

Text Input

Extract Spectrum,

Train Acoustic

Stored

Training PartTraining PartTraining PartTraining Part

66

Synthesized

Speech

Stored

ModelsSynthesis PartSynthesis PartSynthesis PartSynthesis Part

Challenges:

Generation of the full-context style labels.

Addition of Stress/Syllable Layer.

Defining the Question Set.

Optimizing the Synthesized Quality.Optimizing the Synthesized Quality.

context style labels.

77

Full-Context Label Style:

P A K I S T_D A N

P-A-K T_D

Phoneme

sequence

TriTriTriTri----phone context dependent modelphone context dependent modelphone context dependent modelphone context dependent model

P A K I S T_D A NPhoneme

sequence

x^Px^Px^Px^P----AAAA+K+K+K+K====I@x_xI@x_xI@x_xI@x_x/A …/A …/A …/A … S^T_DS^T_DS^T_DS^T_D----

FullFullFullFull----context style context style context style context style context dependent modelcontext dependent modelcontext dependent modelcontext dependent model

P A K I S T_D A N

T_D-A-N

88

phone context dependent modelphone context dependent modelphone context dependent modelphone context dependent model

P A K I S T_D A N

----AAAA+N=+N=+N=+N=x@x_xx@x_xx@x_xx@x_x/A …/A …/A …/A …

context dependent modelcontext dependent modelcontext dependent modelcontext dependent model

Full-Context Format:

x^x-SIL+A=L@1_0/A:0_0_0/B:0-0-0@1-0&x^SIL-A+L=I_I@1_1/A:0_0_0/B:0-0-1@1-2&

SIL^A-L+I_I=A@1_2/A:0_0_1/B:0-0-2@2-1

A^L-I_I+A=P@2_1/A:0_0_1/B:0-0-2@2-1&

&1-1#1-1$1-1!0-0;0- …&1-9#1-3$1-1!0-2;0- …

1&2-8#1-3$1-1!0-1;0-0 …

&2-8#1-3$1-1!0-1;0- …

99

��� ۔۔۔ �� ا�

Full-Context Format:SIL^A-L+I_I=A@ 1_2/A:0_0_1/B:0-0-2@2-1&2

0-0|I_I/C:1+0+2/D:0_0/E:content+2@1+5&1+4#0+1/F:content_2/G:0_0/H:9=5^1=2|NONE/I:8=6/J:17+11-2

SupraSegmental Context

SegmentalSegmentalSegmentalSegmental

• Current Phoneme

• Previous two Phonemes

• Next two Phonemes

• Syllable

• Stress

• Word

• Phrase

• POS

1&2-8#1-3$1-1!0-1;0|I_I/C:1+0+2/D:0_0/E:content+2@1+5&1+

4#0+1/F:content_2/G:0_0/H:9=5^1=2|NONE

Supra-Segmental

Context

1010

Context

SupraSupraSupraSupra----SegmentalSegmentalSegmentalSegmental

Syllable

Stress

Word

Phrase

POS

Steps to Generate Full-Context Labels:Extract

Segmental &

Word Layer

Apply Stress &

Syllabification

Rules

TextGrid File

Rules

Align Syllable

Boundaries with

Segmental Layer

Generate new

TextGrid File with

Additional Layers

Context Labels:

1111

Convert to Full-

Context format

Generate new

File with

Additional Layers

TextGrid Format:

1212

Steps to Generate Full-Context Labels:Extract

Segmental &

Word Layer

Apply Stress &

Syllabification

Rules

TextGrid File

Rules

Align Syllable

Boundaries with

Segmental Layer

Generate new

TextGrid File with

Additional Layers

Context Labels:

1313

Convert to Full-

Context format

Generate new

File with

Additional Layers

extGrid Format with Additional Layers:Format with Additional Layers:

1414

Context Clustering (Question Set) 1/2:

Number of possible combinations are quite enormous with these 53535353 different contexts.

With only Segmental Context Possible models are:

665 ≈ 1252 million

If we consider all the context, it will be practically infinite.If we consider all the context, it will be practically infinite.

Solution:Solution:Solution:Solution:

Record data having maximum phoneme coverage at trior di-phone level.

Apply context clustering technique to classify and share acoustically similar models

Context Clustering (Question Set) 1/2:

Number of possible combinations are quite enormous with

With only Segmental Context Possible models are:

1252 million

If we consider all the context, it will be practically infinite.

1515

If we consider all the context, it will be practically infinite.

Record data having maximum phoneme coverage at tri-phone

Apply context clustering technique to classify and share

Context Clustering (Question Set) 2/2:

Phoneme� {preceding, current, succeeding} phonemes

Stress/Syllable/Word/� # of phonemes at {preceding, current, succeeding} syllable# of phonemes at {preceding, current, succeeding} syllable� stress of {preceding, current, succeeding} syllable� Position of current syllable in current word� # of syllables {from previous, to next} stressed � Vowel within current syllable� # of syllables in {preceding, current, succeeding}

Context Clustering (Question Set) 2/2:

{preceding, current, succeeding} phonemes

# of phonemes at {preceding, current, succeeding} syllable

1616

# of phonemes at {preceding, current, succeeding} syllable

of {preceding, current, succeeding} syllable

Position of current syllable in current word

stressed syllable

of syllables in {preceding, current, succeeding} word

Some Synthesized Examples:

Seen ContextSeen ContextSeen ContextSeen Context::::

UnUnUnUn----seen seen seen seen ContextContextContextContext::::

Different Carrier Word:Different Carrier Word:Different Carrier Word:Different Carrier Word:

Some Synthesized Examples:

Training Set:Training Set:Training Set:Training Set:

1717

Questions ?Questions ?Questions ?

1818

Questions ?