the creation of emotional effects for an arabic speech synthesis system prepared by: waleed mohamed...

The Creation of Emotional The Creation of Emotional Effects for An Arabic Speech Effects for An Arabic Speech Synthesis SystemSynthesis System

Prepared by:Waleed Mohamed Azmy

Under Supervision:Prof. Dr. Mahmoud Ismail ShomanDr. Sherif Mahdy Abdou

AgendaAgendaMotivations & ApplicationsEmotional Synthesis ApproachesUnit Selection & Blending Data

ApproachArabic Speech Database & ChallengesFestival – Speech Synthesis

framework Arabic Voice BuildingProposed Utterance StructureProposed Target Cost FunctionSystem Evaluation & ResultsConclusion & Future Works

Motivations & ApplicationsMotivations & ApplicationsEmotions are inseparable

components of the natural human speech.

Because of that, the level of human speech can only be achieved with the ability to synthesize emotions.

As the tremendous increasing in the demand of speech synthesis systems, Emotional synthesis become one of the major tends in this field.

Motivations & ApplicationsMotivations & ApplicationsEmotional speech synthesis can be

applied to different applications like:◦Spoken Dialogue Systems◦Customer-care centers◦Task planning◦Tutorial systems◦Automated agents

Future trends are towards approaching Artificial Intelligence

Emotional Synthesis Emotional Synthesis ApproachesApproachesAttempts to add emotion effects to

synthesised speech have existed for more than a decade.

Most emotional synthesis approaches inherent properties of the various normal synthesis techniques used.

Different emotional synthesis techniques provide control over acoustic parameters to very different degrees.◦ Important emotional related acoustic

parameters like: Pitch, Duration, speaking rate …etc

Emotional Synthesis Emotional Synthesis ApproachesApproachesEmotional Formant Synthesis

◦Also known as rule-based synthesis◦No human speech recordings are

involved at run time.◦The resulting speech sounds relatively

unnatural and “robot-like”.Emotional Diphone Concatenation

◦Use Diphone concatenation on Emotional recordings

◦A majority reports shows a degree of success in conveying some emotions.

◦May harm voice quality

Emotional Synthesis Emotional Synthesis ApproachesApproachesHMM-Based Parametric synthesis

◦Use HMM models to change speaking style and emotional expression of the synthetic speech.

◦Train models on speech produced with different speaking styles.

◦Requires several hundreds of sentences spoken in a different speaking style.

Emotional-Based Unit-Selection◦Use variable length units◦Units that are similar to the given target

are likely to be selected. ◦ It is the used approach in our proposed

system.

Emotional Unit SelectionEmotional Unit SelectionUnits are selected from large

databases of natural speech.The quality of unit selection

synthesis depends heavily on the quality of the speaker and on the coverage of the database.

The quality of the synthesis relies on the fact that little or no signal processing is done on the selected units.

Two major cost functions are used◦ Joint Cost function◦Target Cost function

Blending Data ApproachBlending Data ApproachTwo main techniques of building emotional

voices for unit selection◦ Tiering techniques

Limited domain synthesis. Requires a large number of databases for different

emotions.◦ Blending techniques

General purpose application. All databases are located in one data pool used for

selection. … used technique in our system.

In blending approach, The system will choose the units from only one Database.

Requires careful selection criterion to match target speaking style.

Arabic Speech DatabaseArabic Speech DatabaseIn our system we used RDI TTS Saudi

speaker database.This database consists of 10 hours of

recording with neutral emotion and one hour of recordings for four different emotions that are sadness, happiness, surprise and questioning.

The EGG signal is also recorded to support pitch marking during the synthesis process.

The database is for male speaker sampled with 16 kHZ sampling rate.

HMMs based Viterbi alignment procedure is used to produce the phone level segmentation boundaries

Festival – Speech Synthesis Festival – Speech Synthesis frameworkframeworkThe Festival TTS system was developed

at the University of Edinburgh by Alan Black and Paul Taylor.

Festival is primarily a research toolkit for speech synthesis (TTS).

It provides a state-of-the-art unit selection synthesis module called “Multisyn”.

Festival uses a data structure called an utterance structure that consists of items and relations.

An utterance represents some chunk of text that is to be rendered as speech

Festival – Speech Synthesis Festival – Speech Synthesis frameworkframeworkExample of utterance structure:

Arabic Voice BuildingArabic Voice BuildingVoice building phase is one of the major

steps in building a unit selection synthesizer.

The voice building toolkit that comes with festival has been used for building our Arabic Emotional voice. The following steps were used:1. Generate power factors and wave file

normalization2. Using EGGs for accurate pitch marking

extraction3. Generating LPC and residuals from wave files4. Generate MFCCs, pitch values and spectral

coefficients

Arabic Voice Building(Pitch Arabic Voice Building(Pitch marking)marking)In unit selection; Accurate estimation

of pitchmarks is necessary for pitch modification to assure optimal quality of synthetic speech.

We used EGG signal to extract pitchmarks.

Pitchmarking enhancements has been carried out by matrix optimization process.

The low pass filter and high pass filter cut-off frequencies have been chosen for the optimization process.

Arabic Voice Building(Pitch Arabic Voice Building(Pitch marking)marking)Example of the default

parameters of the pitchmarking application versus the optimized ones.

Default Pitchmarking

Optimized Pitchmarking

Proposed Utterance Proposed Utterance StructureStructureThe HMM-based alignment

transcription is converted to the utterance structure.

ASMO-449 is used to transliterate the Arabic text to English characters in the utterance files.

The utterance structures of both the utterances in the training databases and the target utterances is changed to carry emotional information.

A new feature called emotion has been added to the word item type in the utterance structure.

Proposed Utterance Proposed Utterance StructureStructureThe proposed system is designed

to work with three emotional state (normal, sad and question).

So, The emotion feature takes one of three emotional values.1. Normal Normal state2. Sad Sad emotion state3. Question Question emotion sate

Proposed Target Cost Proposed Target Cost FunctionFunctionThe target cost is a weighted sum

of functions that check if features in the target utterance match those features in the candidate utterance.

The standard target cost in festival does not contain any computation differences for emotional state of the utterance.

A new emotional target cost is proposed

Proposed Emotional Target Proposed Emotional Target Cost FunctionCost Function

)u,(t Cw=) u,(t C iitj

p

1j=

tii

t

jThe target cost:

The Emotional target cost:

)()(1

)()(0{)u,(tC ii

temo emouemot

emouemot

ii

ii

Proposed Target Cost Proposed Target Cost FunctionFunctionThe algorithm in general favors

units that are similar in the emotional sate or classified to be emotional in a two stages of penalties.

The tuning weighting factor of the emotional target cost is optimized by try-and-error to find appropriate value.

System EvaluationSystem EvaluationSix sentences were synthesized for each

emotional state.The Sentences were from the news

websites, usual conversations and the holy Qur’an.

Two major set of evaluation used◦ Deterministic Evaluation◦ Perceptual Evaluation

In deterministic evaluation tests the emotional classifier system Emovoice is used across the output utterance.

Emovoice is trained on the Arabic data using SVM model and its complete feature set.

System EvaluationSystem EvaluationIn perceptual tests, 15

participant have involvedTwo types of listening tests were

performed in order to evaluate the system perceptually.◦Test intelligibility

The listener was asked to type in what they heard

Word Error Rate (WER) is computed

System EvaluationSystem Evaluation◦Test Naturalness & Emotiveness

The actual experiment took place on a computer where subjects had to listen to the synthesized sentences using headphones.

The listeners are asked to rate the quality of the output voice in terms of (Naturalness & Emotiveness)

Participants were asked to give ratings between 1 and 4 for poor, acceptable, good and excellent respectively.

System Evaluation - System Evaluation - ResultsResultsThe confusion matrix of the

classifier output emotion and the target emotional state of the utterance is

Classified As

Normal Sad Question

Normal 5 1 0

Sad 0 6 0

Question 1 0 5

System Evaluation - System Evaluation - ResultsResultsThe results of the WER is shown

in figure.It shows a maximum average of

8% in sad emotionWER

System Evaluation - System Evaluation - ResultsResultsThe descriptive statistics of the

naturalness and emotiveness ratings is summarized..

Rating Mean

Standard Deviation

Normal 2.72 1.01

Sad 2.71 1.02

Question 2.56 0.96

All 2.67 1.06

Rating Mean

Standard Deviation

Normal 3.28 0.71

Sad 2.76 1.09

Question 3.08 0.81

All 3.04 0.95

Naturalness Emotiveness

System Evaluation - System Evaluation - ResultsResultsThe overall mean of naturalness ratings

is 2.7 which approach a good quality naturalness.

The average ratings of overall emotiveness are 3.1 which indicate good emotive state of the synthesized speech

The naturalness and emotiveness ratings for question emotion sentences has lower mean value and high variance which means that they were not –to some extent- recognized as natural human speech. However it shows a good emotiveness scores.

ConclusionsConclusionsThe main goal of this research was

to develop an Emotional Arabic TTS voice.

This research focused on three important emotional sates; normal, sad and questions.

According to the different tests performed on the system, it shows promising results. At most the participants feel acceptable natural voice with clear good emotive state.

Future worksFuture worksIt is recommended to increase the

duration of acted or real emotional utterances in the RDI Arabic speech database.

However the work done for accurate pitch-marking, some further enhancements are needed especially for question speech utterances.

Optimizing the pitch modification module in festival for better concatenation with different emotions.

Use emotional speech conversion based on signal processing and feature modelling technologies. The initial key features are commonly known to be pitch and duration

QuestionsQuestions

the creation of emotional effects for an arabic speech synthesis system prepared by: waleed mohamed...

Documents

synthetic speech

resulting speech

natural human speech

emotional voices

emotional expression

arabic speech databasein

level of human speech

emotional unit selectionunits