the creation of emotional effects for an arabic speech synthesis system prepared by: waleed mohamed...

30
The Creation of Emotional The Creation of Emotional Effects for An Arabic Effects for An Arabic Speech Synthesis System Speech Synthesis System Prepared by: Waleed Mohamed Azmy Under Supervision: Prof. Dr. Mahmoud Ismail Shoman Dr. Sherif Mahdy Abdou

Upload: michael-scott

Post on 12-Jan-2016

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The Creation of Emotional Effects for An Arabic Speech Synthesis System Prepared by: Waleed Mohamed Azmy Under Supervision: Prof. Dr. Mahmoud Ismail Shoman

The Creation of Emotional The Creation of Emotional Effects for An Arabic Speech Effects for An Arabic Speech Synthesis SystemSynthesis System

Prepared by:Waleed Mohamed Azmy

Under Supervision:Prof. Dr. Mahmoud Ismail ShomanDr. Sherif Mahdy Abdou

Page 2: The Creation of Emotional Effects for An Arabic Speech Synthesis System Prepared by: Waleed Mohamed Azmy Under Supervision: Prof. Dr. Mahmoud Ismail Shoman

AgendaAgendaMotivations & ApplicationsEmotional Synthesis ApproachesUnit Selection & Blending Data

ApproachArabic Speech Database & ChallengesFestival – Speech Synthesis

framework Arabic Voice BuildingProposed Utterance StructureProposed Target Cost FunctionSystem Evaluation & ResultsConclusion & Future Works

Page 3: The Creation of Emotional Effects for An Arabic Speech Synthesis System Prepared by: Waleed Mohamed Azmy Under Supervision: Prof. Dr. Mahmoud Ismail Shoman

Motivations & ApplicationsMotivations & ApplicationsEmotions are inseparable

components of the natural human speech.

Because of that, the level of human speech can only be achieved with the ability to synthesize emotions.

As the tremendous increasing in the demand of speech synthesis systems, Emotional synthesis become one of the major tends in this field.

Page 4: The Creation of Emotional Effects for An Arabic Speech Synthesis System Prepared by: Waleed Mohamed Azmy Under Supervision: Prof. Dr. Mahmoud Ismail Shoman

Motivations & ApplicationsMotivations & ApplicationsEmotional speech synthesis can be

applied to different applications like:◦Spoken Dialogue Systems◦Customer-care centers◦Task planning◦Tutorial systems◦Automated agents

Future trends are towards approaching Artificial Intelligence

Page 5: The Creation of Emotional Effects for An Arabic Speech Synthesis System Prepared by: Waleed Mohamed Azmy Under Supervision: Prof. Dr. Mahmoud Ismail Shoman

Emotional Synthesis Emotional Synthesis ApproachesApproachesAttempts to add emotion effects to

synthesised speech have existed for more than a decade.

Most emotional synthesis approaches inherent properties of the various normal synthesis techniques used.

Different emotional synthesis techniques provide control over acoustic parameters to very different degrees.◦ Important emotional related acoustic

parameters like: Pitch, Duration, speaking rate …etc

Page 6: The Creation of Emotional Effects for An Arabic Speech Synthesis System Prepared by: Waleed Mohamed Azmy Under Supervision: Prof. Dr. Mahmoud Ismail Shoman

Emotional Synthesis Emotional Synthesis ApproachesApproachesEmotional Formant Synthesis

◦Also known as rule-based synthesis◦No human speech recordings are

involved at run time.◦The resulting speech sounds relatively

unnatural and “robot-like”.Emotional Diphone Concatenation

◦Use Diphone concatenation on Emotional recordings

◦A majority reports shows a degree of success in conveying some emotions.

◦May harm voice quality

Page 7: The Creation of Emotional Effects for An Arabic Speech Synthesis System Prepared by: Waleed Mohamed Azmy Under Supervision: Prof. Dr. Mahmoud Ismail Shoman

Emotional Synthesis Emotional Synthesis ApproachesApproachesHMM-Based Parametric synthesis

◦Use HMM models to change speaking style and emotional expression of the synthetic speech.

◦Train models on speech produced with different speaking styles.

◦Requires several hundreds of sentences spoken in a different speaking style.

Emotional-Based Unit-Selection◦Use variable length units◦Units that are similar to the given target

are likely to be selected. ◦ It is the used approach in our proposed

system.

Page 8: The Creation of Emotional Effects for An Arabic Speech Synthesis System Prepared by: Waleed Mohamed Azmy Under Supervision: Prof. Dr. Mahmoud Ismail Shoman

Emotional Unit SelectionEmotional Unit SelectionUnits are selected from large

databases of natural speech.The quality of unit selection

synthesis depends heavily on the quality of the speaker and on the coverage of the database.

The quality of the synthesis relies on the fact that little or no signal processing is done on the selected units.

Two major cost functions are used◦ Joint Cost function◦Target Cost function

Page 9: The Creation of Emotional Effects for An Arabic Speech Synthesis System Prepared by: Waleed Mohamed Azmy Under Supervision: Prof. Dr. Mahmoud Ismail Shoman

Blending Data ApproachBlending Data ApproachTwo main techniques of building emotional

voices for unit selection◦ Tiering techniques

Limited domain synthesis. Requires a large number of databases for different

emotions.◦ Blending techniques

General purpose application. All databases are located in one data pool used for

selection. … used technique in our system.

In blending approach, The system will choose the units from only one Database.

Requires careful selection criterion to match target speaking style.

Page 10: The Creation of Emotional Effects for An Arabic Speech Synthesis System Prepared by: Waleed Mohamed Azmy Under Supervision: Prof. Dr. Mahmoud Ismail Shoman

Arabic Speech DatabaseArabic Speech DatabaseIn our system we used RDI TTS Saudi

speaker database.This database consists of 10 hours of

recording with neutral emotion and one hour of recordings for four different emotions that are sadness, happiness, surprise and questioning.

The EGG signal is also recorded to support pitch marking during the synthesis process.

The database is for male speaker sampled with 16 kHZ sampling rate.

HMMs based Viterbi alignment procedure is used to produce the phone level segmentation boundaries

Page 11: The Creation of Emotional Effects for An Arabic Speech Synthesis System Prepared by: Waleed Mohamed Azmy Under Supervision: Prof. Dr. Mahmoud Ismail Shoman

Festival – Speech Synthesis Festival – Speech Synthesis frameworkframeworkThe Festival TTS system was developed

at the University of Edinburgh by Alan Black and Paul Taylor.

Festival is primarily a research toolkit for speech synthesis (TTS).

It provides a state-of-the-art unit selection synthesis module called “Multisyn”.

Festival uses a data structure called an utterance structure that consists of items and relations.

An utterance represents some chunk of text that is to be rendered as speech

Page 12: The Creation of Emotional Effects for An Arabic Speech Synthesis System Prepared by: Waleed Mohamed Azmy Under Supervision: Prof. Dr. Mahmoud Ismail Shoman

Festival – Speech Synthesis Festival – Speech Synthesis frameworkframeworkExample of utterance structure:

Page 13: The Creation of Emotional Effects for An Arabic Speech Synthesis System Prepared by: Waleed Mohamed Azmy Under Supervision: Prof. Dr. Mahmoud Ismail Shoman

Arabic Voice BuildingArabic Voice BuildingVoice building phase is one of the major

steps in building a unit selection synthesizer.

The voice building toolkit that comes with festival has been used for building our Arabic Emotional voice. The following steps were used:1. Generate power factors and wave file

normalization2. Using EGGs for accurate pitch marking

extraction3. Generating LPC and residuals from wave files4. Generate MFCCs, pitch values and spectral

coefficients

Page 14: The Creation of Emotional Effects for An Arabic Speech Synthesis System Prepared by: Waleed Mohamed Azmy Under Supervision: Prof. Dr. Mahmoud Ismail Shoman

Arabic Voice Building(Pitch Arabic Voice Building(Pitch marking)marking)In unit selection; Accurate estimation

of pitchmarks is necessary for pitch modification to assure optimal quality of synthetic speech.

We used EGG signal to extract pitchmarks.

Pitchmarking enhancements has been carried out by matrix optimization process.

The low pass filter and high pass filter cut-off frequencies have been chosen for the optimization process.

Page 15: The Creation of Emotional Effects for An Arabic Speech Synthesis System Prepared by: Waleed Mohamed Azmy Under Supervision: Prof. Dr. Mahmoud Ismail Shoman

Arabic Voice Building(Pitch Arabic Voice Building(Pitch marking)marking)Example of the default

parameters of the pitchmarking application versus the optimized ones.

Default Pitchmarking

Optimized Pitchmarking

Page 16: The Creation of Emotional Effects for An Arabic Speech Synthesis System Prepared by: Waleed Mohamed Azmy Under Supervision: Prof. Dr. Mahmoud Ismail Shoman

Proposed Utterance Proposed Utterance StructureStructureThe HMM-based alignment

transcription is converted to the utterance structure.

ASMO-449 is used to transliterate the Arabic text to English characters in the utterance files.

The utterance structures of both the utterances in the training databases and the target utterances is changed to carry emotional information.

A new feature called emotion has been added to the word item type in the utterance structure.

Page 17: The Creation of Emotional Effects for An Arabic Speech Synthesis System Prepared by: Waleed Mohamed Azmy Under Supervision: Prof. Dr. Mahmoud Ismail Shoman

Proposed Utterance Proposed Utterance StructureStructureThe proposed system is designed

to work with three emotional state (normal, sad and question).

So, The emotion feature takes one of three emotional values.1. Normal Normal state2. Sad Sad emotion state3. Question Question emotion sate

Page 18: The Creation of Emotional Effects for An Arabic Speech Synthesis System Prepared by: Waleed Mohamed Azmy Under Supervision: Prof. Dr. Mahmoud Ismail Shoman

Proposed Target Cost Proposed Target Cost FunctionFunctionThe target cost is a weighted sum

of functions that check if features in the target utterance match those features in the candidate utterance.

The standard target cost in festival does not contain any computation differences for emotional state of the utterance.

A new emotional target cost is proposed

Page 19: The Creation of Emotional Effects for An Arabic Speech Synthesis System Prepared by: Waleed Mohamed Azmy Under Supervision: Prof. Dr. Mahmoud Ismail Shoman

Proposed Emotional Target Proposed Emotional Target Cost FunctionCost Function

)u,(t Cw=) u,(t C iitj

p

1j=

tii

t

jThe target cost:

The Emotional target cost:

)()(1

)()(0{)u,(tC ii

temo emouemot

emouemot

ii

ii

Page 20: The Creation of Emotional Effects for An Arabic Speech Synthesis System Prepared by: Waleed Mohamed Azmy Under Supervision: Prof. Dr. Mahmoud Ismail Shoman

Proposed Target Cost Proposed Target Cost FunctionFunctionThe algorithm in general favors

units that are similar in the emotional sate or classified to be emotional in a two stages of penalties.

The tuning weighting factor of the emotional target cost is optimized by try-and-error to find appropriate value.

Page 21: The Creation of Emotional Effects for An Arabic Speech Synthesis System Prepared by: Waleed Mohamed Azmy Under Supervision: Prof. Dr. Mahmoud Ismail Shoman

System EvaluationSystem EvaluationSix sentences were synthesized for each

emotional state.The Sentences were from the news

websites, usual conversations and the holy Qur’an.

Two major set of evaluation used◦ Deterministic Evaluation◦ Perceptual Evaluation

In deterministic evaluation tests the emotional classifier system Emovoice is used across the output utterance.

Emovoice is trained on the Arabic data using SVM model and its complete feature set.

Page 22: The Creation of Emotional Effects for An Arabic Speech Synthesis System Prepared by: Waleed Mohamed Azmy Under Supervision: Prof. Dr. Mahmoud Ismail Shoman

System EvaluationSystem EvaluationIn perceptual tests, 15

participant have involvedTwo types of listening tests were

performed in order to evaluate the system perceptually.◦Test intelligibility

The listener was asked to type in what they heard

Word Error Rate (WER) is computed

Page 23: The Creation of Emotional Effects for An Arabic Speech Synthesis System Prepared by: Waleed Mohamed Azmy Under Supervision: Prof. Dr. Mahmoud Ismail Shoman

System EvaluationSystem Evaluation◦Test Naturalness & Emotiveness

The actual experiment took place on a computer where subjects had to listen to the synthesized sentences using headphones.

The listeners are asked to rate the quality of the output voice in terms of (Naturalness & Emotiveness)

Participants were asked to give ratings between 1 and 4 for poor, acceptable, good and excellent respectively.

Page 24: The Creation of Emotional Effects for An Arabic Speech Synthesis System Prepared by: Waleed Mohamed Azmy Under Supervision: Prof. Dr. Mahmoud Ismail Shoman

System Evaluation - System Evaluation - ResultsResultsThe confusion matrix of the

classifier output emotion and the target emotional state of the utterance is

Classified As

Normal Sad Question

Normal 5 1 0

Sad 0 6 0

Question 1 0 5

Page 25: The Creation of Emotional Effects for An Arabic Speech Synthesis System Prepared by: Waleed Mohamed Azmy Under Supervision: Prof. Dr. Mahmoud Ismail Shoman

System Evaluation - System Evaluation - ResultsResultsThe results of the WER is shown

in figure.It shows a maximum average of

8% in sad emotionWER

Page 26: The Creation of Emotional Effects for An Arabic Speech Synthesis System Prepared by: Waleed Mohamed Azmy Under Supervision: Prof. Dr. Mahmoud Ismail Shoman

System Evaluation - System Evaluation - ResultsResultsThe descriptive statistics of the

naturalness and emotiveness ratings is summarized..

Rating Mean

Standard Deviation

Normal 2.72 1.01

Sad 2.71 1.02

Question 2.56 0.96

All 2.67 1.06

Rating Mean

Standard Deviation

Normal 3.28 0.71

Sad 2.76 1.09

Question 3.08 0.81

All 3.04 0.95

Naturalness Emotiveness

Page 27: The Creation of Emotional Effects for An Arabic Speech Synthesis System Prepared by: Waleed Mohamed Azmy Under Supervision: Prof. Dr. Mahmoud Ismail Shoman

System Evaluation - System Evaluation - ResultsResultsThe overall mean of naturalness ratings

is 2.7 which approach a good quality naturalness.

The average ratings of overall emotiveness are 3.1 which indicate good emotive state of the synthesized speech

The naturalness and emotiveness ratings for question emotion sentences has lower mean value and high variance which means that they were not –to some extent- recognized as natural human speech. However it shows a good emotiveness scores.

Page 28: The Creation of Emotional Effects for An Arabic Speech Synthesis System Prepared by: Waleed Mohamed Azmy Under Supervision: Prof. Dr. Mahmoud Ismail Shoman

ConclusionsConclusionsThe main goal of this research was

to develop an Emotional Arabic TTS voice.

This research focused on three important emotional sates; normal, sad and questions.

According to the different tests performed on the system, it shows promising results. At most the participants feel acceptable natural voice with clear good emotive state.

Page 29: The Creation of Emotional Effects for An Arabic Speech Synthesis System Prepared by: Waleed Mohamed Azmy Under Supervision: Prof. Dr. Mahmoud Ismail Shoman

Future worksFuture worksIt is recommended to increase the

duration of acted or real emotional utterances in the RDI Arabic speech database.

However the work done for accurate pitch-marking, some further enhancements are needed especially for question speech utterances.

Optimizing the pitch modification module in festival for better concatenation with different emotions.

Use emotional speech conversion based on signal processing and feature modelling technologies. The initial key features are commonly known to be pitch and duration

Page 30: The Creation of Emotional Effects for An Arabic Speech Synthesis System Prepared by: Waleed Mohamed Azmy Under Supervision: Prof. Dr. Mahmoud Ismail Shoman

QuestionsQuestions