the creation of emotional effects for an arabic speech synthesis system prepared by: waleed mohamed...
Post on 12-Jan-2016
214 Views
Preview:
TRANSCRIPT
The Creation of Emotional The Creation of Emotional Effects for An Arabic Speech Effects for An Arabic Speech Synthesis SystemSynthesis System
Prepared by:Waleed Mohamed Azmy
Under Supervision:Prof. Dr. Mahmoud Ismail ShomanDr. Sherif Mahdy Abdou
AgendaAgendaMotivations & ApplicationsEmotional Synthesis ApproachesUnit Selection & Blending Data
ApproachArabic Speech Database & ChallengesFestival – Speech Synthesis
framework Arabic Voice BuildingProposed Utterance StructureProposed Target Cost FunctionSystem Evaluation & ResultsConclusion & Future Works
Motivations & ApplicationsMotivations & ApplicationsEmotions are inseparable
components of the natural human speech.
Because of that, the level of human speech can only be achieved with the ability to synthesize emotions.
As the tremendous increasing in the demand of speech synthesis systems, Emotional synthesis become one of the major tends in this field.
Motivations & ApplicationsMotivations & ApplicationsEmotional speech synthesis can be
applied to different applications like:◦Spoken Dialogue Systems◦Customer-care centers◦Task planning◦Tutorial systems◦Automated agents
Future trends are towards approaching Artificial Intelligence
Emotional Synthesis Emotional Synthesis ApproachesApproachesAttempts to add emotion effects to
synthesised speech have existed for more than a decade.
Most emotional synthesis approaches inherent properties of the various normal synthesis techniques used.
Different emotional synthesis techniques provide control over acoustic parameters to very different degrees.◦ Important emotional related acoustic
parameters like: Pitch, Duration, speaking rate …etc
Emotional Synthesis Emotional Synthesis ApproachesApproachesEmotional Formant Synthesis
◦Also known as rule-based synthesis◦No human speech recordings are
involved at run time.◦The resulting speech sounds relatively
unnatural and “robot-like”.Emotional Diphone Concatenation
◦Use Diphone concatenation on Emotional recordings
◦A majority reports shows a degree of success in conveying some emotions.
◦May harm voice quality
Emotional Synthesis Emotional Synthesis ApproachesApproachesHMM-Based Parametric synthesis
◦Use HMM models to change speaking style and emotional expression of the synthetic speech.
◦Train models on speech produced with different speaking styles.
◦Requires several hundreds of sentences spoken in a different speaking style.
Emotional-Based Unit-Selection◦Use variable length units◦Units that are similar to the given target
are likely to be selected. ◦ It is the used approach in our proposed
system.
Emotional Unit SelectionEmotional Unit SelectionUnits are selected from large
databases of natural speech.The quality of unit selection
synthesis depends heavily on the quality of the speaker and on the coverage of the database.
The quality of the synthesis relies on the fact that little or no signal processing is done on the selected units.
Two major cost functions are used◦ Joint Cost function◦Target Cost function
Blending Data ApproachBlending Data ApproachTwo main techniques of building emotional
voices for unit selection◦ Tiering techniques
Limited domain synthesis. Requires a large number of databases for different
emotions.◦ Blending techniques
General purpose application. All databases are located in one data pool used for
selection. … used technique in our system.
In blending approach, The system will choose the units from only one Database.
Requires careful selection criterion to match target speaking style.
Arabic Speech DatabaseArabic Speech DatabaseIn our system we used RDI TTS Saudi
speaker database.This database consists of 10 hours of
recording with neutral emotion and one hour of recordings for four different emotions that are sadness, happiness, surprise and questioning.
The EGG signal is also recorded to support pitch marking during the synthesis process.
The database is for male speaker sampled with 16 kHZ sampling rate.
HMMs based Viterbi alignment procedure is used to produce the phone level segmentation boundaries
Festival – Speech Synthesis Festival – Speech Synthesis frameworkframeworkThe Festival TTS system was developed
at the University of Edinburgh by Alan Black and Paul Taylor.
Festival is primarily a research toolkit for speech synthesis (TTS).
It provides a state-of-the-art unit selection synthesis module called “Multisyn”.
Festival uses a data structure called an utterance structure that consists of items and relations.
An utterance represents some chunk of text that is to be rendered as speech
Festival – Speech Synthesis Festival – Speech Synthesis frameworkframeworkExample of utterance structure:
Arabic Voice BuildingArabic Voice BuildingVoice building phase is one of the major
steps in building a unit selection synthesizer.
The voice building toolkit that comes with festival has been used for building our Arabic Emotional voice. The following steps were used:1. Generate power factors and wave file
normalization2. Using EGGs for accurate pitch marking
extraction3. Generating LPC and residuals from wave files4. Generate MFCCs, pitch values and spectral
coefficients
Arabic Voice Building(Pitch Arabic Voice Building(Pitch marking)marking)In unit selection; Accurate estimation
of pitchmarks is necessary for pitch modification to assure optimal quality of synthetic speech.
We used EGG signal to extract pitchmarks.
Pitchmarking enhancements has been carried out by matrix optimization process.
The low pass filter and high pass filter cut-off frequencies have been chosen for the optimization process.
Arabic Voice Building(Pitch Arabic Voice Building(Pitch marking)marking)Example of the default
parameters of the pitchmarking application versus the optimized ones.
Default Pitchmarking
Optimized Pitchmarking
Proposed Utterance Proposed Utterance StructureStructureThe HMM-based alignment
transcription is converted to the utterance structure.
ASMO-449 is used to transliterate the Arabic text to English characters in the utterance files.
The utterance structures of both the utterances in the training databases and the target utterances is changed to carry emotional information.
A new feature called emotion has been added to the word item type in the utterance structure.
Proposed Utterance Proposed Utterance StructureStructureThe proposed system is designed
to work with three emotional state (normal, sad and question).
So, The emotion feature takes one of three emotional values.1. Normal Normal state2. Sad Sad emotion state3. Question Question emotion sate
Proposed Target Cost Proposed Target Cost FunctionFunctionThe target cost is a weighted sum
of functions that check if features in the target utterance match those features in the candidate utterance.
The standard target cost in festival does not contain any computation differences for emotional state of the utterance.
A new emotional target cost is proposed
Proposed Emotional Target Proposed Emotional Target Cost FunctionCost Function
)u,(t Cw=) u,(t C iitj
p
1j=
tii
t
jThe target cost:
The Emotional target cost:
)()(1
)()(0{)u,(tC ii
temo emouemot
emouemot
ii
ii
Proposed Target Cost Proposed Target Cost FunctionFunctionThe algorithm in general favors
units that are similar in the emotional sate or classified to be emotional in a two stages of penalties.
The tuning weighting factor of the emotional target cost is optimized by try-and-error to find appropriate value.
System EvaluationSystem EvaluationSix sentences were synthesized for each
emotional state.The Sentences were from the news
websites, usual conversations and the holy Qur’an.
Two major set of evaluation used◦ Deterministic Evaluation◦ Perceptual Evaluation
In deterministic evaluation tests the emotional classifier system Emovoice is used across the output utterance.
Emovoice is trained on the Arabic data using SVM model and its complete feature set.
System EvaluationSystem EvaluationIn perceptual tests, 15
participant have involvedTwo types of listening tests were
performed in order to evaluate the system perceptually.◦Test intelligibility
The listener was asked to type in what they heard
Word Error Rate (WER) is computed
System EvaluationSystem Evaluation◦Test Naturalness & Emotiveness
The actual experiment took place on a computer where subjects had to listen to the synthesized sentences using headphones.
The listeners are asked to rate the quality of the output voice in terms of (Naturalness & Emotiveness)
Participants were asked to give ratings between 1 and 4 for poor, acceptable, good and excellent respectively.
System Evaluation - System Evaluation - ResultsResultsThe confusion matrix of the
classifier output emotion and the target emotional state of the utterance is
Classified As
Normal Sad Question
Normal 5 1 0
Sad 0 6 0
Question 1 0 5
System Evaluation - System Evaluation - ResultsResultsThe results of the WER is shown
in figure.It shows a maximum average of
8% in sad emotionWER
System Evaluation - System Evaluation - ResultsResultsThe descriptive statistics of the
naturalness and emotiveness ratings is summarized..
Rating Mean
Standard Deviation
Normal 2.72 1.01
Sad 2.71 1.02
Question 2.56 0.96
All 2.67 1.06
Rating Mean
Standard Deviation
Normal 3.28 0.71
Sad 2.76 1.09
Question 3.08 0.81
All 3.04 0.95
Naturalness Emotiveness
System Evaluation - System Evaluation - ResultsResultsThe overall mean of naturalness ratings
is 2.7 which approach a good quality naturalness.
The average ratings of overall emotiveness are 3.1 which indicate good emotive state of the synthesized speech
The naturalness and emotiveness ratings for question emotion sentences has lower mean value and high variance which means that they were not –to some extent- recognized as natural human speech. However it shows a good emotiveness scores.
ConclusionsConclusionsThe main goal of this research was
to develop an Emotional Arabic TTS voice.
This research focused on three important emotional sates; normal, sad and questions.
According to the different tests performed on the system, it shows promising results. At most the participants feel acceptable natural voice with clear good emotive state.
Future worksFuture worksIt is recommended to increase the
duration of acted or real emotional utterances in the RDI Arabic speech database.
However the work done for accurate pitch-marking, some further enhancements are needed especially for question speech utterances.
Optimizing the pitch modification module in festival for better concatenation with different emotions.
Use emotional speech conversion based on signal processing and feature modelling technologies. The initial key features are commonly known to be pitch and duration
QuestionsQuestions
top related