how (not) to select your voice corpus: random selection vs. phonologically balanced
DESCRIPTION
How (not) to Select Your Voice Corpus: Random Selection vs. Phonologically Balanced. Tanya Lambert, Norbert Braunschweiler, Sabine Buchholz 6th ISCA Workshop on Speech Synthesis Bonn, Germany 22-24th August 2007. Text selection for a TTS voice Random sub-corpus - PowerPoint PPT PresentationTRANSCRIPT
Copyright 2007, Toshiba Corporation.
How (not) to Select Your Voice Corpus: Random Selection vs. Phonologically Balanced
Tanya Lambert, Norbert Braunschweiler, Sabine Buchholz
6th ISCA Workshop on Speech SynthesisBonn, Germany
22-24th August 2007
2
Overview
Text selection for a TTS voice
Random sub-corpus
Phonologically balanced sub-corpus
Phonetic and phonological inventory of full corpus and its sub-corpora
Phonetic and phonological coverage of units in test sentences with
respect to the full corpus and its sub-corpora
Voice building - automatic annotation and training
Objective and subjective evaluations
Conclusions
3
Selection of Text for a TTS Voice
Voice preparation for a TTS system is affected by:
Text domain from which text is selected
Text annotations (phonetic, phonological, prosodic, syntactic)
The linguistic and signal processing capabilities of the TTS system
Unit selection method and the type of units selected for speech synthesis
Corpus training
Speech annotation (automatic/manual; phonetic details, post lexical effects)
Other factors (time and financial resources, voice talent, recording quality, the target audience of a TTS application, etc.)
4
Text Selection
Our case study tries to answer the following question: What is the effect of different script selection methods on a
half-phone unit selection system, automatic corpus annotation and corpus training?
Full corpus: The ATR American English Speech Corpus for Speech Synthesis (~ 8 h) used in this year’s Blizzard Challenge.
Random sub-corpus (0.8 h); Phonologically-rich sub-corpus (0.8 h)
Full corpus~8 h
Phonbal Random
Phonologically balanced selection
Random selection
5
Phonologically-Rich Sub-Corpus
…………………………………………….…………………..………………………………………………………..……..……….….….…..................
Set cover algorithm
…….…….…….…….…….Lexical units (full corpus)
…….…….…….
Sub-corpus A (1133 sentences)
Removed stress in
consonants
+
+........................................................................................................................................................................................................................................................................................................................................................................................
Sentences from full corpus (emphasis on interrogative, exclamatory, multisyllabic phrases, consonant clusters before and after silence)
………....……………..…...
Sub-corpus B….
….
+Sub-corpus A539 sentences(above the cut
point)
Sub-corpus(728 sentences
~2906 sec)
Phonetically and phonologically transcribed full corpus
=
Full corpus
Lexical units(sub-corpus)
594 sentences covered 1 unit per sentence
Set cover algorithm
6
Random Sub-Corpus
…….…….…….
Randomized sequence of sentences: Sub-corpus (686 sentences < 2914 sec)
Removed sentence
s including foreign words Sub-corpus
(687 sentences~2914 sec)
Full corpus
……………….…………………………………..………………………….………………..………………………………..……………….…………………………….………………..……………………….…..…………………………………………..………………..………………
+ 1 sentence= 2914 sec
7
Textual and Duration Characteristics of Corpora
Full Arctic Phonbal Randomseconds 28,591 2,914 2,906 2,914
sentences 6,579 1,032 728 687
words 79,182 9,196 8,156 8,094
words/sent. 12.0 8.9 11.2 11.8
% sent with
1 – 9 words 37.7 54.9 41.0 38.6
10 – 15 words 27.6 45.1 18.6 26.9
> 15 words 34.8 - 40.4 34.5
‘?’ 868 1 96 94
‘!’ 4 - - 1
‘,’ 3,977 430 452 410
‘;’ 30 6 4 3
‘:’ 17 - - -
8
Selection of text based on broad phonetic transcription may be insufficient
Inclusion of phonological, prosodic and syntactic markings how to make it effective for a half-phone unit selection system?
Distribution of Unit Types in Full Corpus and its Sub-Corpora
Corpus Selection - Considerations
Unit Types Full Arctic Phonbal Randomdiph. (no stress) 1607 1385 1510 1322
lex. diphones 4332 2716 3306 2735lex. triphones 17032 7945 8716 8144sil_CV clusters (no stress) 104 42 46 43VC_sil clusters (no stress) 184 84 100 75
9
Percentage Distribution of Units in Full Corpus and its Sub-corpora
0.0
5.0
10.0
15.0
20.0
Arctic Random Phonol. Rich
% o
f ful
l cor
pus diphones
lexical diphones
lexical triphones
sil CV clusters
VC sil clusters
0.00
20.00
40.00
60.00
80.00
100.00
Arctic Random Phonol. rich
% o
f ful
l cor
pus
diphones (no stress)
lexical diphones
lexical triphones
sil_cv_clusters.lf
vc_clusters_sil.lf
10
Distribution of Unit Types in Test Sentences
0
500
1000
1500
2000
2500
3000
3500
4000
diph. (no stress) lexical diph. lexical triph.
type
occ
urre
nces
conv mrt news novel sus
Testing distribution of unit types in 400 test sentences 100 sentences each from: conv = conversational; mrt = modified rhyme
test; news = news texts; novel = sentences from a novel; sus = semantically unpredictable sentences
11
Distribution of Lexical Diphone Types per Corpus per Text Genre
0
200
400
600
800
1000
1200
1400
1600
1800
conv mrt news novel sus
occu
rren
ce o
f lex
ical
dip
h. ty
pes
Full corpus
Arctic
Phon. rich
Random
12
Missing Diphone Types from Each Corpus in Relation to Test Sentences
0
20
40
60
80
100
120
140
160
conv mrt news novel sus
mis
sing
lexi
cal d
ipho
ne ty
pes
0
5
10
15
20
25
30
conv mrt news novel sus
mis
sing
dip
hone
type
s (n
o st
ress
)
Full corpus
ArcticRandomPhonologically rich
13
Diphone Types in Each Corpus but not Required in Test Sentences
0
500
1000
1500
2000
2500
3000
3500
4000
4500
conv mrt new s novel sus
lexi
cal d
ipho
ne ty
pes
0
200
400
600
800
1000
1200
1400
1600
conv mrt news novel sus
diph
one
type
s (n
o st
ress
)
Full corpusArcticRandomPhonologically rich
14
Voice Building – Automatic Annotation and Training
From both corpora Phonbal and Random synthesis voices were created
Automatic synthesis voice creation encompasses Grapheme to phoneme conversion Automatic phone alignment Automatic prosody annotation Automatic prosody training (duration, F0, pause, etc.) Speech unit database creation
Automatic phone alignment Depends on the quality of grapheme to phoneme conversion Depends on the output of text normalisation Uses HMM’s with a flat start, i.e. depends on corpus size Respects pronunciation variants Acoustic model typology: three-state Markov, left-to-right with no
skips, context independent, single Gaussian monophone HMM’s
15
Voice Building – Automatic Annotation and Training
Automatic prosody annotation Prosodizer creates ToBI markup for each sentence Rule based Depends on quality of phone alignments Depends on quality of text analysis module, i.e. uses PoS, etc.
Automatic prosody training Depends on phone alignments, ToBI markup, and text analysis Creates prediction models for:
• Phone duration• Prosodic chunk boundaries• Presence or absence of pauses• The length of previously predicted pauses• The accent property of each word: de-accented, accented, high• The F0 contour of each word
Quality of predicted prosody is important factor for overall voice quality
16
Objective Evaluation – how good are the phone alignments?
Comparison of phone alignments in the Phonbal and Random sub-corpora against those in the Full corpus
Phone alignment of Random corpus is slightly better than that of Phonbal
Metric Phonbal RandomOverlap Rate 95.26 96.35RMSE of boundaries 6.3 ms 3.3 ms
boundaries within 5 ms 86.6 % 91.8 %boundaries within 10 ms 97.1 % 99.1 %boundaries within 20 ms 99.1 % 99.9 %
17
Objective Evaluation – Accuracy of Prosody Prediction
Comparison of the accuracy of pause prediction, prosodic chunk prediction, and word accent prediction; by the modules trained on the Phonbal or on the Random sub-corpus
against the automatic markup of 1000 sentences not in either sub-corpus
Some prosody modules trained on Random corpus are better
Phonbal RandomChunks Precision 58.9 56.3
Recall 34.2 38.7
Pauses Precision 63.1 63.4
Recall 34.1 38.0
acc Precision 69.7 69.5
Recall 78.4 78.9
high Precision 54.7 57.1
Recall 38.6 41.1
18
Subjective Evaluation – Preference Listening Test
Subject Phonbal RandomNon-American Listeners
1 20 33
2 21 32
3 24 29
4 25 28
All 90 122
American English Listeners
1 21 32
2 21 32
3 16 37
4 23 30
5 25 28
All 106 159
Result of preference test comparing 53 test sentences synthesized with voice Phonbal or voice Random
2 groups of listeners: Non American listeners Native American listeners
Columns 2 and 3 show the number of times each subject preferred each voice
Each of the 9 subjects preferred the Random voice
19
Conclusions
Two synthesis voices were compared in this study: The two voices are based on two separate selections of sentences
from the same source corpus The Random corpus was created by a random selection of
sentences from the source corpus The Phonbal corpus was created by selecting sentences which
optimise its phonetic and phonological coverage
Listeners consistently preferred the TTS voice built with our system from the Random corpus
Investigation of the differences of the two sub-corpora revealed: Phonbal has better diphone and lexical diphone coverage Random has better phone alignments Random has slightly better prosody prediction performance
20
Future
Is the prosody prediction performance only due to better automatic prosody annotation which is due to better phone alignment?
Is the random selection inherently better suited to train prosody models on, e.g. because its distribution of sentence lengths is not as skewed as the Phonbal one?
What exactly is the relation between phone frequency and alignment accuracy?
Why does the Random corpus have so much better pause alignment when it contains fewer pauses?
Is it worth trying to construct some kind of prosodically balanced corpus to boost the performance of the trained modules, or would that result in a similar detrimental effect on alignment accuracy?