speech synthesis : up from state-of-the-art corpus …
TRANSCRIPT
![Page 1: SPEECH SYNTHESIS : UP FROM STATE-OF-THE-ART CORPUS …](https://reader031.vdocuments.net/reader031/viewer/2022020620/61e4396866a80454c8550bc1/html5/thumbnails/1.jpg)
eNTERFACEeNTERFACE’’05, Friday Aug. 5th05, Friday Aug. 5th
SPEECH SYNTHESIS : SPEECH SYNTHESIS : UP FROM STATEUP FROM STATE--OFOF--THETHE--ART ART
CORPUSCORPUS--BASED APPROACHES?BASED APPROACHES?
Provided to you equationProvided to you equation--free by:free by:
Thierry DutoitThierry [email protected]@fpms.ac.be
TCTS Lab Faculté Polytechnique de Mons Belgium
![Page 2: SPEECH SYNTHESIS : UP FROM STATE-OF-THE-ART CORPUS …](https://reader031.vdocuments.net/reader031/viewer/2022020620/61e4396866a80454c8550bc1/html5/thumbnails/2.jpg)
TTS = NLP + DSPTTS = NLP + DSP
TEXT
SPEECH
DIGITAL SIGNALPROCESSING
NATURAL LANGUAGEPROCESSING
PhonesInt/Dur
TEXT-TO-SPEECH SYNTHESIZER
NarrowPhonetic
Transcription
PhonetizationSpeech
Synthesis
Intonation/Duration
Generation
(a) (b) (c)
_ 210t 40U 55 0 173 75 173b 80 10 160i: 198 5 173 75 235…
To be or not to be, that is the
question.
![Page 3: SPEECH SYNTHESIS : UP FROM STATE-OF-THE-ART CORPUS …](https://reader031.vdocuments.net/reader031/viewer/2022020620/61e4396866a80454c8550bc1/html5/thumbnails/3.jpg)
ChallengesChallenges
• Accurate automatic phonetization (≠dictionnary look-up)• Prosody generation(i.e., intonation and phoneme
durations) must be “coherent”; easy to produce unnatural prosody
• Synthesis of phoneme sequences with corresponding prosody
– Coarticulation! (~Harris, 53)– Segmental quality should be maintained after pitch and
duration modification• Engineering
– Low design and maintenance cost
– Low computational and memory cost
– Easy adaptation to other languages
Intelligible – Natural – Cost effective
![Page 4: SPEECH SYNTHESIS : UP FROM STATE-OF-THE-ART CORPUS …](https://reader031.vdocuments.net/reader031/viewer/2022020620/61e4396866a80454c8550bc1/html5/thumbnails/4.jpg)
TTS = NLP + DSPTTS = NLP + DSP
TEXT
SPEECH
DIGITAL SIGNALPROCESSING
NATURAL LANGUAGEPROCESSING
PhonesInt/Dur
TEXT-TO-SPEECH SYNTHESIZER
NarrowPhonetic
Transcription
PhonetizationSpeech
Synthesis
Intonation/Duration
Generation
(a) (b) (c)
_ 210t 40U 55 0 173 75 173b 80 10 160i: 198 5 173 75 235…
To be or not to be, that is the
question.
![Page 5: SPEECH SYNTHESIS : UP FROM STATE-OF-THE-ART CORPUS …](https://reader031.vdocuments.net/reader031/viewer/2022020620/61e4396866a80454c8550bc1/html5/thumbnails/5.jpg)
ContentsContents
• Acoustic speech synthesis (DSP)
– Model-based (rule-based) approach
– Instance-based (concatenative) approach
• Diphone concatenation
• Corpus-based (Unit Selection) Synthesis
• Is there a future after Corpus-based
synthesis?
![Page 6: SPEECH SYNTHESIS : UP FROM STATE-OF-THE-ART CORPUS …](https://reader031.vdocuments.net/reader031/viewer/2022020620/61e4396866a80454c8550bc1/html5/thumbnails/6.jpg)
Von Von KempelenKempelen’’ss talking talking machine (1791)machine (1791)
Mouth
Nostrils
Main bellows
Small bellows
'S' pipe
'Sh' pipe
'Sh' lever'S' lever
(J.S. Liénard, LIMSI)
![Page 7: SPEECH SYNTHESIS : UP FROM STATE-OF-THE-ART CORPUS …](https://reader031.vdocuments.net/reader031/viewer/2022020620/61e4396866a80454c8550bc1/html5/thumbnails/7.jpg)
Omer DudleyOmer Dudley’’s s VoderVoder(Bell Labs, 1936)(Bell Labs, 1936)
NoiseSource
Oscillator
Resonnance Cont rol Amplifier
106 7 8 9
"Quiet "
t -dp-bk-g
Energy swit chwrist bar
VoderConsoleKeyboard
1 2 3 4
5
Pit ch-cont rolpedal
UV
V
![Page 8: SPEECH SYNTHESIS : UP FROM STATE-OF-THE-ART CORPUS …](https://reader031.vdocuments.net/reader031/viewer/2022020620/61e4396866a80454c8550bc1/html5/thumbnails/8.jpg)
John HolmesJohn Holmes’’ formant formant synthesizer (1964)synthesizer (1964)
Rule-based Synthesis
Haskins Labs (1968) DecTalk (1983)InfoVox (1983-95) LIMSI’s Polyglot (92)
Intelligibility Naturalness Mem/CPU New Voice(<100 kB)
![Page 9: SPEECH SYNTHESIS : UP FROM STATE-OF-THE-ART CORPUS …](https://reader031.vdocuments.net/reader031/viewer/2022020620/61e4396866a80454c8550bc1/html5/thumbnails/9.jpg)
ContentsContents
• Acoustic speech synthesis (DSP)
– Model-based (rule-based) approach
– Instance-based (concatenative)
approach• Diphone concatenation
• Corpus-based (Unit Selection) synthesis
• Is there a future after corpus-based
synthesis?
![Page 10: SPEECH SYNTHESIS : UP FROM STATE-OF-THE-ART CORPUS …](https://reader031.vdocuments.net/reader031/viewer/2022020620/61e4396866a80454c8550bc1/html5/thumbnails/10.jpg)
DiphoneDiphone concatenation (1977)concatenation (1977)
![Page 11: SPEECH SYNTHESIS : UP FROM STATE-OF-THE-ART CORPUS …](https://reader031.vdocuments.net/reader031/viewer/2022020620/61e4396866a80454c8550bc1/html5/thumbnails/11.jpg)
DiphoneDiphone concatenation concatenation (1968)(1968)
DiphoneDat abase
Prosody
Modificat ion
_ d o g _
50ms 80ms 160ms 70ms 50ms
F0
_d do og g _
Smooth joints
0 1000 2000 3000 4000 5000 6000 7000 8000-1
-0.50
0.51 x 104
AT&T: LPC (1980)
France Telecom : PSOLA (1990)
![Page 12: SPEECH SYNTHESIS : UP FROM STATE-OF-THE-ART CORPUS …](https://reader031.vdocuments.net/reader031/viewer/2022020620/61e4396866a80454c8550bc1/html5/thumbnails/12.jpg)
The MBROLA project (95)The MBROLA project (95)
http://tcts.fpms.ac.be/synthesis/
![Page 13: SPEECH SYNTHESIS : UP FROM STATE-OF-THE-ART CORPUS …](https://reader031.vdocuments.net/reader031/viewer/2022020620/61e4396866a80454c8550bc1/html5/thumbnails/13.jpg)
DiphoneDiphone concatenation (1977)concatenation (1977)
DiphoneDat abase
Prosody
Modificat ion
_ d o g _
50ms 80ms 160ms 70ms 50ms
F0
_d do og g _
Smooth joints
0 1000 2000 3000 4000 5000 6000 7000 8000-1
-0.50
0.51 x 104
Intelligibility Naturalness~ Mem/CPU New Voice ~(5 MB : “High DENSITY TTS”)
![Page 14: SPEECH SYNTHESIS : UP FROM STATE-OF-THE-ART CORPUS …](https://reader031.vdocuments.net/reader031/viewer/2022020620/61e4396866a80454c8550bc1/html5/thumbnails/14.jpg)
ContentsContents
• Acoustic speech synthesis (DSP)
– Model-based (rule-based) approach
– Instance-based (concatenative) approach
• Diphone concatenation
• Corpus-based (Unit Selection) synthesis
• Is there a future after corpus-based
synthesis?
![Page 15: SPEECH SYNTHESIS : UP FROM STATE-OF-THE-ART CORPUS …](https://reader031.vdocuments.net/reader031/viewer/2022020620/61e4396866a80454c8550bc1/html5/thumbnails/15.jpg)
CorpusCorpus--based synthesisbased synthesis
DiphoneDat abase
Prosody
Modificat ion
_ d o g _
50ms 80ms 160ms 70ms 50ms
F0
_d do og g _
Smooth joints
0 1000 2000 3000 4000 5000 6000 7000 8000-1
-0.50
0.51 x 104
Diphone-based synthesis
![Page 16: SPEECH SYNTHESIS : UP FROM STATE-OF-THE-ART CORPUS …](https://reader031.vdocuments.net/reader031/viewer/2022020620/61e4396866a80454c8550bc1/html5/thumbnails/16.jpg)
CorpusCorpus--based synthesisbased synthesis
(Univ. Edinburgh, 1997)
VERY LARGE
CORPUS
Prosody
Modificat ion
_ d o g _
50ms 80ms 160ms 70ms 50ms
F0
_d do og g _
Smooth joints
0 1000 2000 3000 4000 5000 6000 7000 8000-1
-0.50
0.51 x 104
Unit selection -coprus-based synthesis
(AT&T, 1998)(L&H, 1999)
(ATR, 1996)
(Loquendo, 2001)(Babel Technologies, 2003)
![Page 17: SPEECH SYNTHESIS : UP FROM STATE-OF-THE-ART CORPUS …](https://reader031.vdocuments.net/reader031/viewer/2022020620/61e4396866a80454c8550bc1/html5/thumbnails/17.jpg)
CorpusCorpus--based synthesisbased synthesisHow to get the best sequence of units for a
given utterance? Viterbi search
Use a model, but give last word to the dataOr : Choose the best, modify the least
Target j
Unit i-1 Unit i+1
Concatenation cost cc(i-1,i)
Concatenation cost cc(i,i+1)
target cost tc(j,i)
Unit i
Intelligibility Naturalness Mem/CPU ~ New Voice(1 GB : “High QUALITY TTS”)
![Page 18: SPEECH SYNTHESIS : UP FROM STATE-OF-THE-ART CORPUS …](https://reader031.vdocuments.net/reader031/viewer/2022020620/61e4396866a80454c8550bc1/html5/thumbnails/18.jpg)
ContentsContents
• Acoustic speech synthesis (DSP)
– Model-based (rule-based) approach
– Instance-based (concatenative) approach
• Diphone concatenation
• Corpus-based (Unit Selection) synthesis
• Is there a future after corpus-based
synthesis? ?
![Page 19: SPEECH SYNTHESIS : UP FROM STATE-OF-THE-ART CORPUS …](https://reader031.vdocuments.net/reader031/viewer/2022020620/61e4396866a80454c8550bc1/html5/thumbnails/19.jpg)
Speech Science?Speech Science?This time is over– planes do not flap their wings– replace experts by corpora
cf. Jelinek ’s «Each time I fire a linguist my recognition rate goes 1% higher»
1. Future milestones in speech processing will come from labs with strong commitment to solid, portable, and extensible code;2. Speech scientists and software engineers will soon be the same people.
SpokenSpoken LanguageLanguage Engineering!Engineering!ICASSP-INTERSPEECH : “Speech” synthesis “Spoken Language” Synthesis
![Page 20: SPEECH SYNTHESIS : UP FROM STATE-OF-THE-ART CORPUS …](https://reader031.vdocuments.net/reader031/viewer/2022020620/61e4396866a80454c8550bc1/html5/thumbnails/20.jpg)
![Page 21: SPEECH SYNTHESIS : UP FROM STATE-OF-THE-ART CORPUS …](https://reader031.vdocuments.net/reader031/viewer/2022020620/61e4396866a80454c8550bc1/html5/thumbnails/21.jpg)
HoweverHowever……• Engineering is now in the hands of companies
– Reduce the footprint of TTS systems (a few Megs)– Create new voices as fast as possible
• (Academic) TTS research? – Speech coding? +-DEAD– Voice conversion? YES– Speaker adaptation? YES– Expressive speech synthesis?
-Corpus-based : (ex: Loquendo)-DSP-based : eNTERFACE #6 ☺ Who will win?
• At FPMs : Back to acoustic speech modeling Voice quality analysis
– Breathy, Creaky, Diplophonic, Tense, Relaxed, etc.– Using acoustic features (spectral tilt, glottal formant
estimation, open quotient of glottal waveform, etc.)
![Page 22: SPEECH SYNTHESIS : UP FROM STATE-OF-THE-ART CORPUS …](https://reader031.vdocuments.net/reader031/viewer/2022020620/61e4396866a80454c8550bc1/html5/thumbnails/22.jpg)
Summertime …and the living is easy…