speech synthesis : up from state-of-the-art corpus …

eNTERFACEeNTERFACE’’05, Friday Aug. 5th05, Friday Aug. 5th

SPEECH SYNTHESIS : SPEECH SYNTHESIS : UP FROM STATEUP FROM STATE--OFOF--THETHE--ART ART

CORPUSCORPUS--BASED APPROACHES?BASED APPROACHES?

Provided to you equationProvided to you equation--free by:free by:

Thierry DutoitThierry [email protected]@fpms.ac.be

TCTS Lab Faculté Polytechnique de Mons Belgium

TTS = NLP + DSPTTS = NLP + DSP

TEXT

SPEECH

DIGITAL SIGNALPROCESSING

NATURAL LANGUAGEPROCESSING

PhonesInt/Dur

TEXT-TO-SPEECH SYNTHESIZER

NarrowPhonetic

Transcription

PhonetizationSpeech

Synthesis

Intonation/Duration

Generation

(a) (b) (c)

_ 210t 40U 55 0 173 75 173b 80 10 160i: 198 5 173 75 235…

To be or not to be, that is the

question.

ChallengesChallenges

• Accurate automatic phonetization (≠dictionnary look-up)• Prosody generation(i.e., intonation and phoneme

durations) must be “coherent”; easy to produce unnatural prosody

• Synthesis of phoneme sequences with corresponding prosody

– Coarticulation! (~Harris, 53)– Segmental quality should be maintained after pitch and

duration modification• Engineering

– Low design and maintenance cost

– Low computational and memory cost

– Easy adaptation to other languages

Intelligible – Natural – Cost effective

TTS = NLP + DSPTTS = NLP + DSP

TEXT

SPEECH

DIGITAL SIGNALPROCESSING

NATURAL LANGUAGEPROCESSING

PhonesInt/Dur

TEXT-TO-SPEECH SYNTHESIZER

NarrowPhonetic

Transcription

PhonetizationSpeech

Synthesis

Intonation/Duration

Generation

(a) (b) (c)

_ 210t 40U 55 0 173 75 173b 80 10 160i: 198 5 173 75 235…

To be or not to be, that is the

question.

ContentsContents

• Acoustic speech synthesis (DSP)

– Model-based (rule-based) approach

– Instance-based (concatenative) approach

• Diphone concatenation

• Corpus-based (Unit Selection) Synthesis

• Is there a future after Corpus-based

synthesis?

Von Von KempelenKempelen’’ss talking talking machine (1791)machine (1791)

Mouth

Nostrils

Main bellows

Small bellows

'S' pipe

'Sh' pipe

'Sh' lever'S' lever

(J.S. Liénard, LIMSI)

Omer DudleyOmer Dudley’’s s VoderVoder(Bell Labs, 1936)(Bell Labs, 1936)

NoiseSource

Oscillator

Resonnance Cont rol Amplifier

106 7 8 9

"Quiet "

t -dp-bk-g

Energy swit chwrist bar

VoderConsoleKeyboard

1 2 3 4

5

Pit ch-cont rolpedal

UV

V

John HolmesJohn Holmes’’ formant formant synthesizer (1964)synthesizer (1964)

Rule-based Synthesis

Haskins Labs (1968) DecTalk (1983)InfoVox (1983-95) LIMSI’s Polyglot (92)

Intelligibility Naturalness Mem/CPU New Voice(<100 kB)

ContentsContents



– Instance-based (concatenative)

approach• Diphone concatenation

• Corpus-based (Unit Selection) synthesis

• Is there a future after corpus-based

synthesis?

DiphoneDiphone concatenation (1977)concatenation (1977)

DiphoneDiphone concatenation concatenation (1968)(1968)

DiphoneDat abase

Prosody

Modificat ion

_ d o g _

50ms 80ms 160ms 70ms 50ms

F0

_d do og g _

Smooth joints

0 1000 2000 3000 4000 5000 6000 7000 8000-1

-0.50

0.51 x 104

AT&T: LPC (1980)

France Telecom : PSOLA (1990)

The MBROLA project (95)The MBROLA project (95)

http://tcts.fpms.ac.be/synthesis/

http://tcts.fpms.ac.be/synthesis/

DiphoneDiphone concatenation (1977)concatenation (1977)

DiphoneDat abase

Prosody

Modificat ion

_ d o g _


F0

_d do og g _

Smooth joints

0 1000 2000 3000 4000 5000 6000 7000 8000-1

-0.50

0.51 x 104

Intelligibility Naturalness~ Mem/CPU New Voice ~(5 MB : “High DENSITY TTS”)

ContentsContents







synthesis?

CorpusCorpus--based synthesisbased synthesis

DiphoneDat abase

Prosody

Modificat ion

_ d o g _


F0

_d do og g _

Smooth joints

0 1000 2000 3000 4000 5000 6000 7000 8000-1

-0.50

0.51 x 104

Diphone-based synthesis

CorpusCorpus--based synthesisbased synthesis

(Univ. Edinburgh, 1997)

VERY LARGE

CORPUS

Prosody

Modificat ion

_ d o g _


F0

_d do og g _

Smooth joints

0 1000 2000 3000 4000 5000 6000 7000 8000-1

-0.50

0.51 x 104

Unit selection -coprus-based synthesis

(AT&T, 1998)(L&H, 1999)

(ATR, 1996)

(Loquendo, 2001)(Babel Technologies, 2003)

CorpusCorpus--based synthesisbased synthesisHow to get the best sequence of units for a

given utterance? Viterbi search

Use a model, but give last word to the dataOr : Choose the best, modify the least

Target j

Unit i-1 Unit i+1

Concatenation cost cc(i-1,i)

Concatenation cost cc(i,i+1)

target cost tc(j,i)

Unit i

Intelligibility Naturalness Mem/CPU ~ New Voice(1 GB : “High QUALITY TTS”)

ContentsContents







synthesis? ?

Speech Science?Speech Science?This time is over– planes do not flap their wings– replace experts by corpora

cf. Jelinek ’s «Each time I fire a linguist my recognition rate goes 1% higher»

1. Future milestones in speech processing will come from labs with strong commitment to solid, portable, and extensible code;2. Speech scientists and software engineers will soon be the same people.

SpokenSpoken LanguageLanguage Engineering!Engineering!ICASSP-INTERSPEECH : “Speech” synthesis “Spoken Language” Synthesis

HoweverHowever……• Engineering is now in the hands of companies

– Reduce the footprint of TTS systems (a few Megs)– Create new voices as fast as possible

• (Academic) TTS research? – Speech coding? +-DEAD– Voice conversion? YES– Speaker adaptation? YES– Expressive speech synthesis?

-Corpus-based : (ex: Loquendo)-DSP-based : eNTERFACE #6 ☺ Who will win?

• At FPMs : Back to acoustic speech modeling Voice quality analysis

– Breathy, Creaky, Diplophonic, Tense, Relaxed, etc.– Using acoustic features (spectral tilt, glottal formant

estimation, open quotient of glottal waveform, etc.)

Summertime …and the living is easy…

speech synthesis : up from state-of-the-art corpus …

Documents