proposals for extending ssml 1.0 from the point-of-view of hungarian tts developers géza németh,...

40
Proposals for Proposals for Extending Extending SSML 1.0 SSML 1.0 from the Point-of- from the Point-of- View of Hungarian TTS View of Hungarian TTS Developers Developers Géza Németh, Géza Kiss, Bálint Tóth Géza Németh, Géza Kiss, Bálint Tóth Laboratory of Speech Technology, Department of Telecommunications Laboratory of Speech Technology, Department of Telecommunications and Media Informatics and Media Informatics Budapest University of Technology and Economics, Budapest, Budapest University of Technology and Economics, Budapest, Hungary Hungary

Upload: khalil-brockett

Post on 31-Mar-2015

219 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Proposals for Extending SSML 1.0 from the Point-of-View of Hungarian TTS Developers Géza Németh, Géza Kiss, Bálint Tóth Laboratory of Speech Technology,

Proposals for Proposals for Extending Extending SSML 1.0 SSML 1.0

from the Point-of-View from the Point-of-View of Hungarian TTS of Hungarian TTS

DevelopersDevelopersGéza Németh, Géza Kiss, Bálint TóthGéza Németh, Géza Kiss, Bálint Tóth

Laboratory of Speech Technology, Department of Laboratory of Speech Technology, Department of Telecommunications and Media InformaticsTelecommunications and Media Informatics

Budapest University of Technology and Economics, Budapest University of Technology and Economics, Budapest, HungaryBudapest, Hungary

{nemeth,kgeza,toth.b}@tmit.bme.hu{nemeth,kgeza,toth.b}@tmit.bme.hu

Page 2: Proposals for Extending SSML 1.0 from the Point-of-View of Hungarian TTS Developers Géza Németh, Géza Kiss, Bálint Tóth Laboratory of Speech Technology,

Budapest University of Technology & Economics Budapest University of Technology & Economics (BME)(BME)

Dept. of Telecommunications & Media InformaticsDept. of Telecommunications & Media Informatics (TMIT)(TMIT) SpeechSpeech activit activitiesies::

CoordinatorCoordinator: Gordos: Gordos Géza D.Sc. Géza D.Sc.

Speech Technology Lab (STL)NémethNémeth Géza Géza andand OlaszyOlaszy Gábor Gábor

PhDPhD D.Sc.D.Sc.

Telecommunications & Signal Processing Lab (TSP)

TataiTatai Péter PéterMScMSc

Laboratory of Speech AcousticsVicsiVicsi Klára Klára (LSA) D.Sc.

In each lab•4-6 PhD students•Graduate students 306 in Speech Information Systems subject (2005)

Page 3: Proposals for Extending SSML 1.0 from the Point-of-View of Hungarian TTS Developers Géza Németh, Géza Kiss, Bálint Tóth Laboratory of Speech Technology,

Basic researchBasic research Multi-lingual aMulti-lingual artificial speech generationrtificial speech generation (s (synthesynthesis, is,

STL)STL) limited vocabulary (e.g., numbers, date, address)limited vocabulary (e.g., numbers, date, address) multi-lingual TTS (Hungarian, German, Polish, Spanish)multi-lingual TTS (Hungarian, German, Polish, Spanish) speech profiles (variability, individual features)speech profiles (variability, individual features) expression/emotion presentation (user’s manual <-> expression/emotion presentation (user’s manual <->

news)news) Speech recognitionSpeech recognition (TSP, LSA) (TSP, LSA)

noise handling (telephone, in-car, ..., TSP)noise handling (telephone, in-car, ..., TSP) dictation (good quality, continouos, LSA)dictation (good quality, continouos, LSA) audio indexing (e.g. radio archives, broadcast news, TSP)audio indexing (e.g. radio archives, broadcast news, TSP) speech segmentation (TSP, LSA)speech segmentation (TSP, LSA) emotion detection (TSP)emotion detection (TSP)

Speech understanding (TSP)Speech understanding (TSP) Speech databasesSpeech databases (LSA, TSP) (LSA, TSP)

Page 4: Proposals for Extending SSML 1.0 from the Point-of-View of Hungarian TTS Developers Géza Németh, Géza Kiss, Bálint Tóth Laboratory of Speech Technology,

AppliApplied Researched Research Fully proprietaryFully proprietary components and components and solutions: solutions:

All parameters controlled, systems are tailor-made for All parameters controlled, systems are tailor-made for the end-userthe end-user, , Integration of original research results, Integration of original research results, unique productsunique products

T-Mobile Hungary services: T-Mobile Hungary services: E-mail reader 1999-, name- E-mail reader 1999-, name- and address reader and address reader in reverse directory, in reverse directory, 20032003 (Motto: Why (Motto: Why is the human operator speaking, not the machine?!)is the human operator speaking, not the machine?!),, Symbian SMS-reader Symbian SMS-reader 202002- (STL)02- (STL)

Others: Others: SMS reader 2001-, bookreader 2002-, SMS reader 2001-, bookreader 2002-, (STL)(STL) Voice portals (Generali Hungary name dial-in 2004, Voice portals (Generali Hungary name dial-in 2004,

Hungarian VoiceXML browserHungarian VoiceXML browser,, 2003 2003, TSP+STL, TSP+STL)) Industrial information systems (STL, TSP)Industrial information systems (STL, TSP) UUnified Messagingnified Messaging (STL) (STL) Call CenterCall Center (STL, TSP) (STL, TSP) Audio user interfaces (especially portable/mobile devices, Audio user interfaces (especially portable/mobile devices,

car information systems, wearable devices, STL, TSP)car information systems, wearable devices, STL, TSP) Disability (Disability (1986-1986-, speech, vision, Hungarian version of , speech, vision, Hungarian version of

Jaws for Windows, notetaker for blind people, STL, TSP, Jaws for Windows, notetaker for blind people, STL, TSP, LSA)LSA)

Page 5: Proposals for Extending SSML 1.0 from the Point-of-View of Hungarian TTS Developers Géza Németh, Géza Kiss, Bálint Tóth Laboratory of Speech Technology,

Contact Contact informationinformationTel: (+36 1) 463-38-83Tel: (+36 1) 463-38-83

Fax: (+36 1) 463-31-07Fax: (+36 1) 463-31-07

http://speechlab.thttp://speechlab.tmimit.bme.hut.bme.hu

email: nemeth@temail: [email protected]

Page 6: Proposals for Extending SSML 1.0 from the Point-of-View of Hungarian TTS Developers Géza Németh, Géza Kiss, Bálint Tóth Laboratory of Speech Technology,

Overview

Text-to-phoneme

Text structure

ProsodySummary

Text Prosody

normalization

conversion

predictionprescription

Page 7: Proposals for Extending SSML 1.0 from the Point-of-View of Hungarian TTS Developers Géza Németh, Géza Kiss, Bálint Tóth Laboratory of Speech Technology,

Overview

Text-to-phoneme

Text structure

ProsodySummary

Text Prosody

normalization

conversion

predictionprescription

Page 8: Proposals for Extending SSML 1.0 from the Point-of-View of Hungarian TTS Developers Géza Németh, Géza Kiss, Bálint Tóth Laboratory of Speech Technology,

Overview

Text-to-phoneme conversion

Text structure

Prosody prescription

Summary

Text normalization

Prosody prediction

Page 9: Proposals for Extending SSML 1.0 from the Point-of-View of Hungarian TTS Developers Géza Németh, Géza Kiss, Bálint Tóth Laboratory of Speech Technology,

Overview

Text-to-phoneme conversion

Text structure

Prosody prescription

Summary

Text normalization

Prosody prediction

Text structure elements already Text structure elements already contained contained in SSML 1.0: in SSML 1.0:

paragraphparagraph sentencesentence

Suggested further structuring:Suggested further structuring: wordword syllablessyllables

Page 10: Proposals for Extending SSML 1.0 from the Point-of-View of Hungarian TTS Developers Géza Németh, Géza Kiss, Bálint Tóth Laboratory of Speech Technology,

Overview

Text-to-phoneme conversion

Text structure

Prosody prescription

Summary

Text normalization

Prosody prediction

This can be usedThis can be used to helpto help

text-to-phoneme conversiontext-to-phoneme conversion prosody prediction and prescriptionprosody prediction and prescription ……

by giving higher level information, by giving higher level information, namelynamely syllable structuresyllable structure part-of-speech informationpart-of-speech information

(Examples given later)(Examples given later) to indicate wordsto indicate words in languages that in languages that

do not use space to separate wordsdo not use space to separate words

Page 11: Proposals for Extending SSML 1.0 from the Point-of-View of Hungarian TTS Developers Géza Németh, Géza Kiss, Bálint Tóth Laboratory of Speech Technology,

Overview

Text-to-phoneme conversion

Text structure

Prosody prescription

Summary

Text normalization

Prosody prediction

Reasons to useReasons to use text structure text structure elements elements instead of e.g. instead of e.g. phonemephoneme, , prosodyprosody, , breakbreak, , emphasisemphasis

Easier for human editor to addEasier for human editor to add Replacing synthesis processor may Replacing synthesis processor may

necessitate rewriting necessitate rewriting phoneme specification phoneme specification prosody prescriptionprosody prescription

Page 12: Proposals for Extending SSML 1.0 from the Point-of-View of Hungarian TTS Developers Géza Németh, Géza Kiss, Bálint Tóth Laboratory of Speech Technology,

Overview

Text-to-phoneme conversion

Text structure

Prosody prescription

Summary

Text normalization

Prosody prediction

Suggested Suggested word elementword element<w<w [syllables=“…-…”][syllables=“…-…”]

[POS=“…” [number=“…” …]]> … </w>[POS=“…” [number=“…” …]]> … </w>

E.g.E.g.

<w syllables="hosz-szú"> hosszú </w><w syllables="hosz-szú"> hosszú </w>

<w POS="noun" number="plural" <w POS="noun" number="plural" case="accusative"> halászsasokat </w>case="accusative"> halászsasokat </w>

Page 13: Proposals for Extending SSML 1.0 from the Point-of-View of Hungarian TTS Developers Géza Németh, Géza Kiss, Bálint Tóth Laboratory of Speech Technology,

Overview

Text-to-phoneme conversion

Text structure

Prosody prescription

Summary

Text normalization

Prosody prediction

Suggestion extended Suggestion extended from other proposalsfrom other proposals<w<w [syllables=“…-…”][syllables=“…-…”]

[POS=“…” [number=“…” gender=“…” [POS=“…” [number=“…” gender=“…” case=“…” …]case=“…” …][morph=“…+…”][morph=“…+…”][tone=“h+l+…”]]> … </w>[tone=“h+l+…”]]> … </w>

When not a word, but an When not a word, but an expressionexpression is is labeled:labeled:

<e<e [POS=“…” [number=“…” …]> … </e>[POS=“…” [number=“…” …]> … </e>

E.g. three kilosE.g. three kilos <e<e POS=“cardinal” number=“plural” POS=“cardinal” number=“plural”

gender=“neuter” case=“genitive”]> gender=“neuter” case=“genitive”]> 3 3 kk. </e>. </e>

Page 14: Proposals for Extending SSML 1.0 from the Point-of-View of Hungarian TTS Developers Géza Németh, Géza Kiss, Bálint Tóth Laboratory of Speech Technology,

Overview

Text-to-phoneme conversion

Text structure

Prosody prescription

Summary

Text normalization

Prosody prediction

Page 15: Proposals for Extending SSML 1.0 from the Point-of-View of Hungarian TTS Developers Géza Németh, Géza Kiss, Bálint Tóth Laboratory of Speech Technology,

Overview

Text-to-phoneme conversion

Text structure

Prosody prescription

Summary

Text normalization

Prosody prediction

When pronunciation cannot be When pronunciation cannot be determined, you candetermined, you can

1.1. Add a Add a lexiconlexicon element elementBUT hard to add all BUT hard to add all

2.2. Specify using Specify using phonemephoneme::BUT hard to write & read for humanBUT hard to write & read for human

3.3. Add a textual replacement using Add a textual replacement using subsub

4.4. Provide higher level information Provide higher level information Currently this is only Currently this is only say-assay-as

Page 16: Proposals for Extending SSML 1.0 from the Point-of-View of Hungarian TTS Developers Géza Németh, Géza Kiss, Bálint Tóth Laboratory of Speech Technology,

Overview

Text-to-phoneme conversion

Text structure

Prosody prescription

Summary

Text normalization

Prosody prediction

Other types of higher level information Other types of higher level information (easier, more natural)(easier, more natural)

Syllable structureSyllable structure Part-of-speech informationPart-of-speech information Language of included foreign textLanguage of included foreign text

We are going to give you some We are going to give you some examples.examples.

Page 17: Proposals for Extending SSML 1.0 from the Point-of-View of Hungarian TTS Developers Géza Németh, Géza Kiss, Bálint Tóth Laboratory of Speech Technology,

Overview

Text-to-phoneme conversion

Text structure

Prosody prescription

Summary

Text normalization

Prosody prediction

Hungarian: Hungarian: highly agglutinativehighly agglutinative pronunciation inference rules are usedpronunciation inference rules are used rules can be tricked by some wordsrules can be tricked by some words

E.g. “egészség” (“health”)E.g. “egészség” (“health”)

Letter combinations might beLetter combinations might be “s+zs”“s+zs” [[]+]+[[]→[]→[]]

but they are in factbut they are in fact “sz+s”“sz+s” [[]+]+[[]→[]→[]]

Syllable structureSyllable structure

Page 18: Proposals for Extending SSML 1.0 from the Point-of-View of Hungarian TTS Developers Géza Németh, Géza Kiss, Bálint Tóth Laboratory of Speech Technology,

Overview

Text-to-phoneme conversion

Text structure

Prosody prescription

Summary

Text normalization

Prosody prediction

Enough to know syllable structure. Enough to know syllable structure.

Instead of Instead of <phoneme alphabet="ipa" <phoneme alphabet="ipa"

ph="&#x25B;ge&#x2D0;#x283;#x283;ph="&#x25B;ge&#x2D0;#x283;#x283;

e&#x2D0;g"> egészség </phoneme>e&#x2D0;g"> egészség </phoneme>

you can write you can write <w syllables="e-gész-ség"> egészség </w><w syllables="e-gész-ség"> egészség </w>

(Note: here you could also write(Note: here you could also write

<sub alias="e-gész-ség"> egészség </sub><sub alias="e-gész-ség"> egészség </sub>))

Syllable structureSyllable structure

Page 19: Proposals for Extending SSML 1.0 from the Point-of-View of Hungarian TTS Developers Géza Németh, Géza Kiss, Bálint Tóth Laboratory of Speech Technology,

Overview

Text-to-phoneme conversion

Text structure

Prosody prescription

Summary

Text normalization

Prosody prediction

Word forms may have several Word forms may have several meanings/pronunciationsmeanings/pronunciations

Specifying part-of-speech may helpSpecifying part-of-speech may help

E.g.E.g. I will <w POS=“verb” tense=“present”> I will <w POS=“verb” tense=“present”> read </w> the bookread </w> the book

I have <w POS=“participle”> I have <w POS=“participle”> read </w> the bookread </w> the book

Part-of-speechPart-of-speech

Page 20: Proposals for Extending SSML 1.0 from the Point-of-View of Hungarian TTS Developers Géza Németh, Géza Kiss, Bálint Tóth Laboratory of Speech Technology,

Overview

Text-to-phoneme conversion

Text structure

Prosody prescription

Summary

Text normalization

Prosody prediction

Foreign parts often occur in textsForeign parts often occur in texts Using same voice, currently you canUsing same voice, currently you can

Do nothingDo nothing Specify using Specify using phonemephoneme

Another desirable approachAnother desirable approach Specify lexicon for language and Specify lexicon for language and

specify language of textspecify language of text

LanguageLanguage

Page 21: Proposals for Extending SSML 1.0 from the Point-of-View of Hungarian TTS Developers Géza Németh, Géza Kiss, Bálint Tóth Laboratory of Speech Technology,

Overview

Text-to-phoneme conversion

Text structure

Prosody prescription

Summary

Text normalization

Prosody prediction

Instead ofInstead of……<speak … xml:lang="en-US"><speak … xml:lang="en-US">The title of the movie is:The title of the movie is:

<phoneme alphabet="ipa"<phoneme alphabet="ipa"ph="&#x2C8;l&#x251; ph="&#x2C8;l&#x251; &#x2C8;vi&#x2D0;&#x27E;&#x259; &#x2C8;vi&#x2D0;&#x27E;&#x259; &#x2C8;&#x294;e&#x26A; &#x2C8;&#x294;e&#x26A; &#x2C8;b&#x25B;l&#x259;">&#x2C8;b&#x25B;l&#x259;">

La vita è bella </phoneme> La vita è bella </phoneme> (Life is beautiful). (Life is beautiful).

LanguageLanguage

Page 22: Proposals for Extending SSML 1.0 from the Point-of-View of Hungarian TTS Developers Géza Németh, Géza Kiss, Bálint Tóth Laboratory of Speech Technology,

Overview

Text-to-phoneme conversion

Text structure

Prosody prescription

Summary

Text normalization

Prosody prediction

you could writeyou could write……<speak … xml:lang="en-US"><speak … xml:lang="en-US">The title of the movie is:The title of the movie is:

<phoneme lang="it"> <phoneme lang="it">

La vita è bella </phoneme> La vita è bella </phoneme> (Life is beautiful).(Life is beautiful).

LanguageLanguage

Page 23: Proposals for Extending SSML 1.0 from the Point-of-View of Hungarian TTS Developers Géza Németh, Géza Kiss, Bálint Tóth Laboratory of Speech Technology,

Overview

Text-to-phoneme conversion

Text structure

Prosody prescription

Summary

Text normalization

Prosody prediction

Suggested language attributeSuggested language attribute<phoneme [lang=“…” | “x-unknown”]<phoneme [lang=“…” | “x-unknown”]

[ph=“…” [alphabet=“…”]]> …[ph=“…” [alphabet=“…”]]> …</phoneme></phoneme>

If both If both langlang and and phph is given, is given, langlang has has prioritypriority

If language is If language is “x-unknown”“x-unknown”, , LID (language identification) is used.LID (language identification) is used.

We suggest that We suggest that “x-unknown”“x-unknown” c can be an be used with used with xml:langxml:lang also. also.

LanguageLanguage

Page 24: Proposals for Extending SSML 1.0 from the Point-of-View of Hungarian TTS Developers Géza Németh, Géza Kiss, Bálint Tóth Laboratory of Speech Technology,

Overview

Text-to-phoneme conversion

Text structure

Prosody prescription

Summary

Text normalization

Prosody prediction

Page 25: Proposals for Extending SSML 1.0 from the Point-of-View of Hungarian TTS Developers Géza Németh, Géza Kiss, Bálint Tóth Laboratory of Speech Technology,

Overview

Text-to-phoneme conversion

Text structure

Prosody prescription

Summary

Text normalization

Prosody prediction

Text normalization effectively assisted Text normalization effectively assisted by by say-assay-as element. element.

The constructs we found appropriate The constructs we found appropriate in our practice include:in our practice include:datedate, , timetime (including time intervals like (including time intervals like opening hours), opening hours), numbernumber, , currencycurrency, , namename, , addressaddress. .

Additionally Additionally suggest as suggest as standard standard values: values: acronym/abbreviationacronym/abbreviation, , webweb, , e-e-mailmail, , phonephone, , program-codeprogram-code, , tabletable, , equationequation..

Page 26: Proposals for Extending SSML 1.0 from the Point-of-View of Hungarian TTS Developers Géza Németh, Géza Kiss, Bálint Tóth Laboratory of Speech Technology,

Overview

Text-to-phoneme conversion

Text structure

Prosody prescription

Summary

Text normalization

Prosody prediction

Page 27: Proposals for Extending SSML 1.0 from the Point-of-View of Hungarian TTS Developers Géza Németh, Géza Kiss, Bálint Tóth Laboratory of Speech Technology,

Overview

Text-to-phoneme conversion

Text structure

Prosody prescription

Summary

Text normalization

Prosody prediction

We speak differently in different We speak differently in different situationssituations(e.g. speaking with friends, giving a talk (e.g. speaking with friends, giving a talk at a conference, reading news, reading at a conference, reading news, reading stories to children) – speaking stylestories to children) – speaking style

Differences in prosody can be quantifiedDifferences in prosody can be quantified Emotional speech also in the focus of Emotional speech also in the focus of

researchresearch Modern TTS systems are likely to be able Modern TTS systems are likely to be able

to imitate these to some extentto imitate these to some extent

Page 28: Proposals for Extending SSML 1.0 from the Point-of-View of Hungarian TTS Developers Géza Németh, Géza Kiss, Bálint Tóth Laboratory of Speech Technology,

Overview

Text-to-phoneme conversion

Text structure

Prosody prescription

Summary

Text normalization

Prosody prediction

Suggested Suggested speaking-stylespeaking-style attribute attribute Can be used where the Can be used where the xml:langxml:lang element, element,

i.e. i.e. voicevoice, , speakspeak, , pp, , ss, , ww Synthesis processors can define their own Synthesis processors can define their own

set of supported speaking-stylesset of supported speaking-styles They should support: They should support: "spelling""spelling"

– can be viewed a special reading style – can be viewed a special reading style They may support e.g. They may support e.g. "syllabification""syllabification", , "causal""causal", , "news reading""news reading", , "story telling""story telling"

Speaking styleSpeaking style

Page 29: Proposals for Extending SSML 1.0 from the Point-of-View of Hungarian TTS Developers Géza Németh, Géza Kiss, Bálint Tóth Laboratory of Speech Technology,

Overview

Text-to-phoneme conversion

Text structure

Prosody prescription

Summary

Text normalization

Prosody prediction

Suggested Suggested emotionemotion attribute attribute Mentioned here, although prosody is only Mentioned here, although prosody is only

one of its aspectsone of its aspects Complementary to speaking-style, Complementary to speaking-style,

therefore separate attribute is suggested therefore separate attribute is suggested Can be used where the Can be used where the xml:langxml:lang element, element,

i.e. i.e. voicevoice, , speakspeak, , pp, , ss, , ww Possible values: "Possible values: "happinesshappiness", "", "sadnesssadness", ",

""angeranger", "", "surprisesurprise", "", "disgustdisgust", "", "fearfear".".

EmotionEmotion

Page 30: Proposals for Extending SSML 1.0 from the Point-of-View of Hungarian TTS Developers Géza Németh, Géza Kiss, Bálint Tóth Laboratory of Speech Technology,

Overview

Text-to-phoneme conversion

Text structure

Prosody prescription

Summary

Text normalization

Prosody prediction

Part-of-speech (POS) of word may affect Part-of-speech (POS) of word may affect emphasis and other aspects of prosodyemphasis and other aspects of prosody

Not always possible to automatically determineNot always possible to automatically determine More desirable to specify POS than to More desirable to specify POS than to

prescribe prosody (higher level, speaking style prescribe prosody (higher level, speaking style can override it)can override it)

Example in Hungarian:Example in Hungarian: ““Mondd, Mondd, hogyhogy vagy? vagy?” (“Tell me, ” (“Tell me, howhow are you?”) are you?”)

– interrogative adverb,– interrogative adverb, strong (focus) strong (focus) emphasisemphasis

““Igaz, Igaz, hogyhogy jól vagy? jól vagy?” (“Is it true ” (“Is it true thatthat you are you are alright?”)alright?”) – conjunction,– conjunction, reduced emphasisreduced emphasis

Part-of-speechPart-of-speech

Page 31: Proposals for Extending SSML 1.0 from the Point-of-View of Hungarian TTS Developers Géza Németh, Géza Kiss, Bálint Tóth Laboratory of Speech Technology,

Overview

Text-to-phoneme conversion

Text structure

Prosody prescription

Summary

Text normalization

Prosody prediction

Page 32: Proposals for Extending SSML 1.0 from the Point-of-View of Hungarian TTS Developers Géza Németh, Géza Kiss, Bálint Tóth Laboratory of Speech Technology,

Overview

Text-to-phoneme conversion

Text structure

Prosody prescription

Summary

Text normalization

Prosody prediction

Analytic languages (e.g. English, Analytic languages (e.g. English, Chinese)Chinese) Words are usually short Words are usually short They convey only one portion of the meaningThey convey only one portion of the meaning Individual words can be stressedIndividual words can be stressed

Synthetic languages (e.g. Hungarian, Synthetic languages (e.g. Hungarian, Korean)Korean) Words are often longWords are often long Made up of several morphemes and have Made up of several morphemes and have

very complex meaningsvery complex meanings Stress, pitch changes, etc. may need to be Stress, pitch changes, etc. may need to be

realized on certain morphemes (~syllables)realized on certain morphemes (~syllables)

Page 33: Proposals for Extending SSML 1.0 from the Point-of-View of Hungarian TTS Developers Géza Németh, Géza Kiss, Bálint Tóth Laboratory of Speech Technology,

Overview

Text-to-phoneme conversion

Text structure

Prosody prescription

Summary

Text normalization

Prosody prediction

Example 1: contrastive sentencesExample 1: contrastive sentences English:English:

“The book is not “The book is not inin the box, but the box, but onon the box.” the box.” Speaker can Speaker can emphasize one wordemphasize one word..

Hungarian:Hungarian:““Nem a dobozNem a dobozonon, hanem a doboz, hanem a dobozbanban van a könyv.” van a könyv.” Speaker sometimes has to Speaker sometimes has to emphasize one emphasize one

syllablesyllable. . Stress expressed mainly by pitch; may be Stress expressed mainly by pitch; may be

aided aided by short pause, slower rate, higher volume.by short pause, slower rate, higher volume.

Page 34: Proposals for Extending SSML 1.0 from the Point-of-View of Hungarian TTS Developers Géza Németh, Géza Kiss, Bálint Tóth Laboratory of Speech Technology,

Overview

Text-to-phoneme conversion

Text structure

Prosody prescription

Summary

Text normalization

Prosody prediction

Example 2: pitch change on syllableExample 2: pitch change on syllable1.1. ““Elmentek.Elmentek.” – “They are gone” – “They are gone..” ”

Pitch is continuously fallingPitch is continuously falling2.2. ““Elmentek?Elmentek?” – “Are they gone?”” – “Are they gone?”

Pitch rises at the beginning of the Pitch rises at the beginning of the second syllable and falls down on the second syllable and falls down on the third syllablethird syllable

1. 2.

Page 35: Proposals for Extending SSML 1.0 from the Point-of-View of Hungarian TTS Developers Géza Németh, Géza Kiss, Bálint Tóth Laboratory of Speech Technology,

Overview

Text-to-phoneme conversion

Text structure

Prosody prescription

Summary

Text normalization

Prosody prediction

Suggestion for extensions to prosody:Suggestion for extensions to prosody: Stress and prosody can be described Stress and prosody can be described

on a on a per-syllable basisper-syllable basis Extension to prosody: time can be Extension to prosody: time can be

syllable positionsyllable position decimal fractions can also be useddecimal fractions can also be used negative values indicate nnegative values indicate nthth position from position from

endend special symbol syl_end indicates end of special symbol syl_end indicates end of

expressionexpression

E.g.:E.g.:<prosody contour=“(syl1,…) (syl1.5,…) <prosody contour=“(syl1,…) (syl1.5,…) (syl2,…) … (syl-1,…)(syl_end,…)”>(syl2,…) … (syl-1,…)(syl_end,…)”>

Page 36: Proposals for Extending SSML 1.0 from the Point-of-View of Hungarian TTS Developers Géza Németh, Géza Kiss, Bálint Tóth Laboratory of Speech Technology,

Overview

Text-to-phoneme conversion

Text structure

Prosody prescription

Summary

Text normalization

Prosody prediction

Suggestion for optional extensions:Suggestion for optional extensions:

some synthesis processors may processsome synthesis processors may process pitch-contourpitch-contour (= (=contourcontour), ), rate-contourrate-contour, , volume-contourvolume-contour

time positions: the same as in time positions: the same as in contourcontour

rate / volume: described as in rate / volume: described as in raterate / / volumevolume emphasisemphasis and and breakbreak extended with a extended with a positionposition attribute; value can be syllable attribute; value can be syllable position.position.In this case In this case breakbreak will not be an empty will not be an empty element.element.

Page 37: Proposals for Extending SSML 1.0 from the Point-of-View of Hungarian TTS Developers Géza Németh, Géza Kiss, Bálint Tóth Laboratory of Speech Technology,

Overview

Text-to-phoneme conversion

Text structure

Prosody prescription

Summary

Text normalization

Prosody prediction

Page 38: Proposals for Extending SSML 1.0 from the Point-of-View of Hungarian TTS Developers Géza Németh, Géza Kiss, Bálint Tóth Laboratory of Speech Technology,

Overview

Text-to-phoneme conversion

Text structure

Prosody prescription

Summary

Text normalization

Prosody prediction

Suggested extensionsSuggested extensions1.1. <w<w [syllables=“…-…”] [syllables=“…-…”]

[POS=“…” [number=“…” …]][POS=“…” [number=“…” …]]</w></w>

2.2. <phoneme lang=“…” | “x-unknown”<phoneme lang=“…” | “x-unknown”[ph=“…” [alphabet=“…”]]> …[ph=“…” [alphabet=“…”]]> …

</phoneme></phoneme>3.3. <voice | speak | p | s | w <voice | speak | p | s | w

[speaking-style=“spelling” |[speaking-style=“spelling” |“syllabification” | “causal” | “syllabification” | “causal” | “news reading” | “story telling” | …]“news reading” | “story telling” | …]

[emotion=“happiness” | “sadness” | [emotion=“happiness” | “sadness” | “anger” |“anger” |

“surprise” | “disgust” | “fear”]“surprise” | “disgust” | “fear”][<xml:lang=“…” | “xml-unknown”>][<xml:lang=“…” | “xml-unknown”>]

</voice></voice>4.4. <prosody contour=“(syl1,…) (syl2,…) (syl2.5,…) <prosody contour=“(syl1,…) (syl2,…) (syl2.5,…)

… (syl-2,…) (syl-1,…) (syl_end,…)”>… (syl-2,…) (syl-1,…) (syl_end,…)”>optionally:optionally: pitch-contour (=contour), pitch-contour (=contour), rate-contour, volume-contour; break, emphasis rate-contour, volume-contour; break, emphasis

Page 39: Proposals for Extending SSML 1.0 from the Point-of-View of Hungarian TTS Developers Géza Németh, Géza Kiss, Bálint Tóth Laboratory of Speech Technology,

Prosodyprescription

Prosodyprediction

Textnormalization

Overview Text-to-phonemeText structure Summaryconversion

Page 40: Proposals for Extending SSML 1.0 from the Point-of-View of Hungarian TTS Developers Géza Németh, Géza Kiss, Bálint Tóth Laboratory of Speech Technology,

Thank youThank youfor your for your

attention!attention!