text-to-speech text-to-speech part i 1 intelligent robot lecture note

Text-to-Speech

Text-to-SpeechPart I

Text-to-SpeechPart I

1Intelligent Robot Lecture Note

Text-to-Speech

2

Introduction

Intelligent Robot Lecture Note

Text-to-Speech

Introduction

• History► Long before electronic signal processing was invented, speech

researchers tried to build machine to create human speech.◦ Gerbert of Aurillac (d. 1003 AD)◦ Albertus Magnus (1198-1280)◦ Roger Bacon (1214-1294)

► In 1779, the Danish scientist Christian Kratzenstein built models of the human vocal tract that could produce the vowel sounds.◦ This was followed by the bellows-operated “acoustic-mechanical speech

machine” by Wolfgang von Kempelen.◦ This machine added models of the tongue and lips, enabling it to produce

consonants as well as vowels.◦ In 1837, Charles Wheatstone produced a “speaking machine” based on

von Kempelen’s design.◦ In 1857, M. Faber built the “Euphonia”.


Text-to-Speech

Introduction

• History► In the 1930s, Bell Labs developed the VOCODER, a keyboard-

operated electronic speech analyzer and synthesizer that was said to be clearly intelligible.

► The pattern playback was build by Franklin S. Cooper and his colleagues in the late 1940s and completed in 1950.◦ There were several different versions of this hardware device but only one

currently survives.◦ The machine converts pictures of the acoustic patterns of speech in the

form of a spectrogram back into sound.◦ Using this device, Alvin Liberman and colleagues were able to discover

acoustic cues for the perception of phonetic segments (consonants and vowels).


Text-to-Speech

5

Text and Phonetic Analysis


Text-to-Speech

Text and Phonetic Analysis

• Modules and Data Flow• Text Normalization• Linguistic Analysis• Homograph Disambiguation• Morphological Analysis• Letter-to-Sound Conversion• Evaluation


Text-to-Speech

Modules and Data Flow

7

Lexicon

Document Structure Detection

Linguistic Analysis

Text Normalization

Homograph Disambiguation

Morphological Analysis

Letter-to-Sound Conversion

raw text or tagged text

tagged text

Text Analysis

Phonetic Analysis

tagged text & phones

Modularized functional blocks for text and phonetic analysis components [Huang et al., 2001]


Text-to-Speech


• Modules► Document structure detection

◦ Document structure is important to provide a context for all later processes. In addition, some elements of document structure, such as sentence breaking and paragraph segmentation, may have direct implications for prosody. (not be covered here)

► Text normalization◦ Text normalization is the conversion from the variety of symbols, numbers,

and other non-orthographic entities of text into a common orthographic transcription suitable for subsequent phonetic conversion.

► Linguistic analysis◦ Linguistic analysis recovers the syntactic constituency and semantic

features of words, phrases, clauses, and sentences, which is important for both pronunciation and prosodic choices in the successive processes.


Text-to-Speech


• Modules► Homograph disambiguation

◦ It is important to disambiguate words with different senses to determine proper phonetic pronunciations, such as object (/ah b jh eh k t/) as a verb or as a noun (/aa b jh eh k t/).

► Morphological analysis◦ Analyzing the component morphemes provides important cues to attain

the pronunciations for inflectional and derivational words.► Letter-to-sound conversion

◦ The last stage of the phonetic analysis generally includes general letter-to-sound rules (or modules) and a dictionary lookup to produce accurate pronunciations for any arbitrary word.


Text-to-Speech


• Data flow

10

SS [f1, f2, …, fn]

NP [f1, f2, …, fn] VP [f1, f2, …, fn]

W W1 W2 W3 W4

Σ A skilled e lec tri cian re por ted

C

ax skihld

ih lehk

trih

shaxn

riy

paor

taxd

P

F1 F2 F3 F4 F5

IP1 [f1, f2, …, fn] IP2 [f1, f2, …, fn]

U [f1, f2, …, fn]

Annotation tiers indicating incremental analysis based on an input (text) sentence “A skilled electrician reported.”Flow of incremental annotation is indicated by arrows on the left side. [Huang et al., 2001]


Text-to-Speech


• Data flow► W(ords) Σ, C(ontrols)

◦ The syllabic structure (Σ) and the basic phonemic form of a word are derived from lexical lookup and/or the application of rules. The Σ tier shows the syllable divisions (written in text form for convenience). The C tier, at this stage, shows the basic phonemic symbols for each word’s syllables.

► W(ords) S(yntax/semantics)◦ The word stream from text is used to infer a syntactic and possibly

semantic structure (S tier) for an input sentence. Syntactic and semantic structure above the word would include syntactic constituents such as Noun Phrase (NP), Verb Phrase (VP), etc. and any semantic features that can be recovered from the current sentence or analysis of other contexts that may be available (such as an entire paragraph or document). The lower-level phrases such as NP and VP may be grouped into broader constituents such as Sentence (S), depending on the parsing architecture.


Text-to-Speech


• Data flow► S(yntex/semantics) P(rosody)

◦ The P(rosodic) tier is also called the symbolic prosodic module. If a word is semantically important in a sentence, that importance can be reflected in speech with a little extra phonetic prominence, called an accent. Some synthesizers begin building a prosodic structure by placing metrical foot boundaries to the left of every accented syllable. The resulting metrical foot structure is shown as F1, F2, etc. Over the metrical foot structure, higher-order prosodic constituents, with their own characteristic relative pitch ranges, boundary pitch movements, etc. can be constructed, shown as intonational phrases IP1, IP2.


Text-to-Speech

Text Normalization

• Abbreviations► An abbreviation is a shortened form of a word or phrase.► Since any abbreviation is potentially ambiguous, and there are several

distinct types of ambiguity, a system should combine knowledge from a variety of contextual sources such as document structure and origin, in order to resolve abbreviations.◦ Dr. (doctor or drive)◦ MD (medicinae doctor or Maryland)

► Types of abbreviations◦ Acronyms*◦ Apocope◦ Clipping◦ Elision◦ Syncope◦ Syllabic abbreviation◦ Portmanteau


Text-to-Speech

Text Normalization

• Abbreviations► Acronyms are words created from the first letters or other words.► Examples of acronyms

◦ Pronounced as a word– NATO: North Atlantic Treaty Organization– scuba: self-contained underwater breathing apparatus

◦ Pronounced as the names of letters– DNA: deoxyribonucleic acid– LED: light-emitting diode

◦ Pronounced as the names of letters but with a shortcut– IEEE: Institute of Electrical and Electronics Engineers– W3C: World Wide Web Consortium

◦ Pseudo-acronyms– IOU: “I owe you”– CQR: “secure”, a brand of boat anchor


Text-to-Speech

Text Normalization

• Number formats► Phone numbers

► Dates

15

02-1234-5678

(02) 1234-5678

+82-2-1234-5678

12/19/94 December nineteenth ninety four

04/27/1992 April twenty seventh nineteen ninety two

May 27, 1995 May twenty seventh nineteen ninety five

1,994 one thousand nine hundred and ninety four

1994 nineteen ninety four


Text-to-Speech

Text Normalization

• Number formats► Times

► Money and currency

16

＄ 40 forty dollars

￡ 200 two hundred pounds

5 ￥ five yen

300 € three hundred euros

11:15 eleven fifteen

5:20 am five twenty am

12:15:20 twelve hours fifteen minutes and twenty seconds

07:55:46 seven hours fifty-five minutes and forty-six seconds


Text-to-Speech

Text Normalization

• Number formats► Ordinal numbers

► Cardinal numbers

17

123 one two three one hundred (and) twenty three

1,234 one thousand two hundred (and) thirty four

2426two four two six twenty four twenty six

two thousand four hundred (and) twenty six

1/2 one half

1/3 one third

1/4 one quarter or one fourth

3/10 three tenths


Text-to-Speech

Text Normalization

• Domain-specific tags► Mathematical expressions (MathML)

◦ (x + 2)2

► Chemical formulae (CML)◦ C2OCHO4

18

<EXPR><EXPR>x<PLUS/>2</EXPR><POWER/>2

</EXPR>

<FORMULA><XVAR BUILTIN=“STOICH”>C C O C O H H H H</XVAR>

</FORMULA>


Text-to-Speech

Text Normalization

• Miscellaneous formats► Approximately or tilde

◦ The symbol ~ is spoken as approximately before numeral or currency amount, otherwise it is the character named tilde.

► Accented Roman characters◦ A table of folding characters can be provided so that for a term such as

Über-mensch, rather than spell out the word Über, or ignore it, the system can convert it to its nearest English-orthography equivalent: Uber.

► High ASCII characters◦ Ⓒ (copyright), ™ (trademark), ＠ (at), ® (registered mark)

► Asterisk◦ “Larry has *never* been here.”◦ This may be suppressed for asterisks spanning two or more words.

► Emoticons◦ :-) (smiley face), :-( (frowning face), ;-) (winking smiley face)


Text-to-Speech

Linguistic Analysis

• A minimal set of modular functions or services► Sentence breaking

◦ Abbreviation processing◦ Rules or CART built upon features based on: document structure,

whitespace, case conventions, etc.◦ Statistical frequencies on sentence-initial word likelihood◦ Statistical frequencies of typical lengths of sentences for various genres◦ Streaming syntactic/semantic (linguistic) analysis

20

Mr. Smith came by. He knows that it costs $1.99, but I don’t know when he’ll be back (he didn’t ask, “when should I return?”)... His Web site is www.mrsmithhhhhh.com. The car is 72.5 in. long (we don’t know which parking space he’ll put his car in.) but he said “...and the truth shall set you free,” an interesting quote.


Text-to-Speech

Linguistic Analysis

• A minimal set of modular functions or services► POS tagging [Manning et al., 1999]

◦ A system uses the POS labels to decide alternative pronunciations and to assign differing degrees of prosodic prominence.

◦ In addition, the bracketing might assist in deciding where to place pauses for great intelligibility.

21

Influential/JJ members/NNS of/IN the/DT House/NNP Ways/NNP and/CC Means/NNP Committee/NNP introduced/VBD legislation/NN that/WDT would/MD restrict/VB how/WRB the/DT new/JJ savings-and-loan/NN bailout/NN agency/NN can/MD raise/VB capital/NN ,/, creating/VBG another/DT potential/JJ obstacle/NN to/TO the/DT government/NN ’s/POS sale/NN of/IN sick/JJ thrifts/NNS ./.


Text-to-Speech

Linguistic Analysis

• A minimal set of modular functions or services► Homograph disambiguation

◦ Homograph disambiguation in general refers to the case of words with the same orthographic representation (written form) but having different semantic meanings and sometimes even different pronunciations.

► Noun phrase (NP) and clause detection◦ Basic NP and clause information could be critical for a prosodic generation

module to generate segmental durations. It also provides useful clues to introduce necessary pauses for intelligibility and naturalness. Phrase and clause structure are well covered in any parsing techniques.

► Sentence type identification◦ Sentence types (declarative, yes-no question, etc.) are critical for macro-

level prosody for the sentence.


Text-to-Speech


• Homograph variation► Homograph variation can often be resolved on POS (grammatical)

category.► Deep semantic and/or discourse analysis would be required to resolve

the tense ambiguity.◦ “If you read the book, he’ll be angry.”

• Two special sources of pronunciation ambiguity► A variation of dialects

◦ Tomato: tom[ey]to, tom[aa]to◦ Boston natives tend to reduce the /r/ sound in sentences like “Park your

car in Harvard yard.”► Speech rate and formality level

◦ The /g/ sound in recognize may be omitted in faster speech.


Text-to-Speech


• Examples of homographs► Stress homographs: noun with front-stress vowel, verb with end-stress

vowel◦ “an absent boy” vs. “Do you choose to absent yourself?”

► Voicing: noun/verb or adjective/verb distinction made by voice final consonant◦ “They will abuse him.” vs. “They won’t take abuse.”

► -ate words: noun/adjective sense uses schwa, verb sense uses a full vowel◦ “He will graduate.” vs. “He is a graduate.”

► Double stress: front-stressed before noun, end-stressed when final in phrase◦ “an overnight bag” vs. “Are you staying overnight?”

► -ed adjectives with matching verb past tenses◦ “He is a learned man.” vs. “He learned to play piano.”


Text-to-Speech


• Examples of homographs► Ambiguous abbreviations

◦ in, am, SAT (Saturday vs. Standard Aptitude Test)► Borrowed words from other languages

◦ They could sometimes be distinguishable based on capitalization.◦ “El Camino Real road in California” vs. “real world”◦ “polish shoes” vs. “Polish accent”

► Miscellaneous ◦ “The sewer overflowed.” vs. “a sewer is not a tailor.”◦ “He moped since his parents refused to buy a moped.”◦ “Agape is a Greek word.” vs. “His mouth was agape.”


Text-to-Speech


• Morphological analysis► Decomposition process

► When a dictionary does not list a given orthographic form explicitly, it is sometimes possible to analyze the new word in terms of shorter forms already present.

26

re nation al ize prefix stem suffix


Text-to-Speech


• Prefixes and suffixes► The prefixes and suffixes are generally considered bound, in the

sense that they cannot stand alone but must combine with a stem.◦ A word such as establishment may be decomposed into a “stem” establish

and a suffix -ment.◦ Should a system attempt to further decompose the stem establish into

establ and -ish?– Since a difference that makes no difference is no difference, it is best to be

conservative and list only obvious and highly productive affixes.

► Prefix and suffix stripping gives an analysis for many common inflected and some derived words.◦ It helps in saving system storage, but it also make mistakes (from a

synchronic point of view: basement is not base + -ment).◦ Adding irregular morphological formation into a system dictionary is always

a desirable solution.


Text-to-Speech


• Letter-to-sound conversion [Demper et al., 1998]► Also known as grapheme-to-phoneme conversion► For most words encountered in the input of a TTS system, a canonical

(or ‘baseform’) pronunciation is easily obtained by dictionary look-up.► The traditional default strategy uses a set of context-dependent

phonological rules written by an expert.◦ However, the task of manually writing such a set of rules, deciding the rule

order so as to resolve conflicts appropriately, maintaining the rules as mispronunciations are discovered etc., is very considerable and requires an expert with depth of knowledge of the specific language.

► More recent attention has focused on the application of automatic techniques based on machine-learning from large corpora.


Text-to-Speech


• Phonological rules► Assumption

◦ The pronunciation of a letter or letter substring can be found if sufficient is known of its context, i.e. the surrounding letters.

► The form of the rules is A[B]C D, which states that the letter substring B with left-context A and right-context C receives the pronunciation (i.e. phoneme substring) D.

► Because of the complexities of letter-to-sound correspondence, more than one rule generally applies at each stage of transcription.◦ The conflicts which arise are resolved by maintaining the rules in a set of

sublists, grouped by (initial) letter and with each sublist ordered by specificity.

◦ The most specific rule is usually at the top and most general at the bottom.◦ Transcription is usually a one-pass, left-to-right process.


Text-to-Speech


• Pronunciation by analogy► Pronunciation by analogy exploits the phonological knowledge

implicitly contained in a dictionary of words and their corresponding pronunciations.

► The underlying idea is that a pronunciation for an unknown word is assembled by matching substrings of the input to substrings of known, lexical words, hypothesizing a partial pronunciation for each matched substring from the phonological knowledge, and concatenating the partial pronunciations.


Text-to-Speech


• Decision trees [Black et al., 1998]► A decision tree has as the input grapheme sliding window with three

letters to the left and three to the right accordingly.► This method is appropriate for discreet characteristics and produces

rather compact models, whose size is defined by the total number of questions and leaf nodes in the output tree.

► Disadvantages◦ It assumes that the decisions are independent one from another so is that

it cannot use the prediction of the previous phone as the reference to predict the next one.

◦ Another limitations introduced by the binary decision trees that every time a question is asked the training corpus is divided into two parts and further questions are asked only over the remaining parts of the corpus.


Text-to-Speech


• Finite state transducers► This approach chooses the pronunciation φ that maximizes the

probability of a phoneme sequence given the letter sequence g.

► This estimation can be done using standard n-gram methods.

◦ where N is the number of letters in the word.

32

)}|({maxargˆ gp

N

i

iiii ggpgp

1

11

11 ),|,(),(


Text-to-Speech


• Finite state transducers► n-grams can be represented by a finite-state automaton, where a new

state is defined for each history h and arc a is created for each new grapheme-phoneme pair (g,φ).

► These arcs are labeled with a grapheme-phoneme pair and weighted with the probability of (g,φ) given the history h.

► To derive the finite state transducer the labels attached to the automaton edges are split in a way that letters become input and phonemes become output.


Text-to-Speech


• Hidden Markov models [Taylor, 2005]► Each phoneme is represented by one HMM and letters are the emitted

observations.► The probability of transitions between models is equal to the

probability of the phoneme given the previous phoneme.► The objective of this method is to find the most probable sequence of

hidden models (phonemes) given the observations (letters), using the probability distributions found during the model training.

◦ where p(φ) is the prior probability of a sequence of phonemes occurring, and p(g,φ) is the likelihood of grapheme sequence g given phoneme sequence φ.

34

)()|(maxargˆ

pgp


Text-to-Speech

Evaluation

• Text and phonetic analysis► The evaluation is usually feasible, because the input and output of

such module is relatively well defined.► The evaluation focuses mainly on symbolic and linguistic level in

contrast to the acoustic level.

• Automatic detection of document structures► The evaluation typically focuses on sentence breaking and sentence

type detection.► Since the definitions of these two types of document structures are

straightforward, a standard evaluation database can be easily established.

• LTS conversion analysis► An automated test framework for the LTS conversion analysis

minimally includes a set of test words and their phonetic transcriptions for automated lookup and comparison tests.


Text-to-Speech

Reading List

• Black et al., 1998. “Issues in building general letter to sound rules”. 3rd ESCA workshop on speech synthesis, Jenolah Caves, Australia, pp. 77-80.

• Demper et al., 1998. “A comparison of letter-to-sound conversion techniques for English text-to-speech synthesis”. Institute of Acoustics 20(6), pp. 245-254.

• Huang et al., 2001. Spoken Language Processing: A Guide to Theory, Algorithm, and System Development. New Jersey, Prentice Hall.

• Manning and Schütze, 1999. Foundations of Statistical Natural Language Processing. Cambridge, MIT Press.

• Taylor, 2005. “Hidden Markov Models for Grapheme to Phoneme Conversion”. Interspeech-2005, Lisbon, pp. 1973-1976.


Text-to-Speech

37

Prosody


Text-to-Speech

Prosody

• General Prosody• Speaking Style• Symbolic Prosody• Duration Assignment• Pitch Generation• Prosody Markup Languages• Prosody Evaluation


Text-to-Speech

General Prosody

• It is not what you said; it is how you said it!► An important supporting role in guiding a listener’s recovery of the

basic messages► A starring role in signaling connotation► The speaker’s attitude toward the message, toward the listener

• Prosody is often defined on two different levels► An abstract, phonological level

◦ (phrase, accent and tone structure)► A physical phonetic level

◦ (fundamental frequency, intensity or amplitude, and duration)


Text-to-Speech

General Prosody

• From the listener’s point of view, prosody consists of systematic perception and recovery of a speaker’s intentions based on:

► Pauses◦ To indicate phrases and to avoid running out of air

► Pitch◦ Rate of vocal-fold cycling (fundamental frequency or F0) as a function of

time► Rate/relative duration

◦ Phoneme durations, timing, and rhythm► Loudness

◦ Relative amplitude/volume


Text-to-Speech

General Prosody

• The analysis of prosody► Two approaches in the prosody literature

◦ Create an abstract descriptive system witch characterizes observations of the behavior of the parameters of prosody within the acoustic signal (fundamental frequency movement, intensity changes and durational movement), and promote the system to a symbolic phonological role.

◦ Create a symbolic phonological system which can be used to input to processes which eventually result in an acoustic signal judged by listeners to have a proper prosody.


Text-to-Speech

General Prosody

42

sound wave strip randomelement

strip non-randomdevice-constrainedphonetic variations

strip non-random,but non-essentiallanguage-specific

phonologicalvariations

abstractunderlying

representation

Phonetic analysis Phonological analysis

preserve generalized stripped information in a static phonetic

model

preserve generalized stripped information

in a static phonological model

add random element

reconstruct phonetic variations

reconstruct phonological

variations

abstract markup

synthesized sound wave

Low level synthesis High level synthesis

The diagram of how linguists take a sound wave [Monaghan, 2002]Intelligent Robot Lecture Note

Text-to-Speech

General Prosody

43

Pause insertion and prosodic phrasing

F0 contour Duration Loudness

Parsed text and phone string

Enriched prosodic representation

Speaking style

Block diagram of a prosody generation system


Text-to-Speech

Speaking Style

• Prosody depends not only on the linguistic content of a sentence.• Different people generate different prosody for the same sentence.• Even the same person generates a different prosody depending

on his or her mood.


Text-to-Speech

Speaking Style

• Character► As a determining element in prosody, refers primarily to long-term,

stable, extra-linguistic properties of a speaker.◦ Speaker’s region and economic status◦ Gender, age, speech defects◦ Fatigue, inebriation, talking with mouth full

► Since many of these elements have implications for both the prosodic and voice quality of speech output.


Text-to-Speech

Speaking Style

• Emotion► Temporary emotional conditions such as amusement, anger,

contempt, grief, sympathy, suspicion, etc. have an effect on prosody.► A few preliminary conclusion from existing research on emotion in

speech [Murray et al., 1993].◦ Speakers vary in their ability to express emotive meaning vocally in

controlled situations.◦ Listeners vary in their ability to recognize and interpret emotions from

recorded speech.◦ Some emotions are more readily expressed and identified than others.◦ Similar intensity of two emotions can lead to confusing one with the other.


Text-to-Speech

Speaking Style

• Some basic emotions that have been studied in speech include:► Anger, though well studied in the literature, may be too broad a

category for coherent analysis. One could imagine a threatening kind of anger with a more overtly expressive type of tantrum could be correlated with a wide, raised pitch range.

► Joy is generally correlated with increase in pitch and pitch range, with in crease in speech rate, Smiling generally raises F0 and formant frequencies and can be well identified by untrained listeners.

► Sadness generally has normal or lower than normal pitch realized in a narrow range, with a slow rate and tempo, It may also be characterized by slurred pronunciation and irregular rhythm.

► Fear is characterized by high pitch in a wide range, variable rate, precise pronunciation, and irregular voicing (perhaps due to disturbed respiratory pattern).


Text-to-Speech

Symbolic Prosody

• Abstract or symbolic prosodic structure is the link between the infinite multiplicity of pragmatic, semantic, and syntactic features of an utterance and the relatively limited F0, phone durations, energy, and voice quality.

• Symbolic prosody deals with:► Braking the sentence into prosodic phrases, possibly separated by

pauses► Assigning labels, such as emphasis, to different syllables or words

within each prosodic phrase.


Text-to-Speech

Symbolic Prosody

• Words are normally spoken continuously, unless there are specific linguistic reasons to signal a discontinuity.

• The term juncture refers to prosodic phrasing that is, where do words cohere, and where do prosodic breaks (pauses and/or special pitch movements) occur.

• Juncture effects, expressing the degree of cohesion or discontinuity between adjacent words, are determined by physiology (running out of breath), phonetics, syntax, semantics, and pragmatics.


Text-to-Speech

Symbolic Prosody

• The primary phonetic means of signaling juncture are:► Silence insertion► Characteristic pitch movements in the phrase-final syllable.► Lengthening of a few phones in the phrase-final syllable.► Irregular voice quality such as vocal fry.


Text-to-Speech

Symbolic Prosody

51Pitch generation decomposed in symbolic and phonetic prosody

Symbolic Prosody

PausesProsodic Phrases

AccentToneTune

Prosody Attributes

F0 Contour Generation

Speaking Style

F0 contour

Parsed text and phone string

Pitch RangeProminenceDeclination


Text-to-Speech

Symbolic Prosody

• Pauses► In a long sentence, speakers normally and naturally pause a number

of times.► These pauses have traditionally been thought to correlate with

syntactic but might more properly be thought of as markers of information structure [Steedman, 2000].

► They may also be motivated by poorly understood stylistic idiosyncrasies of the speaker, or physical constraints.

► There are many reasonable places to pauses in a long sentence, but a few where it is critical not to pause. The goal of a TTS system should be to avoid placing pauses anywhere that might lead to ambiguity, misinterpretation, or complete breakdown of understanding.


Text-to-Speech

Symbolic Prosody

• Pauses► The CART can be used for pause assignment [Ostendorf et al., 1998].

◦ Use POS categories of words, punctuation, and a few structural measures, such as overall length of a phrase, and length relative to neighboring phrases to construct the classification tree.

◦ Is this a sentence boundary (marked by punctuation)?◦ Is the left word a content word and the right word a function word?◦ What is the function word type of word to the right?◦ Is either adjacent word a proper name?◦ What is the current location in the sentence?◦ …


Text-to-Speech

Symbolic Prosody

• Prosodic phrases► An end-of-sentence period may trigger an extreme lowering of pitch, a

comma-terminated prosodic phrase may exhibit a small continuation rise at its end, signaling more to come, etc.

► Prosodic junctures that are clearly signaled by silence (and usually by characteristic pitch movement as well), also called intonational phrases, are required between utterances and usually at punctuation boundaries.

► Prosodic junctures that are not signaled by silence but rather by characteristic pitch movement only, also called phonological phrases, may be harder to place with certainty and to evaluate.


Text-to-Speech

Symbolic Prosody

• Prosodic phrases► To discuss linguistically significant juncture types and pitch movement,

it helps to have a simple standard vocabulary.► ToBI (for Tones and Break Indices) [Silverman, 1992] is a proposed

standard for transcribing symbolic intonation of American English utterances, though it can be adapted to other languages as well.


Text-to-Speech

Symbolic Prosody

• Prosodic phrases► The Break Indices part of ToBI specifies an inventory of numbers

expressing the strength of a prosodic juncture.► The Break Indices are marked for any utterance on their own discrete

break index tier (or layer of information), with the BI notations aligned in time with a representation of the speech phonetics and pitch track.


Text-to-Speech

Symbolic Prosody

• Prosodic phrases► On the break index tier, the prosodic association of words in an

utterance is shown by labeling the end of each word for the subjective strength of its association with the next word, on a scale from 0 (strongest perceived conjoining) to 4 (most disjoint).◦ 0 for cases of clear phonetic marks of clitic groups (pronunced as part of

another word, as in ve in I’ve)◦ 1 most phrase-medial word boundaries.◦ 2 a strong disjuncture marked by a pause or virtual pause, but with no

tonal marks.◦ 3 intermediate intonation phrase boundrary.◦ 4 full intonation phrase boundary.◦ “Did/0 you/1 want/0 an/1 example?/4”


Text-to-Speech

Symbolic Prosody

• Accents► We should briefly clarify use of terms such as stress and accent.► Stress generally refers to an idealized location in an English word that

is a potential site for phonetic prominence effects, such as extruded pitch and/or lengthened duration.◦ “Acme Industries is the biggest employer in the area.”

► Accent is the signaling of semantic salience by phonetic means.◦ “I didn’t say employer, I said employee.”


Text-to-Speech

Symbolic Prosody

• Tones► Tones can be understood as labels for perceptually salient levels or

movements of F0 on syllables.► Pitch levels and movements on accented and phrase-boundary

syllables can exhibit a bewildering diversity, based on the speaker’s characteristics, the nature of the speech event.


Text-to-Speech

Symbolic Prosody

• Tones► Chinese, a lexical tone language, is said to have an inventory of 4

lexical tones (5 if neutral tone is included).

60

The four Chinese tones


Text-to-Speech

Symbolic Prosody

• Tones► A basic set of tonal contrasts has been codified for American English

as part of the Tones and Break Indices (ToBI) system [Silverman, 1992].

► ToBI can be used for annotation of prosodic training data for machine learning, and also for internal modular control of F0 generation in a TTS system◦ The set specifies 2 abstract levels, H (High) and L (Low), indicating a

relatively higher or lower point in a speaker’s range.


Text-to-Speech

Symbolic Prosody

62

ToBI Pitch accent tonesToBI Tone Description Graph

H*Peak accent – a tone target on an accented syllables which is in the upper part of the speaker’s pitch range

L*Low accent – a tone target on an accented syllable which is in the lowest part of the speaker’s pitch range

L*+HScooped accent – a low tone target on an accented syllable which is immediately followed by a relatively sharp rise to a peak in the upper part of the speaker’s pitch range

L*+!HScooped downstep accent – a low tone target on an accented syllable which is immediately followed by a relatively flat rise to a downstep peak

L+H*

Rising peak accent – a high peak target on an accented syllable which is immediately preceded by a relatively sharp rise from a valley in the lowest part of the speaker’s pitch range

!H*

Downstep high tone –a clear step down onto an accented syllable from a high pitch which itself cannot be accounted for by an H phrasal tone ending the preceding phrase or by a preceding H pitch accent in the same phrase


Text-to-Speech

Symbolic Prosody

63

ToBI boundary tones

ToBI Tone Description

L-L%For a full intonation phrase with an L phrase accent ending its final intermediate phrase and an L% boundary tone falling to a point low in the speaker’s range, as in the standard ‘declarative’ contour of American English.

L-H%For a full intonation phrase with an L phrase accent closing the last intermediate phrase, followed by an H boundary tone, as in ‘continuation rise’.

H-H%

For an intonation phrase with a final intermediate phrase ending in an H phrase accent and a subsequent H boundary tone, as in the canonical ‘yes-no question’ contour. Note that the H- phrase accent causes ‘upstep’ on the following boundary tone, so that the H% after H- rises to a very high value.

H-L%For an intonation phrase in which the H phrase accent of the final intermediate phrase upsteps the L% to a value in the middle of the speaker’s range, producing a final level plateau.

%HHigh initial boundary tones; marks a phrase that begins relatively high in the speaker

‘s pitch range when not explained by an initial H* or preceding H%.

ToBI intermediate phrasal tones

ToBI Tone Description

L- Phrase accent, which occurs at an intermediate phrase boundary (level 3 and above).

H- Phrase accent, which occurs at an intermediate phrase boundary (level 3 and above).


Text-to-Speech

Symbolic Prosody

64

“Marianna made the marmalade”, with an H* accent on Marianna and marmalade,and a final L-L% marking the characteristic sentence-final pitch drop.


Text-to-Speech

Symbolic Prosody

• Prosodic transcription system► INTSINT

◦ INTSINT is a coding system of intonation [Hirst, 1994].► TILT

◦ TILT is one of the most interesting models of prosodic annotation.◦ It can represent a curve in both its qualitative (ToBI-like) and quantitative

(parameterized) aspects.► K-ToBI

◦ Prosodic transcription convention for standard Korean [Jun, 2000].


Text-to-Speech

Symbolic Prosody

• INTSINT► INTSINT is a coding system of intonation [Hirst, 1994].

◦ It provides a formal encoding of the symbolic or phonologically significant events on a pitch curve.

◦ Each such target point of the stylized curve is coded by a symbol, either as an absolute tone, scaled globally with respect to the speakers pitch range, or as a relative tone, defined locally in conjunction with the neighboring target points.

66

The definition of absolute tones

T Top of the speaker’s pitch range

M Initial, mid value

B Bottom of the speaker’s pitch range

H Target higher than both immediate neighbors

S Target not different from preceding target

L Target lower than both immediate neighbors

U Target in a rising sequence

D Target in a falling sequence


Text-to-Speech

Symbolic Prosody

• INTSINT

67

T

B

HS

LU

D M

Absolute toneRelative tone

Top

Downstepped

Bottom

Upstepped

Higher

Lower

Mid Same


Text-to-Speech

Symbolic Prosody

• TILT► The automatic parameterization of a pitch event on a syllable is in

terms of:◦ Starting F0 value (Hz)◦ Duration

◦ Amplitude of rise (Arise, in Hz)

◦ Amplitude of fall (Afall, in Hz)

◦ Starting point, time aligned with the signal and with the vowel onset► The tone shape, mathematically represented by its tilt, is a value

computed directly from the F0 curve by the following formula

68

||||

||||

fallrise

fallrise

AA

AAtilt


Text-to-Speech

Symbolic Prosody

• TILT

69

Label scheme for syllables

sil Silence

c Connection

a Major pitch accent

fb Falling boundary

rb Rising boundary

afb Accent + falling boundary

arb Accent + rising boundary

m Minor accent

mfb Minor accent + falling boundary

mrb Minor accent + rising boundary

l Level accent

lrb Level accent + rising boundary

lfb Level accent + falling boundary


Text-to-Speech

Symbolic Prosody

• K-ToBI► Prosodic transcription convention for standard Korean [Jun, 2000].► Based on the design principles of the original English ToBI and the

Japanese ToBI system.► Assumes intonational phonology with a close relationship to a

hierarchical model of prosodic constituents.


Text-to-Speech

Symbolic Prosody

• K-ToBI

71

IP: intonation phraseAP: accentual phraseW: phonological words: syllableT=H, when the syllable initial segment is aspirated/tense otherwise, T=L%: intonation phrase boundary tone

IP

AP (AP)

W (W)

s s … s s

T L HH %

Intonational structure of Korean


Text-to-Speech

Symbolic Prosody

• K-ToBI► Two prosodic units defined by intonation► Intonation Phrase (IP)

◦ Marked by a boundary tone (%) and phrase final lengthening► Accentual Phrase (AP)

◦ Smaller than an IP but larger than word and demarcated by phrasal tones◦ AP phrasal tones

– LHLH when the phrase has more than 3 syllables– LH, LLH, LHH when the phrase has fewer than 4 syllables

◦ AP initial tone– Changes depending on the laryngeal feature of the phrase initial segment– H when the AP begins with an aspirated or a tense obstruent (e.g., HHLH)– Otherwise L


Text-to-Speech

Symbolic Prosody

73

“I hate Younga”


Text-to-Speech

Symbolic Prosody

• Structure of K-ToBI

74

ToBI (4 tiers) K-ToBI (5 tiers)

Word tierWord tier

Tone tierPhonological tone tier

Phonetic tone tier

Break index tier Break index tier

Miscellaneous tier Miscellaneous tier


Text-to-Speech

Symbolic Prosody

• Why does K-ToBI need 2 tone tier?► Surface tonal variation are not distinctive or predictable in Korean

Prosody► Rather, what is distinctive is the phrasing and IP boundary tones.

◦ The presence or absence of an AP and IP boundary can change the meaning of an utterance, as with the distinction between wh-questions and yes/no-question and the disambiguation of syntactically ambiguous structures.

◦ Also, the IP boundary tone delivers semantic s sell as pragmatic meaning for an utterance.


Text-to-Speech

Symbolic Prosody

• Phonological tone tier► Labels the boundary of two prosodic units

◦ A boundary tone (X%) at the end of an IP– X% can be one of the 9 IP boundary tones (L%, H%, HL%, LH%, LHL%, HLH

%, HLHL%, LHLH%, LHLHL%)– We note that it is possible that not all of the IP boundary tones are distinctive

(e.g., LHLH% vs. LHLHL%), but until we find further evidence of distinctive meaning, or lack thereof, we will use all of these tones

◦ LHa at the end of an AP– Aligned with the end of AP final segment determined from the waveform.

◦ T%– Marks the end of an IP– Aligned with the end of IP final segment determined form the waveform


Text-to-Speech

Symbolic Prosody

• Phonetic tone tier► IP boundary tones: the same as those in the phonological tone tier.► AP tones: 14 types of surface tonal patterns

◦ (LHa, LHHa, LLHa, HLHa, HHa, HLa, LHLa, HHLa, HLLa, LLa, HHLHa, LHLHa, HHLLa, LHLLa)

◦ Labeled by 3 AP initial tones (L, H, +H) and 3 AP final tones (La, Ha, L+)◦ These 6 tones are not always realized, but the combination of these 6

tones can represent all 14 tonal types


Text-to-Speech

Symbolic Prosody

78

Schematic F0 contours of 14 tonal patterns of AP

Schematic F0 contours of 8 boundary tones of IP


Text-to-Speech

Symbolic Prosody

• The break index tier► Represents the degree of juncture perceived between each pair of words► 4 break indices

◦ 3: a strong phrasal disjuncture such as an IP◦ 2: minimal phrasal disjuncture such as an AP◦ 1: phrase-internal word boundaries◦ 0: a juncture smaller than a word boundary

► 3 more break index’s for a mismatch between the perceived degree of juncture and the tonal pattern◦ 3m: used when the juncture is 3 but has an AP tonal pattern◦ 2m: used when the juncture is 2 but has either no AP tonal pattern or the

tonal pattern of an IP◦ 1m: used when the juncture is 1 but there is an AP tonal pattern

► “-” diacritic affixed directly to the right of the higher break index◦ 1-: uncertainty between 0 and 1◦ 2-: uncertainty between 1 and 2


Text-to-Speech

Symbolic Prosody

• Miscellaneous tier► Contains labeler comments concerning events such as silence,

audible breathing, laughter, or other disfluencies.

80

“These days, that kind of church, eh, Year 2000, millennium…”


Text-to-Speech

Reading List

• Allen, J., M.S. Hunnicutt, and D.H. Klatt, From Text to Speech: the MITalk System, 1987, Cambridge, UK, University Press.

• Black, A. and A. Hunt, “Generating F0 Contours from toBI labels using Linear Regression,” Proc, of the Int. Conf, on Spoken Language Processing, 1996, pp. 1385-1388.

• Fujisaki, H. and H, Sudo, “A generative Model of the Prosody of Connected Speech in Japanese,” annual Report of Eng. Research Institute, 1971, 30, pp. 75-80.

• Hirst, D.H., “The Symbolic Coding of Fundamental Frequency Curves: from Acoustics to Phonology,” Proc. Of Int, Symposium on Prosody, 1994, Yokohama, Japan.

• Huang, X., et al., “Whistler: A Trainable Text-to-Speech System,” Int, Conf. on Spoken Language Processing, 1996, Philadelphia, PA, pp. 2387-2390.


Text-to-Speech

Reading List

• Jun, S., K-ToBI (Korean ToBI) labeling conventions (version 3.1), http://www.linguistics.ucla.edu/people/jun/ktobi/K-tobi.html, 2000.

• Monaghan, A. “State-of-the-art summary of European synthetic prosody R&D,” Improvements in Speech Synthesis. Chichester: Wiley, 1993, 93-103.

• Murray, I. and J. Arnott, “Toward the Simulation of Emotion in Synthetic Speech: A Review of the Literature on Human Vocal Emotion,” Journal Acoustical Society of Ameriac, 1993, 93(2), pp. 1097-1108.

• Ostendorf, M., and N. Veilleux, “A Hierarchical Stochastic Model for Automatic Prediction of Prosodic Boundary Location,” Computational Linguistics, 1994, 20(1), pp. 27-54.


Text-to-Speech

Reading List

• Plumpe, M. and S. Meredith, “which is More Important in a Concatenative Text-to-Speech System: Pitch, Duration, or Spectral Discontinuity,” Third ESCA/COCOSDA Int. Workshop on Speech Synthesis, 1998, Jenolan Caves, Australia, pp. 231-235.

• Silverman, K., The Structure and processing of fundamental Frequency Contours, Ph.D. Thesis, 1987, University of Cambridge, Cambridge, UK.

• Steedman, M., “Information Structure and the Syntax-Phonology Interface,” Linguistic Inquiry, 2000.

• van Santen, J., “Assignment of Segmental Duration in Text-to-Speech Synthesis,” computer Speech and Language, 1994, 8, pp. 95-128.

• W3C, Speech Synthesis Markup Requirements for Voice Markup Languages, 2000, http://www.w3.org/TR/voice-tts-reqs/.


text-to-speech text-to-speech part i 1 intelligent robot lecture note

Documents

speech text

text phones

speech modules

human speech

speech researchers

sound conversion raw

phonetic analysis modules

speech introduction