1 processing with prosody & predicting prosody taal- en spraaktechnologie fall 2005 jennifer...

56
1 Processing with Prosody & Predicting Prosody Taal- en spraaktechnologie Fall 2005 Jennifer Spenader

Upload: mae-fitzgerald

Post on 17-Dec-2015

213 views

Category:

Documents


1 download

TRANSCRIPT

1

Processing with Prosody&

Predicting Prosody

Taal- en spraaktechnologie

Fall 2005Jennifer Spenader

2

Today and tomorrow

Today:

1. Why we need to be able to recognize prosody

2. Elements that correlate with prosody in synthetic speech

Tomorrow1. How do categories like new-given relate to

choice of lexical and syntactic form?

2. How do we determine the interpretation of underspecified forms?

3

Structure of Today’s Lecture

1. What makes speech sound good?2. What role does prosody play in language

understanding?3. Categories that are relevant to generation of

prosody• Defining, identifying, operalizing, implementing,

testing

4. How is the information used to generate natural synthetic speech?

4

What makes good synthetic speech good?

• Idealized synthetic speech

• Good synthetic speech (AT & T’s Crystal)

• BAD synthetic speech

5

Characteristics of good synthetic speech

• Intelligibility– It should support the listener’s decoding of the

speaker’s message

• Naturalness– It should follow the rules of discourse and

information structure

• Pleasant to listen to? Friendly sounding?

6

How do we evaluate synthetic speech?

• Present listeners with samples and ask them

– Their opinion (give rating, e.g. 1 to 5)– To compare two samples – To compare two samples to a third reference– To ‘type what you hear’

7

Problems with evaluation

• Are all listeners informative subjects?– consistency (do the scores make sense when taken together)– reliability (do people’s scores have same range? same

mean?)– native language, experience, etc.?

• What are we judging anyway? naturalness, understandability, likeableness, coverage,

intelligibility – how are these the same or different?

Slide slightly modified from Tina Bennett (2004)

8

Prosody in synthetic speech

– Using the expected accentuation patterns makes synthetic speech more predictable

• If applications used in real world, e.g. noisy environments, then we need to have high intelligibility

• (Does it make the message more redundant?)• Meaning is sometimes effected by prosody

– Important for analysis, for machine translation, etc.

– Ex.

» Jag behöver en biljet.

» Jag behöver EN biljet.

9

What role does prosody play in language?

• Lexicon– Some languages make meaningful lexical distinctions with

prosody, e.g. Chinese, even Japanese • Ame_candy vs. Ame rain

• Syntactic Structure– Identify constituents or phrases?

• Discourse structure– Identifies referents, distinguishes given from new– Identifies contrasts, emphasizes key points?– Marks topic changes

• Aides in identifying rhetorical relations

10

Prosodic prominence aids processing

• Word initial phonemes are recognized faster in words with pitch accent – (Shields et al. 1974; Cutler & Foss 1977)– Phoneme identification tasks

• Mispronunciations are recognized faster if the word has pitch accent – (Cole et al. 1978, Cole & Jakimik, 1980)– Words with pitch accent have clearer acoustics

11

Why not just give everything prosodic

prominence?

– Information theory and coding & – “Speaker economy”:

• An efficient code has a low average length per message compared to an inefficient code

• Giving everything prosodic prominence might be helpful to the hearer but makes things harder for the speaker

• Language is already redundant, speaker’s utilize this

12

What does prosody tell us about the message?

• So far we’ve just said something about prosody being helpful in decoding and recognize words

13

Syntactic form

• Rising and falling fundamental frequency, with final lengthening function as boundary tones

• For many years linguists assumed prosody mirrored syntactic structure

14

Prosody not isomorphic with syntax

• No one-to-one prosodic correlates of syntactic structure– Accepted only fairly recently

• Major syntactic boundaries: – show greater F0 movement and longer segmental durations

• Major syntactic boundaries may be accurately located from prosodic information alone– (Collier & ‘t Hart, 1975)

• How good is Crystal?– Note break before “during”

15

Prosody disambiguates local ambiguities

Ex. – John believes Mary implicitly.– John believes Mary to be a professor.

• Prosody helps online processing– Ex. – Earlier my sister took a dip/in the pool/at the club/on the

hill.– Grosjean (1983) Subjects could distinguish whether the

target word “dip” was followed by zero, three, or six more words.

• Language specific: French listener’s couldn’t do more than recognize sentence finality of English sentences

16

Information structure

• From Eady and Cooper (1986) (version of “Question Test of Harjicova et al. 1995)

• Ex. George has flowers for Mary.

1. Who has flowers for Mary?

2. What does George have for Mary?

3. Who does George have flowers for?

• Depending on the question (=context), different words will receive phonetic focus.

17

Listeners actively search for sentence focus

Cutler (1976) Phoneme monitoring task

(listen for a particular phoneme, e.g. /d/)

1. That summer four years ago I ate roast DUCK for the first time.

2. That summer four years ago I ate roast duck for EVERY MEAL.

• “duck” edited out and replaced by neutral version• Subjects faster in recognizing target word’s phoneme in

context where it would have been focused

18

Prosodic prominence also triggers extra semantic processing

• Homophomes “gelijkklinkend woord” – hart vs. hard, (de) bal vs. (het) bal

• Blutner & Sommer (1988)– If a homophone (a word with several meanings) is

focused, its multiple meanings are activated– Unaccented activates only the contextually correct

interpretation

19

Why deaccent and accent?

• Let your hearer know what’s important!– New-given :New items receive accent, given items are

deaccented– Receive accent: The stressed syllable is produced so that it

coincides with an F0 maxima…• As well as longer duration, increased intensity?

– Be deaccented: – Get cliticized: clitic: An unstressed word incapable of

standing on its own and attaches in pronunciation to a stressed word, with which it forms a single accentual unit.

• the pronoun 'em in I see 'em • the definite article in French l'arme, "the arm."

– (modified from Free Online Dictionary)

20

Sentence processing is sensitive to new-given

• Response to comprehension task better with correct new-given prosody – (Bock & Mazzella, 1983)

• Simple definition of new-given– First occurance = NEW– Second occurance = GIVEN

21

Correct accenting

• Mark “new information” by using question test form (Harjicova et al. 1995)– Ex.– Who won the lottery? – It was won by a phonologist.

(Target phonemes: /b/ or /f/)

• Cutler & Fodor (1979) phoneme-identification is faster when the word in the phoneme identification was the same as focus word.

22

Correct deaccenting

• Verification of given information in pictures faster when given information deaccented– When this information was accented reaction

times became longer– (Terken & Nooteboom 1987)

23

Do speakers deaccent to

distinguish given information from new, or do the

deaccent because they

can?

24

New-given: how defined

• Actually until now we just used a simple definition, repetition of same word form

• This is also the type of data used in most testing

• But surely there is more to new-given!

25

26

When is something “given”?

27

When is something given?

• Threshold and scope of givenness– How does an item become given– Same word earlier?– Reference to same referent earlier?– Reference to same concept earlier?– How much earlier? Is 6 pages/20 minutes earlier

too long ago? How long does something remain given?

28

Theories: Threshold and scope

• Chafe (1976)– Scope of givenness depends on number of

intervening concepts, number of words. Change of topic might remove given items from consciousness

• Grosz & Sidner (1981)– Local focus: items that are now in focus, stored in

stack, this are “popped” at topic change– Global focus items are always given: references to

topic of article of conversation

29

Experimental evidence threshold and scope

• Terken & Nooteboom (1987) Studied radio program speech. Mentioning a word once was enough for the time to be deaccented for the rest of the program

• • If deaccentuation in this situation corresponds

to givenness then givenness is established after one mention

30

Inheritence of givenness

• Can items be considered given even if the same exact surface form wasn’t used before?

• Referents are given or new, not the words used to refer to them!– E.g. purse - handbag

31

Deaccentuation of given forms or given concept?

• Donselaar (1995a)– Ship-boat vs. boat-boat– Subjects asked to make true-false judgements

about spoken sentences

Ex. The millionaire bought a surprise for his wife. He gave her a boat/ship/mink.

The wife UNEXPECTEDLY got a BOAT/boat.– BOAT: accented, or not accented.– Sentences with unaccented synonyms verified

more quickly than accented synomyms• No difference for same word

32

Chafe (1976) Inheritance patterns

• Generic concept specific instance– I don’t like Norwegians. I met a Norwegian

yesterday.– I met a Norwegian yesterday. I don’t like

Norwegians.

• Specific concepts implies more general concepts if the distance is not more than one step– Table furniture– Mentioning furniture does not make tables given

33

When is something new?

1. We just bought a new house. The roof needs repairing.

2. We just bought a new house. The sauna is fabulous!

34

Summary

• Experimental results show that correct prosody aids in processing

• Incorrect prosody makes processing harder

• Getting the prosody right should greatly increase the intelligibility and naturalness of synthetic speech

35

Predicting prosody

What do we expect to be accented or deaccented?

36

Development of TTS

• Original TTS systems: used one of two strategies– Accent all open class words, and deaccent all

closed class words• This results in too many accents

– Accent the last open class word in a phrase• Deaccent everything else• This sounds terrible for many languages, though is “OK”

for English

37

38

Vos en Haas:1

(Sylvia van den Heiden, ilistrations The Tjong-Khing)

• Koekboek

Haas is niet thuis. Vos hang lui in de stoel. Hij heeft nergens zin in. Of toch wel. Hij heeft zin in iets lekkers. Koek of zo. Iets ZOETS. Is er nog koek? Vast wel. Vos loopt naar de keuken. HIj doet de kast open. Daar staat de koektrommel. Maar er zit bijna niks meer in. Drie kleine koekjes! En hoop kruimels.

39

Use other strategies

• Information structure– Identify new-given information– Accent new information, deaccent given

information

• Identify contrasted elements– Emphasize them

• Identify most important part of message– Focus this

40

Hirschberg (1993)

• Algorithm to assign pitch accent – Implemented in NewSpeak, Bell Laboratories TTS system– Input: unrestricted text, output: tagged text – Used FM Radio texts, ATIS texts and ??? To predict accent

– Closed-open class word strategy gets 85% of accents right in FM Radio texts

• Tendency for news readers to accent final phrase content words even though most people would not

– E.g. TRIAL lawyer vs. TRIAL LAWYER

41

Not all function words deaccented

• “and” as a conjunction vs. “and” as discourse particle (Example from Hirschberg, 1992)

1. They left after lunch AND landed in France in time for dinner.

2. ?? They left after lunch. AND, they landed in France in time for dinner.

42

NewSpeak’s treatment of closed-class items

• Three categories

1. closed-class and frequently deaccented• Possessive pronouns, definite and indefinite

articles, copulas, coordinating and subordinating subjections, existential “there”, have, accusative pronouns, most prepositions, positive modals, positive do, as well as certain adverbials, nominative they, nominative and accusative it, some nominal pronouns (e.g. something)

43

Commonly accented closed class items

• Negative article, negative modals, negative do, most nominal pronouns, most nominative and all reflexive pronouns, pre-quantifiers (e.g. all), post-determiners (e.g. next) nominal adverbials (e.g. here), interjections, particles, most wh-words, plus some prepositions

44

Not all content words are accented

• Complex nominals

– CAMPAIGN promise– MASSACHUSSETS BAR Association

– Semantico-syntactic structure maps to differences in stress assignment

– Some stress to left, some to right.

45

Identifying new-given information

• Harder to “tag” for information structure than it is to construct your own examples

• For each word– Identify its root– If root is already mentioned in the context, treat as

given– If root isn’t mentioned in context, treat as new– Context = local context, should coincide with

topics

46

Vos en Haas:2

(Sylvia van den Heiden, ilistrations The Tjong-Khing)

• Koekboek

Haas is niet thuis. Vos hang lui in de stoel. Hij heeft nergens zin in. Of toch wel. Hij heeft zin in iets lekkers. Koek of zo. Iets ZOETS. Is er nog koek? Vast wel. Vos loopt naar de keuken. HIj doet de kast open. Daar staat de koektrommel. Maar er zit bijna niks meer in. Drie kleine koekjes! En hoop kruimels.

47

Content vs. form words

• Hirschberg (1992)– If a word with the same root as a word in the local

focus stack, then it is treated as given• Ignores synonyms! Introduces errors because roots can’t

always be identified easily• Horne et al. (1993) does same thing but used a network

of synonyms an hyponyms to identify given concepts if the referential form was different

– inform and information same root!– Koek and koekjes same root!

48

Contrastive elements

• NewSpeak: contrastiveness within a complex noun identified– If part of the complex noun is given, while others

are new, then the new items are contrastive

– TRIAL\N lawyer\N vs. CRIMINAL\contrastive lawyer\G

49

Focused elements

• Something the speaker considers particular important

– ZOETs and LEEG in Kookboek?

50

Certain closed class words almost always get focused

• Negative adverbials

Haas is niet thuis. Vos hang lui in de stoel. Hij heeft nergens zin in.

Maar er zit bijna niks meer in.

51

Vos en Haas: 3

(Sylvia van den Heiden, ilistrations The Tjong-Khing)

• Koekboek

Haas is niet thuis. Vos hang lui in de stoel. Hij heeft nergens zin in. Of toch wel. Hij heeft zin in iets lekkers. Koek of zo. Iets ZOETS. Is er nog koek? Vast wel. Vos loopt naar de keuken. HIj doet de kast open. Daar staat de koektrommel. Drie kleine koekjes! En hoop kruimels.

52

Local and Global Focus

• Hirschberg considers all words in the first sentence to be in Global focus

• Local focus: experimented with sentences and paragraph boundaries– Paragraph boundaries best

53

Remains: how to realize prominence?

• Should all prominences be realized the same way?

• Sentences also often have general pattern of F0 movement…

54

Associate different statuses with prosodic correlates

OPEN RESEARCH QUESTIONS:

• Is prosodic focus the same for each category?

• Does prosodic focus differ in form if it is within or at the end of a sentence?

• Is there a special type of contrastive focus that is different from the focus on new items?

55

Tomorrow

1. How do categories like new-given relate to choice of lexical and syntactic form?• Definite vs. indefinite forms• Marking with particles• In depth look at one theory of new-given (Prince 1981)

2. How do we determine the interpretation of underspecified forms?• Resolution of anaphoric reference• Interpretation of bridging NPs

– How lexical relationships help

56

References

• Hirschberg, J. (1992). Using discourse context to guide pitch accent decisions in synthetic speech. In Talking Machines: Theories Models and Designs, G. Balily, C. Benoit, and T. R. Sawallis (Editors). 1992 Elsevier Science Publishing

• Cutler, A., D. Dahan & W. Donselaar (1997). Prosody in the Comprehension of Spoken Language: A Literature Review