una y. chow stephen j. winters alberta conference on linguistics november 1, 2014

Una Y. ChowStephen J. Winters

Designing an exemplar-based computational model of intonation perception of English statements and questions

Alberta Conference on LinguisticsNovember 1, 2014

Research question

Can exemplar theory account for native listeners’ perception of intonation in English statements and questions?

2

Issue: Variations in speech

Previous studies reveal significant variations in speech.

Peterson & Barney (1952):

frequency of F1 (x-axis) vs. frequency of F2 (y-axis) for 10 vowels (i, ɪ, ɛ, æ, ɑ ɔ, ʊ, u, ʌ, ɝ) produced by 76 speakers

How do listeners perceivespeech sounds given the amount of variance?

3

Background: Exemplar theory

Johnson (1997) proposed an exemplar theory to account for listeners’ perception of speech.

According to this theory (Johnson, 1997; Pierrehumbert, 2001), listeners store in memory the fine phonetic details of the words (or exemplars) that they hear, including sounds that are associated with the speaker’s identity, gender, and language.

When listeners hear a new word, they categorize the word with the exemplars in memory that are most similar to the new word, overall.

4

Objective: Intonation perception model

The objective of my project was to create an exemplar-based computational model that would learn to categorize English statements and questions based on how similar a sentence is with the previously encountered sentences, according to their intonation patterns.

If a similarity-based calculation model (Johnson, 1997) can accurately classify novel sentences at an acceptable rate on the basis of intonation alone, it can be expanded to account for the human perception of intonation more generally.

5

Design of the model

6

Design: Preanalysis function

Reads in audio-recorded samples of speech sounds (in .wav format), e.g. Ann teaches history.

Removes any silence or noise before and after the speech sound.

7

Design: Analysis function

This function analyzes the pitch contour of the input sentence for salient cues.

In English, the pitch of the voice tends to fall at the end of a statement but tends to rise at the end of an echo question (Wells, 2006). For example,

Statement:Mary has a little lamb.

Echo question:Mary has a little lamb?

8

Design: Analysis function (cont’d)

This step first fills the gaps within a pitch contour using interpolation (a mathematical method) in order to create a continuous curve.

It then locates the nuclear tone in the sentence, that is, the last fall or rise.

9

Design: Extraction function

In order to calculate how similar a new exemplar (i.e., sentence) is with other exemplars in ‘memory’, we used the following perceptual dimensions:

the speed of change in pitch value at the nuclear tone, the direction of the change, and the timing of the nuclear tone relative to its position in the

sentence.This step extracts these similarity measures from the new

exemplars. E.g. for the statement, Ann teaches history. Category = S, exemplar = e07a21S, speed = 537,

direction = -1, time = 0.6,

10

Design: Training function

In calculating similarities, the model assigns different weights to the dimensions.

For example, the direction of the nuclear tone (whether it is a fall or rise) may serve as a better cue in identifying the sentence type than the timing of the nuclear tone. If that is the case, direction would be weighted more heavily than timing.

This step trains the model to learn the weight distribution of the dimensions that would yield the best accuracy rate in categorizing new sentences.

11

Design: Testing function

This step tests how accurately the model can categorize statements and questions from a set of sentences that is different from the training set.

It uses the weighted sum of the dimensions to estimate to which category a new sentence belongs.

(Johnson 1997:147)

12

Design: Cross-validation function

To evaluate how well the model generalizes, this step uses a k-fold cross-validation (Refaeilzadeh et al., 2009).

K refers to the number of folds used.In a k-fold cross-validation, the training and test data are

separate in a given run but they cross-over in successive runs such that each exemplar gets tested (once and only once) eventually.

For example,a 3-fold cross-validation

13

Testing: Stimuli

40 statements and 40 echo questions per speaker: 5 dialogues x 4 sentences x 2 repetitions

Speakers: One male and one female (18 years old), native speakers

of Canadian English Recruited from the online LING 201 (Introduction to

Linguistics) Research Participation System at the University of Calgary.

Received 1% credit towards their LING 201 course grades for completing the one-hour recording session.

14

Testing: Stimuli (cont’d)

The stimuli were recorded in the sound booth in the Phonetics Lab at the University of Calgary.

Statements and questions of 5, 7, 9, 11, and 13 syllables long; 4 pairs of statements and questions for each length

E.g. Ann teaches history.

Ann teaches history? Alice went horse riding with a friend.

Alice went horse riding with a friend? Morris wants to visit the old mansion on Monday.

Morris wants to visit the old mansion on Monday?

15

Testing: Results

For testing, we used a 10-fold cross-validation.There were 15 sentences that showed pitch halving or

doubling so these sentences and their corresponding statements or questions were removed from the training and test data. The total number of sentences for each type reduced to 65.

All 65 questions had arising intonation, but 5of the 65 statements alsohad a rising intonation.

16

Testing: Results

With all the weight on the direction dimension, the 10-fold cross-validation method

correctly trained 95.69% - 97.46% of the exemplars, and correctly categorized statements (100%) and questions

(75% - 100%).

1 2 3 4 5 6 7 8 9 100

20406080

100120

10-Fold Cross-ValidationEnglish Statements (S) and Questions

(Q)

Trained S & QCategorized SCategorized Q

Fold #

Corr

ect

(%)

17

Discussion

How well the model categorizes the sentences depends on the intonation patterns of the sentences as well as the generalized weights.

The model works well for this data set when 100% of the weight is on the direction dimension. The accuracy declines when a weight is added to another dimension.

Therefore, this model would need to be modified in order to be able to deal with uptalk, a terminal rising intonation (Ladd, 2006), in statements.

It is also predicted to fail to work for languages that do not mainly rely on the pitch direction, such as Mandarin.

18

Future work

Mandarin is a tone language that uses lexical tones to differentiate meaning in words.

Some researchers (e.g. Yuan, Shih, & Kochanski, 2002) claim that Mandarin raises the pitch of the overall sentence to signal an echo question.

Can exemplar theory account for the perception of intonation in Mandarin sentences?

19

References

Johnson, K. (1997). Speech perception without speaker normalization: An exemplar model. In K. Johnson & J. W. Mullennix (Eds.), Talker variability in speech processing (pp. 145-165). San Diego: Academic Press.

Pierrehumbert, J. (2001). Exemplar dynamics: Word frequency, lenition, and contrast. In J. L. Bybee, & P. J. Hopper (Eds.), Frequency and emergence of linguistic structure (pp. 137-157). Philadelphia: John Benjamins.

Ladd, D. R. (2008). Intonational phonology. Cambridge: Cambridge University Press.

20

References (cont’d)

Refaeilzadeh, P., Tang, L., & Liu, H. (2009). Cross-validation. In L. Liu & M. T. Zsu (Eds.), Encyclopedia of database systems (pp. 532-538). Springer Publishing Company Incorporated.

Wells, J. C. (2006). English intonation: An introduction. Cambridge: Cambridge University Press.

Yuan, J., Shih, C., & Kochanski, G. (2002). Comparison of declarative and interrogative intonation in Chinese. In B. Bel, & I. Marlien (Eds.), Proceedings of the Speech Prosody 2002 Conference (pp. 711-714). Aix-en-Provence: Laboratoire Parole et Langage.

21

Acknowledgement

This research was funded by the University of Calgary Program for Undergraduate Research Experience (PURE), awarded to Una Chow in 2013.

22

Thank you!

Comments? Questions?

23

una y. chow stephen j. winters alberta conference on linguistics november 1, 2014

Documents

human perception of

exemplarbased process

basis of intonation

intonation patterns

listeners perception

exemplar theoryjohnson

new word

vowels i