una y. chow stephen j. winters alberta conference on linguistics november 1, 2014
TRANSCRIPT
Una Y. ChowStephen J. Winters
Designing an exemplar-based computational model of intonation perception of English statements and questions
Alberta Conference on LinguisticsNovember 1, 2014
Research question
Can exemplar theory account for native listeners’ perception of intonation in English statements and questions?
2
Issue: Variations in speech
Previous studies reveal significant variations in speech.
Peterson & Barney (1952):
frequency of F1 (x-axis) vs. frequency of F2 (y-axis) for 10 vowels (i, ɪ, ɛ, æ, ɑ ɔ, ʊ, u, ʌ, ɝ) produced by 76 speakers
How do listeners perceivespeech sounds given the amount of variance?
3
Background: Exemplar theory
Johnson (1997) proposed an exemplar theory to account for listeners’ perception of speech.
According to this theory (Johnson, 1997; Pierrehumbert, 2001), listeners store in memory the fine phonetic details of the words (or exemplars) that they hear, including sounds that are associated with the speaker’s identity, gender, and language.
When listeners hear a new word, they categorize the word with the exemplars in memory that are most similar to the new word, overall.
4
Objective: Intonation perception model
The objective of my project was to create an exemplar-based computational model that would learn to categorize English statements and questions based on how similar a sentence is with the previously encountered sentences, according to their intonation patterns.
If a similarity-based calculation model (Johnson, 1997) can accurately classify novel sentences at an acceptable rate on the basis of intonation alone, it can be expanded to account for the human perception of intonation more generally.
5
Design of the model
6
Design: Preanalysis function
Reads in audio-recorded samples of speech sounds (in .wav format), e.g. Ann teaches history.
Removes any silence or noise before and after the speech sound.
7
Design: Analysis function
This function analyzes the pitch contour of the input sentence for salient cues.
In English, the pitch of the voice tends to fall at the end of a statement but tends to rise at the end of an echo question (Wells, 2006). For example,
Statement:Mary has a little lamb.
Echo question:Mary has a little lamb?
8
Design: Analysis function (cont’d)
This step first fills the gaps within a pitch contour using interpolation (a mathematical method) in order to create a continuous curve.
It then locates the nuclear tone in the sentence, that is, the last fall or rise.
9
Design: Extraction function
In order to calculate how similar a new exemplar (i.e., sentence) is with other exemplars in ‘memory’, we used the following perceptual dimensions:
the speed of change in pitch value at the nuclear tone, the direction of the change, and the timing of the nuclear tone relative to its position in the
sentence.This step extracts these similarity measures from the new
exemplars. E.g. for the statement, Ann teaches history. Category = S, exemplar = e07a21S, speed = 537,
direction = -1, time = 0.6,
10
Design: Training function
In calculating similarities, the model assigns different weights to the dimensions.
For example, the direction of the nuclear tone (whether it is a fall or rise) may serve as a better cue in identifying the sentence type than the timing of the nuclear tone. If that is the case, direction would be weighted more heavily than timing.
This step trains the model to learn the weight distribution of the dimensions that would yield the best accuracy rate in categorizing new sentences.
11
Design: Testing function
This step tests how accurately the model can categorize statements and questions from a set of sentences that is different from the training set.
It uses the weighted sum of the dimensions to estimate to which category a new sentence belongs.
(Johnson 1997:147)
12
Design: Cross-validation function
To evaluate how well the model generalizes, this step uses a k-fold cross-validation (Refaeilzadeh et al., 2009).
K refers to the number of folds used.In a k-fold cross-validation, the training and test data are
separate in a given run but they cross-over in successive runs such that each exemplar gets tested (once and only once) eventually.
For example,a 3-fold cross-validation
13
Testing: Stimuli
40 statements and 40 echo questions per speaker: 5 dialogues x 4 sentences x 2 repetitions
Speakers: One male and one female (18 years old), native speakers
of Canadian English Recruited from the online LING 201 (Introduction to
Linguistics) Research Participation System at the University of Calgary.
Received 1% credit towards their LING 201 course grades for completing the one-hour recording session.
14
Testing: Stimuli (cont’d)
The stimuli were recorded in the sound booth in the Phonetics Lab at the University of Calgary.
Statements and questions of 5, 7, 9, 11, and 13 syllables long; 4 pairs of statements and questions for each length
E.g. Ann teaches history.
Ann teaches history? Alice went horse riding with a friend.
Alice went horse riding with a friend? Morris wants to visit the old mansion on Monday.
Morris wants to visit the old mansion on Monday?
15
Testing: Results
For testing, we used a 10-fold cross-validation.There were 15 sentences that showed pitch halving or
doubling so these sentences and their corresponding statements or questions were removed from the training and test data. The total number of sentences for each type reduced to 65.
All 65 questions had arising intonation, but 5of the 65 statements alsohad a rising intonation.
16
Testing: Results
With all the weight on the direction dimension, the 10-fold cross-validation method
correctly trained 95.69% - 97.46% of the exemplars, and correctly categorized statements (100%) and questions
(75% - 100%).
1 2 3 4 5 6 7 8 9 100
20406080
100120
10-Fold Cross-ValidationEnglish Statements (S) and Questions
(Q)
Trained S & QCategorized SCategorized Q
Fold #
Corr
ect
(%)
17
Discussion
How well the model categorizes the sentences depends on the intonation patterns of the sentences as well as the generalized weights.
The model works well for this data set when 100% of the weight is on the direction dimension. The accuracy declines when a weight is added to another dimension.
Therefore, this model would need to be modified in order to be able to deal with uptalk, a terminal rising intonation (Ladd, 2006), in statements.
It is also predicted to fail to work for languages that do not mainly rely on the pitch direction, such as Mandarin.
18
Future work
Mandarin is a tone language that uses lexical tones to differentiate meaning in words.
Some researchers (e.g. Yuan, Shih, & Kochanski, 2002) claim that Mandarin raises the pitch of the overall sentence to signal an echo question.
Can exemplar theory account for the perception of intonation in Mandarin sentences?
19
References
Johnson, K. (1997). Speech perception without speaker normalization: An exemplar model. In K. Johnson & J. W. Mullennix (Eds.), Talker variability in speech processing (pp. 145-165). San Diego: Academic Press.
Pierrehumbert, J. (2001). Exemplar dynamics: Word frequency, lenition, and contrast. In J. L. Bybee, & P. J. Hopper (Eds.), Frequency and emergence of linguistic structure (pp. 137-157). Philadelphia: John Benjamins.
Ladd, D. R. (2008). Intonational phonology. Cambridge: Cambridge University Press.
20
References (cont’d)
Refaeilzadeh, P., Tang, L., & Liu, H. (2009). Cross-validation. In L. Liu & M. T. Zsu (Eds.), Encyclopedia of database systems (pp. 532-538). Springer Publishing Company Incorporated.
Wells, J. C. (2006). English intonation: An introduction. Cambridge: Cambridge University Press.
Yuan, J., Shih, C., & Kochanski, G. (2002). Comparison of declarative and interrogative intonation in Chinese. In B. Bel, & I. Marlien (Eds.), Proceedings of the Speech Prosody 2002 Conference (pp. 711-714). Aix-en-Provence: Laboratoire Parole et Langage.
21
Acknowledgement
This research was funded by the University of Calgary Program for Undergraduate Research Experience (PURE), awarded to Una Chow in 2013.
22
Thank you!
Comments? Questions?
23