ars.els-cdn.com€¦ · web viewsupplementary materials i: proof of concept experiment. method....
TRANSCRIPT
Supplementary Materials I: Proof of Concept Experiment
Methods
Participants. Sixteen Cantonese-English bilingual listeners (Mean age = 20.3 years, SD
= 3.38 years) and 16 native English listeners (Mean age = 27.0 years, SD = 9.07 years)
participated in the proof-of-concept experiment. Demographic and language learning
background information for these participants was obtained through a language background
questionnaire (Choi, Tong, Gu, Tong, & Wong, 2017). According to participants’ self-
reports, all Cantonese-English bilingual listeners were born in Hong Kong and were native
speakers of Cantonese. They began learning English as a second language in early childhood
(Mean age of onset of English learning = 2.8 years, SD = 1.48 years). Their self-assessment
of proficiency in Cantonese and English on a 5-point Likert scale (1 = poor; 5 = excellent)
indicated that their proficiencies in Cantonese and English were 4.73 (SD = .59) and 4.00 (SD
= .65) respectively. The mean frequencies of language use on a 5-point Likert scale (1 =
never; 5 = always) were 4.93 (SD = .26) and 3.73 (SD = 1.10) for Cantonese and English,
respectively. In contrast, the native English listeners had acquired English as their first
language and had not received any formal or informal instruction in Cantonese, nor had they
learned any other tone languages. The mean period of residing in Hong Kong for the native
English listeners was 2.2 years (SD = 3.84 years).
Materials and Procedure. The core stimuli for the English lexical stress discrimination
task were the four minimal pairs /'kaka/ - /ka'ka/, /'kɛkɛ/ - /kɛ'kɛ/, /'kiki/ - /ki'ki/, /'kɔkɔ/ -
/kɔ'kɔ/. The consonants and vowels were selected to be common in both Cantonese and
English in order to eliminate any possible influence of segmental contrasts. These were
recorded by two native American English speakers, one male and one female. The male
speaker was an English professor and the female speaker was an international graduate
student at The University of Hong Kong. They grew up in the United States and had acquired
English as their first language. At the time of recording, each had lived in Hong Kong for less
than 1.5 years. Prior to recording, they were given a list of English real words with
contrastive lexical stress patterns (e.g., IMport/imPORT; SUSpect/susPECT) to familiarize
them with iambic and trochaic stress patterns. During the recording, each pseudoword with a
marked lexical stress pattern (e.g., KAka) was presented on a computer screen. The speakers
recorded each pseudoword at their own pace. Each item was recorded three times, and the
best token (i.e., the most clearly articulated) was chosen by a speech-language pathologist
experienced in English phonetics. The selected tokens were normalized to 500ms.
The stimuli were presented using an AX discrimination paradigm. On each trial, a
minimal pair was presented to participants over professional quality stereo headphones with
an inter-stimulus interval of 500ms. Each minimal pair consisted of one male and one female
recording, with the order of presentation of male and female recordings counterbalanced. The
two stimuli on a trial were always the same pseudoword but could either have the same
lexical stress pattern (e.g., /'kaka/ vs. /'kaka/) or different lexical stress patterns (e.g., /'kaka/
vs. /ka'ka/). The frequency of occurrence of each minimal pair (/'kaka/ - /ka'ka/, /'keke/ -
/ke'ke/, /'kiki/ - /ki'ki/, /'koko/ - /ko'ko/) was balanced.
The use of two quite different voices was designed to prevent listeners from attending to
acoustic similarity, per se. Prior to the experiment, listeners were explicitly told that they
would be hearing items that either matched or mismatched on stress pattern. On each trial,
participants responded by pressing one of two keys on a keyboard to indicate whether the two
sounds had the same stress pattern or not. There were 128 trials (8 stimuli × 2 speaker orders
× 4 repetitions × 2 trial types). Accuracy and response time were recorded on each trial.
Results
Prior to our main analyses, we first converted the raw data of the discrimination task to
signal detection theory’s sensitivity index (d’) by subtracting the z-transform of the false
alarm rate from the z-transform of the hit rate (MacMillan, & Creelman, 2005). The hit rate
and false alarm rates were obtained from the number of correct trials on different-trials and
number of incorrect trials on same-trials, respectively.
Figure S1 shows the scatter plot of the d’ scores for Cantonese-English bilingual
listeners and native English listeners. The mean d’ score for the Cantonese-English bilingual
listeners was 3.10 (SD = .84), versus a mean of 1.26 (SD = .77) for the native English
listeners. An independent samples t-test revealed that Cantonese-English bilingual listeners
discriminated English lexical stress significantly more accurately than native English
listeners, t(30) = 6.44, p < .001, d = 2.28. This difference provides preliminary confirmation
of our hypothesis that there is a facilitative effect of tone language experience on English
lexical stress discrimination. Cantonese-English bilingual listeners also had significantly
shorter response times than native English listeners, t(30) = -3.14, p < .01, d = 1.05. Thus, the
non-native advantage cannot be attributed to a speed-accuracy trade-off.
Figure S1. Scatterplot of the d’ scores for Cantonese-English bilingual listeners and native
English listeners. Using .99 and .01 as maximum hit and minimum false alarm rates
respectively, the effective upper limit of d’ is 4.65.
Supplementary Materials II
Language and Musical Backgrounds of the Participants
According to the language background questionnaires (Choi et al., 2017), all Cantonese-
English bilingual listeners were born in Hong Kong, had acquired Cantonese as their first
language, and had learned English as their second language in early childhood (Mean age of
onset of English learning = 3.46 years, SD = 1.45 years). On a scale from 1 (poor) to 5
(excellent), their self-rated proficiencies in Cantonese and English were 4.82 (SD = .39) and
4.04 (SD = .39) respectively. On a scale from 1 (never) to 5 (always) for frequency of
language use, the mean frequencies for Cantonese and English usage were 4.96 (SD = .19)
and 3.64 (SD = .73) respectively.
The native English listeners had acquired American English as their first language and
had not received any formal or informal instruction in Cantonese or any other tonal
languages. Adopting the criteria used in prior research (e.g., Cooper & Wang, 2012;
Wayland, Herrera, & Kaan, 2010), nine Cantonese-English bilingual listeners and five native
English listeners were identified as musicians. Cantonese-English bilingual musicians (M =
9.7 years, SD = 2.50 years) and native English musicians (M = 10.2 years, SD = 2.86 years)
had similar amounts of musical training, t(12) = -.36, p = .722.
Supplementary Materials III
Stimuli in the Lexical Stress Discrimination Task
Phonotactically legal pseudowords. Four minimal pairs, i.e., /'kiki/ -
/ki'ki/, /'kwikwi/ - /kwi'kwi/, /'kimkim/ - /kim'kim/, /'kwimkwim/ - /kwim'kwim/ were
recorded. These stimuli are non-words in both languages, but are phonotactically legal in
both. The CVCV or CVCCVC structures were used because they are the most common
word-shapes with repeated syllable structures in Cantonese and English (Battistella, 1990).
The consonants and vowels selected are common in both Cantonese and English.
Phonotactically illegal pseudowords. We recorded the four minimal pairs /'riri/ -
/ri'ri/, /'krikri/ - /kri'kri/, /'kivkiv/ - /kiv'kiv/, /'krivkriv/ - /kriv'kriv/. Although these stimuli
have CVCV or CVCCVC structures, they are not phonotactically legal in Cantonese. The
consonants /v/ and /r/, and the cluster /kr/ are present in English, but do not occur in
Cantonese. These stimuli are thus legal pseudowords in English, and non-words in
Cantonese.
Real word stimuli. We recorded four minimal pairs of real English words, /ˈpɚmɪt -
pɚˈmɪt/, /ˈsəspekt - səsˈpekt/, /ˈɪnsɚt - ɪnˈsɚt/, and /ˈimpɔrt - imˈpɔrt/ (permit, suspect, insert,
import). We chose these four minimal pairs because their lexical stress change (e.g., PERmit
versus perMIT) did not involve vowel reduction (which is a segmental change, rather than a
variation in suprasegmental information).
All stimuli were digitally recorded in a sound-shielded booth, produced by two native
American English speakers (one male and one female), at a sample rate of 48 kHz. The male
speaker (the third author) has extensive experience recording stimuli for speech experiments;
he has lived in the New York City region for most of his life. The female speaker was an
undergraduate student living in the New York City region. During the recording, the male
speaker was given a written list of the stimuli and asked to produce the stimuli with the
lexical stress patterns indicated in the list (e.g., PERmit). The female speaker produced the
items by imitating recordings of the male productions: Prior to recording each item on the
list, she heard the item produced by the male speaker. Each item was recorded three times,
and the best item was chosen by the same speech-language pathologist who judged the
stimuli for the proof of concept experiment. Each chosen item was normalized to 600ms in
Praat 5.4.02 (Institute of Phonetic Sciences, University of Amsterdam, the Netherlands). The
acoustic features of the naturally recorded stimuli are provided in Tables S1, S2, S3 and S4.
We then manipulated the acoustic parameters to create two additional stimulus sets: f0-only
stimuli and duration-and-intensity only stimuli, as described below.
F0-only stimuli. The three sets of stimuli were acoustically manipulated in Praat to
remove the duration and intensity cues. Specifically, the durations of the first and second
syllables were normalized to 300ms, and their intensities neutralized to 66dB.
Duration-and-intensity only stimuli. The three sets of stimuli were acoustically
manipulated to remove the dynamic f0 information. The f0 of each syllable was flattened to
120Hz (male recordings) or 210Hz (female recordings). These f0 values correspond to the
average fundamental frequency of male and female adults (Traunmuller & Erikson, 1994).
Short-term Memory Task
We used two tasks that Zheng and Samuel (2018) had used to assess participants’
memory and intelligence. The task that measured short term memory was a version of the
children’s game “Simon” (retrieved from
http://www.freegames.ws/games/kidsgames/simon/mysimon.htm). In this task, there was a
round object split into four different wedges corresponding to different colors, i.e., red, green,
yellow and blue. On each trial, different colors would light up in a sequence, e.g., red-blue-
green. Participants were asked to remember and reproduce the sequence, e.g., red-blue-green.
The first trial presented a single color. Each time a participant gave the correct answer, the
sequence would increase by one, e.g., blue-yellow-green-red, until the participant failed to
reproduce the sequence. Participants completed five rounds, and we computed the
participant’s median score. In the usual game, each color corresponds to a specific musical
tone. We turned off the sound during the task to avoid the possible involvement of pitch-to-
color mapping.
Non-verbal Intelligence Task
Participants completed a set of non-verbal multiple-choice questions chosen from two
free intelligence tasks (retrieved from http://www.iq-test.com/free-iq-test/, and
http://www.quickiqtest.net). The questions were displayed on PowerPoint slides that
automatically moved on every 30 seconds. There were 14 questions, and the number of
correct answers was tallied to give the total score.
Table S1.
Original and modified durations of the naturally recorded stimuli originating from the male
speaker.
Stimuli Overall First syllable Second syllable
Original
duration
Modified
Duration
Original
duration
Modified
duration
Original
duration
Modified
duration
Legal
kiKI 728 600 333 275 395 325
KIki 642 600 291 273 351 327
kimKIM 755 600 343 273 412 327
KIMkim 675 600 336 399 339 201
kwiKWI 714 600 289 243 425 357
KWIkwi 671 600 290 260 381 340
kwimKWI
M 770
600
383 300387
300
KWIMkwi
m 678
600
354 313324
287
Ilegal
kivKIV 913 600 363 237 550 363
KIVkiv 926 600 359 232 567 368
kriKRI 735 600 325 263 410 337
KRIkri 703 600 315 269 388 331
krivKRIV 999 600 393 236 606 364
KRIVkriv 951 600 407 257 544 343
riRI 663 600 344 243 319 357
RIri 769 600 368 260 401 340
Real words
imPORT 727 600 272 224 455 376
IMport 855 600 250 177 605 423
inSERT 748 600 169 139 579 461
INsert 816 600 166 124 650 476
perMIT 601 600 141 151 460 449
PERmit 600 600 141 155 459 445
susPECT 913 600 459 308 454 292
SUSpect 972 600 490 272 482 328
Note. Stressed syllables are capitalized. All values are represented in ms.
Table S2.
Original and modified durations of the naturally recorded stimuli originating from the
female speaker.
Stimuli Overall First syllable Second syllable
Original
duration
Modified
duration
Original
duration
Modified
duration
Original
duration
Modified
duration
Legal
kiKI 638 600 340 319 298 281
KIki 612 600 367 361 245 239
kimKIM 714 600 373 314 341 286
KIMkim 618 600 305 297 313 303
kwiKWI 692 600 355 306 337 294
KWIkwi 660 600 300 271 360 329
kwimKWI
M 767
600
358 279409
321
KWIMkwi
m 662
600
327 294335
306
Ilegal
kivKIV 878 600 364 247 514 353
KIVkiv 852 600 321 228 531 372
kriKRI 667 600 344 308 323 292
KRIkri 619 600 304 294 315 306
krivKRIV 843 600 352 252 491 348
KRIVkriv 1013 600 368 219 645 381
riRI 675 600 306 247 369 353
RIri 616 600 323 332 293 268
Real words
imPORT 832 600 349 250 483 350
IMport 829 600 403 290 426 310
inSERT 772 600 263 204 509 396
INsert 683 600 268 239 415 361
perMIT 796 600 277 209 519 391
PERmit 748 600 300 237 448 363
susPECT 1000 600 523 247 477 353
SUSpect 962 600 526 332 436 268
Note. Stressed syllables are capitalized. All values are represented in ms.
Table S3.
Fundamental frequency, duration and intensity values of the naturally recorded stimuli
originating from the male speaker.
Stimuli First syllable Second syllable
F0
(Hz)
Duration
(ms)
Intensity
(dB)
F0
(Hz)
Duration
(ms)
Intensity
(dB)
Legal
kiKI 127.12 275 66.03 162.57 325 66.81
KIki 176.19 273 69.03 98.24 327 61.93
kimKIM 121.19 273 64.90 121.38 327 67.52
KIMkim 205.40 399 68.89 98.43 201 60.72
kwiKWI 133.44 243 65.61 142.22 357 67.00
KWIkwi 189.53 260 66.46 95.78 340 66.59
kwimKWIM 123.16 300 63.52 121.04 300 68.26
KWIMkwim 170.04 313 67.63 87.12 287 64.84
Ilegal
kivKIV 137.13 237 67.30 122.88 363 65.68
KIVkiv 166.85 232 68.82 99.55 368 63.87
kriKRI 133.27 263 65.60 144.29 337 66.50
KRIkri 190.56 269 68.10 97.10 331 63.36
krivKRIV 123.17 236 66.88 162.58 364 66.13
KRIVkriv 168.22 257 68.56 112.50 343 63.08
riRI 133.44 243 65.61 142.22 357 67.00
RIri 189.53 260 66.46 95.78 340 66.59
Real words
imPORT 109.81 224 58.19 123.29 376 67.77
IMport 163.19 177 59.91 103.00 423 66.77
inSERT 116.94 139 66.65 157.50 461 65.90
INsert 167.84 124 70.46 N/A* 476 62.97
perMIT 124.02 151 66.84 127.35 449 61.55
PERmit 168.02 155 69.34 188.23 445 57.52
susPECT 118.77 308 65.57 128.81 292 67.32
SUSpect 145.21 272 68.73 99.78 328 63.42
Note. Stressed syllables are capitalized. * = unable to obtain reliable f0 due to frication noise.
Table S4.
Fundamental frequency, duration and intensity values of the naturally recorded stimuli
originating from the female speaker.
Stimuli First syllable Second syllable
F0
(Hz)
Duration
(ms)
Intensity
(dB)
F0
(Hz)
Duration
(ms)
Intensity
(dB)
Legal
kiKI 215.92 319 65.28 214.86 281 67.65
KIki 297.58 361 67.01 235.04 239 65.62
kimKIM 235.17 314 65.55 226.74 286 67.4
KIMkim 271.63 297 68.1 231.5 303 64.28
kwiKWI 191.09 306 66.2 233.65 294 66.89
KWIkwi 347.38 271 68.37 240.89 329 64.29
kwimKWIM 230.07 279 64.87 248.49 321 67.64
KWIMkwim 325.53 294 68.55 233.43 306 63.03
Ilegal
kivKIV 227 247 64.89 236 353 67.39
KIVkiv 299.83 228 68.53 236.92 372 64.81
kriKRI 235.22 308 65.37 264.89 292 67.55
KRIkri 283.04 294 67.12 233.36 306 65.92
krivKRIV 232.75 252 64.04 243.4 348 67.67
KRIVkriv 306.85 219 69.36 287.9 381 63.6
riRI 210.62 247 64.81 221.2 353 67.39
RIri 256.13 332 67.99 180.8 268 63.76
Real words
imPORT 221.69 250 62.96 239 350 67.86
IMport 262.08 290 68.23 205.04 310 62.43
inSERT 218.01 204 64.89 310.87 396 66.89
INsert 239.81 239 69.43 208.87 361 59.58
perMIT 227.81 209 69.19 260.65 391 64.32
PERmit 119.39 237 70.28 208.9 363 59.01
susPECT 210.62 247 64.81 221.2 353 67.39
SUSpect 256.13 332 67.99 180.8 268 63.76
Note. Stressed syllables are capitalized.
Supplementary Materials IV
Preliminary Analysis: Potential Group Differences in Control Measures
Before the main analysis, we tested whether the groups differed on non-verbal
intelligence and/or short-term memory. The Cantonese-English bilingual listeners had
significantly higher non-verbal intelligence scores (M = 11.40, SD = 1.96) than the native
English listeners (M = 9.10, SD = 2.71), t(58) = 3.77, p < .001, d = .97, but no significant
difference was found in short-term memory between Cantonese-English bilingual listeners
(M = 7.53, SD = 3.15) and native English listeners (M = 8.43, SD = 2.69), t(58) = -1.19, p
= .239). Thus, in the main analyses, only non-verbal intelligence was included as a covariate.
An alternative method for controlling for the group difference in non-verbal intelligence is
described below (removing participants from the two groups in a way that equalized the
average non-verbal intelligence scores between the groups).
Supplementary Analysis with Non-verbal Intelligence-matched Native and Non-native
Listeners
To match the two subject groups on their non-verbal intelligence scores, one third of the
Cantonese-English bilingual listeners (those at the upper end of the non-verbal intelligence
distribution), and one third of the native English listeners (those at the lower end of the non-
verbal intelligence distribution) were excluded, resulting in a data subset with 20 listeners per
group. In this data subset, Cantonese-English bilingual listeners (M = 10.50, SD = 1.79) and
native English listeners (M = 10.60, SD = 1.50) did not differ significantly in non-verbal
intelligence, t(38) = -.19, p = .849. This exclusion process led to a sample in which the native
English listeners (M = 9.55, SD = 2.26) had better short-term memory scores than the
Cantonese-English bilingual listeners (M = 7.75, SD = 2.77), t(38) = -2.25, p < .05, d = .71.
Thus, short-term memory was included as a co-variate in the following analyses.
Using the same factors as the main ANCOVA (but now with memory score as the co-
variate), the ANCOVA produced the same significant main effect of group, F(1, 37) = 4.33,
p < .05, ηp2 = .11: The Cantonese-English bilingual listeners (average d' = 2.37)
outperformed native English listeners (average d' = 1.80) on English lexical stress
discrimination, even after the groups were matched on non-verbal intelligence. The
ANCOVA also revealed significant two-way interactions between acoustic condition and
group, F(2, 74) = 5.76, p < .01, ηp2 = .14, and between context and group, F(2, 74) = 4.46, p
< .05, ηp2 = .11. Consistent with our central hypothesis, simple main effects analyses
revealed that Cantonese-English bilingual listeners outperformed native English listeners in
the all-cues condition, F(1, 37) = 9.29, p < .01, ηp2 = .20, CI [.38, 1.88], in the real word
context, F(1, 37) = 6.81, p < .05, ηp2 = .16, CI [.18, 1.43], and in the legal context, F(1, 37) =
3.62, p = .065, ηp2 = .09, CI [-.04, 1.10].
Supplementary Analysis of Response Times
To evaluate the possibility that the advantage found for the Cantonese-English
bilingual listeners was due to a speed-accuracy trade-off, we conducted a 3 × 3 × 2 mixed
ANCOVA on response times with acoustic condition (all-cues, pitch-only, and duration-and-
intensity-only) and phonotactic/lexical context (legal pseudoword, illegal pseudoword and
real word) as within-subjects factors, and group (Cantonese-English bilingual listeners and
native English listeners) as a between-subjects factor. Non-verbal intelligence was the
covariate. The main effect of group was not significant, F(1, 57) = .76, p = .387: Cantonese-
English bilingual listeners did not have significantly longer response times than native
English listeners when discriminating English lexical stress.
As shown in Figures S2 and S3, instead of having longer response times, Cantonese-
English bilingual listeners actually had shorter response times than native English listeners
under the all-cues and pitch-only conditions. This was also the case for the illegal and real
word contexts. The pattern resulted in the ANCOVA having significant interactions between
acoustic condition and group, F(2, 114) = 4.85, p < .05, ηp2 = .08, and between
phonotactic/lexical context and group, F(2, 144) = 6.49, p < .01, ηp2 = .10. Simple main
effects analyses confirmed that Cantonese-English bilingual listeners had significantly shorter
response times than native English listeners under the pitch-only condition F(1, 57) = 4.01, p
= .05, ηp2 = .07, CI [.-337.75, .02] and marginally shorter response times in the real word
context, F(1, 57) = 3.37, p = .071, ηp2 = .06, CI [-311.25, 13.45]. These results demonstrate
that the advantage found for the Cantonese-English bilingual listeners cannot be accounted
for by a speed-accuracy trade-off.
All f0 di0
100
200
300
400
500
600
700
800
Cantonese-English bilingual listenersNative English listeners
Condition
Res
pons
e tim
e (m
s)
Figure S2. Mean response times for the Cantonese-English bilingual and native English
listeners across all acoustic conditions. All, f0 and di denote all cues, pitch-only, and
duration-and-intensity-only conditions respectively.
Legal Illegal Realword0
100
200
300
400
500
600
700
Cantonese-English bilingual lis-tenersNative English listeners
Condition
Res
pons
e tim
e (m
s)
Figure S3. Mean response times for the Cantonese-English bilingual and native English
listeners across all phonotactic/lexical contexts.