supporting information...2020/01/01  · 49 consistently positively correlated with component scores...

29
Submitted Manuscript: Confidential Supporting Information 1 2 Methods Overview 3 4 Music samples. 2,168 diverse music samples were compiled primarily using crowdsourcing and 5 data-scraping methods. See Methods S1 for further details. 6 7 Judgments of subjective experience. Judgments of the subjective experience evoked by the 8 music samples were obtained using Amazon Mechanical Turk and Qualtrics. Two separate 9 surveys were used to obtain judgments: one for the category judgments and one for the affective 10 scale judgments. Each survey was written in English and then translated into Chinese by author 11 X. F. A total of 2,777 participants, including 1,591 English-speaking US participants (815 12 women, mean age = 35.6) and 1,258 Chinese participants (805 women, mean age = 28.3), took 13 part in the study. The experimental procedures for each survey were approved by the 14 Institutional Review Board at the University of California, Berkeley or the Psychology Ethics 15 Committee at the University of Amsterdam. All participants gave their informed consent. 16 17 Statistical analyses. Our statistical analyses are outlined briefly below. Data were analyzed 18 primarily using custom code in Matlab. For detailed description of each method, see SI Methods. 19 20 Category judgment proportions. For each music sample, we computed (1) the proportion of 21 participants who chose each category and (2) the average judgments of each affective scale. To 22 estimate the significance of the category judgment proportions of each music sample we 23 constructed a null distribution of category judgment proportions using a Monte Carlo simulation. 24 25 Signal correlations. To derive signal correlations between cultures for each judgment, we 26 correlated the mean judgments from each culture across all music samples and divided by the 27 estimated explainable variance. Explainable variance was estimate by dividing the mean of the 28 squared standard errors (estimated using bootstrapping) by the total variance and subtracting this 29 quantity from 1. To calculate standard errors and p-values for signal correlations, it was 30 necessary to account for potential non-independence across ratings of different music samples 31 due to the fact that each rater rated multiple samples. To do so, we applied a non-parametric 32 bootstrap approach, using stratified resampling across individual raters rather than individual 33 ratings. We validate these methods by demonstrating that signal correlations accurately estimate 34 the respective population-level correlations in Monte Carlo simulations (Fig. S3). 35 36 Regression between category and affective scale judgments. We predicted affective feature 37 judgments from category judgments using ordinary least squares (OLS) linear regression. Here, 38 it may be worth acknowledging that methods specialized for sparse data could potentially have 39 produced better prediction correlations. However, this only provides for a more conservative 40 interpretation of our findings that category judgments explain the preservation of affective 41 feature judgments in the two different cultural groups. 42 43 PPCA. We determined the number of dimensions necessary to explain the preservation of 44 reports of subjective experience across two different cultures using a new method called 45 principal preserved component analysis (PPCA). PPCA maximizes the objective function 46 www.pnas.org/cgi/doi/10.1073/pnas.1910704117

Upload: others

Post on 11-Mar-2021

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Supporting Information...2020/01/01  · 49 consistently positively correlated with component scores of ratings from the other country using 50 a non-parametric Wilcoxon signed-rank

Submitted Manuscript: Confidential

Supporting Information 1 2 Methods Overview 3 4 Music samples. 2,168 diverse music samples were compiled primarily using crowdsourcing and 5 data-scraping methods. See Methods S1 for further details. 6 7 Judgments of subjective experience. Judgments of the subjective experience evoked by the 8 music samples were obtained using Amazon Mechanical Turk and Qualtrics. Two separate 9 surveys were used to obtain judgments: one for the category judgments and one for the affective 10 scale judgments. Each survey was written in English and then translated into Chinese by author 11 X. F. A total of 2,777 participants, including 1,591 English-speaking US participants (815 12 women, mean age = 35.6) and 1,258 Chinese participants (805 women, mean age = 28.3), took 13 part in the study. The experimental procedures for each survey were approved by the 14 Institutional Review Board at the University of California, Berkeley or the Psychology Ethics 15 Committee at the University of Amsterdam. All participants gave their informed consent. 16 17 Statistical analyses. Our statistical analyses are outlined briefly below. Data were analyzed 18 primarily using custom code in Matlab. For detailed description of each method, see SI Methods. 19 20 Category judgment proportions. For each music sample, we computed (1) the proportion of 21 participants who chose each category and (2) the average judgments of each affective scale. To 22 estimate the significance of the category judgment proportions of each music sample we 23 constructed a null distribution of category judgment proportions using a Monte Carlo simulation. 24 25 Signal correlations. To derive signal correlations between cultures for each judgment, we 26 correlated the mean judgments from each culture across all music samples and divided by the 27 estimated explainable variance. Explainable variance was estimate by dividing the mean of the 28 squared standard errors (estimated using bootstrapping) by the total variance and subtracting this 29 quantity from 1. To calculate standard errors and p-values for signal correlations, it was 30 necessary to account for potential non-independence across ratings of different music samples 31 due to the fact that each rater rated multiple samples. To do so, we applied a non-parametric 32 bootstrap approach, using stratified resampling across individual raters rather than individual 33 ratings. We validate these methods by demonstrating that signal correlations accurately estimate 34 the respective population-level correlations in Monte Carlo simulations (Fig. S3). 35 36 Regression between category and affective scale judgments. We predicted affective feature 37 judgments from category judgments using ordinary least squares (OLS) linear regression. Here, 38 it may be worth acknowledging that methods specialized for sparse data could potentially have 39 produced better prediction correlations. However, this only provides for a more conservative 40 interpretation of our findings that category judgments explain the preservation of affective 41 feature judgments in the two different cultural groups. 42 43 PPCA. We determined the number of dimensions necessary to explain the preservation of 44 reports of subjective experience across two different cultures using a new method called 45 principal preserved component analysis (PPCA). PPCA maximizes the objective function 46

www.pnas.org/cgi/doi/10.1073/pnas.1910704117

Page 2: Supporting Information...2020/01/01  · 49 consistently positively correlated with component scores of ratings from the other country using 50 a non-parametric Wilcoxon signed-rank

Cov(α*X,α*Y). We tested the significance of each component by applying PPCA in a leave-one-47 rater-out fashion, determining whether held-out ratings projected onto each component were 48 consistently positively correlated with component scores of ratings from the other country using 49 a non-parametric Wilcoxon signed-rank test (39). After determining the number of significant 50 PPCs, we applied varimax rotation in order to generate more interpretable components. To 51 compute p- and q-values for the scores of each individual music sample on each component, we 52 used a Monte Carlo simulation of the category ratings. 53 It is worth acknowledging that we do not establish here that PPCA is applicable to all 54 distributions of data. However, we do establish that PPCA generates accurate results on 55 randomly simulated data distributed identically to those of the present study, but with varying 56 numbers of underlying dimensions (Fig. S8). Thus, PPCA is applicable here (also see (10) for 57 further validation of method). 58 59 Interactive maps. To visualize the distribution of music samples within the multidimensional 60 space derived using PPCA, we applied a method called t-distributed stochastic neighbor 61 embedding (t-SNE) (45). We then assigned a color to each music sample in the map 62 corresponding to a weighted average of the unique colors of its top two scores on the 13 63 categorical judgment dimensions. 64 65 Continuous versus discrete category models. We compared how well continuous versus 66 discrete category models predicted affective scale ratings using ordinary least squares (OLS) 67 regression. For the discrete models, we used OLS to predict the affective scale judgments from 68 the maximally loading PPC, which was converted to a dummy variable (1 for the maximally 69 loading PPC, 0 otherwise) to form a fully discrete model, and to a continuous intensity (keep the 70 maximally loading PPC, convert others to 0) to form a discrete model with intensity. For these 71 analyses, we collapsed across ratings from the US and China. To test for a difference in variance 72 explained between the continuous and discrete category models, we used across-rater bootstrap 73 resampling. 74 75 Data and code availability. The 2,168 music samples used in the present study, their mean 76 ratings, and the analysis code can be requested here: https://goo.gl/forms/SvcNjZdkmwoojhI82. 77 78 79 Methods S1. Collecting Music Samples. Across two experiments, we used a total 2,168 music 80 samples drawn from five sources. 81

1. Our primary source for music samples was the same participant population from whom 82 we collected our judgment data. Adopting a crowdsourcing paradigm, we recruited 111 83 participants (41 f, mean age = 32) from the US to each find 5-second music samples on 84 YouTube that evoked each of our 28 categories of subjective experience without lyrics 85 (Fig. S1). After excluding entries that contained discernable lyrics and some entries that 86 overlapped in content (i.e., containing parts of the same sample of the same music track), 87 we were left with 1572 music samples. 88

2. For diversity, we added 88 clips of leitmotifs from Howard Shore’s Lord of the Rings 89 soundtrack and 181 leitmotifs from Wagner’s Ring cycle, known to convey strong 90 feelings with short segments of music (1) ideal for the present study. 91

Page 3: Supporting Information...2020/01/01  · 49 consistently positively correlated with component scores of ratings from the other country using 50 a non-parametric Wilcoxon signed-rank

3. For Experiment 2, we used our crowdsourcing paradigm to recruit 22 participants (6 92 women, mean age = 30.7) from the US to each find 5-second music samples on YouTube 93 that evoked 12 levels of valence and arousal (see main text). After excluding entries that 94 contained discernable lyrics and some entries that overlapped, we were left with 138 95 music samples. 96

4. To collect Chinese traditional music samples, we recruited 6 native Chinese participants 97 familiar with the genre to contribute Chinese traditional music tracks that conveyed as 98 many as possible of the 12 valence and arousal levels. From each Chinese traditional 99 music track, three evenly spaced 5-second samples were then extracted, for a total of 189 100 music samples. 101

Sound intensities for each music sample were normalized to similar volumes using the 102 ReplayGain algorithm. 103 104 105 Methods S2. The Category and Scale of Affect Judgment Surveys. Three separate surveys 106 were used to obtain judgments of subjective experience: one for the category judgments and two 107 for the affective scale judgments. Surveys were translated into Chinese by author X. F. and a 108 research assistant, both of whom speak Chinese as a first language. 109

The category judgment survey was used to obtain multiple choice judgments of the 110 feelings conveyed by each music sample. Each of the 1,841 music samples in Experiment 1 was 111 judged by an average of 31.2 observers, averaging 22.9 from the US and 9.3 from China, in 112 terms of the 28 categories (listed in Table S1). Each of the 327 music samples in Experiment 2 113 was judged by an average of 58.6 observers, averaging 36.5 from the US and 22.0 from China. 114 Observers were required to select one or more categories per music sample. In an individual 115 survey, each observer provided data for 30 music samples in the US and up to 80 or 81 in China, 116 ordered randomly. Observers in the US were allowed to complete as many versions of the survey 117 as desired, with different music samples presented in each. Across Experiments 1 and 2, the 118 average US participant provided responses to 71.4 music samples, and the average Chinese 119 participant provided responses to 70.8 music samples. Payment for each survey was 55 cents in 120 the US and 15 yuan in China. 121

The affective scale judgment survey was used to obtain rating scale judgments of the 122 feelings conveyed by each music sample. Each of the music samples was rated by an average of 123 15.6 observers in Experiment 1, averaging 7.1 from the US and 8.5 from China, and 36.8 in 124 Experiment 2, averaging 18.5 from the US and 18.3 from China. Each music sample was 125 evaluated along the 11 affective scales, as well as a judgment in Experiment 1 of the genre of the 126 music sample in terms of 18 options (ALTERNATIVE, AMBIENT, CLASSICAL, COUNTRY, DISCO, 127 EDM, ELECTRONIC (OTHER), FOLK, HEAVY METAL, HIP-HOP, JAZZ, LATIN, NOISE, POP, R&B, 128 REGGAE, ROCK, and WORLD MUSIC; see Fig. S2 for genre breakdown). The ratings were each 129 obtained on a nine-point Likert scale with the number 5 anchored at neutral (aside from genre, 130 which was multiple choice). See Table S2 for the questions corresponding to each affective 131 feature. Within these surveys, which comprised more questions than the category judgment 132 survey, observers provided data for 12 music samples in the US and up to 80 in China, ordered 133 randomly. Observers in the US were allowed to complete as many versions of the survey as 134 desired, with different music samples presented in each. Across Experiments 1 and 2, the 135 average US participant provided responses to 23.6 music samples, and the average Chinese 136

Page 4: Supporting Information...2020/01/01  · 49 consistently positively correlated with component scores of ratings from the other country using 50 a non-parametric Wilcoxon signed-rank

participant provided responses to 23.0 music samples. Payment for each survey was 85 cents in 137 the US and 15 yuan in China. 138 Across experiments and survey formats, the average US participant provided responses to 139 46.1 music samples, and the average Chinese participant provided responses to 36.5 music 140 samples (overall average = 41.8). For distribution of responses to each question in each survey, 141 see Fig. S2. 142

To recruit participants from the US, we ensured participants were registered within the 143 US on Amazon Mechanical Turk and asked participants to self-report their country of origin. We 144 excluded data from participants whose registration location and country of origin did not match. 145 To recruit participants from China, we asked a network of research associates of author X. F. to 146 recruit students and other participants from existing research participant pools at Chinese 147 universities. We ensured that the participants were all from Mainland China and were studying 148 or working in China when they completed the survey. Some Chinese participants who began the 149 survey did not complete any trials; these participants were excluded. 150 151 152 Methods S3. Estimating Significance of Category Judgments Proportions and Mean 153 Affective Scale Judgments of Individual Music Samples. To estimate the significance of the 154 category judgment proportions of each music sample we first constructed a null distribution of 155 category judgments using a Monte Carlo simulation. We simulated N random 28-category 156 judgments of each of 100,000 music samples. Judgments were drawn at random, with 157 replacement, from the actual judgments of the 1,841 stimuli in Experiment 1. We calculated the 158 P value for each proportion as the proportion of times that a greater proportion for that category 159 appeared within the null distribution. Given that different music samples were rated differing 160 numbers of times (N), we conducted the above simulation separately for each N. We controlled 161 the FDR using the Benjamini–Hochberg procedure (2). 162 Similarly, to estimate the significance of the Cohen’s d for the mean affective scale 163 judgments of each music sample, we constructed a null distribution using a Monte Carlo 164 simulation. Here, we simulated N random judgments of each of 100,000 music samples along 165 each of the 11 affective scales. Judgments were drawn at random, with replacement, from the 166 actual judgments of the 1,841 stimuli. We computed Cohen’s d by subtracting the mean 167 judgments of all other samples from the mean judgments of each sample and dividing by the 168 standard deviation of the judgments of each sample. We calculated the P value for each Cohen’s 169 d as the proportion of times that a greater Cohen’s d was observed for that affective scale within 170 the null distribution. Given that different music samples were rated differing numbers of times 171 (N), we conducted the above simulation separately for each N. We controlled the FDR using the 172 Benjamini–Hochberg procedure (2). 173 174 175 Methods S4. Explainable Variance Calculation. To calculate explainable variance80, we note 176 that the variance of a given rating across stimuli is equal to the explainable variance plus the 177 unexplainable variance. The unexplainable variance can be estimated as the mean of the squared 178 standard errors across stimuli. Hence, the proportion of explainable variance can be estimated by 179 simply dividing the mean of the squared standard errors by the total variance and subtracting this 180 quantity from 1. 181

Page 5: Supporting Information...2020/01/01  · 49 consistently positively correlated with component scores of ratings from the other country using 50 a non-parametric Wilcoxon signed-rank

More formally, let 𝑌"# be the mean judgment of stimulus j, 𝜎#% be the standard error of the 182 mean judgment 𝑌"#, and 𝜎% be the variance of 𝑌"# over all stimuli j. Note that the actual proportion 183 of explainable variance in the mean judgments of stimuli in a given culture is given by 184

185

rexp = 1 −)* ∑ ,-.

*-/)

,. 186

187 If 𝑌" is the observed mean over all 𝑌"#, then we estimate 𝜎% with 188

𝑠% = 23∑ (𝑌"# − 𝑌")%

3#62 . Then, we estimate the standard error for each stimulus, 𝜎%# with 𝑠%# 189

using a non-parametric bootstrapping approach. 190 Specifically, to obtain a standard error estimate, 𝑠%#, for the average judgment of each 191 music sample (in terms of each individual category and affective scale), the ratings of each 192 attribute for each music sample were resampled with replacement and then averaged. This was 193 repeated 1,000 times, approximating the sampling distribution of each average with 1,000 194 resamples from the empirical distribution function of the observed data. The sample standard 195 deviation across resamples was then used to estimate the standard errors for each average, which 196 were used to estimate explainable variance of judgments each category and affective scale in 197 each culture. 198 To ensure that the explainable variance estimates were accurate, we performed repeated 199 simulations of our entire experiment, specifying sampling distributions that closely matched our 200 actual data. The results, shown in Fig. S3, confirm that when the explainable variances we 201 estimate (as above) are used to adjust our sample statistics, we accurately recover the population-202 level statistics that guided the simulation. 203 Explainable variance was also estimated for Spearman correlations between mean 204 judgments in each culture. We note that the Spearman correlations are equivalent to Pearson 205 correlations between the ranks (adjusted for ties) of each judgment mean across music samples. 206 Thus, the explainable variance for each Spearman correlation was calculated by applying the 207 ranking transformation to each of the 1,000 resample averages for each category and affective 208 scale, estimating standard errors for each rank using the sample standard deviation across 209 resamples, and using these standard error estimates to compute the explainable variance. 210 211 212 Methods S5. Estimates, Standard Errors and P-Values for Cross-Cultural Signal 213 Correlations. Dividing correlations in judgments across cultures by the explainable variance in 214 each judgment in each culture (r/rexp) results in what we refer to here as a signal correlation: an 215 unbiased estimate of what the correlation would be if we averaged an infinite number of ratings 216 in each culture. We then calculate a standard error around that estimate. (Note that the standard 217 error will depend on sample size while the expected value of the signal correlation will not.) See 218 (3) for review of this technique, applied in the field of neuroimaging. Signal correlations captures 219 the degree of similarity between cultures in the recognition of each category and scale of affect 220 from music while correcting for the sampling error arising from inconsistent judgments within 221 each culture (see Fig. S3 for demonstration that these methods are effective using simulated 222 data). For example, if only two ratings were collected for each judgment in each culture, the 223 correlation across cultures would be lower because the mean estimates were less unreliable. The 224

Page 6: Supporting Information...2020/01/01  · 49 consistently positively correlated with component scores of ratings from the other country using 50 a non-parametric Wilcoxon signed-rank

signal correlation corrects for the noise in mean judgments and thus gives an unbiased estimate 225 of the population-level correlation. 226

To calculate standard errors and p-values for cross-cultural signal correlations, it was 227 necessary to account for potential non-independence across ratings of different music samples 228 due to the fact that each rater rated multiple samples. To do so, we applied a second non-229 parametric bootstrap approach, resampling across individual raters (who can be assumed to be 230 independent) rather than individual ratings. 231

To accomplish this, we first resampled 1,000 times with replacement from the list of 232 raters who participated in each experiment in each culture. Because the raters each contributed a 233 different number of ratings, we sought to mitigate the variation of overall number of ratings 234 using stratified resampling. Specifically, the raters were divided into four subgroups before 235 resampling, each subgroup consisting of raters who contributed similar number of ratings. To 236 accomplish this, the raters were sorted from least to most number of ratings contributed and then 237 divided into subgroups such that Subgroups 1-3 each accounted for a quarter or less of the total 238 ratings and Subgroup 4 accounted for the rest. The raters from each resampled list of raters were 239 then concatenated and averaged for each music sample. 240

After resampling the categorical and dimensions ratings separately across raters from 241 each culture, we calculated separate correlations across cultures for each of 1,000 resampled 242 datasets. These bootstrapped correlations were biased downward relative to the empirical 243 correlations from the original data, due to the fact that the redundancy of the resamples adds 244 noise to each judgment average. To directly estimate this downward bias we correlated the full 245 1,000*1,841 vector of resampled average judgments with the original empirical average 246 judgments of each category and affective scale (replicated 1,000 times), within each culture. We 247 then divided the 1,000 bootstrapped correlations by the downward bias estimates, resulting in an 248 unbiased sampling distribution of each correlation with 1,000 resamples. This adjustment proved 249 to be successful in eliminating bias in repeated Monte Carlo simulations of the entire experiment 250 (Fig. S3; also see (4) for further validation of method). 251

The standard error for the sampling distribution of each correlation, shown as error bars 252 in Fig. 1, was estimated by taking the sample standard deviation across the bootstrapped 253 correlation estimate for each category and affective scale. P-values for differences between 254 correlations were calculated as the proportion of times one bootstrapped correlation exceeded the 255 other across the 1,000 resamples. This one-tailed test was used to compare the signal correlations 256 for each category to those for valence and arousal, in order to determine whether the recognition 257 of the categories was better preserved across cultures than that for valence and arousal. Each p-258 value was corrected for multiple comparisons by using the Benjamini-Hochberg method82 to 259 control for false discoveries across the 28 categories. 260

Standard errors and p-values for Spearman correlations were similarly to those for 261 Pearson correlations. Since the Spearman correlation is equivalent to a Pearson correlation 262 between rank-transformed vectors, the ranking transform was applied to each resample prior to 263 computing correlations and estimating bias. 264 265 266 Methods S6. Regression Analyses Predicting Affective Scales from Categories, and Vice 267 Versa, Across Cultures, As Represented in Fig. 1B and S5. Fig. 1B and S5 displays signal 268 correlations between actual ratings of the affective scales and ratings of the affective scale 269 predicted from the category ratings of the other culture. 270

Page 7: Supporting Information...2020/01/01  · 49 consistently positively correlated with component scores of ratings from the other country using 50 a non-parametric Wilcoxon signed-rank

During boostrapping, signal correlations were computed for each of the 1,000 bootstrap-271 resampled datasets generated by resampling across raters (Methods S4). The bootstrapped 272 correlations were adjusted for downward bias due to bootstrapping across raters, calculated as 273 described in Methods S4, and adjusted for total explainable variance in judgments, as described 274 in Methods S3. Standard errors were calculated as the sample standard deviation across bootstrap 275 estimates. Two-tailed p-values for comparisons between estimates were calculated as the 276 proportion of times the lesser estimate exceeded the greater estimate, multiplied by two. 277 Affective scale judgments were predicted from category judgments using ordinary least 278 squares (OLS) regression, generating a matrix of weights on each category for each affective 279 scale. The weights were applied to the category judgments from the other culture to predict the 280 affective scale judgments in each culture. Predicted affective scale judgments were correlated 281 with the actual average affective scale judgments in each culture. Explainable variance was 282 calculated by applying the weights to bootstrap resampled judgment averages and then taking the 283 sample standard deviation across bootstrap predictions, as in Methods S3. Prediction correlations 284 were adjusted using this explainable variance estimate. 285

To generate standard errors and p-values for prediction correlations that accounted for 286 potential non-independence across ratings of different music samples from each rater, as in 287 Methods S4, the prediction weights were applied to each of the 1,000 bootstrap-resampled sets of 288 category judgments generated by resampling across raters. The resulting bootstrap-resampled 289 predictions were then correlated with the bootstrapped affective scale judgments from each 290 culture. Downward bias of the correlation due to bootstrapping the predictions across raters was 291 estimated by correlating the full matrix of 1,000 bootstrapped predictions with the predictions 292 made using the original average judgments of each category (replicated 1,000 times). The 293 bootstrap prediction correlations were subsequently adjusted both for downward bias due to 294 bootstrapping across raters. 295 All predictions were also computed for binarized affective scale ratings, also shown in 296 Fig. S5, by thresholding every individual affective scale rating by its within-culture average prior 297 to each analysis. 298 299 300 Methods S7. Finding Shared Dimensions of Subjective Experience Across Cultures: 301 Principal Preserved Component Analysis (PPCA). We developed PPCA to extract the shared 302 dimensions of reported experience (components of variance) across the same judgments made in 303 two cultures (datasets composed of matched variables). PPCA first seeks a unit vector α1 that 304 maximizes the objective function 305 306

Cov(Xα1, Yα1) 307 308

We call α1 the first principal preserved component. Subsequent components are obtained by 309 seeking additional unit vectors αi that maximize the objective function Cov(Xαi, Yαi) subject to 310 the constraint that α1 is orthogonal to the previous components, α1,…, αi-1. 311

In the special case that X = Y, PPCA is equivalent to PCA, given that the latter method 312 maximizes the objective function 313

314 Var(Xαi) = Cov(Xαi, Xαi) 315

316

Page 8: Supporting Information...2020/01/01  · 49 consistently positively correlated with component scores of ratings from the other country using 50 a non-parametric Wilcoxon signed-rank

(substituting another X for Y in Cov[Xα1, Yα1]). Also note the similarity to the partial least 317 squares correlation analysis (PLSC) objective, which seeks to find two separate bases 𝛼and ß to 318 maximize 319 320

Cov(Xαi, Yßi) 321 322

as well as the CCA objective, which seeks to maximize 323 324

Corr(Xαi, Yßi) 325 326

However, given our aim of finding preserved dimensions of subjective experience across two 327 cultures, PPCA derives only one basis, α, that applies to both datasets. In PPCA, therefore, the 328 data matrices must be commensurate: observations in both datasets must be of the same 329 dimension (i.e. the number of rows in X and Y must be equal). This is certainly true in the 330 present study, given that we collect the same judgments of each music sample in each culture. 331

To solve the PPCA objective and find an α1 we apply eigendecomposition to the addition 332 of the cross-covariance matrix between datasets and its transpose: Cov(X,Y)/2 + Cov(Y,X)/2. 333 We claim that the principal eigenvector of this symmetric matrix maximizes Cov(Xα1, Yα1). To 334 derive this, first recall a general property of cross-covariance, Cov(Xa, Yb) = bTCov(X, Y)a. 335 Thus, 336 337

Cov(Xα1, Yα1) = α1TCov(X, Y) α1 (Property 1) 338 339 In addition, because both Xα1 and Yα1 are vectors, Cov(Xα1, Yα1) = Cov(Yα1, Xα1). Thus, 340

341 Cov(Xα1, Yα1) = Cov(Xα1, Yα1)/2 + Cov(Yα1, Xα1)/2 (Property 2) 342

343 Combining these two properties, we can see that 344 345

Cov(Xα1, Yα1) = Cov(Xα1, Yα1)/2 + Cov(Yα1, Xα1)/2 (By property 2) 346 = α1TCov( X , Y ) α1/2 + α1TCov( Y , X ) α1/2 (By property 1) 347

= α1T[Cov(X , Y)/2 + Cov(Y , X)/2] α1 348 349 Now, letting R = [Cov(X,Y)/2 + Cov(Y,X)/2], we see that maximizing α1TRα1 is equivalent to 350 maximizing Cov(Xα1, Yα1), the originally stated PPCA objective. (Note that if X = Y, we are 351 applying eigendecomposition to Var[Xαi] = Cov[Xαi, Xαi], which performs PCA.) 352 Finally, the min-max theorem dictates that the principal eigenvector of R maximizes α1TRα1 353 subject to α1 being a unit vector (|α1|=1) 354

We have thus found a unit vector α1 that maximizes Cov(Xα1, Yα1)—the covariance 355 between the projections of X and Y projected onto the first component. Based on the min-max 356 theorem, subsequent eigenvectors αi will maximize Cov(Xαi, Yαi) subject to their orthogonality 357 with previous components α1 through αi-1 and to each αi also being a unit vector (|αi|=1). 358

We note that the min-max theorem also provides that the last eigenvector, αn, will 359 minimize Cov(Xαn, Yαn), equivalent to maximizing -Cov(Xαn, Yαn). Hence, if there are 360 dimensions of negative covariance between the two datasets, then some eigenvectors will 361 maximize the negative covariance. 362

Page 9: Supporting Information...2020/01/01  · 49 consistently positively correlated with component scores of ratings from the other country using 50 a non-parametric Wilcoxon signed-rank

With respect to the corresponding eigenvalues, each eigenvalue λi will be equal to 363 Cov(Xαi, Yαi). To see this, note that: 364

365 [Cov(X , Y)/2 + Cov(Y , X)/2] αi = λi αi (Eigenvalue equation) 366 αiT [Cov(X , Y)/2 + Cov(Y , X)/2] αi = αiT λi αi 367 Cov(Xαi, Yαi) = λi αiTαi (By property 1) 368

369 Now αiTαi = 1 because the αi are orthonormal. Hence, 370 371

Cov(Xαi, Yαi) = λi 372 373 This also entails that there will be negative eigenvalues corresponding to negative covariance. 374 To ascertain whether each component was significant, we determined whether it reliably 375 captured positive covariance in a separate (held-out) sample of ratings. We generated p-values 376 corresponding to the null hypothesis that the out-of-sample covariance explained by each 377 component was no greater than zero by applying PPCA in a leave-one-rater-out fashion. 378 Specifically, we iteratively applied PPCA to extract components from the judgments of all but 379 one of the raters and projected the held-out rater’s judgments onto the components. We then 380 assessed the partial Pearson correlation between the component scores derived from each held-381 out rater’s ratings and those derived from the mean ratings from the other culture, partialing out 382 each previous component. Finally, we tested whether these held-out, statistically independent 383 correlation values were consistently positive for each component using a non-parametric 384 Wilcoxon signed-rank test85. 385 See Fig. S3 for results of repeated Monte Carlo simulations validating these methods. 386 Each simulation specifies a sampling distribution that closely matches our actual data after it is 387 projected onto some number of orthonormal components of covariance (varying from one to the 388 maximum, 29). The results of these simulations confirm that PPCA combined with our leave-389 one-rater-out approach accurately recovers the number of shared components and yields 390 conservative p- and q-values. 391

We note that PPCA generates conservative estimates even though traditional cross-392 covariance measures are suboptimal for binomial proportion data. We believe this is the case 393 because we use leave-one-out procedures and non-parametric techniques to test the significance 394 of each dimension—such statistical tests avoid distributional assumptions. Nevertheless, 395 techniques specially adapted to the distribution of the data might achieve greater statistical power 396 and less often underestimate the dimensionality reliably shared by the two datasets. Future work 397 should therefore focus on developing variations of PPCA with different distributional 398 assumptions. 399 In addition, to verify that we would obtain comparable results using a more established 400 method, we applied canonical correlation analysis (CCA) between the US and Indian judgments. 401 We did so in a similar leave-one-rater-out fashion to PPCA. Specifically, we iteratively applied 402 CCA to extract components from the judgments of all but one of the raters and projected the 403 held-out rater’s judgments onto the components. We then assessed the partial Pearson correlation 404 between the component scores derived from each held-out rater’s ratings and those derived from 405 the mean ratings from the other culture, partialing out each previous component. Finally, we 406 tested whether these held-out, statistically independent correlation values were consistently 407 positive for each component using a non-parametric Wilcoxon signed-rank test (5). 408

Page 10: Supporting Information...2020/01/01  · 49 consistently positively correlated with component scores of ratings from the other country using 50 a non-parametric Wilcoxon signed-rank

Note that we excluded the “Neutral” category from these analyses to avoid matrix 409 degeneracy, resulting in dimensions that can be conceived as variations from neutrality. After 410 determining the number of significant PPCs, we generate more interpretable components by 411 applying varimax rotation. 412 413 414 Methods S8. Generating Maps of the Feelings Associated with Music. To visualize the 415 categories of subjective experience evoked by music, chromatic maps were generated of the 416 scores of each of the 1,841 music samples within the 13-dimensional space derived from PPCA. 417 Each map consists of spatial coordinates and colors for each music sample. 418

To generate the spatial coordinates of each music sample within the map, we applied a 419 method called t-distributed stochastic neighbor embedding (t-SNE), among the most popular 420 techniques for visualizing high dimensional data. To visualize the data along just two 421 dimensions, t-SNE attempts to preserve shorter distances between data points (music samples) 422 while sacrificing the accuracy of its representation of longer distances. As a result, t-SNE 423 naturally groups together music samples that convey similar feelings and is able to capture 424 smooth, continuous variations within the 13-dimensional space despite being limited to two 425 dimensions. Of course, some information is lost in this process—this is why it is important to 426 simultaneously view a second, independent channel of information, conveyed through the color 427 assigned to each music sample. 428

To generate coordinates for each music sample, t-SNE was applied 100 times to the data 429 matrix, using default Matlab settings (1000 iterations, perplexity = 30, learning rate = 500, theta 430 0.5). The resulting t-SNE map that resulted in the lowest loss (Kullback-Leibler divergence) was 431 then subjected to more iterations of t-SNE for fine-tuning purposes (1000 more iterations under 432 the same settings). 433

The color assigned to each music sample corresponds to a weighted average of the unique 434 colors corresponding to its top two scores on the 13 categorical judgment dimensions. Of course, 435 other information is lost in using only the top two dimensions, but this supports the visual clarity 436 of the map, the goal of which is to capture as much information as possible about the structure of 437 subjective experience evoked by music in a relatively accessible format. The combined color 438 representations and structure of the map reveal the smooth gradients that traverse many 439 categories, such as fear and sadness. 440

441 442

Methods S9. Estimating Variance Captured in the Affective Scales by the 13 PPCs and the 443 Maximal PPC Alone. We used ordinary least squares (OLS) regression to predict the affective 444 scale judgments from the scores of each music sample on the 13 PPCs (continuous variation 445 model). To compare these predictions to those of a discrete model, we also used OLS to predict 446 the affective scale judgments from the maximally loading PPC, which was converted to a 447 dummy variable (1 for the maximally loading PPC, 0 otherwise) to form a fully discrete model, 448 and to a continuous intensity (keep the maximally loading PPC, convert others to 0) to form a 449 discrete model with intensity. For these analyses, we averaged across all ratings from the US and 450 China. We computed variance explained within the training sample for each model, given that 451 with 13 dimensions and 1,841 data points, we can expect very little overfitting. 452 To test for a difference in variance explained between the continuous variation model and 453 the discrete model, we used across-rater bootstrap resampling, as in Methods S4. Here, we 454

Page 11: Supporting Information...2020/01/01  · 49 consistently positively correlated with component scores of ratings from the other country using 50 a non-parametric Wilcoxon signed-rank

applied bootstrapping only to the affective scale judgments. (Applying bootstrapping to the 455 category judgments for purposes of testing the discrete model would not be feasible here, given 456 that the estimation of maxima across PPC loadings for each resample would violates 457 assumptions of smoothness for the bootstrap method.) The resampled affective scale judgments 458 were correlated with the original predictions of the continuous and discrete models to produce 459 bootstrapped prediction correlations. The p-value for the difference in correlation was computed 460 as the proportion of times that the continuous model prediction correlations exceeded those of 461 each of the discrete models, multiplied by two (two-tailed), across bootstrap resamples. 462

Note that because we did not bootstrap the model predictions, our results do not account 463 for random variability in model predictions, only for random variability in the outcome variable 464 (the affective scale judgments). What our results do indicate is that if new affective scale 465 judgments were collected, they would almost certainly be more highly correlated with the 466 continuous model predictions than with the predictions of either of the discrete models (p < 467 .001). 468 469 470 Methods S10. P- and Q-Values for Scores of Each Individual Music Sample on Each of the 471 13 PPCs. To compute p- and q-values for the scores of each individual music sample on each of 472 the 13 PPCs, we used a Monte Carlo simulation similar to that described in Methods S2. We 473 simulated N random 28-category judgments of each of the 100,000 music samples by drawing N 474 samples at random, with replacement, from the actual judgments of the 1,841 music samples. We 475 then projected the resulting proportions onto the 13 PPCs. We calculated the p-value for each 476 score on each PPC as the proportion of times that a greater score for that PPC appeared within 477 the null distribution. Given that different music samples were rated differing numbers of times 478 (N), we conducted the above simulation separately for each N. We controlled the FDR using the 479 Benjamini–Hochberg procedure (2). 480

Page 12: Supporting Information...2020/01/01  · 49 consistently positively correlated with component scores of ratings from the other country using 50 a non-parametric Wilcoxon signed-rank

SI Discussion 481 482 Q. What do we mean by categories, affective features, affective scales, and dimensions? 483 484

By categories of subjective experience, we mean the set of categories psychologists 485 (and people more generally) typically use to refer to specific feelings: categories such as sadness, 486 anger, amusement, awe, desire, and so on. 487

We use “affective features” as an umbrella term for the various conceptually broader 488 processes that have been proposed to constitute and distinguish emotional experiences in 489 constructivist, appraisal, and componential theories (7–11). Affective features include proposed 490 properties of core affect, such as valence and arousal, motivational processes, such as approach-491 avoidance and commitment, and cognitive appraisal-based inferences, such as safety and goal 492 relevance. Affective features are measured along scales. Hence, we use the term “scales of 493 affect” or “affective scales” to refer to the psychometric scales used to collect ratings of 494 affective features. 495

To avoid confusion, we reserve the term “dimensions” for latent dimensions that are 496 derived from judgment data using multidimensional reliability analysis methods—techniques 497 that extract dimensions on the basis of reliability across raters, rather than, for example, variance 498 (here, PPCA). For this reason, we avoid referring to features such as valence and arousal as 499 dimensions. Note that we also avoid using terms such as “discrete category”, which conflate the 500 conceptualization and structure of subjective experience (6). 501 502 503 Q. What is meant by “multidimensional reliability analysis”, and how does PPCA accomplish 504 this? 505 506

We use the phrase “multidimensional reliability analysis” to refer to methods that 507 compute the dimensions of similarity -- or reliability -- across two independent, corresponding 508 samples. To understand what is meant by calling PPCA a form of multidimensional reliability 509 analysis, first consider that the method effectively reduces to reliability analysis (test-retest 510 reliability) in the case where all features are orthonormal. For example, if there is one normalized 511 feature, the method is precisely equivalent to test-retest reliability. If there are multiple 512 orthonormal features, then the method essentially orders them in terms of their test-retest 513 reliability. It then determines the cutoff for significance. If there are multiple correlated features, 514 then the method becomes a form of multidimensional reliability analysis in the sense that it 515 extracts dimensions and measures how well those dimensions are preserved across test and 516 retest. 517

Finally, in the most general case -- that the features are neither normalized nor orthogonal 518 -- PPCA takes into account the degree of covariance explained by each component (rather than 519 just the correlation), which can be seen as optimizing a least squares (quadratic) loss function. It 520 can therefore be interpreted as extracting components that could most accurately predict one 521 dataset from another using only a direct (identity) map. Because these components relate one 522 dataset to another using only an identity map, they reflect the degree to which the datasets are 523 similar—i.e., the reliability of the variation in each component across datasets. 524 525 526

Page 13: Supporting Information...2020/01/01  · 49 consistently positively correlated with component scores of ratings from the other country using 50 a non-parametric Wilcoxon signed-rank

Q. In what ways have traditional methods (univariate recognition accuracy, PCA / factor 527 analysis, clustering methods, prototypical expressions, small sample sizes) been ill-equipped to 528 interrogate the semantic space of subjective experience? 529 530

● Methods based only on univariate recognition accuracy (e.g. percent correct or 531 interrater agreement) fail to rule out various ways in which judgments can be redundant. 532 For example, two categories of subjective experience can be synonyms. In such cases, 533 judgments of each of the two categories would yield significant interrater agreement, but, 534 when analyzed using methods such as PPCA, would only provide evidence for one 535 dimension of recognition. This issue, however, goes beyond synonyms—concepts can be 536 equivalent to groups of other concepts, can differ in frequency of usage without different 537 in meaning, etc. 538

● Traditional dimensionality reduction methods such as principal components analysis 539 (PCA) or factor analysis are limited in two very important ways. First, methods of 540 testing the number of significant PCs or factors are based at least in part on correlations 541 or covariance between judgments. However, they do not typically consider the reliability 542 of reports of individual items -- they cannot identify whether an individual category, like 543 fear, is reliably distinguished from every other category. This is a serious limitation in 544 most factor analytic studies of subjective experience which, incorporating only a subset 545 of the wide variety of terms people use to describe their feelings, cannot be presumed to 546 include multiple judgments corresponding to every significant dimension. Second, PCA 547 and factor analytic methods do not explicitly separate signal variance from noise 548 variance; rather, they rely on the assumption that high variance components contain 549 signal whereas low variance components contain noise. This assumption can be useful, 550 but it is not always valid. For example, in fMRI studies, noise components are often high 551 in variance (see (3)). Similarly, here, a category applied frequently but randomly to music 552 samples could have high variance in spite of the fact that it has no signal. Likewise, two 553 judgments that are rated together may be exhibiting noise correlations rather than signal 554 correlations. That is, they may always align for a single rater (e.g., if the rater reports 555 high “approach” and “valence” whenever a voice resembles their own) but there may be 556 no consistency across raters. Where PCA and factor analysis do not explicitly separate 557 signal and noise (except in cases where signal components are always higher variance 558 than noise components), multidimensional reliability analysis methods such as PPCA sort 559 dimensions based on their reliable covariance across independent or repeated measures, a 560 measure of signal variance. (Note that averaging or concatenating datasets and then 561 applying PCA would not separate dimensions that explain variance within one dataset 562 from dimensions that explain covariance across datasets.) See Video S1 for illustration. 563

● Methods based on clustering can be useful in characterizing the distribution of reports of 564 subjective experience within a semantic space. However, they do not directly inform the 565 dimensionality of the semantic space, nor do they reveal continuous gradients in 566 subjective experience, as most clustering methods impose assumptions of discreteness. 567

● The traditional reliance on a set of prototypical expressions and morphs between them 568 precludes understanding the full distribution of expressions within a continuous semantic 569 space. Each prototypical expression occupies a single point in the space, and morphs 570 between the expressions occupy only lines within the space. Note that if expressions 571 along these lines differed in their recognition properties from prototypical expressions, 572

Page 14: Supporting Information...2020/01/01  · 49 consistently positively correlated with component scores of ratings from the other country using 50 a non-parametric Wilcoxon signed-rank

this would not necessarily entail that categories of subjective experience entirely discrete, 573 given that not all categories of subjective experience share direct gradients. 574

● The reliance on small sample sizes (i.e., anything less than several hundred expressions 575 or a few individual judgments) limits the number of significant dimensions that can be 576 extracted (12). It also precludes understanding smooth gradients in the distribution of 577 feelings within the space. 578

579 580 Q. If we can visualize the feelings association with music along two dimensions, does this 581 mean that they actually occupy a two-dimensional space? 582 583

Every data matrix can be mapped onto two dimensions, regardless of dimensionality. If 584 the data points actually lie on a two-dimensional manifold, then this can in theory be done 585 without losing information about the original positions of the data points. However, if the actual 586 dimensionality of the data exceeds two, then we will inherently lose information about the 587 original positions of the data points if we project them onto two dimensions. 588

However, in many cases, we are willing to sacrifice some information to map the data 589 onto two dimensions—for instance, if the resulting map is both more comprehensive and 590 qualitatively easier to comprehend than other views of the data. The t-SNE method used to 591 produce Fig. 3 is designed to preserve as much of the information about the local neighborhood 592 of each data point as possible. The result makes it easier to grasp the gradients that link different 593 categories. 594

It is worth noting that the 13 PPCs and the axes of the t-SNE map are somewhat different 595 types of things, in that the 13 PPCs are metric components and the axes of the t-SNE are non-596 metric. Each PPC is a linear combination of judgments. Hence, the PPCs explicitly represent the 597 phenomena under investigation. By contrast, the axes of the t-SNE map do not explicitly 598 represent anything, nor are they separately interpretable. Rather, the axes together represent a 599 unit-less, non-linear transformation of the PPC space, in which the meaning of variation along 600 one axis in one part of the map is disassociated from the meaning of variation along the same 601 axis in a different part of the map. In much the same way, every dataset can be mapped onto a 602 single dimension in a manner that preserves some of the neighborhood structure of the data. For 603 example, we could derive 20 informative clusters of data points and scatter them at arbitrary 604 points along a single dimension. That this can be done with any dataset does not, of course, mean 605 that all datasets are one-dimensional. 606

Why is t-SNE so effective at visualizing the feelings associated with the music samples? 607 t-SNE is effective at visualizing subjective experience judgments for the same reason it is 608 effective at visualizing many other types of high-dimensional data, such as handwritten digits 609 (https://lvdmaaten.github.io/tsne/). Such data is “sparse”. It occupies relatively confined areas of 610 a high-dimensional space. Thus, a flexible two-dimensional manifold—bent, curved, and skewed 611 like a blanket to fit the data—can capture a fairly informative snapshot of where data points are 612 located relative to one another, while disregarding relatively less common or more nuanced (but 613 not necessarily less significant) variation. 614 615 616 Q. What are some of the cultural differences between China and the US? 617 618

Page 15: Supporting Information...2020/01/01  · 49 consistently positively correlated with component scores of ratings from the other country using 50 a non-parametric Wilcoxon signed-rank

China and the US vary greatly with respect to Hofstede’s proposed cultural dimensions. 619 Considering four of the dimensions in Hofstede’s model (https://www.hofstede-620 insights.com/product/compare-countries; (14)), (a) the US is considered a highly individualistic 621 society, whereas China is considered highly collectivistic; (b) the US is considered low in power 622 distance, whereas China is considered high in power distance; (c) the US is considered low in 623 long-term orientation, whereas China is considered very high in long-term orientation; and (d) 624 the US is considered high in indulgence, whereas China is considered low in indulgence. Other 625 work suggests that the tendency to think dialectically (i.e., tolerate potentially contradictory 626 beliefs) is stronger in Chinese culture than in the US (15). 627

Page 16: Supporting Information...2020/01/01  · 49 consistently positively correlated with component scores of ratings from the other country using 50 a non-parametric Wilcoxon signed-rank

Fig. S1. Judgment 628 surveys. From top to 629 bottom, surveys used to 630 collect music samples, 631 affective feature 632 judgments, and category 633 judgments. 634

Page 17: Supporting Information...2020/01/01  · 49 consistently positively correlated with component scores of ratings from the other country using 50 a non-parametric Wilcoxon signed-rank

635 Fig. S2. Category judgment interrater agreement levels, affective scale judgment frequencies, 636 and genre judgment frequencies. A. Interrater agreement levels for each music sample, for 637 each category of subjective experience. Dots represent the proportions of times the category to 638 the left was chosen for each music sample. Only music samples for which the category was 639 chosen at least once are shown. 93.3% of the music samples elicited significance interrater 640 agreement in subjective experience (q < .05), with every category being recognized with 641 significant interrater agreement from at least one music sample. B. Judgment frequency for 642 each affective scale across music samples. Shaded plots for each affective scale are kernel 643 histograms of the distribution of average ratings for that affective scale across music samples. 644 Histograms are normalized to an arbitrary height. Dots (jittered vertically) represent individual 645 music samples. C. Judgment frequency for each genre in each culture. 646 647 648

Page 18: Supporting Information...2020/01/01  · 49 consistently positively correlated with component scores of ratings from the other country using 50 a non-parametric Wilcoxon signed-rank

649 Fig. S3. Simulations demonstrating that signal correlations accurately estimate the 650 population-level correlations across cultures. To determine whether our method of computing 651 signal correlations between cultures accurately estimates the population-level correlation in mean 652 judgments across music samples, we ran realistic Monte Carlo simulations of our entire 653 experiment. (Left) Analyses of simulated category ratings demonstrate that our methods 654 yield accurate estimates of the population-level correlation in judgments. In 30 separate 655 simulations, category ratings of each music sample in each culture were drawn at random from 656 binomial distributions with matrix of parameters P and N. P was set to the actual proportion of 657 times each category was selected for each music sample in each culture. N was set to the number 658 of ratings actually obtained of each music sample in each culture. For each music sample, each 659 rating was randomly assigned to a “rater” at random without replacement (since each rater could 660 rate a music sample at most once), drawing each hypothetical rater with probability given by the 661 percentage of ratings each rater contributed in our actual experiment. Cross-cultural correlations 662 were then calculated for each category (Methods S4-5). Plotted here are the actual population-663 level correlations from the simulation (x-axis) and mean signal correlation estimates (y-axis). 664 The standard deviations of the signal correlation estimates for the categories across the 30 665 simulations are represented by the red bars. The green bars represent the mean of the estimated 666 standard errors (Methods S5). We can see not only that the signal correlations accurately 667 estimate the population level correlation for each category, but also that the estimated standard 668 errors are conservative. (Right) Analyses of simulated affective scale ratings demonstrate 669 that our methods yield accurate estimates of the population-level correlation in judgments. 670 In 30 separate simulations, affective scale ratings of each music sample in each culture were 671 drawn from normal distributions with matrix of parameters µ and σ. µ was set to the actual mean 672 judgment of each music sample in each culture. σ was set to the standard deviation of judgments 673 in each culture. Ratings were rounded to the nearest 1-9 integer. For each music sample, each 674 rating was randomly assigned to a “rater” at random without replacement (since each rater could 675 rate a music sample at most once), drawing each hypothetical rater with probability given by the 676 percentage of ratings each rater contributed in our actual experiment. Cross-cultural correlations 677 were then calculated for each affective scale (Methods S4-5). Plotted here are the actual 678 population-level correlations from the simulation (x-axis) and mean signal correlation estimates 679 (y-axis). The standard deviations of the signal correlation estimates for the categories across the 680

Page 19: Supporting Information...2020/01/01  · 49 consistently positively correlated with component scores of ratings from the other country using 50 a non-parametric Wilcoxon signed-rank

30 simulations are represented by the red bars. The green bars represent the mean of the 681 estimated standard errors (Methods S5). We can see not only that the signal correlations 682 accurately estimate the population level correlation for each affective scale, but also that the 683 estimated standard errors are conservative. 684

Page 20: Supporting Information...2020/01/01  · 49 consistently positively correlated with component scores of ratings from the other country using 50 a non-parametric Wilcoxon signed-rank

685 Fig. S4. Pearson and Spearman signal correlations between category and affective scale 686 ratings in the US and China, before and after binarizing the affective scale ratings. (Top 687 left) The top left bar graph is identical to Fig. 1A—it displays the Pearson correlations between 688 mean responses by Chinese participants and mean responses by US participants, divided by the 689 explainable variance in responses from each culture. Error bars represent standard error. (Bottom 690 left) The plot below is comparable but displays the Spearman correlations rather than the 691 Pearson correlations. Due to the sparseness of the category ratings (the presence of many values 692 close to zero, which likely leads to instability in the rank transform) we would expect the 693 Spearman correlations to be lower than the Pearson correlations for the category judgments. 694 Although the correlations for affective feature judgments are thus differentially higher here, it is 695 all the more noteworthy that our results from Fig. 1B and Fig. S5 indicate that the affective 696 features from each culture are often better predicted by category judgments from the other 697 culture than by the same affective feature judgments from the other culture. (Top and bottom 698 right) Analogous plots where the affective scale ratings were made binary (thresholding by the 699 mean in each culture) prior to all analysis to ascertain whether the differences in signal 700 correlation between categories and affective scales was due to judgment format (binary vs. 701 Likert). We can see that the differences are not due to judgment format. In fact, the differences in 702 signal correlation between category and affective scale judgments are greater when using binary 703 affective scale ratings. 704 705

Page 21: Supporting Information...2020/01/01  · 49 consistently positively correlated with component scores of ratings from the other country using 50 a non-parametric Wilcoxon signed-rank

706 Fig S5. Predicting all affective feature judgments from category judgments, and vice versa, 707 across cultures. Grey bars correspond to cross-cultural correlations in the judgment of each feature, 708 yellow error bars correspond to US category/affective feature judgments predicted from Chinese 709 affective feature/category judgments, and purple error bars correspond to Chinese category/affective 710 feature judgments predicted from US affective feature/category judgments. Category judgments 711 consistently predict affective feature judgments from the other culture as robustly as, or more 712 robustly than, affective feature judgments themselves (top left). We verified that this was still true 713 when we made judgments of the affective features, which were ordinal (1-9), more comparable to the 714 multiple-choice category judgments by binarizing them (assigning a one to above-average ratings 715 and a zero to below-average ratings) prior to all analyses (top right). However, the affective feature 716 judgments do not generally predict category judgments from the other culture better than the 717 category judgments themselves (bottom). (All signal correlations have been divided by explainable 718 variance for each matrix [see Methods S6]. Standard errors were estimated by bootstrapping across 719 raters.) 720

Page 22: Supporting Information...2020/01/01  · 49 consistently positively correlated with component scores of ratings from the other country using 50 a non-parametric Wilcoxon signed-rank

721 Figure S6. Mean agreement rates for US and traditional Chinese music samples in 722 Experiment 2, broken down by cultural group of raters. In general, raters from China had 723 higher interrater agreement, perhaps due to greater cultural homogeneity within the sample. 724 More intriguingly, Chinese raters converged slightly more in their ratings of traditional Chinese 725 music samples than US raters, with a smaller drop in interrater agreement relative to the US-726 contributed samples. The interaction between music origin (US vs. traditional Chinese samples) 727 and rater cultural group (US vs. China) was significant (p = .0002, F = 13.9, 2-way analysis of 728 variance). Error bars represent standard error. 729 730

Page 23: Supporting Information...2020/01/01  · 49 consistently positively correlated with component scores of ratings from the other country using 50 a non-parametric Wilcoxon signed-rank

731 Fig. S7. The primacy of categories in the feelings associated with Western and traditional 732 Chinese music targeting valence and arousal in Experiment 2. (Top) Correlations in the 733 feelings associated with US-contributed music samples across cultures. The cross-cultural signal 734 correlation (r) for each dimension derived in Experiment 1 (orange bars) and valence/arousal (green 735 bars) captures the degree to which each attribute is preserved across Chinese and US American 736 listeners in the 138 US-contributed music samples gathered on the basis of valence and arousal. 737 Many feelings were better preserved than valence or arousal. Error bars represent standard error. *p 738 ≤ .022, q[FDR] ≤ .05, bootstrap test, raw partial correlation r ≥ .16 controlling for every other 739 dimension (despite its high correlation, the partial correlation is low for “anxious, tense” [r = .14, 740 p = .050, ns], after controlling for “scary, fearful” and other dimensions in this sample). (Bottom) 741 Correlations in the feelings associated with traditional Chinese music samples, contributed by 742 Chinese participants, across cultures. The cross-cultural signal correlation (r) for each dimension 743 derived in Experiment 1 (orange bars) and valence/arousal (green bars) captures the degree to which 744 each attribute is preserved across Chinese and US American listeners in the 189 Chinese traditional 745 music samples gathered on the basis of valence and arousal. Many feelings were better preserved 746 than valence or arousal. Reports of some feelings, e.g. “Erotic”, were less well preserved across 747 cultures in this sample, but this is likely due to scant representation within the stimulus set. Error bars 748 represent standard error. *p ≤ .024, q[FDR] ≤ .05, bootstrap test, raw partial correlation r ≥ .13 749 after controlling for every other dimension (despite its high correlation, the partial correlation is 750 low for “amusing” [r = .10, p = .084, ns], after controlling for “joyful, cheerful” and other 751 dimensions in this sample). 752 753

Page 24: Supporting Information...2020/01/01  · 49 consistently positively correlated with component scores of ratings from the other country using 50 a non-parametric Wilcoxon signed-rank

754 Fig. S8. Verifying that PPCA accurately estimates the number of shared dimensions. To test 755 whether leave-one-rater-out PPCA would accurately estimate the number of preserved 756 dimensions of subjective experience across cultures, we ran realistic Monte Carlo simulations of 757 our experiment in which the ratings were drawn from distributions varying systematically in 758 their underlying dimensionality. In 100 separate simulations, five each for dimensionalities 759 between 1 and 18, category ratings of each music sample in each culture were drawn at random 760 from binomial distributions parameterized by 𝑿9, the 1841 by 28 matrix of probabilities of 761 selecting each category for each music sample, and N, the number of raters who actually rated 762 each music sample in each country. 𝑿9 was computed by applying PPCA to the actual proportion 763 of times each category was selected for each music sample in each culture, XUSA and XCHN, 764 projecting XUSA onto the first 1-18 dimensions extracted by PPCA, back-projecting these scores 765 back into the space of categories (αsimTXUSAαsim), and finally normalizing the result by 766 subtracting the minimum from each row and dividing by the sum of each row. This resulted in a 767 fixed number of preserved dimensions (1-18) in each simulation, each repeated five times. We 768 used the same 𝑿9 to simulate ratings in both cultures to maximize the similarity in ratings (and the 769 likelihood of overestimating the number of dimensions). N was set to the number of ratings 770 actually obtained of each music sample in each culture. For each music sample, each rating was 771 randomly assigned to a “rater” with probability given by the percentage of ratings each rater 772 actually contributed. Finally, we applied leave-one-rater-out PPCA to determine the p-values for 773 1-18 preserved dimensions (Methods S7). Plotted here are the actual numbers of preserved 774 dimensions used to generated the data in each simulation (x-axis) versus the estimated number of 775 dimensions (y-axis) using two different stopping (left: stop when p > .05, one-tailed Wilcoxon 776 signed-rank test85; right: Holm-Bonferroni FWER corrected p < .05 one-tailed Wilcoxon signed-777 rank test85). Our analyses use the more conservative method, corresponding to the right plot. We 778 can see that leave-one-rater-out PPCA accurately estimates the number of preserved dimensions 779 across cultures. 780

Page 25: Supporting Information...2020/01/01  · 49 consistently positively correlated with component scores of ratings from the other country using 50 a non-parametric Wilcoxon signed-rank

Table S1: Category information 781 Category (English) Chinese translation References Amusing 有趣的,好笑的 (16) Angry 气愤的 (16–19) Annoying 烦人的,恼人的

Anxious, tense 焦虑的,紧张的 (16, 18)

Awe-inspiring, amazing 令人敬畏的,令人惊叹的 (16, 18, 20, 21) Beautiful 美丽的 (16, 20, 21)

Bittersweet 苦乐参半的 (18) Calm, relaxing, serene 平静的,放松的,安逸的 (18, 21) Compassionate, sympathetic 富于同情心的,善于感受的 (18) Dreamy 梦幻的 (18, 21) Eerie, mysterious 怪异的,神秘的 Energizing, pump-up 有活力的,有能量的 (18, 22) Entrancing 使人入迷的 (16) Erotic, desirious 色情的,渴望的 (16, 23) Euphoric, ecstatic 狂喜的,极度喜悦的 (18) Exciting 兴奋的 (16, 23) Goose bumps 起鸡皮疙瘩的 (23–25) Indignant, defiant 愤愤不平的,反抗的 (22) Joyful, cheerful 欢喜的,欢快的 (16–19) Nauseating, revolting 恶心的,让人反感的 (16) Painful 痛苦的 (16) Proud, strong 骄傲的,强大的 (18, 22) Romantic, loving 浪漫的 (16) Sad, depressing 伤感的,抑郁的 (16–19) Scary, fearful 恐怖的,可怕的 (16, 17, 19) Tender, longing 温柔的,渴望的 (17–19) Transcendent, mystical 超然的,神化的 (18) Triumphant, heroic 胜利的,英勇的 (18, 22)

782

Page 26: Supporting Information...2020/01/01  · 49 consistently positively correlated with component scores of ratings from the other country using 50 a non-parametric Wilcoxon signed-rank

Table S2: Affective scale information 783 Dimension Q: To what extent does this make you feel (1…9) References arousal …Stimulated? (more subdued…more stimulated)

振奋的?非常压制…非常振奋 (7, 8, 19, 26–28)

attention …Focused? (more unfocused…more focused) 专心的?非常不专心…非常专心

(8)

certainty ...Certain? (very uncertain…very certain) 确定的?非常不确定…非常确定

(8, 29)

commitment ...Commitment to a person? (lack of commitment to a person…strong commitment to a person) 对一个人的承诺? 缺乏对一个人的承诺…对一个人坚定的承诺

(30)

dominance …Dominant? (more submissive…more dominant) 强势的?非常顺从…非常强势

(10, 28)

enjoyment (Separate question) How much do you enjoy this? (not at all…very much) 您喜欢这个音乐片段吗?完全不喜欢 一般 非常喜欢

familiarity (Separate question) How familiar is this music? (I’ve never heard anything like this…I’ve heard this before) 您熟悉该音乐片段吗?完全没听过类似的音乐 听过一些类似的音乐 听过该音乐片段

(8, 10)

identity …Identity with a group? (lack of group identity…strong group identity) 认同一个群体?缺乏群体认同…强烈的群体认同

(31)

obstruction …Like you're obstructed by something? (very unobstructed…very obstructed) 阻塞的?非常通畅…非常阻塞

(10)

safety …Safe? (very unsafe…very safe) 安全的?非常不安全…非常安全

(11)

valence …Pleasant? (very unpleasant…very pleasant) 愉快的?非常不愉快…非常愉快

(7–9, 26, 28, 30, 32)

784 785

Page 27: Supporting Information...2020/01/01  · 49 consistently positively correlated with component scores of ratings from the other country using 50 a non-parametric Wilcoxon signed-rank

Video S1. Illustration of how factor analytic methods based on correlations or covariance 786 between different judgments fail to capture the reliability of reports of individual items. 787 The scatterplots above represent hypothetical ratings gathered from two sets of participants. Each 788 dot represents a hypothetical stimulus. Its position represents its average rating in terms of two 789 hypothetical features. Each dot alternates between its average rating by one set of hypothetical 790 raters (blue) versus a second, independent set of hypothetical raters (orange). In the scatterplots 791 on the left, we can see that there is no consistency in ratings across the two sets of participants. 792 In the middle, we can see that the consistency in ratings can be captured by one dimension, with 793 ratings orthogonal to this dimension being inconsistent across the two sets of raters. On the right, 794 we can see that both dimensions are required to explain the consistency in ratings across 795 participants, given that the dots are stable in position along both axes. Methods such as PCA 796 (first row), which extract dimensions of variance in the distribution of ratings – 797 concatenated or averaged across datasets – will extract the same dimensions regardless of the 798 reliability of the ratings of each feature. Thus, PCA cannot identify when an individual category, 799 like fear, is reliably distinguished from every other rated category. By contrast, methods such as 800 PPCA (second row), applied across the two datasets, extract dimensions that account for the 801 reliability of ratings of each individual feature. 802 803

Page 28: Supporting Information...2020/01/01  · 49 consistently positively correlated with component scores of ratings from the other country using 50 a non-parametric Wilcoxon signed-rank

SI References 804 805 1. Bribitzer-Stull M (2015) Understanding the Leitmotif: From Wagner to Hollywood Film 806

Music (Cambridge University Press, Cambridge, United Kingdom) 807 doi:10.1017/CBO9781316161678. 808

2. Benjamini Y, Hochberg Y (1995) Controlling the false discovery rate: a practical and 809 powerful approach to multiple testing. J R Stat Soc Ser B 57(1):289–300. 810

3. Benjamini Y, Yu B (2013) The shuffle estimator for explainable variance in FMRI 811 experiments. Ann Appl Stat 7(4):2007–2033. 812

4. Cowen AS, Laukka P, Elfenbein HA, Liu R, Keltner D (2019) The primacy of categories 813 in the recognition of 12 emotions in speech prosody across two cultures. Nat Hum Behav. 814 doi:10.1038/s41562-019-0533-6. 815

5. Wilcoxon F (1945) Individual Comparisons by Ranking Methods. Biometrics Bull 816 1(6):80. 817

6. Cowen AS, Keltner D (2018) Clarifying the Conceptualization, Dimensionality, and 818 Structure of Emotion: Response to Barrett and Colleagues. Trends Cogn Sci 22(4):274–819 276. 820

7. Russell JA (2003) Core affect and the psychological construction of emotion. Psychol Rev 821 110(1):145. 822

8. Smith CA, Ellsworth PC (1985) Patterns of cognitive appraisal in emotion. J Pers Soc 823 Psychol 48(4):813. 824

9. Barrett LF (2006) Valence is a basic building block of emotional life. J Res Pers 40:35–825 55. 826

10. Scherer KR (2009) The dynamic architecture of emotion: Evidence for the component 827 process model. Cogn Emot 23(7):1307–1351. 828

11. Lazarus RS (1991) Progress on a cognitive-motivational-relational theory of emotion. Am 829 Psychol 46(8):819. 830

12. MacCallum RC, Widaman KF, Zhang S, Hong S (1999) Sample size in factor analysis. 831 Psychol Methods 4(1):84. 832

13. Jack RE, Sun W, Delis I, Garrod OGB, Schyns PG (2016) Four not six: Revealing 833 culturally common facial expressions of emotion. J Exp Psychol Gen 145(6):708–730. 834

14. Hofstede G (2011) Dimensionalizing Cultures: The Hofstede Model in Context. Online 835 Readings Psychol Cult 2(1). doi:10.9707/2307-0919.1014. 836

15. Hamamura T, Heine SJ, Paulhus DL (2008) Cultural differences in response styles: The 837 role of dialectical thinking. Pers Individ Dif 44(4):932–942. 838

16. Cowen AS, Keltner D (2017) Self-report captures 27 distinct categories of emotion 839 bridged by continuous gradients. Proc Natl Acad Sci:201702247. 840

17. Juslin PN, Laukka P (2004) Expression, Perception, and Induction of Musical Emotions: 841 A Review and a Questionnaire Study of Everyday Listening. J New Music Res. 842 doi:10.1080/0929821042000317813. 843

18. Zentner M, Grandjean D, Scherer KR (2008) Emotions Evoked by the Sound of Music: 844 Characterization, Classification, and Measurement. Emotion 8(4):494–521. 845

19. Eerola T, Vuoskoski JK (2013) A review of music and emotion studies: approaches, 846 emotion models, and stimuli. Music Percept 30(3):307–340. 847

20. Silvia PJ, Fayn K, Nusbaum EC, Beaty RE (2015) Openness to experience and awe in 848 response to nature and music: Personality and profound aesthetic experiences. Psychol 849

Page 29: Supporting Information...2020/01/01  · 49 consistently positively correlated with component scores of ratings from the other country using 50 a non-parametric Wilcoxon signed-rank

Aesthetics, Creat Arts 9(4):376–384. 850 21. Keltner D, Haidt J (2003) Approaching awe, a moral, spiritual and aesthetic emotion. 851

Cogn Emot 17(2):297–314. 852 22. Tsitsos W (1999) Rules of Rebellion: Slamdancing, Moshing, and the American 853

Alternative Scene. Pop Music 18(3):397–414. 854 23. Sloboda JA (1991) Musical structure and emotional response: Some empirical findings. 855

Psychol Music Music Educ 19:110–120. 856 24. Goldstein A (1980) Thrills in response to music and other stimuli. Physiol Psychol 857

8(1):126–129. 858 25. Hodges DA (2008) Bodily responses to music. The Oxford Handbook of Music 859

Psychology doi:10.1093/oxfordhb/9780199298457.013.0011. 860 26. Posner J, Russell J, Peterson B (2005) The circumplex model of affect: An integrative 861

approach to affective neuroscience, …. Dev Psychopathol 17(3):715–734. 862 27. Vuoskoski JK, Eerola T (2011) Measuring Music-Induced Emotion: A Comparison of 863

Emotion Models, Personality Biases, and Intensity of Experiences. Music Sci 15(2):159–864 173. 865

28. Mehrabian A, Russell J (1974) An approach to environmental psychology (T. Press, 866 Cambridge). 867

29. Roseman Jr. I (1991) Appraisal Determinants of Discrete Emotions. Cogn Emot 5(3):161–868 200. 869

30. Frijda NH, Kuipers P, ter Schure E (1989) Relations among emotion, appraisal, and 870 emotional action readiness. J Pers Soc Psychol. doi:10.1037/0022-3514.57.2.212. 871

31. Smith ER, Mackie DM (2008) Intergroup emotions. Handb Emot. doi:10.1037/14342-010. 872 32. Russell J (1980) A circumplex of affect. J Pers Soc Psychol 36:1152–1168. 873 874