australian numerals - supplementary...

Supplementary MaterialsQuantifying Uncertainty in the Phylogenetics of Australian Numeral Systems

Zhou and Bowern

Further Information on DataData on numeral systemsData for this paper have the same sources as those for Bowern and Zentz [1], and the reader is referred to that paper for details of the primary sources for numeral data. Each language was coded for the extent of the system (that is, the maximum attested numeral) and the etymological structure of the numerals 3 and 4. Bowern and Zentz included data from non-Pama-Nyungan languages; these languages were excluded from our sample because of the lack of resolution of phylogenetic relationships beyond the level of Proto-Pama-Nyungan. We have numeral extent data for 127 Pama-Nyungan languages and etymological structure data for 110 languages. Figure S1 summarizes the data. Each language was assigned a number representing its numeral extent; for example, languages coded with a ‘3’ have numeral extents of three, and so on. Languages whose extents were eight and greater were all represented by state ‘9’. Such languages include extents of 10, 19, 20, and 100. For clarity, we refer to this state as ‘8+’. Numerals in this survey are all sequential; for example, there are no languages with a term for 4 which do not also have a term for 3, 2, or 1.

Data coding for opacity (that is, the etymological structure of numeral systems) also come from Bowern and Zentz [1], supplemented by coding for additional languages by Bowern to ensure adequate coverage across all subgroups of Pama-Nyungan. A numeral was coded as compositional if it was comprised of other numerals in the synchronic system. This accounted for all but a few forms. Most of the exceptional forms were where one of the numerals in the compositional system was truncated (e.g. the word for ‘2’ is kutyarra but the compositional term for ‘3’ or ‘4’ contained kutya rather than kutyarra). These were also treated as compositional, as were the few forms where only one component of the numeral could be identified. Note that none of the languages in our sample have subtractive bases (e.g. where 7 is etymologically 10-3). All compositional forms for ‘3’ and ‘4’ are additive.

Figure S1. Frequency of the various states of numeral extent and opacity data, for a total of 127 languages. Extent data is coded with states from 2 numerals to 8+ numerals (presence of two numerals = state 1, presence of 3 numerals = state 2, etc; presence of 8 or more numerals is coded as ‘9’). For the opacity data, 0,0 = compositional 3/ compositional 4; 0,1 = compositional 3/opaque 4; 1,0 = opaque 3/compositional 4; 1,1 = opaque 3/opaque 4. Languages where the maximum numeral was ‘3’ were coded as having missing data for 4 (states 5 and 6 in opacity analyses).

1

The languages in the sample, along with the trait values, are given in the trees in Figures S2 and S3 below.

Phylogenetic TreeThe phylogenetic consensus tree on which results are presented is based on that published in Bowern and Atkinson [2]. Some additional languages were coded for basic vocabulary cognacy using the same judgment criteria as in Bowern and Atkinson. The samples were combined with those from the 2012 paper and a tree compiled using the same model and parameters and those resulting in the highest posterior likelihood scores (that is, the Stochastic Dollo model).

To quantify the uncertainty inherent in Bayesian phylogenetics, we ran our analyses using a sample of 700 trees, subsampled from the log file used to generate the consensus tree presented in Figure 1 of the main text. It is important to note that the trees were generated using basic vocabulary data, which was unrelated to the numeral extent data. This avoids biases for the reconstructions. To match the data we had for each of our analyses, the trees were subsequently pruned to the 127 languages represented, using R (ape package) [3].

Internal subgroup nodes are named according to the literature [4,5] on previously identified low-level clades in Pama-Nyungan. Most of these nodes were stable throughout the tree sample, as indicated by very high posterior probabilities on those nodes. Some of the higher level nodes were less stable. Central Pama-Nyungan, for example, had a posterior of .54 (compared to .87 for Western) and 1 for the ancestor of Lower Murray and Kulin (termed Macro-Victorian in Bowern and Atkinson). See Bowern and Atkinson [2] for further discussion.

2

Figure S2. Trait data showing extent of numeral systems for the Pama-Nyungan languages under study.

3

Figure S3. Opacity Trait data; languages with numeral extent below 3 are omitted. Data refer to the opacity of 3 and 4. A hyphen - indicates that 4 is not present in the language (that is, the language has numeral extent 3); 1 indicates that the form is opaque, 0 that it is compositional.

Further Information on MethodsRJMCMC procedureRJMCMC was run using the BayesTraits (BT; [6,7]) package for analyzing trait evolution. BT has multiple modules, the relevant one of which is Multistate, which can be run with ML or MCMC analysis. Multistate is used for analyzing traits that take multiple discrete states, as is the case for numeral extent data (which take states 2:8+).

4

BT allows freedom for selecting a variety of settings for the MCMC simulations. For both numeral extent and opacity analyses, we set the number of MCMC iterations to 101,000,000, throwing out the first 1,000,000 iterations as burn-in, to allow the Markov Chain to stabilize from a typically non-ideal starting point, and sampling every 1,000 values to minimize autocorrelation. BT by default auto-adjusts the MCMC runs so that it has an acceptance rate of around 30%; given that 20-40% is a typical suggested acceptance rate, we kept this default parameter unmodified.

Entropy-sorting and -weighting allow easier simultaneous visualization of the breadth of reconstructions given by RJMCMC. To produce the entropy plots, we sorted the 100,000 reconstructions by their entropies. An entropy histogram is then constructed. Estimates of the entropy density were given by spline interpolation of histograms. In principle we could have used kernel density estimation to smooth the histogram, but we found that this procedure was sufficient for our largely monomodal densities. From the sorted list of reconstructions, adjacent 100 reconstructions were averaged (for a total of 1,000 displayed reconstructions) to give the plots a smoother appearance and to mitigate display artifacts resulting from limited screen resolution. The reconstructions were weighted based on their entropy density value. An alternative approach would be to plot the multinomial distribution of standard deviation for each rate across the chain. We use the Shannon-Weiner approach because in the case of the multinomial distribution, we would obtain multiple numbers characterizing one reconstruction, making it unclear how the reconstructions should be sorted. Entropy gives one number and is an explicit measure for uncertainty.

BT allows priors on trait frequencies. We ran models with estimated frequencies. Using empirically determined frequencies might have skewed the results for early tree nodes if contemporary languages show biased trait frequencies.

Numeral extentA model that describes the evolution of numeral extent has at most 42 parameters, one for each possible forward/reverse transition. To parameterize the prior information, we used a uniform distribution between 0 and a hyperparameter that was allowed to vary between 0 and multiple ceiling values which we manually set. We found that the smaller the ceiling, the lower the entropy for the reconstructions. Moreover, while the degree of certainty varied, the state with the highest probability remained consistent, regardless of the ceiling (provided the ceiling was not too high). Figure 1 in the main paper reports data from the lowest ceiling; Figures S4 to S6 give higher ceilings (from left to right: 2x, 4x and 15x the original ceiling). Note that as the rate ceiling increases, there is more ambiguity in general, particular with respect to whether the root trait was ‘3’ or ‘4’. At the most relaxed rate ceiling, most reconstructions were uninformative, as they were nearly uniform. However, certain nodes are still relatively confidently reconstructed, particularly Mayi, Arandic, and Yolngu. These different rate ceilings on our prior distributions represent different hypotheses about the rate of evolution of these traits, where the slower the evolution rate, the more obscured the original states were.

5

Figure S4. Extent data for selected internal tree nodes, limit 4


6


Because every reconstruction from the Markov chain is displayed, the relative areas of each color represents the mean degree of support. This is valid because we have modulated the entropy-sorted reconstructions by the entropy density, which assures that the least frequent reconstructions are not exaggerated.

Investigating rate differencesIn order to investigate the properties of different transition rates, we want to discover which rates are particularly fast or slow, both compared to other rates in the dataset, and specifically compared to the converse rate (that is, qxy compared to qyx). There are several different ways in which rates can be compared. We opted for three different measures: 1) the proportion of runs where the rate is zero; 2) the proportion of runs where a given rate was fastest, and 3) the mean rank of each rate (that is, which rate was on average fastest across all runs). The first two measures are reported in the main text. We also examined the standard deviation of rate variance across runs. Of these, the fastest six rates were the most stable across rank measures, and also had the lowest standard deviation. Rates involving state 7 were the most frequently deleted (perhaps due at least in part to the overall very low incidence of languages with extent 7 in our dataset). Other rates varied more extensively, and also shifted rank (sometimes quite extensively) depending on the measure used. We therefore conclude that the relative speed of the first six rates is likely to be significant, but any conclusions based on other rates are possibly artifacts of how rates were measured. Figure 2 of the main text gives measures 1 and 2 ; Figure S7 below gives the mean rank and Figure S8 the standard deviation of each rate.

7

Figure S7 Rank comparisons for transition rates, extent. The top panel shows the proportion of runs where each rate was the fastest, while the bottom panel gives the mean rank across runs.

Figure S8 Standard deviations of transition rates.

To investigate the statistical robustness of breaks in the data, we clustered rates and used the R package pvclust [8] to determine multiscale bootstrap values (10,000 bootstrap runs). However, this procedure was unhelpful determining statistically significant rate classes, as almost every cluster was returned as significant at 0.05 level. We therefore adopt the approach described here, concentrating on the fastest and slowest rates which are both ranked consistently and which have the lowest variance across the MCMC chain. Doing so allows us to focus on the main trends in the data.

8

Numeral opacityWith numeral opacity of ‘3’ and ‘4’, we are interested in the correlatedness or dependency of their evolution. That is, we study whether the opacity status of one numeral influenced the status of the other. The application of RJMCMC to the numeral system extent data is natural and straightforward, given that there are potentially on the order of 1038 models to explore. However, our analysis of correlated evolution of numeral opacity requires more elaboration.

To test for correlated evolution, we examine two classes of models, one that describes independent evolution and one that describes dependent evolution, which we could then compare via the Bayes factor (BF). With two binary traits, there are four possible paired states a language may assume; hence, the general coevolution model involves eight possible transitions, where simultaneous transitions in both traits are assumed to be negligibly slow (Figure 3 of main text). For independent evolution, there are four transitions.

Fitting the data to both models and comparing their posterior likelihoods via the BF would tell us whether coevolution is favored. RJMCMC can be incorporated to strengthen this analysis by allowing the Markov chain to search the universe of dependent and independent models [6]; we could then either use the BF to compare all dependent models against all independent models, or compare the ratio of visits between dependent and independent models against that expected from random chance. This approach may be more desirable because it may be possible that the eight-parameter dependent model does not explain the data, but another dependent model might.

Here, we adopt the RJMCMC approach as detailed by Pagel and Meade, but we modify the coevolution model of opacity of ‘3’ and ‘4’ by introducing two new paired states. These are required because the opacity of ‘4’ is ternary—opaque, compositional, or not attested. We kept the opacity of ‘3’ binary because no languages have a term for ‘4’ and not for ‘3’, and languages that only have a numeral extent of ‘2’ are not of interest in this analysis. Note that while ‘2’ was a possible limit for reconstruction in the extent model, it was never returned with more than minimal frequency. The underlying method for reconstruction is the same as for the extent data, except the number of possible models has significantly increased. We are thus interested in whether any of the three states influences one or both of the other states.

Log likelihood averages across three runs are given in Table S1 below. The BF between runs range between 7.8 and 10.4, indicating strong support for the dependent model over the independent one.

Dependent run A -172.683 Independent run A -177.571Dependent run B -173.336 Independent run B -177.259Dependent run C -173.154 Independent run C -177.905

Figures S8-S11 below are entropy plots of the opacity data, showing two rate limits (8 (Figures S9 and S10) and 30 (Figures S11 and S12).

9

Figure S9. Entropy plots for root and selected internal nodes for evolution of opacity in numeral systems, upper rate limit of 8, Dependent Model.

Figure S10. Entropy plots for root and selected internal nodes for evolution of opacity in numeral systems, upper rate limit of 8, Independent Model (that is, uncorrelated evolution between the opacity or compositionality of 3 and 4).

10

Figure S 11. Entropy plots for root and selected internal nodes for evolution of opacity in numeral systems, upper rate limit of 30, Dependent Model.

Figure S12. Entropy plots for root and selected internal nodes for evolution of opacity in numeral systems, upper rate limit of 30, Independent Model (that is, uncorrelated evolution between the opacity or compositionality of 3 and 4).

To infer the mechanism of correlated evolution, we look at the relative sizes of the transition rates. We cannot look at the list of models most visited by RJMCMC, as was done by Pagel and Meade [6] and other work in this line, such as that by Dunn et al [9], for several reasons. By increasing the number of opacity states, the number of models has substantially increased (from 21,146 to 5,832,742,205,056), thus requiring very large RJMCMC samples to obtain sufficient resolution. Furthermore, there may be a handful of good models that differ from each other by

11

one or two transition rates, and would thus expect to give similar overall results. However, while they may hold a collective plurality, on an individual basis they do not appear very commonly, so even the ‘best’ models are visited only a few times. Thus, instead we look at the model-averaged transition rate results, which would give us more meaningful information because common evolutionary models would collectively influence the transition rate results the most. Just as for the numeral extent investigation, we rank transition rates by speed (Figure S13) and present the mean rank across all runs (Figure S14). Figure 3 of the main text presents the fastest rates from Figure S13 in an alternative format.

Figure S 13 R Rank comparisons for transition rates for the dependent model of the opacity of numerals 3 and 4. The red line gives the proportion of runs where that rate was the fastest; the blue line gives the proportion of the runs where the rate was deleted.

12

Figure S14 R Rank comparisons for transition rates for the dependent model of the opacity of numerals 3 and 4. The top panel shows the proportion of runs where each rate was the fastest, while the bottom panel gives the mean rank across runs.

References1. Bowern, C. & Zentz, J. 2012 Numeral Systems in Australian languages. Anthropological

Linguistics

2. Bowern, C. & Atkinson, Q. 2012 Computational phylogenetics and the internal structure of Pama-Nyungan. Language 88, 817–845.

3. R Core Team 2014 R: A language and environment for statistical computing.

4. Dixon, R. M. W. 1980 The languages of Australia. Cambridge: Cambridge University Press.

5. Bowern, C. & Koch, H. 2004 Australian languages: Classification and the comparative method. Amsterdam / Philadelphia: John Benjamins.

6. Pagel, M. & Meade, A. 2004 A phylogenetic mixture model for detecting pattern-heterogeneity in gene sequence or character-state data. Systematic Biology 53, 571–581.

7. Pagel, M., Meade, A. & Barker, D. 2004 Bayesian estimation of ancestral character states on phylogenies. Systematic biology 53, 673–684.

8. Suzuki, R. & Shimodaira, H. 2006 Pvclust: an R package for assessing the uncertainty in hierarchical clustering. Bioinformatics 22, 1540–1542. (doi:10.1093/bioinformatics/btl117)

13

9. Dunn, M., Greenhill, S. J., Levinson, S. C. & Gray, R. D. 2011 Evolved structure of language shows lineage-specific trends in word-order universals. Nature 473, 79–82. (doi:10.1038/nature09923)

14

australian numerals - supplementary...

Documents