genetic, geographic, and linguistic distances europe · proc. natl. acad. sci. usa vol. 85, pp....

5
Proc. Natl. Acad. Sci. USA Vol. 85, pp. 1722-1726, March 1988 Population Biology Genetic, geographic, and linguistic distances in Europe (human variation/gene frequencies/cranial variables/language families) ROBERT R. SOKAL Department of Ecology and Evolution, State University of New York at Stony Brook, Stony Brook, NY 11794-5245 Contributed by Robert R. Sokal, November 13, 1987 ABSTRACT Genetic and taxonomic distances were com- puted for 3466 samples of human populations in Europe based on 97 allele frequencies and 10 cranial variables. Since the actual samples employed differed among the genetic systems studied, the genetic distances were computed separately for each system, as were matrices of geographic distances and of linguistic distances based on membership in the same language family or phylum. Significant matrix correlations between genetics and geography were found for the majority of sys- tems; somewhat less frequent are significant correlations between genetics and language. The effects of the two factors can be separated by means of partial matrix correlations. These show significant values for both genetics and geography, language kept constant, and genetics and language, geography kept constant, with a tendency for the former to be higher. These findings demonstrate that speakers of different language families in Europe differ genetically and that this difference remains even after geographic differentiation is allowed for. The greater effect of geography than of language may be due to the several factors that bring about spatial differentiation in human populations. A continuing problem in population biology is to estimate and explain genetic differences among the populations con- stituting a species. Such differences are generally measured as genetic distances (1-3). At the level of local populations, random differentiation has generally been invoked as a null model for explaining the observed differences among sam- pling units. Such differentiation is due to sampling errors from gene pools of limited size (genetic drift; see refs. 2, 4) and to the limited mobility of individuals within the area of study (isolation by distance; see refs. 2, 5, 6). Models have been proposed that explain the amount of differentiation in terms of distance among sampling units (7-9). Microevolu- tionary forces that complicate the realization of a straight- forward spatial differentiation model are selection and di- rected migration (as distinct from the random dispersal of individuals underlying the isolation-by-distance model). When genetic variation is studied on a continental scale, as in the present paper, another factor-history-may be responsible for some of the variation. Populations that differentiated elsewhere may undertake long-range directed migration and settle in a part of the area under study, displacing or genetically absorbing the previous residents. Such a process disturbs what might otherwise be a simple relation between genetic and geographic distance. In hu- mans, we have evidence from historical sources that such migrations have indeed taken place. Some are well known; others are inferred with greater or lesser certainty from archeological and prehistorical information. In animal and plant populations, such occurrences may be far more diffi- cult to detect. It is of interest, therefore, to examine genetic differences in humans, where some estimate of the relative time of separation of the populations concerned can be made through an estimation of linguistic distances. The justification for using language to detect relative time of separation is the customary simplifying assumption that a common language frequently signifies a common origin and a related language indicates a common origin further back in time. Such commonality of origin should be reflected by genetic relationship, despite several complicating factors. One of these is the well-documented, repeated genetic and linguistic assimilation of disparate ethnic elements within a named ethnic group of migrants, making immigrant popula- tions genetically heterogeneous. A second factor is that, even when immigrant populations were supposedly homogeneous, these rarely settled in unoccupied areas in Europe, but frequently absorbed the native populations of their settlement area, the resulting population adopting the language of either the natives or the immigrants. Both of these occurrences tend to diminish genetic-linguistic correspondence. However, since language differences themselves are likely to be barriers to free gene flow, they will enhance genetic differentiation, counteracting the earlier two forces to some extent. This paper describes the relations of genetic distances based on separate genetic systems in European populations to geographic distances of the populations studied and to linguis- tic distances based on language family affiliations. Correla- tions between such distances cannot be tested in conventional ways, because of the lack of independence of the elements of the distance matrices and also because these distance matri- ces are based on spatially autocorrelated variables (10-13). Such data require special methods of analysis treated in ref. 14. The investigation was, therefore, carried out by comput- ing correlations between these matrices using the normalized Mantel statistic, following the method of Smouse et al. (15). Three pairwise correlations between genetics, geography, and language, as well as partial correlations between genetics and geography, language kept constant, and between genetics and language, geography kept constant, were computed. The null hypotheses tested are that there is no relationship between any pair of variables under these conditions and that there is no consistent pattern of inequalities in the magnitudes of the correlations. Examining the relations between genetic, geo- graphic, and linguistic distances in European populations also answers indirectly the question of whether speakers of differ- ent language groups in Europe differ genetically. This ques- tion has also been addressed by two different approaches in my laboratory (ref. 30; R.R.S., N. L. Oden, P. Legendre, M.-J. Fortin, J. Kim, and A. Vaudor, unpublished data). MATERIALS AND METHODS For this study, I analyzed records of 97 gene frequencies at 3369 locations in Europe (6603 data points in all), as well as 10 cranial measurements available for 97 samples of recent European populations. The 107 different variables were grouped into 27 systems, most corresponding to a genetic locus. The systems employed are given below, each pre- ceded by the number assigned to it by Mourant et al. (16) and followed, in parentheses, by the number of allele frequencies and of samples employed, separated by a comma. Sources of 1722 The publication costs of this article were defrayed in part by page charge payment. This article must therefore be hereby marked "advertisement" in accordance with 18 U.S.C. §1734 solely to indicate this fact. Downloaded by guest on December 17, 2020

Upload: others

Post on 26-Aug-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Genetic, geographic, and linguistic distances Europe · Proc. Natl. Acad. Sci. USA Vol. 85, pp. 1722-1726, March1988 Population Biology Genetic, geographic, andlinguistic distances

Proc. Natl. Acad. Sci. USAVol. 85, pp. 1722-1726, March 1988Population Biology

Genetic, geographic, and linguistic distances in Europe(human variation/gene frequencies/cranial variables/language families)

ROBERT R. SOKALDepartment of Ecology and Evolution, State University of New York at Stony Brook, Stony Brook, NY 11794-5245

Contributed by Robert R. Sokal, November 13, 1987

ABSTRACT Genetic and taxonomic distances were com-puted for 3466 samples of human populations in Europe basedon 97 allele frequencies and 10 cranial variables. Since theactual samples employed differed among the genetic systemsstudied, the genetic distances were computed separately foreach system, as were matrices of geographic distances and oflinguistic distances based on membership in the same languagefamily or phylum. Significant matrix correlations betweengenetics and geography were found for the majority of sys-tems; somewhat less frequent are significant correlationsbetween genetics and language. The effects of the two factorscan be separated by means of partial matrix correlations.These show significant values for both genetics and geography,language kept constant, and genetics and language, geographykept constant, with a tendency for the former to be higher.These findings demonstrate that speakers of different languagefamilies in Europe differ genetically and that this differenceremains even after geographic differentiation is allowed for.The greater effect of geography than of language may be dueto the several factors that bring about spatial differentiation inhuman populations.

A continuing problem in population biology is to estimateand explain genetic differences among the populations con-stituting a species. Such differences are generally measuredas genetic distances (1-3). At the level of local populations,random differentiation has generally been invoked as a nullmodel for explaining the observed differences among sam-pling units. Such differentiation is due to sampling errorsfrom gene pools of limited size (genetic drift; see refs. 2, 4)and to the limited mobility of individuals within the area ofstudy (isolation by distance; see refs. 2, 5, 6). Models havebeen proposed that explain the amount of differentiation interms of distance among sampling units (7-9). Microevolu-tionary forces that complicate the realization of a straight-forward spatial differentiation model are selection and di-rected migration (as distinct from the random dispersal ofindividuals underlying the isolation-by-distance model).When genetic variation is studied on a continental scale,

as in the present paper, another factor-history-may beresponsible for some of the variation. Populations thatdifferentiated elsewhere may undertake long-range directedmigration and settle in a part of the area under study,displacing or genetically absorbing the previous residents.Such a process disturbs what might otherwise be a simplerelation between genetic and geographic distance. In hu-mans, we have evidence from historical sources that suchmigrations have indeed taken place. Some are well known;others are inferred with greater or lesser certainty fromarcheological and prehistorical information. In animal andplant populations, such occurrences may be far more diffi-cult to detect. It is of interest, therefore, to examine geneticdifferences in humans, where some estimate of the relative

time of separation of the populations concerned can be madethrough an estimation of linguistic distances.Thejustification for using language to detect relative time of

separation is the customary simplifying assumption that acommon language frequently signifies a common origin and arelated language indicates a common origin further back intime. Such commonality of origin should be reflected bygenetic relationship, despite several complicating factors.One of these is the well-documented, repeated genetic andlinguistic assimilation of disparate ethnic elements within anamed ethnic group of migrants, making immigrant popula-tions genetically heterogeneous. A second factor is that, evenwhen immigrant populations were supposedly homogeneous,these rarely settled in unoccupied areas in Europe, butfrequently absorbed the native populations of their settlementarea, the resulting population adopting the language of eitherthe natives or the immigrants. Both of these occurrences tendto diminish genetic-linguistic correspondence. However,since language differences themselves are likely to be barriersto free gene flow, they will enhance genetic differentiation,counteracting the earlier two forces to some extent.

This paper describes the relations of genetic distancesbased on separate genetic systems in European populations togeographic distances of the populations studied and to linguis-tic distances based on language family affiliations. Correla-tions between such distances cannot be tested in conventionalways, because of the lack of independence of the elements ofthe distance matrices and also because these distance matri-ces are based on spatially autocorrelated variables (10-13).Such data require special methods of analysis treated in ref.14. The investigation was, therefore, carried out by comput-ing correlations between these matrices using the normalizedMantel statistic, following the method of Smouse et al. (15).Three pairwise correlations between genetics, geography, andlanguage, as well as partial correlations between genetics andgeography, language kept constant, and between genetics andlanguage, geography kept constant, were computed. The nullhypotheses tested are that there is no relationship betweenany pair of variables under these conditions and that there isno consistent pattern of inequalities in the magnitudes of thecorrelations. Examining the relations between genetic, geo-graphic, and linguistic distances in European populations alsoanswers indirectly the question of whether speakers of differ-ent language groups in Europe differ genetically. This ques-tion has also been addressed by two different approaches inmy laboratory (ref. 30; R.R.S., N. L. Oden, P. Legendre,M.-J. Fortin, J. Kim, and A. Vaudor, unpublished data).

MATERIALS AND METHODSFor this study, I analyzed records of 97 gene frequencies at3369 locations in Europe (6603 data points in all), as well as10 cranial measurements available for 97 samples of recentEuropean populations. The 107 different variables weregrouped into 27 systems, most corresponding to a geneticlocus. The systems employed are given below, each pre-ceded by the number assigned to it by Mourant et al. (16) andfollowed, in parentheses, by the number of allele frequenciesand of samples employed, separated by a comma. Sources of

1722

The publication costs of this article were defrayed in part by page chargepayment. This article must therefore be hereby marked "advertisement"in accordance with 18 U.S.C. §1734 solely to indicate this fact.

Dow

nloa

ded

by g

uest

on

Dec

embe

r 17

, 202

0

Page 2: Genetic, geographic, and linguistic distances Europe · Proc. Natl. Acad. Sci. USA Vol. 85, pp. 1722-1726, March1988 Population Biology Genetic, geographic, andlinguistic distances

Proc. Natl. Acad. Sci. USA 85 (1988) 1723

the data for systems 1.1 through 65 are refs. 16 and 17 and theresults of an extensive computer search of the recent litera-ture. System numbers 100-910 were assigned in my labora-tory and are also employed in related studies. Systems 100and 101-102 were obtained through the courtesy of P. Me-nozzi, A. Piazza, and L. L. Cavalli-Sforza. Systems 200 and201 come from ref. 18 and system 901-910 comes from ref. 19.The systems are 1.1 ABO (3, 870), 1.2 ABO with anti-A, -Al,and -B (4, 157), 2.5 MN (2, 194), 2.7 MN with anti-M, -N, and-S (4, 68), 3.1 P (2, 102), 4.1 Rh (2, 568), 4.13 Rh with anti-C,-D, -E, and -c (8, 82), 4.19 Rh with anti-C, -D, -E, -c, and -e (8,76), 5.1 Lu (2, 33), 6.1 K (2, 116), 6.3 K with anti-K and -k (2,39), 7.1 Se (2, 53), 8.1 Fy (2, 108), 36.1 Hp (2, 175), 37.1 Tf(3,38), 38.1 Gc (2, 112), 50.1.1 ACPI (3, 72) , 52 PGD (3, 42), 53PGMI (3, 70), 56AK (3, 64), 63 ADA (2, 53), 65 T (2, 62), 100HLA-A (7, 66), 101-102 HLA-B (14, 66), 200 Gm 1,2,5 (4, 45),201 Km (Inv) (2, 38), and 901-910 cranial variables (10variables from ref. 19; 10, 97). The numbers of localitiessampled for each separate system range from 870 for ABO to33 for Lu. The gene frequencies are based on sample sizesranging from 50 to many thousands of persons. All genefrequency estimates are from samples collected after WorldWar II. The cranial measurements are means based on at least25 skulls from populations dated between A.D. 1500 and thepresent (19). The distribution of samples over the continent ofEurope (pooled for the different systems) is shown in Fig. 1.

Since the numbers of localities and the actual samplesemployed differ among the systems studied, genetic dis-tances between all pairs of localities were computed sepa-rately for each system. Trial computations of genetic dis-tances D1 through D5 featured in ref. 3 and based on earlierwork (20, 21) yielded similar findings. Therefore, the sim-plest, DI, formulated in ref. 21, was adopted, and all resultsreported here are based on it. Its formula for a single locuswith m alleles for populations j and k, pij being the genefrequency of the ith allele in population j, is

m

D = 0.5 IPij - PiklIi=1

For the cranial variables taxonomic distances (22) werecomputed. The separate distance matrices computed for the

27 systems differ in dimension from the very large 870 x 870matrix for ABO (system 1.1) to the 33 x 33 matrix for Lu.Geographic distances were computed as great-circle dis-

tances from the geographic coordinates of the locality sam-ples, separately for each set of samples comprising onesystem. Earlier trial computations had shown that great-circle distances invariably yielded higher correlations withthe genetic and linguistic distances than did inverse-squaredgeographic distances.The European populations studied fall into 5 language phyla

and 12 language families (23). The families are listed below,preceded by their phyla (in capitals). INDO-EUROPEAN:Albanian, Baltic, Celtic, Germanic, Greek, Romance, Slavic;FINNO-UGRIC: Finnic, Ugric (Hungarian); ALTAIC:Turkic; AFRO-ASIATIC: Semitic (Maltese); BASQUE:Basque. Language family boundaries were obtained by con-sulting a number of sources (24-29, 31). Sample localitiesclose to language boundaries were investigated carefully toascertain the language actually spoken.Corresponding to the set of locality samples comprising

one system, a linguistic distance matrix was constructed,which indicates the distances between all pairs of samplessharing the same language family as 0, distances betweensamples from the same language phylum but different lan-guage families as 1, and distances between samples fromdifferent language phyla as 2. Previous experience with morerefined distances based on estimated times of separation hadnot shown appreciably different results from the trinarydistances employed here (32).The comparison of the distance matrices was carried out

by the Mantel procedure (33, 34), a quadratic assignmentmethod (14). Designating the elements of any two distancematrices as wjk and djk, respectively, the Mantel test statisticZ is computed as

Z = > WJkd.k,jk

where Yjk is summation over allj and k other thanj - k. Thenull hypothesis tested is independence of the elements of thetwo matrices-the W matrix (e.g., spatial distances) and theD matrix (e.g., genetic distances)-for the system studied.Expectations for moments of Z under this null hypothesis

FIG. 1. Map of all sample localities studied. This map records the locations of the samples of all 107 gene frequencies and cranial variablesanalyzed in this study. It provides an indication of the relative abundance of data in various regions of Europe.

Population Biology: Sokal

Dow

nloa

ded

by g

uest

on

Dec

embe

r 17

, 202

0

Page 3: Genetic, geographic, and linguistic distances Europe · Proc. Natl. Acad. Sci. USA Vol. 85, pp. 1722-1726, March1988 Population Biology Genetic, geographic, andlinguistic distances

Proc. Nati. Acad. Sci. USA 85 (1988)

have been derived in ref. 33. The distribution of Z wasshown to be asymptotically normal, leading to a straightfor-ward significance test. Alternatively, one may test the sig-nificance of the Mantel statistic by a Monte Carlo test, inwhich rows and columns of one of the two matrices arerandomly permuted, followed each time by recalculation ofZ. Proposals for normalizing Z to a coefficient ranging from-1 to +1 have been made in refs. 15, 35, 36. The firstmethod is the one adopted here and called matrix correla-tion, following ref. 22.Three different approaches have been suggested recently

for investigating the relations among three distance matrices(15, 36, 37). Two of these are employed in this study. Let thethree matrices to be compared be designated as A, B, and C.Smouse et al. (15) compute the partial correlations rAB.C andrAC.B of the matrix elements. These authors test the signifi-cance of partial correlation rABc by computing residualmatrices from the regressions of A on C and B on C, thenobtaining the distribution of the partial correlation as anormalized Mantel product of the two residual matrices,permuting rows and columns of either matrix. Dow andCheverud (37) compare matrices A and (B - C)-that is,they carry out a Mantel test between matrix A and thedifference matrix, B - C. The matrices B and C must becomparably scaled before the subtraction. This proceduretests whether rAB rAc, and, by its sign, suggests which ofthe two distance matrices B or C has the greater correlationwith distance matrix A.The results of significance tests for normalized Mantel

correlations are reported as probabilities based on the as-ymptotic approximation to normality. These are obtained ata fraction of the computation cost of permutational MonteCarlo procedures, which are very computer intensive, espe-

cially for the large matrices employed in this study. Exten-sive comparisons of the results obtained in my laboratory byboth methods for numerous data sets and for some systemsin this study yielded quite similar probabilities.

RESULTSTable 1 shows the pairwise and partial correlations for the 27systems. The great majority (22 of the 27 systems) yieldssignificant correlations between the genetic and geographicdistances (P < 0.05). The magnitudes of the significantcorrelation coefficients range from a low of 0.109 for T to ahigh of 0.588 for HLA-B. These correlations may not seemimpressive in terms of the total percent of the jointlydetermined variation. In my experience with such matrixcorrelations values above 0.4 are rarely obtained, eventhough the significance of even low coefficients may beestablished with near certainty. There clearly is a substantialand significant association for most genetic systems betweenthe genetic and geographic distances of pairs of populations.Because it has been suggested (e.g., ref. 38) that geneticdistance may be an inverse exponential function of geo-graphic distance, the correlations were recomputed with thegenetic distances transformed to natural logarithms. Suchcorrelations were never substantially greater than thosebased on the untransformed distances and frequently wereless.Only 16 of the 27 correlations between the genetic dis-

tances and the corresponding linguistic distances are signif-icant at P < 0.05. Of the five systems with nonsignificantcorrelations between genetics and geography, four (all but P)also lack significance between genetics and language.The correlations between geography and language are all

highly significant, as would be expected. Nearby popula-

Table 1. Pairwise and partial correlations

System GEN,GEO GEN,LANG GEO,LANG GEN,GEO.LANG GEN,LANG.GEO GEN,(GEO-LANG)1.1 ABO 0.337* 0.378* 0.353* 0.236* 0.294* -0.040t1.2 AIA2BO 0.310* 0.248* 0.343* 0.247* 0.159* 0.0612.5 MN 0.037 0.009 0.379* 0.037 -0.005 0.0282.7 MNS haplotypes 0.262* 0.012 0.384* 0.279* -0.100 0.251*3.1 P 0.064 0.139* 0.320* 0.021 0.125t -0.0754.1 Rh 0.117* 0.003 0.312* 0.122* -0.035 0.114*4.13 Rh haplotypes 0.162t 0.297* 0.318* 0.075 0.262* -0.1354.19 Rh haplotypes 0.047 0.013 0.427* 0.045 -0.008 0.0345.1 Lu -0.034 0.031 0.461* -0.054 0.052 -0.0656.1 K 0.173t 0.118t 0.558* 0.130t 0.026 0.0556.3 K (anti-K, -k) 0.148t 0.101 0.380* 0.119 0.049 0.0477.1 Se 0.318* -0.034 0.285* 0.342* -0.137 0.352*8.1 Fy 0.140* 0.160* 0.376* 0.088t 0.117t -0.02036.1 Hp 0.200* 0.052 0.285* 0.194* -0.006 0.148*37.1 Tf 0.188t 0.410* 0.443* 0.008 0.371* -0.222*38.1 Gc -0.009 -0.016 0.380* -0.003 -0.014 0.00750.1.1 ACPI 0.303* 0.433* 0.453* 0.133t 0.347* -0.12952 PGD 0.124t 0.134 0.500* 0.066 0.084 -0.01153 PGMI 0.396* 0.365* 0.481* 0.270* 0.217t 0.03056 AK 0.247* 0.064 0.435* 0.244* -0.050 0.183t63 ADA 0.421* 0.449* 0.708* 0.163t 0.237* -0.02965 T 0.109t 0.126t 0.531* 0.050 0.081 -0.017100 HLA-A 0.577* 0.380* 0.440* 0.493* 0.172t 0.197t101-102 HLA-B 0.588* 0.302* 0.440* 0.531* 0.060 0.286*200 Gm 1,2,5 0.392* 0.283* 0.370* 0.322* 0.161t 0.109201 Km 0.355* 0.217t 0.408* 0.299t 0.084 0.138901-9l0 cranial variables 0.211* 0.099t 0.593* 0.191* -0.034 0.111tGEN, genetics; GEO, geography; LANG, language. Pairwise correlations are identified as, for example, GEN,GEO to indicate the

correlation between genetic and geographic distances; partial correlations are identified as, for example, GEN,GEO.LANG to indicate thecorrelation of genetic and geographic distances, language distances kept constant. The last column lists the results of the Dow and Cheverud(37) procedure expressed as a correlation coefficient. Positive values indicate a greater effect on genetics by geography than by language;negative values indicate the opposite. All correlation coefficients are normalized Mantel coefficients following the method of Smouse et al. (15).to.01 < P - 0.05. t0.001 < P s 0.01. *P s 0.001.

1724 Pooulation Biology: Sokal

Dow

nloa

ded

by g

uest

on

Dec

embe

r 17

, 202

0

Page 4: Genetic, geographic, and linguistic distances Europe · Proc. Natl. Acad. Sci. USA Vol. 85, pp. 1722-1726, March1988 Population Biology Genetic, geographic, andlinguistic distances

Proc. Natl. Acad. Sci. USA 85 (1988) 1725

tions speak similar languages; distant populations differlinguistically. The separate values for the correlation coeffi-cient ranging from 0.284 to 0.708 are, of course, not inde-pendent but are samples from the parametric correlation oflanguage distance with geographic distance in Europe (r =0.322 was obtained by covering Europe with a fine grid andcorrelating the linguistic distance between each pair of gridpoints with its geographic distance).The partial correlations of genetics with geography, lan-

guage kept constant, and of genetics with language, geogra-phy kept constant, substantially reflect the values of thecorresponding pairwise correlations, although fewer (17 and11, respectively) are significant. No partial correlation issignificant for any system that was not also significant as apairwise correlation.To obtain an overall probability for rejection of the null

hypothesis of no significant correlation, I applied Fisher'stest for combining probabilities to the probabilities associ-ated with the correlation coefficients of the columns of Table1. First, I reduced the number of systems from 27 to 24, bycomputing Bonferroni probabilities for pooled estimates ofthe pairs of ABO, Rh haplotype, and K (Kell) systems that,obviously, are not independent of each other (for methodssee ref. 39). For the correlation between genetics andgeography, x2 = 392.757; df = 48; P << 0.001. A compa-rable finding was obtained for the correlations betweengenetics and language (X2 = 294.701; P << 0.001). For thetwo columns of partial correlations (genetics and geography,language kept constant, and genetics and language, geogra-phy kept constant), highly significant overall probabilitieswere again obtained (X2 = 312.397 and 198.867; P <<0.001). Thus, there is little doubt that on the whole, bothpartial correlations are greater than zero in these data.Do geographic distances or linguistic distances show

higher partial correlations with genetic distances? Of the 27systems, 17 show higher partial correlations for genetics andgeography, language kept constant, than for genetics andlanguage, geography kept constant. The remaining 10 sys-tems show the reverse relationship. Applying a sign test tothese comparisons, the preponderance of higher partialcorrelations of genetics with geography is not significant.Because these correlations are based on distance matrices, itis not possible to apply conventional significance tests to thedifference of the partial correlations. However, from mereinspection of the values, it is obvious that there is significantheterogeneity in these data so that for some systems, genet-ics and geography are more closely related, whereas forother systems, genetics and language have the higher value.The differential effects of geography and language can be

addressed directly by the technique of Dow and Cheverud(37) described earlier. The values obtained are shown in thelast column of Table 1. Not surprisingly, the signs of thesecorrelation coefficients match the inequalities in magnitudenoted earlier for the partial correlation coefficients in the twopreceding columns of Table 1. Thus, the preponderance ofcoefficients showing greater effects of geography than oflanguage cannot be shown significant by a sign test. Ofthe 27systems, 10 show significant differences between the effectsof geography and language. Two of these, ABO (system 1.1)and Tf show language distances to have a greater effect ongenetic distances than do geographic distances, but 8 of the10 attest to the greater influence of geographic distances.They are MNS haplotypes (system 2.7), Rh (system 4.1), Se,Hp, AK, HLA-A, HLA-B, and the cranial variables. The nullhypothesis that there are no significant differences betweenthe effects of geography and language over all of the systemscan be decisively rejected by Fisher's test for combiningprobabilities which yields x2 = 181.537, P << 0.001. Thus,there is little doubt that, on the whole, an inequality of the

effect of geography and language can be demonstrated inthese data.

DISCUSSIONThe results of this study show conclusively that there aregenetic differences between populations in Europe whenthese are grouped by language families. This is true evenwhen the effect of geographic distance is kept constant, as inthe partial correlations of genetics with language, geographykept constant. Such correlations can be interpreted as theeffect of language distance on genetic distance between pairsof observations found at constant geographic distance. Boththe substantial number of significant partial correlations ofthis type and the test of their overall significance support theabove conclusions. The data also show significant andpossibly higher correlation of genetic distances with geo-graphic distances in Europe. The partial correlation betweengenetics and geography, language kept constant, representsthe effect of geographic distance on genetic distance withinan area of unchanging language distances.Although the evidence is suggestive that the correlations

between genetic distances and geographic distances arehigher than those between genetic distances and languagedistances, this trend cannot be shown to be statisticallysignificant. Are there reasons for believing that such a trendshould exist? Significant genetics-language correlation willoccur whenever pairs of localities not belonging to the samelanguage family show greater genetic distances than thosethat do. This argument can be extended to locality pairs fromthe same language phylum versus those from different phyla.The absence of such correlations in some systems is ex-plained by observations made in related studies (ref. 30;R.R.S., N. L. Oden, P. Legendre, M.-J. Fortin, J. Kim, andA. Vaudor, unpublished data) that different language fami-lies differ in the allele frequencies and genetic loci thatdifferentiate them. If the particular combination of samplesfrom given language families is such that substantial geneticdifferences exist for a given system between populationsspeaking the same and different languages, then a markedcorrelation between genetics and language will result for thatsystem and probably be significant. If, on the other hand,some combinations of language families show genetic differ-ences for a given system, whereas others do not, thecorrelation will be weakened. Note that populations belong-ing to different language phyla in Europe do not necessarilyoccur at the largest spatial distances. Thus, geographicdistance and interphylum differences are not confounded inthe analysis.For the above reasons, language-genetics correlations will

not always be significant or large. But geography-geneticscorrelations are more likely to be substantial and significantbecause there are genetic differences between different areasregardless of the language they speak. Even when thecomparison is between populations belonging to the samelanguage family, if these are far apart, they are likely to differgenetically due to some combination of three factors: (i)geographic differentiation of substrate populations beforethe Indo-European, Finno-Ugric, and Turkic speakers en-tered Europe; (it) geographically distant areas were occupiedby different branches of the same language families; and (iii)subsequent differentiation.Other studies comparing genetic or morphometric dis-

tances with linguistic and geographic distances (37, 40-47)range in scale from a very small 9 km (42) to the continental(7000 km for sub-Saharan Africa; refs. 45 and 46). They varyin number of samples studied as well, from as few as 4 (47)to as many as 183 (44). Only refs. 37, 40, 41, 43, and 44 are

comparable to the present study in terms of analyticaltechniques. In all studies with adequate sample sizes, signif-

Population Biology: Sokal

Dow

nloa

ded

by g

uest

on

Dec

embe

r 17

, 202

0

Page 5: Genetic, geographic, and linguistic distances Europe · Proc. Natl. Acad. Sci. USA Vol. 85, pp. 1722-1726, March1988 Population Biology Genetic, geographic, andlinguistic distances

Proc. Natl. Acad. Sci. USA 85 (1988)

icant correlations can be shown between geographic andphenetic distances (comprising genetic, anthropometric, ordermatoglyphic distances). Similarly significant correlationscan be shown between phenetics and language. Correlationsor partial correlations with language are greater than thosewith geography in only two specific instances (37, 43), butneither difference has been shown to be significant. In fact,no difference between the effects of language and geographyon overall phenetic distances has been shown to be signifi-cant in any other study. Only in this paper are theresignificant differences established between the effects ofgeography and language for specific systems.The extent of the correlation between language distances

and geographic distances ranges widely for different studiesfrom a low of 0.115 for Kenyan tribes (differentiated else-where and relatively recently situated at their present loca-tions; ref. 43), to highs of 0.773 and 0.7% for the YanomamaAmerindians (40) and the Solomon Islanders (41), respec-tively. In European samples studied at three successive500-year intervals, the correlation increased with time from0.147 in the Early Middle Ages to 0.351 in modem times (44).

In summary, these results clearly demonstrate that speak-ers of different language families in Europe differ geneticallyand that this difference remains even after allowing forgeographic differentiation. There is a strong suggestion thatthe effect of geographic differentiation on genetic structuremay be greater than that of linguistic differentiation, but thistrend cannot be unequivocally confirmed.

I appreciate the constructive criticisms by Drs. Peter Smouse,Neal Oden, Guido Barbujani, and Rosalind Harding of an earlierversion of this manuscript. The work would have been impossiblewithout the expert computational assistance of Barbara Thomson.Rosalind Harding, Donna DiGiovanni, and Cheryl Daly helped withadditional computations and word processing. This work was sup-ported by Grant GM28262 from the National Institutes of Health.This article is contribution 649 in Ecology and Evolution from theState University of New York at Stony Brook.

1. Nei, M. (1972) Am. Nat. 106, 283-292.2. Nei, M. (1987) Molecular Evolutionary Genetics (Columbia

Univ. Press, New York).3. Wright, S. (1978) Evolution and the Genetics of Populations,

Vol. 4: Variability Within and Among Populations (Univ. ofChicago Press, Chicago).

4. Wright, S. (1969) Evolution and the Genetics of Populations,Vol. 2: The Theory of Gene Frequencies (Univ. of ChicagoPress, Chicago).

5. Rohlf, F. J. & Schnell, G. D. (1971) Am. Nat. 105, 295-324.6. Endler, J. A. (1977) Geographic Variation, Speciation, and

Clines (Princeton Univ. Press, Princeton, NJ).7. Malecot, G. (1973) in Genetic Structure of Populations, ed.

Morton, N. E. (Univ. of Hawaii Press, Honolulu), pp. 72-75.8. Morton, N. E. (1973) in Genetic Structure ofPopulations, ed.

Morton, N. E. (Univ. of Hawaii Press, Honolulu), pp. 76-79.9. Kimura, M. & Weiss, G. H. (1964) Genetics 49, 561-575.

10. Cliff, A. D. & Ord, J. K. (1981) Spatial Processes (Pion,London).

11. Sokal, R. R. & Oden, N. L. (1978) Biol. J. Linn. Soc. 10,199-228.

12. Sokal, R. R. & Wartenberg, D. E. (1981) in Dynamic SpatialModels, eds. Griffith, D. & McKinnon, R. (Sijthoffand Noord-hoff, Alphen aan den Rijn, The Netherlands), pp. 186-213.

13. Sokal, R. R. & Wartenberg, D. E. (1983) Genetics 105,219-237.

14. Hubert, L. J. (1987) Assignment Methods in Combinatorial

Data Analysis (Dekker, New York).15. Smouse, P. E., Long, J. C. & Sokal, R. R. (1986) Syst. Zool.

35, 627-632.16. Mourant, A. E., Kopec, A. C. & Domaniewska-Sobczak, K.

(1976) The Distribution of the Human Blood Groups (OxfordUniv. Press, London).

17. Tills, D., Kopdc, A. C. & Tills, R. E. (1983) The Distributionof the Human Blood Groups and Other Polymorphisms (Ox-ford Univ. Press, London), Suppl. 1.

18. Steinberg, A. G. & Cook, C. D. (1981) The Distribution of theHuman Immunoglobulin Allotypes (Oxford Univ. Press, Lon-don).

19. Schwidetzky, I. & Rosing, F. W. (1984) Homo 35, 1-49.20. Cavalli-Sforza, L. L. & Edwards, A. W. F. (1967) Evolution

21, 550-570.21. Prevosti, A., Ocana, J. & Alonso, G. (1975) Theor. Appl.

Genet. 45, 231-241.22. Sneath, P. H. A. & Sokal, R. R. (1973) Numerical Taxonomy

(Freeman, San Francisco).23. Ruhlen, M. (1987) A Guide to the World's Languages, Vol. 1:

Classification (Stanford Univ. Press, Stanford, CA).24. Moulton, W. G., Haugen, E. & Herzog, M. I. (1976) The New

Encyclopaedia Britannica (Macropaedia), (Chicago), Vol. 8,pp. 19-31.

25. Cowgill, W. (1976) The New Encyclopaedia Brittanica (Macro-paedia), (Chicago), Vol. 9, pp. 431-438.

26. Ivanov, V. V. (1976) The New Encyclopaedia Britannica(Macropaedia), (Chicago), Vol. 16, pp. 866-874.

27. Harms, R. T. (1976) The New Encyclopaedia Brittanica(Macropaedia), (Chicago), Vol. 18, pp. 1022-1032.

28. Posner, R. (1976) The New Encyclopaedia Britannica (Macro-paedia), (Chicago), Vol. 15, pp. 1025-1045.

29. Meillet, A. & Cohen, M., eds. (1952) Les Langues du Monde(Centre National de la Recherche Scientifique, Paris), maps I& VII.

30. Sokal, R. R., Oden, N. L. & Thomson, B. A. (1988) Am. J.Phys. Anthropol., in press.

31. Mather, J. Y., Speitel, H. H. & Leslie, G. W., eds. (1975) TheLinguistic Atlas of Scotland (Archon Books, Hamden, CT).

32. Sokal, R. R. & Uytterschaut, H. (1987) Am. J. Phys. Anthro-pol. 74, 21-38.

33. Mantel, N. (1967) Cancer Res. 27, 209-220.34. Sokal, R. R. (1979) Syst. Zool. 28, 227-231.35. Hubert, L. J. & Golledge, R. G. (1982) Geogr. Anal. 14,

273-278.36. Hubert, L. (1985) Psychometrika 50, 449-467.37. Dow, M. M. & Cheverud, J. M. (1985) Am. J. Phys. Anthro-

pol. 68, 367-373.38. Piazza, A. & Menozzi, P. (1983) in Numerical Taxonomy, ed.

Felsenstein, J. (Springer, Berlin), pp. 444-450.39. Sokal, R. R. & Rohlf, F. J. (1981) Biometry (Freeman, San

Francisco), 2nd Ed.40. Sokal, R. R., Smouse, P. E. & Neel, J. V. (1986) Genetics 114,

259-281.41. Dow, M. M., Cheverud, J. M. & Friedlaender, J. S. (1987)

Am. J. Phys. Anthropol. 72, 343-353.42. Smouse, P. E. & Wood, J. W. (1987) in Mammalian Dispersal

Patterns: The Effects of Social Structure on Population Ge-netics, eds. Chepko-Sade, B. D. & Halpin, Z. (Univ. ofChicago Press, Chicago), pp. 211-214.

43. Sokal, R. R. ,& Winkler, E.-M. (1987) Hum. Biol. 59, 147-164.44. Sokal, R. R., Uytterschaut, H., Rosing, F. W. & Schwi-

detzky, I. (1987) Am. J. Phys. Anthropol. 74, 1-20.45. Vecchi, F. & Passarello, P. (1977-1979) Riv. Antropol. 60,

187-194.46. Rosing, F. W. (1984-1985) Riv. Antropol. 63, 259-262.47. Parsons, P. A. & White, N. G. (1973) in The Human Biology of

Aborigines in Cape York, ed. Kirk, R. L. (Australian Instituteof Aboriginal Studies, Canberra), Australian Aboriginal Stud-ies No. 44, pp. 81-94.

1726 Population Biology: Sokal

Dow

nloa

ded

by g

uest

on

Dec

embe

r 17

, 202

0