admixture, population structure and f-statistics - genetics · genetics | investigation admixture,...
TRANSCRIPT
GENETICS | INVESTIGATION
Admixture, Population Structure and F-statisticsBenjamin M Peter1
1Department of Human Genetics, University of Chicago, Chicago IL USA
ABSTRACT Many questions about human genetic history can be addressed by examining the patterns of shared geneticvariation between sets of populations. A useful methodological framework for this purpose are F-statistics, that measureshared genetic drift between sets of two, three and four populations, and can be used to test simple and complexhypotheses about admixture between populations. This paper provides context from phylogenetic and populationgenetic theory. I review how F-statistics can be interpreted as branch lengths or paths, and derive new interpretationsusing coalescent theory. I further show that the admixture tests can be interpreted as testing general properties ofphylogenies, allowing extension of some ideas applications to arbitrary phylogenetic trees. The new results are usedto investigate the behavior of the statistics under different models of population structure, and show how populationsubstructure complicates inference. The results lead to simplified estimators in many cases, and I recommend to replaceF3 with the average number of pairwise differences for estimating population divergence.
KEYWORDS admixture; gene flow; phylogenetics; population genetics; phylogenetic network
1
2For humans, whole-genome genotype data is now available for 3
individuals from hundreds of populations (Lazaridis et al. 4
2014; Yunusbayev et al. 2015), opening up the possibility to ask 5
more detailed and complex questions about our history (Pickrell 6
and Reich 2014; Schraiber and Akey 2015), and stimulating the 7
development of new tools for the analysis of the joint history of 8
these populations (Reich et al. 2009; Patterson et al. 2012; Pickrell 9
and Pritchard 2012; Lipson et al. 2013; Ralph and Coop 2013; 10
Hellenthal et al. 2014). A simple and intuitive framework for this 11
purpose that has quickly gained in popularity are the F-statistics, 12
introduced by Reich et al. (2009), and summarized in Patterson 13
et al. (2012). In that framework, inference is based on “shared 14
genetic drift” between sets of populations, under the premise 15
that shared drift implies a shared evolutionary history. Tools 16
based on this framework have quickly become widely used in 17
the study of human genetic history, both for ancient and modern 18
DNA (Green et al. 2010; Reich et al. 2012; Lazaridis et al. 2014; 19
Haak et al. 2015; Allentoft et al. 2015). 20
Some care is required with terminology, as the F-statistics 21
sensu Reich et al. (2009) are distinct, but closely related to 22
Wright’s fixation indices (Wright 1931; Reich et al. 2009), which 23
are also often referred to as F-statistics. Furthermore, it is nec- 24
Copyright © 2016 by the Genetics Society of Americadoi: 10.1534/genetics.XXX.XXXXXXManuscript compiled: Wednesday 3rd February, 2016%[email protected]
essary to distinguish between statistics (quantities calculated 25
from data) and the underlying parameters (which are part of the 26
model)(Weir and Cockerham 1984). 27
In this paper, I will mostly discuss model parameters, and I 28
will therefore refer to them as drift indices. The term F-statistics 29
will be used when referring to the general framework introduced 30
by Reich et al. (2009), and Wright’s statistics will be referred to 31
as FST or f . 32
Most applications of the F-statistic-framework can be phrased 33
in terms of the following six questions: 34
1. (Treeness test): Are populations related in a tree-like fash- 35
ion? (Reich et al. 2009) 36
2. (Admixture test): Is a particular population descended from 37
multiple ancestral populations? (Reich et al. 2009) 38
3. (Admixture proportions): What are the contributions from 39
different populations to a focal population (Green et al. 2010; 40
Haak et al. 2015). 41
4. (Number of founders): How many founder populations are 42
there for a certain region? (Reich et al. 2012; Lazaridis et al. 43
2014) 44
5. (Complex demography): How can mixtures and splits of 45
population explain demography? (Patterson et al. 2012; 46
Lipson et al. 2013) 47
Genetics, Vol. XXX, XXXX–XXXX February 2016 1
Genetics: Early Online, published on February 8, 2016 as 10.1534/genetics.115.183913
Copyright 2016.
A
B
Figure 1 Population phylogeny and admixture graph. A: Apopulation phylogeny with branches corresponding to F2
(green), F3 (yellow) and F(B)4 (blue). B: An admixture graph
extends population phylogenies by allowing gene flow (red,full line) and admixture events (red, dotted).
6. (Closest relative): What is the closest relative to a contem- 48
porary or ancient population (Raghavan et al. 2014) 49
The demographic models under which these questions are 50
addressed, and that motivated the drift indices, are called popula- 51
tion phylogenies and admixture graphs. The population phylogeny 52
(or population tree), is a model where populations are related in 53
a tree-like fashion (Figure 1A), and it frequently serves as the null 54
model for admixture tests. The branch lengths in the population 55
phylogeny correspond to how much genetic drift occurred, so 56
that a branch that is subtended by two different populations can 57
be interpreted as the “shared” genetic drift between these pop- 58
ulations. The alternative model is an admixture graph (Figure 59
1B), which extends the population phylogeny by allowing edges 60
that represent population mergers or a significant exchange of 61
migrants. 62
Under a population phylogeny, the three F-statistics pro- 63
posed by Reich et al. (2009), labelled F2, F3 and F4 have interpre- 64
tations as branch lengths (Fig. 1A) between two, three and four 65
taxa, respectively: Assume populations are labelled as P1, P2, . . . , 66
then 67
• F2(P1, P2) corresponds to the path on the phylogeny from 68
P1 to P2. 69
• F3(PX ; P1, P2) represents the length of the external branch 70
from PX to the (unique) internal vertex connecting all three 71
populations. Thus, the first parameter of F3 has an unique 72
role, whereas the other two can be switched arbitrarily. 73
• F(B)4 (P1, P2; P3, P4) represents the internal branch from the 74
internal vertex connecting P1 and P2 to the vertex connect- 75
ing P3 and P4 (blue). 76
If the arguments are permuted, some F-statistics will have no 77
corresponding internal branch. In particular, it can be shown 78
that in a population phylogeny, one F4-index will be zero, imply- 79
ing that the corresponding internal branch is missing. This is the 80
property that is used in the admixture test. For clarity, I add the 81
superscript F(B)4 if I need to emphasize the interpretation of F4 82
as a branch length, and F(T)4 to emphasize the interpretation as a 83
test-statistic. For details, see the subsection on F4 in the results 84
section. 85
In an admixture graph, there is no longer a single branch 86
length corresponding to each F-statistic, and interpretations are 87
more complex. However, F-statistics can still be thought of 88
as the proportion of genetic drift shared between populations 89
(Reich et al. 2009). The basic idea exploited in addressing all 90
six questions outlined above is that under a tree model, branch 91
lengths, and thus the drift indices, must satisfy some constraints 92
(Buneman 1971; Semple and Steel 2003; Reich et al. 2009). The 93
two most relevant constraints are that i) in a tree, all branches 94
have positive lengths (tested using the F3-admixture test) and ii) 95
in a tree with four leaves, there is at most one internal branch 96
(tested using the F4-admixture test). 97
The goal of this paper is to give a broad overview on the 98
theory, ideas and applications of F-statistics. Our starting point 99
is a brief review on how genetic drift is quantified in general, 100
and how it is measured using F2. I then propose an alterna- 101
tive definition of F2 that allows us to simplify applications, and 102
study them under a wide range of population structure models. 103
I then review some basic properties of distance-based phylo- 104
gentic trees, show how the admixture tests are interpreted in 105
this context, and evaluate their behavior. Many of the results 106
that are highlighted here are implicit in classical (Wahlund 1928; 107
Wright 1931; Cavalli-Sforza and Edwards 1967; Felsenstein 1973; 108
Cavalli-Sforza and Piazza 1975; Felsenstein 1981; Slatkin 1991; 109
Excoffier et al. 1992) and more recent work (Patterson et al. 2012; 110
Pickrell and Pritchard 2012; Lipson et al. 2013), but often not 111
explicitly stated, or given in a different context. 112
Results & Discussion 113
The next sections will discuss the F-statistics, introducing dif- 114
ferent interpretations and giving derivations for some useful 115
expressions. Longer derivations are deferred to the Methods 116
section. A graphical summary of the three interpretations of 117
the statistics is given in Figure 2, and the main formulas are 118
summarized in Table 1. 119
Throughout this paper, populations are labelled as 120
P1, P2, . . . , Pi, . . . . Often, PX will denote a an admixed popu- 121
lation. The allele frequency pi is defined as the proportion of 122
individuals in Pi that carry a particular allele at a biallelic locus, 123
and throughout this paper I assume that all individuals are hap- 124
loid. However, all results hold if instead of haploid individuals, 125
an arbitrary allele of a diploid individual is used. I focus on 126
genetic drift only, and ignore the effects of mutation, selection 127
and other evolutionary forces. 128
2 Benjamin M Peter et al
F2(P1, P2) F3(PX ; P1, P2) F4(P1, P2, P3, P4)
Definition E[(p1 − p2)2] E(pX − p1)(pX − p2) E(p1 − p2)(p3 − p4)
F2 -12(
F2(P1, PX) + F2(P2, PX) −F2(P1, P2)
) 12(
F2(P1, P4) + F2(P2, P3) −F2(P1, P3)− F2(P2, P4)
)Coalescencetimes 2ET12 −ET11 −ET22 ET1X + ET2X −ET12 −ETXX ET14 + ET23 −ET13 −ET24
Variance Var(p1 − p2)Var(pX) + Cov(p1, p2) −Cov(p1, pX)−Cov(p2, pX)
Cov((p1 − p2), (p3 − p4))
Branch length 2Bc −Bd 2Bc −BdBc − Bd or as admixture test: B′d −Bd
Table 1 Summary of equations. A constant of proportionality is omitted for coalescence times and branch lengths. Derivations forF2 are given in the main text, F3 and F4 are a simple result of combining Equation 5 with 10b and 14b. Bc and Bd correspond to theaverage length of the internal branch in a gene genealogy concordant and discordant with the population assignment respectively(see section Gene tree branch lengths).
Figure 2 Interpretation of F-statistics. F-statistics can be interpreted i) as branch lengths in a population phylogeny (PanelsA,E,I,M), the overlap of paths in an admixture graph (Panels B,F,J,N, see also Figure S1), and in terms of the internal branches ofgene-genealogies (see Figures 4, S2 and S3). For gene trees consistent with the population tree, the internal branch contributes pos-itively (Panels C,G,K), and for discordant branches, internal branches contribute negatively (Panels D,H) or zero (Panel L). F4 hastwo possible interpretations; depending on how the arguments are permuted relative to the tree topology, it may either reflect the
length of the internal branch (third row, F(B)4 ), or a test statistic that is zero under a population phylogeny (last row, F(T)
4 ). For theadmixture test, the two possible gene trees contribute to the statistic with different sign, highlighting the similarity to the D-statisticGreen et al. (2010) and its expectation of zero in a symmetric model.
Admixture, Population Structure and F-statistics 3
Measuring genetic drift – F2 129
The purpose of F2 is simply to measure how much genetic drift 130
occurred between two populations, i.e. to measure genetic dis- 131
similarity. For populations P1 and P2, F2 is defined as (Reich et al. 132
2009) 133
F2(P1, P2) = F2(p1, p2) = E(p1 − p2)2. (1)
The expectation is with respect to the evolutionary process, 134
but in practice F2 is estimated from hundreds of thousands of 135
loci across the genome (Patterson et al. 2012), which are assumed 136
to be non-independent replicates of the evolutionary history of 137
the populations. 138
Why is F2 a useful measure of genetic drift? As it is infeasible 139
to observe changes in allele frequency directly, the effect of drift 140
is assessed indirectly, through its impact on genetic diversity. 141
Most commonly, genetic drift is quantified in terms of i) the 142
variance in allele frequency, ii) heterozygosity, iii) probability 143
of identity by descent iv) correlation (or covariance) between 144
individuals and v) the probability of coalescence (two lineages 145
having a common ancestor). In the next sections I show how 146
F2 relates to these quantities in the the cases of a single popula- 147
tion changing through time and a pair of populations that are 148
partially isolated. 149
Single population I assume a single population, measured at 150
two time points (t0 and t), and label the two samples P0 and Pt. 151
Then F2(P0, Pt) can be interpreted in terms of the variances of 152
allele frequencies (Figure 3A, see Equation 21) 153
F2(P0, Pt) = Var(pt)−Var(p0) = Var(pt − p0), (2a)
the expected decrease in heterozygosity Ht, between the two 154
sample times (Figure 3B, see Equation 27): 155
F2(P0, Pt) =EH0 −EHt
2, (2b)
and in terms of the inbreeding coefficient f , which can be in- 156
terpreted as the probability of two individuals in Pt descending 157
from the same ancestor in P0, or, equivalently, the probability 158
that two samples from Pt coalesce before t0. (Figure 3C, see 159
Equation 26): 160
F2(P0, Pt) =12
f EH0, (2c)
Equations 2a-2c can be rearranged to make the connectionbetween other measures of genetic drift and F2 more explicit:
EHt = EH0 − 2F2(P0, Pt) (3a)
= EH0 − 2(Var(pt)−Var(p0)) (3b)
= EH0(1− f ) (3c)
Pairs of populations Equations 3b and 3c describing the decay 161
of heterozygosity are – of course – well known by population 162
geneticists, having been established by Wright (1931). In struc- 163
tured populations, very similar relationships exist when the 164
number of heterozygotes expected from the overall allele fre- 165
quency, Hobs is compared with the number of heterozygotes 166
present due to differences in allele frequencies between popula- 167
tions Hexp (Wahlund 1928; Wright 1931). 168
In fact, already Wahlund showed that for a population madeup of two subpopulations with equal proportions, the propor-tion of heterozygotes is reduced by
Hobs = Hexp − 2(p1 − p2)2
A
B
C
Figure 3 Measures of genetic drift in a single population. In-terpretations of F2 in terms of (A) the increase in allele fre-quency variance, (B) the decrease in heterozygosity and (C) f ,which can be interpreted as probability of coalescence of twolineages, or the probability that they are identical by descent.
from which it is straightforward (see section Two populations inMethods) to derive
F2(P1, P2) =EHexp −EHobs
2(4a)
= Var(p1 − p2) (4b)
=12
FSTEHobs. (4c)
This last equation served as the original motivation of F2 (Reich 169
et al. 2009), where it was first defined as a numerator of FST . 170
Justification for F2 The preceding arguments show how the us- 171
age of F2 for both single and structured populations can be justi- 172
fied by the similar effects on heterozygosity and allele frequency 173
variance F2 measures. However, what is the benefit of using F2 174
instead of the established inbreeding coefficient f and fixation 175
index FST? Recall that Wright motivated f and FST as correlation 176
coefficients between alleles (Wright 1921, 1931). Correlation co- 177
efficients have the advantage that they are easy to interpret, as, 178
e.g. FST = 0 implies panmixia and FST = 1 implies complete 179
divergence between subpopulations. In contrast, F2 depends 180
on allele frequencies and is highest for intermediate frequency 181
alleles. However, F2 has an interpretation as a covariance, making 182
it simpler and mathematically more convenient to work with. In 183
particular, variances and covariances are frequently partitioned 184
into components due to different effects using techniques such 185
as analysis of variance and analysis of covariance (e.g. Excoffier 186
et al. 1992). 187
F2 as branch length Reich et al. (2009) and Patterson et al. (2012) 188
proposed to partition “drift” (as previously established, char- 189
acterized by allele frequency variance, or decrease in heterozy- 190
gosity) between different populations into contribution on the 191
different branches of a population phylogeny. This model has 192
4 Benjamin M Peter et al
been studied by Cavalli-Sforza and Edwards (1967) and Felsen- 193
stein (1973) in the context of a Brownian motion process. In this 194
model, drift on independent branches is assumed to be indepen- 195
dent, meaning that the variances can simply be added. This is 196
what is referred to as the additivity property of F2 (Patterson et al. 197
2012). 198
To illustrate the additivity property, consider two populationsP1 and P2 that split recently from a common ancestral populationP0 (Figure 2A). In this case, p1 and p2 are assumed to be inde-pendent conditional on p0, and therefore Cov(p1, p2) = Var(p0).Then, using 2a and 4b,
F2(P1, P2) = Var(p1 − p2) = Var(p1) + Var(p2)− 2Cov(p1, p2)
= Var(p1) + Var(p2)− 2Var(p0)
= F2(P1, P0) + F2(P2, P0).
Alternative proofs of this statement and more detailed reasoning 199
behind the additivity assumption can be found in Cavalli-Sforza 200
and Edwards (1967); Felsenstein (1973); Reich et al. (2009) or 201
Patterson et al. (2012). 202
Lineages are not independent in an admixture graph, and 203
so this approach cannot be used.Reich et al. (2009) approached 204
this by conditioning on the possible population trees that are 205
consistent with an admixture scenario. In particular, they pro- 206
posed a framework of counting the possible paths through the 207
graph (Reich et al. 2009; Patterson et al. 2012). An example of 208
this representation for F2 in a simple admixture graph is given 209
in Figure S1, with the result summarized in Figure 2B. Detailed 210
motivation behind this visualization approach is given in Ap- 211
pendix 2 of Patterson et al. (2012). In brief, the reasoning is as 212
follows: Recall that F2(P1, P2) = E(p1 − p2)(p1 − p2), and inter- 213
pret the two terms in parentheses as two paths between P1 and 214
P2, and F2 as the overlap of these two paths. In a population 215
phylogeny, there is only one possible path, and the two paths 216
are always the same, therefore F2 is the sum of the length of all 217
the branches connecting the two populations. However, if there 218
is admixture, as in Figure 2B, both paths choose independently 219
which admixture edge they follow. With probability α they will 220
go left, and with probability β = 1 − α they go right. Thus, 221
F2 can be interpreted by enumerating all possible choices for 222
the two paths, resulting in three possible combinations of paths 223
on the trees (Figure S1), and the branches included will differ 224
depending on which path is chosen, so that the final F2 is made 225
up average of the path overlap in the topologies, weighted by 226
the probabilities of the topologies. 227
However, one drawback of this approach is that it scales 228
quadratically with the number of admixture events, making cal- 229
culations cumbersome when the number of admixture events is 230
large. More importantly, this approach is restricted to panmictic 231
subpopulations, and cannot be used when the population model 232
cannot be represented as a weighted average of trees. 233
Gene tree interpretation For this reason, I propose to redefine 234
F2 using coalescent theory (Wakeley 2009). Instead of allele 235
frequencies on a fixed admixture graph, coalescent theory tracks 236
the ancestors of a sample of individuals, tracing their history 237
back to their most recent common ancestor. The resulting tree is 238
called a gene tree (or coalescent tree). Gene trees vary between 239
loci, and will often have a different topology from the population 240
phylogeny, but they are nevertheless highly informative about 241
a population’s history. Moreover, expected coalescence times 242
and expected branch lengths are easily calculated under a wide 243
array of neutral demographic models. 244
In a seminal paper, Slatkin (1991) showed how FST can beinterpreted in terms of the expected coalescence times of genetrees:
FST =ETB −ETW
ETB,
where ETB and ETW are the expected coalescence times of two 245
lineages sampled in two different and the same population, 246
respectively. 247
Unsurprisingly, given the close relationship between F2 and 248
FST , an analogous expression exists for F2(P1, P2): Define T12 as 249
the coalescence time for one lineage each sampled from popula- 250
tions P1 and P2, and T11, T22 as the expected coalescence times 251
for two samples from P1 and P2, respectively. Furthermore, let 252
θ = 4Nµ be the standard mutation parameter. Then F2 can be 253
written as 254
F2(P1, P2) = θ
(ET12 −
ET11 + ET222
). (5)
The derivation of this result is given in the section Derivation of 255
F2 for gene trees. 256
Unlike FST , the mutation parameter θ does not cancel. How- 257
ever, for most applications, the absolute magnitude of F2 is of 258
little interest, since only the sign of the statistics are used for 259
most tests. In other applications F-statistics with presumably 260
the same θ (Reich et al. 2009) are compared. In these cases, θ 261
can be regarded as a constant of proportionality, and will not 262
change the theoretical properties of the F-statistics. It will, how- 263
ever, influence statistical properties, as a larger θ implies more 264
mutations and hence more data. 265
For estimation, the average number of pairwise differences 266
πij is a commonly used estimator for θTij (Tajima 1983). Thus, a 267
natural estimator for F2 is 268
F2(P1, P2) = π12 −π11 + π22
2. (6)
It is straightforward to verify (Section Equivalence of estima- 269
tors) that this estimator of F2 is numerically equivalent to the 270
unbiased estimator proposed by Reich et al. (2009) in terms of 271
the sample allele frequency pi and the sample size ni: 272
F2(P1, P2) = ( p1 − p2)2 − p1(1− p1)
n1 − 1− p2(1− p2)
n2 − 1. (7)
However, it is worth pointing out that the modeling assump- 273
tions are different: The original definition only considered loci 274
that were segregating in an ancestral population; loci not seg- 275
regating there were discarded. Since ancestral populations are 276
usually unsampled, this is often replaced by ascertainment in an 277
outgroup (Patterson et al. 2012; Lipson et al. 2013). In contrast, 278
Equation 6 assumes that all markers are used, which is the more 279
natural interpretation for sequence data. 280
Gene tree branch lengths An important feature of Equation 5 is 281
that it only depends on the expected coalescence times between 282
pairs of lineages. Thus, the behavior of F2 can be fully charac- 283
terized by considering a sample of size four, with two random 284
individuals taken from each population. This is all that is needed 285
to study the joint distribution of T12, T11 and T22, and hence F2. 286
By linearity of expectation, larger samples can be accommodated 287
by summing the expectations over all possible quartets. 288
For a sample of size four with two pairs, there are only two 289
possible unrooted tree topologies. One, where the lineages 290
from the same population are more closely related to each other 291
Admixture, Population Structure and F-statistics 5
C. Discordant genealogy
B. Concordant genealogy
A. Equation
Figure 4 Schematic explanation how F2 behaves conditionedon gene tree. Blue terms and branches correspond to positivecontributions, whereas red branches and terms are subtracted.Labels represent individuals randomly sampled from thatpopulation. External branches cancel out, so only the internalbranches have non-zero contribution to F2. In the concordantgenealogy (Panel B), the contribution is positive (with weight2), and in the discordant genealogy (Panel C), it is negative(with weight 1). The mutation rate as constant of proportional-ity is omitted.
(called concordant topology, T (2)c ) and one where lineages from 292
different populations coalesce first (which I will refer to as discor- 293
danat topology T (2)d ). The superscripts refers to the topologies 294
being for F2, and I will discard them in cases where no ambiguity 295
arises. 296
Conditioning on the topology yields
F2(P1, P2) =E[F2(P1, P2)|T
]=P(Tc)E[F2(P1, P2)|Tc] + P(Td)E[F2(P1, P2)|Td].
Figure 4 contains graphical representations for 297
E[F2(P1, P2)|Tc] (Panel B) and E[F2(P1, P2)|Td] (Panel C), 298
respectively. 299
In this representation, T12 corresponds to a path from a ran- 300
dom individual from P1 to a random individual from P2, and 301
T11 represents the path between the two samples from P1. 302
For Tc the internal branch is always included in T12, but never 303
in T11 or T22. External branches, on the other hand, are included 304
with 50% probability in T12 on any path through the tree. T11 305
and T22, on the other hand, only consist of external branches, 306
and the lengths of the external branches cancel. 307
On the other hand, for Td, the internal branch is always in- 308
cluded in T11 and T22, but only half the time in T12. Thus, they 309
contribute negatively to F2, but only with half the magnitude of 310
Tc. As for Tc, each T contains exactly two external branches, and 311
so they will cancel out. 312
An interesting way to represent F2 is therefore in terms of the 313
internal branches over all possible gene genealogies. Denote the 314
unconditional average length of the internal branch of Tc as Bc, 315
and the average length of the internal branch in Td as Bd. Then, 316
F2 can be written in terms of these branch lengths as 317
F2(P1, P2) = θ(2Bc − 1Bd
), (8)
resulting in the representation given in Figure 2C-D. 318
As a brief sanity check, consider the case of a population 319
without structure. In this case, the branch length is independent 320
of the topology and Td is twice as likely as Tc and hence Bd = 321
2Bc, from which it follows that F2 will be zero, as expected in a 322
randomly mating population 323
This argument can be transformed from branch lengths to ob- 324
served mutations by recalling that mutations occur on a branch 325
at a rate proportional to its length. F2 is increased by doubletons 326
that support the assignment of populations (i.e. the two lineages 327
from the same population have the same allele), but reduced by 328
doubletons shared by individuals from different populations. 329
All other mutations have a contribution of zero. 330
Testing treeness 331
Many applications consider tens or even hundreds of popula- 332
tions simultaneously (Patterson et al. 2012; Pickrell and Pritchard 333
2012; Haak et al. 2015; Yunusbayev et al. 2015), with the goal to in- 334
fer where and between which populations admixture occurred. 335
Using F-statistics, the approach is to interpret F2(P1, P2) as a 336
measure of dissimilarity between P1 and P2, as a large F2-value 337
implies that populations are highly diverged. Thus, the strat- 338
egy is to calculate all pairwise F2 indices between populations, 339
combine them into a dissimilarity matrix, and ask if that matrix is 340
consistent with a tree. 341
One way to approach this question is by using phylogenetic 342
theory: Many classical algorithms have been proposed that use 343
a measure of dissimilarity to generate a tree (Fitch et al. 1967; 344
Saitou and Nei 1987; Semple and Steel 2003; Felsenstein 2004), 345
and what properties a general dissimilarity matrix needs to have 346
in order to be consistent with a tree (Buneman 1971; Cavalli- 347
Sforza and Piazza 1975), in which case the matrix is also called a 348
tree metric (Semple and Steel 2003). Thus, testing for admixture 349
can be thought of as testing treeness. 350
For a dissimilarity matrix to be consistent with a tree, there 351
are two central properties it needs to satisfy: First, the length of 352
all branches has to be positive. This is strictly not necessary for 353
phylogenetic trees, and some algorithms may return negative 354
branch lengths (e.g. Saitou and Nei 1987); however, since in our 355
case branches have an interpretation of genetic drift, negative 356
genetic drift is biologically nonsensical, and therefore negative 357
branches should be interpreted as a violation of the modeling 358
assumptions, and hence treeness. 359
The second property of a tree metric important in the presentcontext is a bit more involved: A dissimilarity matrix (written interms of F2) is consistent with a tree if for any four populationsPi, Pj, Pk and Pl ,
F2(Pi, Pj) + F2(Pk, Pl) ≤
max(
F2(Pi, Pk) + F2(Pj,Pl), F2(Pi, Pl) + F2(Pj, Pk)
)(9)
that is, if the sums of pairs of distances are compared, two of 360
these sums will be the same, and no smaller than the third. 361
This theorem, due to Buneman (1971, 1974) is called the four- 362
point condition or sometimes, more modestly, the “fundamental 363
theorem of phylogenetics”. A proof can be found in chapter 7 of 364
Semple and Steel (2003). 365
Informally, this statement can be understood by noticing that 366
on a tree, two of the pairs of distances will include the internal 367
branch, whereas the third one will not, and therefore be shorter. 368
Thus, the four-point condition can be colloquially rephrased as 369
“any four taxa tree has at most one internal branch”. 370
Why are these properties useful? It turns out that the ad- 371
mixture tests based on F-statistics can be interpreted as tests 372
of these properties: The F3-test can be interpreted as a test for 373
6 Benjamin M Peter et al
the positivity of a branch; and the F4 as a test of the four-point 374
condition. Thus, the working of the two test statistics can be 375
interpreted in terms of fundamental properties of phylogenetic 376
trees, with the immediate consequence that they may be applied 377
as treeness-tests for arbitrary dissimilarity matrices. 378
An early test of treeness, based on a likelihood ratio, was 379
proposed by Cavalli-Sforza and Piazza (1975): They compared 380
the likelihood of the observed F2-matrix matrix to that induced 381
by the best fitting tree (assuming Brownian motion), rejecting 382
the null hypothesis if the tree-likelihood is much lower than 383
that of the empirical matrix. In practice, however, finding the 384
best-fitting tree is a challenging problem, especially for large 385
trees Felsenstein (2004) and so the likelihood test proved to be 386
difficult to apply. From that perspective, the F3 and F4-tests 387
provide a convenient alternative: Since treeness implies that all 388
subsets of taxa are also trees, the ingenious idea of Reich et al. 389
(2009) was that rejection of treeness for subtrees of size three 390
(for F3) and four (for F4) is sufficient to reject treeness for the 391
entire tree. Furthermore, tests on these subsets also pinpoint the 392
populations involved in the non-tree-like history. 393
F3 394
In the previous section, I showed how F2 can be interpreted 395
as a branch length, as an overlap of paths, or in terms of gene 396
trees (Figure 2). Furthermore, I gave expressions in terms of 397
coalescence times, allele frequency variances and internal branch 398
lengths of gene trees. In this section, I give analogous results for 399
F3. 400
Reich et al. (2009) defined F3 as: 401
F3(PX ; P1, P2) = F3(pX ; p1, p2) = E(pX − p1)(pX − p2) (10a)
with the goal to test whether PX is admixed. Recalling the path 402
interpretation detailed in Patterson et al. (2012), F3 can be inter- 403
preted as the shared portion of the paths from PX to P1 with the 404
path from PX to P1. In a population phylogeny (Figure 2E) this 405
corresponds to the branch between PX and the internal node. 406
Equivalently, F3 can also be written in terms of F2 (Reich et al. 407
2009): 408
F3(PX ; P1, P2) =12
(F2(PX , P1) + F2(PX , P2)− F2(P1, P2)
).
(10b)If F2 in Equation 10b is generalized to an arbitrary tree metric, 409
Equation 10b is known as the Gromov product (Semple and 410
Steel 2003) in phylogenetics. The Gromov product is a com- 411
monly used operation in classical phylogenetic algorithms to 412
calculate the length of the portion of a branch shared between P1 413
and P2 (Fitch et al. 1967; Felsenstein 1973; Saitou and Nei 1987); 414
consistent with the notion that F3 is the length of an external 415
branch in a population phylogeny. 416
In an admixture graph, there is no longer a single external 417
branch; instead all possible trees have to be considered, and F3 418
is the (weighted) average of paths through the admixture graph 419
(Figure 2F). 420
Combining Equations 5 and 10b gives an expression of F3 in 421
terms of expected coalescence times: 422
F3(PX ; P1, P2) =θ
2(ET1X + ET2X −ET12 −ETXX) . (10c)
Similarly, an expression in terms of variances is obtained by 423
combining Equation 2a with 10b: 424
F3(PX ; P1, P2) =
Var(pX) + Cov(p1, p2)−Cov(p1, pX)−Cov(p2, pX).(10d)
which was also noted by Pickrell and Pritchard (2012). 425
Outgroup-F3 statistics A simple application of the interpreta- 426
tion of F3 as a shared branch length are the “outgroup”-F3- 427
statistics proposed by Raghavan et al. (2014). For an unknown 428
population PU , they wanted to find the most closely related pop- 429
ulation from a panel of k extant populations {Pi, i = 1, 2, . . . k}. 430
They did this by calculating F3(PO; PU , Pi), where PO is an out- 431
group population that was assumed widely diverged from PU 432
and all populations in the panel. This measures the shared drift 433
(or shared branch) of PU with the populations from the panel, 434
and high F3-values imply close relatedness. 435
However, using Equation 10c, the outgroup-F3-statistic can 436
be written as 437
F3(PO; PU , Pi) ∝ ETUO + ETiO −ETUi −ETOO. (11)
Out of these four terms, ETUO and ETOO do not depend on Pi. 438
Furthermore, if PO is truly an outgroup, then all ETiO should 439
be the same, as pairs of individuals from the panel population 440
and the outgroup can only coalesce once they are in the joint an- 441
cestral population. Therefore, only the term ETUi is expected to 442
vary between different panel populations, suggesting that using 443
the number of pairwise differences, πUi, is largely equivalent to 444
using F3(PO; PU , Pi). I confirm this in Figure 5A by calculating 445
outgroup-F3 and πiU for a set of increasingly divergent popu- 446
lations, with each population having its own size, sample size 447
and sequencing error probability. Linear regression confirms 448
the visual picture that πiU has a higher correlation with diver- 449
gence time (R2 = 0.90) than F3 (R2 = 0.73). Hence, the number 450
of pairwise differences may be a better metric for population 451
divergence than F3. 452
F3 admixture test However, F3 is motivated and primarily used 453
as an admixture test (Reich et al. 2009). In this context, the null 454
hypothesis is that F3 is non-negative, i.e. the null hypothesis 455
is that the data is generated from a phylogenetic tree that has 456
positive edge lengths. If this is not the case, the null hypothesis 457
is reject in favor of the more complex admixture graph. From 458
Figure 2F it may be seen that drift on the path on the internal 459
branches (red) contributes negatively to F3. If these branches are 460
long enough compared to the branch after the admixture event 461
(blue), then F3 will be negative. For the simplest scenario where 462
PX is admixed between P1 and P2, Reich et al. (2009) provided 463
a condition when this is the case (Equation 20 in Supplement 464
2 of Reich et al. 2009). However, since this condition involves 465
F-statistics with internal, unobserved populations, it cannot 466
be used in practical applications. A more useful condition is 467
obtained using Equation 10c. 468
In the simplest admixture model, an ancestral population 469
splits into P1 and P2 at time tr. At time t1, the populations mix to 470
form PX , such that with probability α, individuals in PX descend 471
from individuals from P1, and with probability (1− α), they 472
descend from P2 (See Fig. 7 for an illustration). In this case, 473
F3(PX ; P1, P2) is negative if 474
1(1− cx)
t1tr
< 2α(1− α), (12)
Admixture, Population Structure and F-statistics 7
−0.0
02
0.0
00
0.0
02
0.0
04
0.0
06
0.0
08
Figure 5 Simulation results. A: Outgroup-F3 statistics (yellow) and πiU(white) for a panel of populations with linearly increasingdivergence time. Both statistics are scaled to have the same range, with the first divergence between the most closely related popu-lations set to zero. F3 is inverted, so that it increases with distance. B: Simulated (boxplots) and predicted(blue) F3-statistics under asimple admixture model. C: Comparison of F4-ratio (yellow triangles, Equation 18) and ratio of differences(black circles, (Equation20).
where cx is the probability two individuals sampled in PX have acommon ancestor before t1. For a randomly mating populationwith changing size N(t),
cx = 1− exp(−∫ t1
0
1N(s)
ds)
.
Thus, the power F3 to detect admixture is large 475
1. if the admixture proportion is close to fifty percent 476
2. if the ratio between the times of the original split to the time 477
of secondary contact is large, and 478
3. if the probability of coalescence before the admixture event 479
in PX is small, i.e. the size of PX is large. 480
A more general condition for negativity of F3 is obtained 481
by considering the internal branches of the possible gene tree 482
topologies, analogously to that given for F2 in the section Gene 483
tree branch lengths. Since Equation 10c includes ETXX , only two 484
individuals from PX are needed, and one each from P1 and P2 485
to study the joint distribution of all terms in (10c). The minimal 486
case therefore contains again just four samples (Figure S2). 487
Furthermore, P1 and P2 are exchangeable, and thus there are 488
again just two distinct gene genealogies, a concordant one T (3)c 489
where the two lineages from PX are most closely related, and a 490
discordant genealogy T (3)d where the lineages from PX merge 491
first with the other two lineages. A similar argument as that 492
for F2 shows (presented in Figure S2) that F3 can be written as a 493
function of just the internal branches in the topologies: 494
F3(PX ; P1, P2) = θ(2Bc −Bd
), (13)
where Bc and Bd are the lengths of the internal branches in T (3)c 495
and T (3)d , respectively, and similar to F2, concordant branches 496
have twice the weight of discordant ones. Again, the case of 497
all individuals coming from a single populations serves as a 498
sanity check: In this case Td is twice as likely as Tc, and all 499
branches are expected to have the same length, resulting in F3 500
being zero. However, for F3 to be negative, notice that Bd needs 501
to be more than two times longer than Bc. Since mutations 502
are proportional to Bd and Bc, F3 can be interpreted as a test 503
whether mutations that agree with the population tree are more 504
than twice as common than mutations that disagree with it. 505
I performed a small simulation study to test the accuracy of 506
Equation 12. Parameters were chosen such that F3 has a negative 507
expectation for α > 0.05, and I find that the predicted F3 fit very 508
well with the simulations (Figure 5B). 509
F4 510
The second admixture statistic, F4, is defined as (Reich et al. 511
2009): 512
F4(P1, P2; P3, P4) = F4(p1, p2; p3, p4) = E[(p1 − p2)(p3 − p4)].(14a)
Similarly to F3, F4 can be written as a linear combination of F2: 513
F4(P1, P2; P3, P4) =
12
(F2(P1, P4) + F2(P2, P3)− F2(P1, P3)− F2(P2, P4)
).
(14b)
which leads to 514
F4(P1, P2; P3, P4) =θ
2
(ET14 + ET23 −ET13 − BET24
). (14c)
8 Benjamin M Peter et al
As four populations are involved, there are 4! = 24 possi-ble ways of arranging the populations in Equation 14a. How-ever, there are four possible permutations of arguments that willlead to identical values, leaving only six unique F4-values forany four populations. Furthermore, these six values come inpairs that have the same absolute value, and a different sign,(i. e. F4(P1, P2; P3, P4) = −F4(P1, P2; P4, P3)), leaving only threeunique absolute values, which correspond to the tree possibletree topologies. Out of these three, one F4 can be written as thesum of the other two, leaving just two independent possibilities:
F4(P1, P2; P3, P4) + F4(P1, P3; P2, P4) = F4(P1, P4; P2, P3).
As for F3, Equation 14b can be generalized by replacing F2 515
with an arbitrary tree metric. In this case, Equation 14b is known 516
as a tree split (Buneman 1971), as it measures the length of the 517
overlap of the branch lengths between the two pairs. As there 518
are two independent F4-indices for a fixed tree, there are two 519
different interpretations for the F4-indices. Consider the tree 520
from Fig. 1A: F4(P1, P2; P3, P4) can be interpreted as the overlap 521
between the paths from P1 to P2 and from P3 to P4. However, 522
these paths do not overlap in Fig. 1A, and therefore F4 = 0. 523
This is how F4 is used as a test statistic. On the other hand, 524
F4(P1, P3; P2, P4) measures the overlap between the paths from 525
P1 to P3 and from P2 to P4), which is the internal branch in Fig. 526
1A, and will be positive. 527
It is cumbersome that the interpretation of F4 depends onthe ordering of its arguments. In order to make our intentionclear, instead of switching the arguments around for the twointerpretations, I introduce the superscripts (T) (for test) and(B) (for branch length):
F(T)4 (P1, P2; P3, P4) = F4(P1, P2; P3, P4) (15a)
F(B)4 (P1, P2; P3, P4) = F4(P1, P3; P2, P4) (15b)
Tree splits, and hence F4, are closely related to the four-point 528
condition (Buneman 1971, 1974), which, informally, states that a 529
(sub-)tree with four populations will have at most one internal 530
branch. Thus, if data is consistent with a tree, F(B)4 will be the 531
length of that branch, and F(T)4 will be zero. In Figure 2, the third 532
row (Panels I-L) correspond to the internal branch, and the last 533
row (Panels M-P) to the “zero”-branch. 534
Thus, in the context of testing for admixture, testing that F4 535
is zero is equivalent to checking whether there is in fact only 536
a single internal branch. If that is not the case, the population 537
phylogeny is rejected. 538
Gene trees Evaluating F4 in terms of gene trees and their inter- 539
nal branches, there are three different gene tree topologies that 540
have to be considered, whose interpretation depends depending 541
on whether the branch length or test-statistic interpretation are 542
considered. 543
For the branch length (F(B)4 ), the gene tree corresponding to 544
the population tree has a positive contribution to F4, and the 545
other two possible trees have a zero and negative contribution, 546
respectively (Figure S3). Since the gene tree corresponding to 547
the population tree is expected to be most frequent, F4 will be 548
positive, and can be written as 549
F(B)4 = θ(Bc −Bd). (16)
This equation is slightly different than those for F2 and F3, where 550
the coefficient for the discordant genealogy was half that for the 551
concordant genealogy. Note, however, that F4 only includes one 552
of the two discordant genealogies. Under a tree, both discordant 553
genealogies are equally likely (Durand et al. 2011), and thus the 554
expectation of F4 will be the same. 555
In contrast, for the admixture test statistic (F(T)4 ), the contribu- 556
tion of the concordant genealogy will be zero, and the discordant 557
genealogies will contribute with coefficients −1 and +1, respec- 558
tively. Under the population phylogeny, these two gene trees 559
will be equally likely (Wakeley 2009), and thus the expectation 560
of F4 as a test statistic 561
F(T)4 = θ(Bd −B′d) (17)
is zero under the null hypothesis. Furthermore, the statistic isclosely related to the ABBA-BABA or D-statistic also used totest for admixture (Green et al. 2010; Durand et al. 2011), whichincludes a normalization term and conditions on alleles beingderived. In our notation the expectation of D is ,
E[D] =B′d −Bd
B′d + Bd
thus, if F(T)4 has an expectation of zero, so will D. 562
F4 as a branch length 563
Rank test Two major applications of F4 use its interpretation 564
as a branch length. First, the rank of a matrix of all F4-statistics 565
is used to obtain a lower bound on the number of admixture 566
events required to explain data (Reich et al. 2012). The principal 567
idea of this approach is that the number of internal branches 568
in a genealogy is bounded to be at most n− 3 in an unrooted 569
tree. Since each F4 is a sum of the length of tree branches, all F4- 570
indices should be sums of n− 3 branches, or n− 3 independent 571
components. This implies that the rank of the matrix (see e.g. 572
Section 4 in McCullagh 2009) is at most n − 3, if the data is 573
consistent with a tree. However, admixture events may increase 574
the rank of the matrix, as they add additional internal branches 575
(Reich et al. 2012). Therefore, if the rank of the matrix is r, the 576
number of admixture events is at least r− n + 3. 577
One issue is that the full F4-matrix has size (n2) × (n
2), and 578
may thus become rather large. Furthermore, in many cases 579
only admixture events in a certain part of the phylogeny are 580
of interest. To estimate the minimum number of admixture 581
events on a particular branch of the phylogeny, Reich et al. (2012) 582
proposed to find two sets of test populations S1 and S2, and 583
two reference populations for each set R1 and R2 that are pre- 584
sumed unadmixed (see Figure 6A). Assuming a phylogeny, all 585
F(B)4 (S1, R1; S2, R2) will measure the length of the same branch, 586
and all F(T)4 (S1, R1; S2, R2) should be zero. Since each admixture 587
event introduces at most one additional branch, the rank of the 588
resulting matrix will increase by at most one, and the rank of 589
either the matrix of all F(T)4 , or the matrix of all F(B)
4 may reveal 590
the number of branches of that form. 591
Admixture proportion The second application is by comparing 592
branches between closely related populations to obtain an esti- 593
mate of mixture proportion, or how much two focal populations 594
correspond to an admixed population. (Green et al. 2010): 595
α =F4(PO, PI ; PX , P1)
F4(PO, PI ; P2, P1)(18)
Here, PX , is the population whose admixture proportion is esti- 596
mated, P1 and P2 are the potential contributors, where I assume 597
Admixture, Population Structure and F-statistics 9
B CA
Figure 6 Applications of F4: A: Visualization of rank test to estimate the number of admixture events. F4(S1, R1, S2, R2) measuresa branch absent from the phylogeny and should be zero for all populations from S1 and S2. B: Model underlying admixture ratioestimate Green et al. (2010). PX splits up, and the mean coalescence time of PX with PI gives the admixture proportion. C: If themodel is violated, α measures where on the internal branch in the underlying genealogy PX (on average) merges
that they contribute with proportions α and 1− α, respectively. 598
and PO, PI are reference populations with no direct contribution 599
to PX (see Figure 6B). PI has to be more closely related to one of 600
P1 or P2 than the other, and PO is an outgroup. 601
The canonical way (Patterson et al. 2012) to interpret this ratio 602
is as follows : the denominator is the branch length from the 603
common ancestor population from PI and P1 to the common 604
ancestor of PI with P2. (Figure 6C, yellow line), The numerator 605
has a similar interpretation as an internal branch (red dotted 606
line). In an admixture scenarios, (Figure 6B), this is not unique, 607
and is replaced by a linear combination of lineages merging 608
at the common ancestor of PI and P1 (with probability α), and 609
lineages merging at the common ancestor of PI with P2 (with 610
probability 1− α). 611
Thus, a more general interpretation is that α measures how 612
much closer the common ancestor of PX and PI is to the common 613
ancestor of PI and P1 and the common ancestor of PI and P2, 614
indicated by the gray dotted line in Figure 6B. This quantity is de- 615
fined also when the assumptions underlying the admixture test 616
are violated, and if the assumptions are not carefully checked, 617
might lead to misinterpretations of the data. In particular, α is 618
well-defined in cases where no admixture occurred, or in cases 619
where either of P1 and P2 did not experience any admixture. 620
Furthermore, it is evident from Figure 6 that if all populations 621
are sampled at the same time, ETOX = ETO1 = ETO2 = ETOI , 622
and therefore, 623
α =ETI1 −ETIXETI1 −ETI2
. (19)
Thus, 624
α =πI1 − πIXπI1 − πI2
(20)
is another estimator for α that can be used even if no outgroup is 625
available. I compare Equations 18 and 20 for varying admixture 626
proportions in Figure 5C using the mean absolute error in the 627
admixture proportion. Both estimators perform very well, but 628
(20) performs slightly better in cases where the admixture pro- 629
portion is low. However, in most cases this minor improvement 630
possibly does not negate the drawback that Equation 20 is only 631
applicable when populations are sampled at the same time. 632
It is an area of recent development how these estimates can be 633
extended to more populations. A simple approach is to assume 634
a fixed series of admixture events, in which case admixture 635
proportions for each event can be extracted from a series of 636
F4-ratios (Lazaridis et al. 2014, , SI 13). A more sophisticated 637
approach estimates mixture weights using the rank of the F4- 638
matrix, as discussed in the Rank test section (Haak et al. 2015, SI 639
10). Then, it is possible to estimate mixture proportions, using 640
a model similar to that introduced in the program structure 641
(Pritchard et al. 2000), by obtaining a low-rank approximation 642
for the F4-matrix. 643
Population structure models 644
For practical purposes, it is useful to know how the admixture 645
tests perform under demographic models different from popula- 646
tion phylogenies and admixture graphs, and in which cases the 647
assumptions made for the tests are problematic. In other words, 648
under which demographic models is population structure dis- 649
tinguishable from a tree? Equation 5 enables the derivation of 650
expectations for F3 and F4 under a wide variety of models of 651
population structure (Figure 7). The simplest case is that of a 652
single panmictic population. In that case, all F-statistics have 653
an expectation of zero, consistent with the assumption that no 654
structure and therefore no population phylogeny exists. Under 655
island models, F4 is also zero, and F3 is inversely proportional 656
to the migration rate. Results are similar under a hierarchical 657
island model, except that the number of demes has a small ef- 658
fect. This corresponds to a population phylogeny that is star-like 659
and has no internal branches, which is explained by the strong 660
symmetry of the island model. Thus, looking at different F3 and 661
F4-statistics may be a simple heuristic to see if data is broadly 662
consistent with an island model; if F3-values vary a lot between 663
populations, or if F4 is substantially different from zero, an is- 664
land model might be a poor choice. When looking at a finite 665
stepping stone model, F3 and F4 are both non-zero, highlighting 666
that F4 (and the ABBA-BABA-D-statistic) is susceptible to mi- 667
gration between any pair of populations. Thus, for applications, 668
F4 should only be used as an admixture test if there is good 669
evidence that gene flow between some pairs of the populations 670
was severely restricted. A hierarchical stepping stone model, 671
where demes are combined into populations, is the only case I 672
studied (besides the admixture graph) where F3 can be negative. 673
This effect indicates that admixture and population structure 674
models may be the two sides of the same coin: admixture is a 675
10 Benjamin M Peter et al
(temporary) reduction in gene flow between individuals from 676
the same population. Finally, for a simple serial founder model 677
without migration, I find that F3 measures the time between 678
subsequent founder events. 679
Estimation and Testing 680
In this paper, I focused almost exclusively on the theoretical 681
properties of the F-statistics, glancing over the statistical prob- 682
lems of how they are estimated. Many procedures are imple- 683
mented in the software package ADMIXTOOLS and described in 684
the accompanying paper (Patterson et al. 2012). Alternatively, 685
the software package treemix (Pickrell and Pritchard 2012) con- 686
tains lightwight alternatives for calculating F3 and F4 statistics. 687
Both use a block-jackknife approach to estimate standard errors, 688
taking linkage between markers into account. 689
Conclusions 690
There are three main ways to interpret F-statistics: In the sim- 691
plest case, they represent branches in a population phylogeny. In 692
the case of an admixture graph, the idea of shared drift in terms 693
of paths is most convenient. Finally, the expressions in terms 694
of coalescence times and the lengths of the internal branches of 695
gene genealogies are useful for more complex scenarios. This 696
last interpretation makes the connection to the ABBA-BABA- 697
statistic explicit, and allows the investigation of the behavior of 698
the F-statistics under arbitrary demographic models. 699
If drift indices exist for two, three and four populations, 700
should there be corresponding quantities for five or more popu- 701
lations (e.g. Pease and Hahn 2015)? Two of the interpretations 702
speak against this possibility: First, a population phylogeny can 703
be fully characterized by internal and external branches, and it 704
is not clear how a five-population statistic could be written as a 705
meaningful branch length. Second, all F-statistics can be written 706
in terms of four-individual trees, but this is not possible for five 707
samples. This seems to suggest that there may not exist a five- 708
population statistic as general as the three F-statistics I discussed 709
here, but they will still be useful for questions pertaining to a 710
specific demographic model. 711
A well-known drawback of F3 is that it may have a positive 712
expectation under some admixture scenarios (Patterson et al. 713
2012). Here, I showed that F3 is positive if and only if the branch 714
supporting the population tree is longer than the two branches 715
discordant with the population tree. Note that this is (possibly) 716
distinct from the probabilities of tree topologies, although the 717
average branch length of the internal branch in a topology, and 718
the probability of that topology are frequently strongly corre- 719
lated. Thus, negative F3-values indicate that individuals from 720
the admixed population are likely to coalesce with individuals 721
from the two other populations, before they coalesce with other 722
individuals from their own population! 723
Overall, when F3 is applicable, it is remarkably robust to 724
population structure, requiring rather strong substructure to 725
yield false-positives. Thus, it is a very striking finding that in 726
many applications to humans, negative F3-values are commonly 727
found (Patterson et al. 2012), indicating that for most human 728
populations, the majority of markers support a discordant gene 729
tree, which suggests that population structure and admixture are 730
widespread, and that population phylogenies are poorly suited 731
to describe human evolution. 732
Ancient population structure was proposed as possible con- 733
founder for the D and F4-statistics (Green et al. 2010). Here, I 734
show that non-symmetric population structure such as in step- 735
ping stone models can lead to non-zero F4-values, showing that 736
both ancestral and persisting population structure may result in 737
false-positives when assumptions are violated. 738
Furthermore, I showed that F2 can be seen as a special case 739
of a tree-metric, and that using F-statistics is equivalent to using 740
phylogenetic theory to test hypotheses about simple phyloge- 741
netic networks (Huson et al. 2010). From this perspective, it 742
is worth re-raising the issue pointed out by Felsenstein (1973), 743
how and when allele-frequency data should be transformed 744
for within-species phylogenetic inference.While F2 has become 745
a de facto standard, different transformations of allele frequen- 746
cies might be useful in some cases, as both F3 and F4 can be 747
interpreted as tests for treeness for arbitrary tree metrics. 748
This relationship provides ample opportunities for interac- 749
tion between these currently diverged fields: Theory (Huson 750
and Bryant 2006; Huson et al. 2010) and algorithms for finding 751
phylogenetic networks such as Neighbor-Net (Bryant and Moul- 752
ton 2004) may provide a useful alternative to tools specifically 753
developed for allele frequencies and F-statistics (Patterson et al. 754
2012; Pickrell and Pritchard 2012; Lipson et al. 2013), particularly 755
in complex cases. On the other hand, the tests and different 756
interpretations described here may be useful to test for treeness 757
in other phylogenetic applications, and the complex history of 758
humans may provide motivation to further develop the theory 759
of phylogenetic networks, and stress its usefulness for within- 760
species demographic analyses. 761
Acknowledgements 762
I would like to thank Heejung Shim, Rasmus Nielsen, John 763
Novembre and all members of the Novembre lab for helpful 764
comments and discussions. I am further grateful for comments 765
from Nick Patterson and an anonymous reviewer. BMP is sup- 766
ported by a Swiss NSF early postdoc mobility fellowship. Ad- 767
ditional funding for this work was provided by NIH grant R01 768
HG007089 to John Novembre. 769
Methods 770
Equivalence of drift interpretations 771
First, I show that F2 can be interpreted as the difference in vari- 772
ance of allele frequencies (Figure 3A): 773
As in the Results section, let Pi denote a population with 774
allele frequency pi and sample size ni, and let us assume P0 775
represents the population at time t0, and Pt the population at 776
time t, with t0 < t. 777
F2(Pt, P0) = E[(pt − p0)2]
= Var(pt − p0) + E[(pt − p0)]2
= Var(pt) + Var(p0)− 2Cov(p0, pt)
= Var(pt) + Var(p0)− 2Cov(p0, p0 + (pt − p0))
= Var(pt)−Var(p0) (21)
Here, I used E[pt − p0] = Cov(p0, pt − p0) = 0 to obtain lines 778
three and five. It is worth noting that this result holds for any 779
model of genetic drift where the expected allele frequency is the 780
current allele frequency and increments are independent. For 781
example, this interpretation of F2 holds also if genetic drift is 782
modeled as a Brownian motion (Cavalli-Sforza and Edwards 783
1967). 784
Admixture, Population Structure and F-statistics 11
Figure 7 Expectations for F3 and F4 under select models. The constant factor θ/2 is omitted. Derivations can be found in the sectionDerivation of F under select models
12 Benjamin M Peter et al
A heterozygosity model The interpretation of F2 in terms of the 785
decay in heterozygosity and identity by descent can be derived 786
elegantly using the duality between the diffusion process and 787
the coalescent: Let again t0 < t. 788
Then, 789
Ept
[pnt
t |p0, nt]= En0
[pn0
0 |p0, nt]
. (22)
This equation is due to Tavaré (1984), who also provided the 790
following intuition: Given nt individuals are sampled at time 791
t. Let E denote the event that all individuals carry allele x, 792
conditional on allele x having frequency p0 at time t0. There 793
are two components to this: First, the frequency will change 794
between t0 and t, and then all nt sampled individuals need to 795
carry x. 796
In a diffusion framework, 797
P(E) =∫ 1
0ynt P(pt = y|p0, nt)dy = E[pnt
t |p0, nt]. (23)
On the other hand, one may argue using the coalescent: For 798
E to occur, all nt samples need to carry the x allele. At time t0, 799
they had n0 ancestral lineages, who all carry x with probability 800
p0. Therefore, 801
P(E) =n0
∑i=1
pi0P (n0 = i|p0, nt) = E
[pn0
0 |p0, nt]
. (24)
Equating (23) and (24) yields Equation 22. 802
In the present case, the only relevant cases are nt = 1, 2, since:
E[
p1t |p0; nt = 1
]= p0 (25a)
E[
p2t |p0; nt = 2
]= p0 f + p2
0(1− f ), (25b)
where f is the probability that two lineage sampled at time t 803
coalesce before time t0. 804
This yields an expression for F2 by conditioning on the allelefrequency p0,
E[(p0 − pt)
2|p0
]= E
[p2
0|p0
]−E [2pt p0|p0] + E
[p2
t |p0
]= p2
0 − 2p20 + p0 f + p2
0(1− f )= f p0(1− p0)
=12
f H0.
Where H0 = 2p0(1− p0) is the heterozygosity. Integrating over 805
P(p0) yields: 806
F2(P0, Pt) =12
f EH0 (26)
and it can be seen that F2 increases as a function of f (Figure 3C). 807
This equation can also be interpreted in terms of probabilities of 808
identity by descent: f is the probability that two individuals are 809
identical by descent in Pt given their ancestors were not identical 810
by descent in P0 (Wright 1931), and EH0 is the probability two 811
individuals are not identical in P0. 812
Furthermore, EHt = (1− f )EH0 (Equation 3.4 in Wakeley 813
(2009)) and therefore 814
EH0 −EHt = EH0(1− (1− f )
)= 2F2(Pt, P0). (27)
which shows that F2 measures the decay of heterozygosity (Fig- 815
ure 3A). A similar argument was used by Lipson et al. (2013) to 816
estimate ancestral heterozygosities and to linearize F2. 817
Two populations F2 in terms of the difference in expected and 818
observed heterozygosity follows directly from the result from 819
Wahlund (1928), which was obtained by considering the geno- 820
types of all possible matings in the two subpopulations (Ta- 821
ble 3 in Wahlund 1928), and the variance case follows di- 822
rectly because Var(p1− p2) = E(p1− p2)2−
[E(p1− p2)
]2, but 823
E(p1 − p2) = 0. Lastly, F2 relates to FST via the definition of F2 824
as a variance in the definition of FST : FST =2(p1−p2)2
EHexp . 825
Covariance interpretation To see how F2 can be interpreted as acovariance between two individuals from the same population,define Xi and Xj as indicator variables that two individuals fromthe same population sample have the A allele, which has fre-quency p1 in one, and p2 in the other population. If individualsare equally likely to come from either population,
EXi = EXj =12
p1 +12
p2
EXiXj =12
p21 +
12
p22
Cov(Xi, Xj) =EXiXj −EXiEXj
=14(p1 − p2)
2 =14
F2(P1, P2) (28)
Derivation of F2 for gene trees 826
To derive equation (5), I start by considering F2 for two sam- 827
ples of size one, express F2 for arbitrary sample sizes in terms 828
of individual-level F2, and obtain a sample-size independent 829
expression by letting the sample size n go to infinity. 830
In this framework, I assume that mutation is rare such that 831
there is at most one mutation at any locus. In a sample of size 832
two, let Ii be an indicator random variable that individual i has 833
a particular allele. For two individuals, F2(I1, I2) = 1 implies 834
I1 = I2, whereas F2(I1, I2) = 0 implies I1 6= I2. Thus, F2(I1, I2) 835
is another indicator random variable with parameter equal to 836
the probability that a mutation happened on the tree branch 837
between I1 and I2. 838
Now, instead of a single individual I1, consider a sampleof n1 individuals: P1 = {I1,1 I1,2, . . . , I1,n1}. The sample allelefrequency is p1 = n−1
1 ∑i I1,i. And the sample-F2 is
F2 ( p1, I2) =F2
(1
n1
n1
∑i=1
I1,i, I2
)= E
(1
n1
n1
∑i=1
I1,i − I2
)2
=E
[1
n21
∑ I21,i +
2n2
1∑ I1,i I1,j −
2n1
∑ I1,i I2 + I22
]
=E
[1
n1∑ I2
1,i −2
n1∑ I1,i I2 +
n1n1
I22
]+ E
[2
n21
∑ I1,i I1,j −n1 − 1
n21
∑ I21,i
]The first three terms can be grouped into n1 terms of the form 839
F2(I1,i, I2), and the last two terms can be grouped into (n12 ) terms 840
of the form F2(I1,i, I1,j), one for each possible pair of samples in 841
P1. 842
Therefore, 843
F2 ( p1, I2) =1
n1∑
iF2(I1,i, I2)−
1n2
1∑i<j
F2(I1,i, I1,j) (29)
where the second sum is over all pairs in P1. This equation is 844
equivalent to Equation 22 in Felsenstein (1973). 845
Admixture, Population Structure and F-statistics 13
As F2( p1, p2) = F2( p2, p1), I can switch the labels, and obtain 846
the same expression for a second population P2 = {I2,i, i = 847
0, . . . , n2} Taking the average over all I2,j yields 848
F2 ( p1, p2) =1
n1∑
iF2(I1,i, I2,j)
− 1n2
1∑i<j
F2(I1,i, I1,j)−1
n22
∑i<j
F2(I2,i, I2,j).(30)
Thus, I can write F2 between the two populations as the average 849
number of differences between individuals from different pop- 850
ulations, minus some terms including differences within each 851
sample. 852
Equation 30 is quite general, making no assumptions on 853
where samples are placed on a tree. In a coalescence framework, 854
it is useful to make the assumptions that all individuals from the 855
same population have the same branch length distribution, i.e. 856
F2(Ix1,i, Iy1,j) = F2(Ix2,i, Iy2,j) = for all pairs of samples (x1, x2) 857
and (y1, y2) from populations Pi and Pj. Secondly, I assume that 858
all samples correspond to the leaves of the tree, so that I can esti- 859
mate branch lengths in terms of the time to a common ancestor 860
Tij. Finally, I assume that mutations occur at a constant rate of 861
θ/2 on each branch. Taken together, these assumptions imply 862
that F2(Ii,k, Ij,l) = θETij for all individuals from populations 863
Pi, Pj. this simplifies (30) to 864
F2( p1, p2) = θ×(
ET12−12
(1− 1
n1
)ET11−
12
(1− 1
n2
)ET22
)(31)
which, for the cases of n = 1, 2 was also derived by Petkova 865
et al. (2014). In some applications, F2 might be calculated only 866
for segregating site in a large sample. As the expected number 867
of segregating sites is θ2 Ttot (with Ttot denoting the total tree 868
length), taking the limit where θ → 0 is meaningful (Slatkin 869
1991; Petkova et al. 2014): 870
F2( p1, p2) =2
Ttot
×(
ET12 −12
(1− 1
n1
)ET11 −
12
(1− 1
n2
)ET22
).
(32)
In either of these equations, 2Ttot
or θ act as a constant of pro- 871
portionality that is the same for all statistics calculated from the 872
same data. Since interest is focused on the relative magnitude 873
of F2, or whether a sum of F2-values is different from zero, this 874
constant has no impact on inference. 875
Furthermore, a population-level quantity is obtained by tak- 876
ing the limit when the number of individuals n1 and n2 go to 877
infinity: 878
F2(P1, P2) = limn1,n2→∞
F2( p1, p2). (33)
This yields Equation 5. Using θ as the constant of proportionality 879
yields 880
E[2π12 − π11 − π22] = 2F2(P1, P2), (34)
leading to the estimator given in Equation 6. 881
Equivalence of estimators The estimator in Eq. 34 is equivalent 882
to that given by Reich et al. (2009): 883
F2(P1, P2) =π12 − π11/2− π22/2
=[p1(1− p2) + p2(1− p1)
]− p1(1− p1)
n1n1 − 1
− p2(1− p2)n2
n2 − 1
= p1
(1− 1− 1
n1
)+ p2
(1− 1− 1
n2
)− 2p1 p2
+ p21
(1− 1
n1 − 1
)− p2
2
(1− 1
n2 − 1
)=( p1 − p2)
2 − p1(1− p1)
n1 − 1− p2(1− q2)
n2 − 1
which is Equation 10 in the Appendix of Reich et al. (2009). 884
Four-point-condition and F4 885
Using F2 as a tree-metric, the four-point condition (Buneman 886
1971) can be written as 887
F2(P1, P2) + F2(P3, P4) ≤
min[
F2(P1, P3) + F2(P2, P4), F2(P1, P4) + F2(P2, P3)
] (35)
for any permutations of the samples. This implies that two of 888
the sums need to be the same, and larger than the third. The 889
claim is that if the four-point condition holds, at least one of 890
the F4-values will be zero, and the others will have the same 891
absolute value. 892
Without loss of generality, assume that
F2(P1, P2) + F2(P3, P4) ≤ F2(P1, P3) + F2(P2, P4)
F2(P1, P3) + F2(P2, P4) = F2(P1, P4) + F2(P2, P3)
Simply plugging this into the three possible F4-equations yields
F4(P1, P2; P3, P4) = 0
F4(P1, P3; P2, P4) = kF4(P1, P4; P2, P3) = −k
where k = F2(P1, P3) + F2(P2, P4)− F2(P1, P2) + F2(P3, P4). 893
It is worth noting that the converse is false. If
F2(P1, P2) + F2(P3, P4) > F2(P1, P3) + F2(P2, P4)
F2(P1, P3) + F2(P2, P4) = F2(P1, P4) + F2(P2, P3)
the four-point condition is violated, but F4(P1, P2; P3, P4) is still 894
zero, and the other two F4-values have the same magnitude. 895
Derivation of F under select models 896
Here, I use Equation 5 together with Equations 10b and 14b to 897
derive expectations for F3 and F4 under some simple models. 898
Panmixia In a randomly mating population (with arbitrary 899
population size changes), P1 and P2 are taken from the same 900
pool of individuals and therefore T12 = T11 = T22, EF2 = EF3 = 901
EF4 = 0. 902
14 Benjamin M Peter et al
Island models A (finite) island model has D subpopulations of 903
size 1 each. Migration occurs at rate M between subpopulations. 904
It can be shown (Strobeck 1987) that ET11 = ET22 = D. ET12 905
satisfies the recursion 906
ET12 =1
(D− 1)M+
D− 2D− 1
ET12 +1
D− 1ET11. (36)
with solution ET12 = 1 + M−1. This results in the equation in 907
Figure 7. The derivation of coalescence times for the hierarchical 908
island models is marginally more complicated, but similar. It is 909
given in Slatkin and Voelm (1991). 910
Admixture models These are the model for which the F-statistics were originally developed. Many details, applications,and the origin of the path representation are found in Pattersonet al. (2012). For simplicity, I look at the simplest possible treewith four populations, where PX is admixed from P1 and P2 withcontributions α and β = (1− α), respectively. I assume that allpopulations have the same size, and that this size is one. Then,
F3(PX ; P1, P2) ∝ET1X + ET2X −ET12 −ETXX
=(αt1 + βtr + 1) + (αtr + βt1 + 1)− tr − 1
− α21 − (1− α)21− 2α(1− α)[(1− cx)tr + 1]
=t1 − 2α(1− α)(1− cx)tr (37)
Here, cx is the probability that the two lineages from PX coalesce 911
before the admixture event. 912
Thus, F3 is negative if 913
t1(1− cx)tr
< 2α(1− α), (38)
which is more likely if α is large, the admixture is recent and the 914
overall coalescent is far in the past. 915
For F4, omitting the within-population coalescence time of 1:
F4(P1PX ; P2, P3) =ET12 + ET3X −ET13 −ET2X
=tr + αtr + βt23 − tr − αtr − βt2X
=β(t2 − t1)
Stepping-stone models For the stepping stone models, I haveto solve the recursions of the Markov chains describing thelocation of all lineages in a sample of size 2. For the standardstepping stone model, I assumed there were four demes, all ofwhich exchange migrants at rate M. This results in a MarkovChain with the following five states: i) lineages in same deme ii)lineages in demes 1 and 2, iii) lineages in demes 1 and 3, lineagesin demes 1 and 4 and v) lineages in demes 2 and 3. Note thatthe symmetry of this system allows collapsing some states. Thetransition matrix for this system is
1 0 0 0 0
2M 1− 3M M 0 0
0 M 1− 3M M M
0 0 2M 1− 2M 0
2M 0 2M 0 1− 4M
(39)
Once lineages are in the same deme, the system terminates 916
as the time to coalescence time is independent of the deme in 917
isotropic migration models (Strobeck 1987), and cancels from 918
the F-statistics. 919
Therefore, the vector v of the expected time until two lineages 920
are in the same deme is found using standard Markov Chain 921
theory by solving v = (I − T)−1)1, where T is the transition 922
matrix involving only the transitive states in the Markov chain 923
(all but the first state), and 1 is a vector of ones. 924
Finding the expected coalescence time involves solving a 925
system of 5 equations. The terms involved in calculating the 926
F-statistics (Table 1) are the entries in v corresponding to these 927
states. 928
The hierarchical case is similar, except there are 6 demes and10 equations. Representing states as lineages being in demes(same), (1,2), (1,3), (1,4), (1,5), (1,6), (2,3), (2,4), (2,5), (3,4).
1 0 0 0 0 0 0 0 0 0
2M 1− 3M M 0 0 0 0 0 0 0
0 M 1− 3M M 0 0 M 0 0 0
0 0 M 1− 3M M 0 0 M 0 0
0 0 0 M 1− 3M M 0 0 M 0
0 0 0 0 2M 1− 2M 0 0 0 0
2M 0 M 0 0 0 1− 4M M 0 0
0 0 0 M 0 0 M 1− 4M M M
0 0 0 0 2M 0 0 2M 1− 4M 0
2M 0 0 0 0 0 0 2M 0 1− 4M
As in the non-hierarchical case, solving this system yields all 929
pairwise coalescence times. Then, all I have to do is average the 930
coalescence times over all possibilities. E.g. 931
ET1X =v2 + v3 + v6 + v7
4. (40)
For F4, I assume that demes 1 and 2 are in P1, demes 3 and 4 in 932
PX and demes 5 and 6 correspond to P2 and P3, respectively. 933
Range expansion model I use a serial founder model with nomigration (Peter and Slatkin 2015), where I assume that theexpansion is recent enough such that the effect of migrationafter the expansion finished can be ignored. Under that model, Iassume that samples P1 and P2 are taken from demes D1 and D2,with D1 closer to the origin of the expansion, and populationswith high ids even further away from the expansion origin. ThenET12 = t1 + ET11, where Et1 is the time required for a lineagesampled further away in the expansion to end up in D1. (Notethat t1 only depends on the deme that is closer to the origin).Thus, for three demes,
F3(P2; P1, P3) ∝ ET12 −ET13 + ET23 −ET22
∝ ET11 + t1 −ET11 − t1 + ET22 + t2 −ET22
∝ t2.
and
F(T)4 (P1, P2; P3, P4) ∝ ET13 −ET14 + ET24 −ET23
∝ ET11 + t1 −ET11 − t1 + ET22 + t2 −ET22 − t2
= 0.
More interesting is
F(B)4 (P1, P2; P3, P4) ∝ ET12 −ET14 + ET34 −ET23
∝ ET11 + t1 −ET11 − t1 + ET33 + t3 −ET22 − t2
∝ ET33 + t3 −ET22 − t2.
Simulations 934
Simulations were performed using ms (Hudson 2002). Specific 935
commands used are 936
ms 466 100 -t 100 -r 10 100000 -I 12 22 6 61 49 937
57 33 43 34 40 84 13 24 -en 0 2 7.2 -en 0 3 .2 -en 0 938
Admixture, Population Structure and F-statistics 15
4 .4 -en 0 5 .2 -en 0 6 4.4 -en 0 7 3.2 -en 0 8 4.8 939
-en 0 9 0.2 -en 0 10 3.2 -en 0 11 0.2 -en 0 12 0.7 940
-ej 0.01 2 1 -ej 0.02 3 1 -ej 0.04 4 1 -ej 0.06 5 1 941
-ej 0.08 6 1 -ej 0.10 7 1 -ej 0.12 8 1 -ej 0.14 9 1 942
-ej 0.16 10 1 -ej 0.18 11 1 -ej 0.3 12 1 943
for the outgroup-F3-statistic (Figure 5A). Sample sizes and 944
population sizes were picked randomly, but kept the same over 945
all one hundred replicates. Additionally, I randomly assigned 946
each population an error rate uniformly between zero and 0.05. 947
Errors were introduced by adding additional singletons and 948
flipping alleles at that rate. 949
For Figure 5B, the command was 950
ms 301 100 -t 10 -I 4 100 100 100 1 -es 0.001 2 951
$ALPHA -ej 0.03 2 1 -ej 0.03 5 3 -ej 0.3 3 1 -ej 952
0.31 4 1 953
with the admixture proportion $ALPHA set to increments of 954
0.025 from 0 to 0.5, with 200 data sets generated per $ALPHA. 955
Lastly, data for Figure 5C was simulated using 956
ms 501 100 -t 50 -r 50 10000 -I 6 100 100 100 100 957
100 1 -es 0.001 3 $ALPHA -ej 0.03 3 2 -ej 0.03 7 4 958
-ej 0.1 2 1 -ej 0.2 4 1 -ej 0.3 5 1 -ej 0.31 6 1 959
Here, the admixture proportion $ALPHA was varied in incre- 960
ments of 0.1 from 0 to 1, again with 200 data sets generated per 961
$ALPHA. 962
F3 and F4-statistics were calculated using the implementation 963
from Pickrell and Pritchard (2012). 964
Literature Cited 965
Allentoft, M. E., M. Sikora, K.-G. Sjögren, S. Rasmussen, 966
M. Rasmussen, J. Stenderup, P. B. Damgaard, H. Schroeder, 967
T. Ahlström, L. Vinner, A.-S. Malaspinas, A. Margaryan, 968
T. Higham, D. Chivall, N. Lynnerup, L. Harvig, J. Baron, P. D. 969
Casa, P. Dabrowski, P. R. Duffy, A. V. Ebel, A. Epimakhov, 970
K. Frei, M. Furmanek, T. Gralak, A. Gromov, S. Gronkiewicz, 971
G. Grupe, T. Hajdu, R. Jarysz, V. Khartanovich, A. Khokhlov, 972
V. Kiss, J. Kolár, A. Kriiska, I. Lasak, C. Longhi, G. McG- 973
lynn, A. Merkevicius, I. Merkyte, M. Metspalu, R. Mkrtchyan, 974
V. Moiseyev, L. Paja, G. Pálfi, D. Pokutta, Ł. Pospieszny, T. D. 975
Price, L. Saag, M. Sablin, N. Shishlina, V. Smrcka, V. I. Soenov, 976
V. Szeverényi, G. Tóth, S. V. Trifanova, L. Varul, M. Vicze, 977
L. Yepiskoposyan, V. Zhitenev, L. Orlando, T. Sicheritz-Pontén, 978
S. Brunak, R. Nielsen, K. Kristiansen, and E. Willerslev, 2015 979
Population genomics of Bronze Age Eurasia. Nature 522: 167– 980
172. 981
Bryant, D. and V. Moulton, 2004 Neighbor-Net: An Agglomera- 982
tive Method for the Construction of Phylogenetic Networks. 983
Molecular Biology and Evolution 21: 255–265. 984
Buneman, P., 1971 The recovery of trees from measures of dis- 985
similarity. Mathematics in the Archaeological and Historical 986
Sciences . 987
Buneman, P., 1974 A note on the metric properties of trees. Jour- 988
nal of Combinatorial Theory, Series B 17: 48–50. 989
Cavalli-Sforza, L. L. and A. W. F. Edwards, 1967 Phylogenetic 990
Analysis: Models and Estimation Procedures. Evolution 21: 991
550–570, ArticleType: research-article / Full publication date: 992
Sep., 1967 / Copyright © 1967 Society for the Study of Evolu- 993
tion. 994
Cavalli-Sforza, L. L. and A. Piazza, 1975 Analysis of evolution: 995
Evolutionary rates, independence and treeness. Theoretical 996
Population Biology 8: 127–165. 997
Durand, E., N. Patterson, D. Reich, and M. Slatkin, 2011 Testing 998
for ancient admixture between closely related populations. 999
Molecular Biology and Evolution 28: 2239–2252. 1000
Excoffier, L., P. E. Smouse, and J. M. Quattro, 1992 Analysis of 1001
Molecular Variance Inferred From Metric Distances Among 1002
DNA Haplotypes: Application to Human Mitochondrial DNA 1003
Restriction Data. Genetics 131: 479–491. 1004
Felsenstein, J., 1973 Maximum-likelihood estimation of evolu- 1005
tionary trees from continuous characters. American Journal of 1006
Human Genetics 25: 471–492. 1007
Felsenstein, J., 1981 Evolutionary Trees From Gene Frequencies 1008
and Quantitative Characters: Finding Maximum Likelihood 1009
Estimates. Evolution 35: 1229–1242. 1010
Felsenstein, J., 2004 Inferring phytogenies. Sunderland, Mas- 1011
sachusetts: Sinauer Associates . 1012
Fitch, W. M., E. Margoliash, and others, 1967 Construction of 1013
phylogenetic trees. Science 155: 279–284. 1014
Green, R., J. Krause, A. Briggs, T. Maricic, U. Stenzel, M. Kircher, 1015
N. Patterson, H. Li, W. Zhai, M. Fritz, et al., 2010 A draft 1016
sequence of the Neandertal genome. science 328: 710. 1017
Haak, W., I. Lazaridis, N. Patterson, N. Rohland, S. Mallick, 1018
B. Llamas, G. Brandt, S. Nordenfelt, E. Harney, K. Steward- 1019
son, Q. Fu, A. Mittnik, E. Bánffy, C. Economou, M. Francken, 1020
S. Friederich, R. G. Pena, F. Hallgren, V. Khartanovich, 1021
A. Khokhlov, M. Kunst, P. Kuznetsov, H. Meller, O. Mochalov, 1022
V. Moiseyev, N. Nicklisch, S. L. Pichler, R. Risch, M. A. 1023
Rojo Guerra, C. Roth, A. Szécsényi-Nagy, J. Wahl, M. Meyer, 1024
J. Krause, D. Brown, D. Anthony, A. Cooper, K. W. Alt, and 1025
D. Reich, 2015 Massive migration from the steppe was a source 1026
for Indo-European languages in Europe. Nature 522: 207–211. 1027
Hellenthal, G., G. B. J. Busby, G. Band, J. F. Wilson, C. Capelli, 1028
D. Falush, and S. Myers, 2014 A Genetic Atlas of Human 1029
Admixture History. Science 343: 747–751. 1030
Hudson, R. R., 2002 Generating samples under a Wright-Fisher 1031
neutral model of genetic variation. Bioinformatics (Oxford, 1032
England) 18: 337–338. 1033
Huson, D. H. and D. Bryant, 2006 Application of phylogenetic 1034
networks in evolutionary studies. Molecular biology and evo- 1035
lution 23: 254–267. 1036
Huson, D. H., R. Rupp, and C. Scornavacca, 2010 Phylogenetic 1037
networks: concepts, algorithms and applications. Cambridge Uni- 1038
versity Press. 1039
Lazaridis, I., N. Patterson, A. Mittnik, G. Renaud, S. Mallick, 1040
K. Kirsanow, P. H. Sudmant, J. G. Schraiber, S. Castellano, 1041
M. Lipson, and others, 2014 Ancient human genomes suggest 1042
three ancestral populations for present-day Europeans. Nature 1043
513: 409–413. 1044
Lipson, M., P.-R. Loh, A. Levin, D. Reich, N. Patterson, and 1045
B. Berger, 2013 Efficient Moment-Based Inference of Admix- 1046
ture Parameters and Sources of Gene Flow. Molecular Biology 1047
and Evolution 30: 1788–1802. 1048
McCullagh, P., 2009 Marginal likelihood for distance matrices. 1049
Statistica Sinica 19: 631. 1050
Patterson, N. J., P. Moorjani, Y. Luo, S. Mallick, N. Rohland, 1051
Y. Zhan, T. Genschoreck, T. Webster, and D. Reich, 2012 1052
Ancient Admixture in Human History. Genetics p. genet- 1053
ics.112.145037. 1054
Pease, J. B. and M. W. Hahn, 2015 Detection and Polarization of 1055
Introgression in a Five-Taxon Phylogeny. Systematic Biology 1056
64: 651–662. 1057
Peter, B. M. and M. Slatkin, 2015 The effective founder effect in 1058
a spatially expanding population. Evolution 69: 721–734. 1059
16 Benjamin M Peter et al
Petkova, D., J. Novembre, and M. Stephens, 2014 Visualizing spa- 1060
tial population structure with estimated effective migration 1061
surfaces. bioRxiv p. 011809. 1062
Pickrell, J. K. and J. K. Pritchard, 2012 Inference of Population 1063
Splits and Mixtures from Genome-Wide Allele Frequency 1064
Data. PLoS Genet 8: e1002967. 1065
Pickrell, J. K. and D. Reich, 2014 Toward a new history and 1066
geography of human genes informed by ancient DNA. Trends 1067
in Genetics 30: 377–389. 1068
Pritchard, J. K., M. Stephens, and P. Donnelly, 2000 Inference of 1069
population structure using multilocus genotype data. Genetics 1070
155: 945–959. 1071
Raghavan, M., P. Skoglund, K. E. Graf, M. Metspalu, A. Albrecht- 1072
sen, I. Moltke, S. Rasmussen, T. W. Stafford Jr, L. Orlando, 1073
E. Metspalu, and others, 2014 Upper Palaeolithic Siberian 1074
genome reveals dual ancestry of Native Americans. Nature 1075
505: 87–91. 1076
Ralph, P. and G. Coop, 2013 The Geography of Recent Genetic 1077
Ancestry across Europe. PLoS Biol 11: e1001555. 1078
Reich, D., N. Patterson, D. Campbell, A. Tandon, S. Mazieres, 1079
N. Ray, M. V. Parra, W. Rojas, C. Duque, N. Mesa, L. F. 1080
García, O. Triana, S. Blair, A. Maestre, J. C. Dib, C. M. 1081
Bravi, G. Bailliet, D. Corach, T. Hünemeier, M. C. Bor- 1082
tolini, F. M. Salzano, M. L. Petzl-Erler, V. Acuña-Alonzo, 1083
C. Aguilar-Salinas, S. Canizales-Quinteros, T. Tusié-Luna, 1084
L. Riba, M. Rodríguez-Cruz, M. Lopez-Alarcón, R. Coral- 1085
Vazquez, T. Canto-Cetina, I. Silva-Zolezzi, J. C. Fernandez- 1086
Lopez, A. V. Contreras, G. Jimenez-Sanchez, M. J. Gómez- 1087
Vázquez, J. Molina, Á. Carracedo, A. Salas, C. Gallo, G. Poletti, 1088
D. B. Witonsky, G. Alkorta-Aranburu, R. I. Sukernik, L. Os- 1089
ipova, S. A. Fedorova, R. Vasquez, M. Villena, C. Moreau, 1090
R. Barrantes, D. Pauls, L. Excoffier, G. Bedoya, F. Rothham- 1091
mer, J.-M. Dugoujon, G. Larrouy, W. Klitz, D. Labuda, J. Kidd, 1092
K. Kidd, A. Di Rienzo, N. B. Freimer, A. L. Price, and A. Ruiz- 1093
Linares, 2012 Reconstructing Native American population 1094
history. Nature 488: 370–374. 1095
Reich, D., K. Thangaraj, N. Patterson, A. L. Price, and L. Singh, 1096
2009 Reconstructing Indian population history. Nature 461: 1097
489–494. 1098
Saitou, N. and M. Nei, 1987 The neighbor-joining method: a 1099
new method for reconstructing phylogenetic trees. Molecular 1100
Biology and Evolution 4: 406–425. 1101
Schraiber, J. G. and J. M. Akey, 2015 Methods and models for 1102
unravelling human evolutionary history. Nature Reviews Ge- 1103
netics . 1104
Semple, C. and M. A. Steel, 2003 Phylogenetics. Oxford University 1105
Press. 1106
Slatkin, M., 1991 Inbreeding coefficients and coalescence times. 1107
Genetic Research 58: 167–175. 1108
Slatkin, M. and L. Voelm, 1991 FST in a Hierarchical Island 1109
Model. Genetics 127: 627–629. 1110
Strobeck, C., 1987 Average number of nucleotide differences in 1111
a sample from a single subpopulation: a test for population 1112
subdivision. Genetics 117: 149–153. 1113
Tajima, F., 1983 Evolutionary Relationship of Dna Sequences in 1114
Finite Populations. Genetics 105: 437–460. 1115
Tavaré, S., 1984 Line-of-descent and genealogical processes, and 1116
their applications in population genetics models. Theoretical 1117
Population Biology 26: 119–164. 1118
Wahlund, S., 1928 Zusammensetzung Von Populationen 1119
Und Korrelationserscheinungen Vom Standpunkt Der 1120
Vererbungslehre Aus Betrachtet. Hereditas 11: 65–106. 1121
Wakeley, J., 2009 Coalescent theory: an introduction. Roberts & Co. 1122
Publishers. 1123
Weir, B. S. and C. C. Cockerham, 1984 Estimating F-Statistics for 1124
the Analysis of Population Structure. Evolution 38: 1358–1370, 1125
ArticleType: research-article / Full publication date: Nov., 1126
1984 / Copyright © 1984 Society for the Study of Evolution. 1127
Wright, S., 1921 Systems of mating. Genetics 6: 111–178. 1128
Wright, S., 1931 Evolution in Mendelian populations. Genetics 1129
16: 97–159. 1130
Yunusbayev, B., M. Metspalu, E. Metspalu, A. Valeev, S. Litvi- 1131
nov, R. Valiev, V. Akhmetova, E. Balanovska, O. Balanovsky, 1132
S. Turdikulova, D. Dalimova, P. Nymadawa, A. Bahmanimehr, 1133
H. Sahakyan, K. Tambets, S. Fedorova, N. Barashkov, I. Khidiy- 1134
atova, E. Mihailov, R. Khusainova, L. Damba, M. Derenko, 1135
B. Malyarchuk, L. Osipova, M. Voevoda, L. Yepiskoposyan, 1136
T. Kivisild, E. Khusnutdinova, and R. Villems, 2015 The Ge- 1137
netic Legacy of the Expansion of Turkic-Speaking Nomads 1138
across Eurasia. PLoS Genet 11: e1005068. 1139
Admixture, Population Structure and F-statistics 17
Figure S1 Path interpretation of F2: F2 is interpreted as the covariance of two possible paths from P1 to P2, which I color green andblue, respectively. Only branches that are taken by both paths contribute to the covariance. With probability α, a path takes the leftadmixture edge, and with probability β = 1− α, the right one. I then condition on which admixture edge the paths follow: In thefirst tree on the right-hand side, both paths take the right admixture edge (with probability α2, in the second and third tree they takedifferent or the right path, respectively. The result is summarized as the weighted sum of branches in the left-hand side tree. For amore detailed explanation, see Patterson et al. (2012).
C. Discordant genealogy
B. Concordant genealogy
A. Equation
Figure S2 Schematic explanation how F3 behaves conditioned on gene tree. Blue terms and branches correspond to positive con-tributions, whereas red branches and terms are subtracted. Labels represent individuals randomly sampled from that population.The external branches cancel out, so only the internal branches have non-zero contribution to F3. In the concordant genealogy(Panel B), the contribution is positive (with weight 2), and in the discordant genealogy (Panel C), it is negative (with weight 1). Themutation rate as constant of proportionality is omitted.
18 Benjamin M Peter et al
C. Discordant genealogy (BABA)
B. Concordant genealogy
A. Equation
D. Discordant genealogy (ABBA)
Figure S3 Schematic explanation how F4 behaves conditioned on gene tree. Blue terms and branches correspond to positive con-tributions, whereas red branches and terms are subtracted. Labels represent individuals randomly sampled from that population.All branches cancel out in the concordant genealogy (Panel B), and that the two discordant genealogies contribute with weight +2and -2, respectively The mutation rate as constant of proportionality is omitted.
Admixture, Population Structure and F-statistics 19