separating population structure from recent evolutionary history

Post on 14-Jan-2016

29 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Separating Population Structure from Recent Evolutionary History. 1. f st . 4N ev  . - PowerPoint PPT Presentation

TRANSCRIPT

Separating Population Structure

from Recent Evolutionary History

Problem: Spatial Patterns Inferred Earlier Represent An Equilibrium Between Recurrent Evolutionary Forces Such as Gene Flow and Drift.

E.g.,

But, Can Obtain The Same Pattern Due to Recent Historical Events That Have Not Had Time to Reach Equilibrium

fst 1

4Nev

To Examine Historical Events & Non-Equilibrium States, Need to Study Genetic Variation in Both

Space & Time

Directly Sample Populations From the Past Reconstruct Variation Through Time

Indirectly

Direct Study: mtDNA in the Woolly Mammoth

Debruyne et al. 2008. Out of America: Ancient DNA Evidence for a New World Origin of Late Quaternary Woolly Mammoths. Curr. Biol. 18:1320-1326.

Direct Study: mtDNA in the Woolly Mammoth

Debruyne et al. 2008. Out of America: Ancient DNA Evidence for a New World Origin of Late Quaternary Woolly Mammoths. Curr. Biol. 18:1320-1326.

Indirect Studies

Recall that Dt=D0(1-r)t

Therefore, Multi-locus or Multi-site Polymorphic Data Contains Historical Information, and This Retention Is For Long Periods of Time When r Is Small.

Attempts to Reconstruct History Depend Upon Multiple Loci or Upon Multi-Site Haplotypes.

Multiple Loci: Principle Component Analysis of Genetic Data

This procedure has long been used in human genetics to extract multi-locus information about gene flow patterns (e.g., Cavalli-Sforza & Ammerman, 1984).

Multiple Loci: Principle Component Analysis of Genetic Data

Novembre et al. Nature 31 Aug 2008. Based on 197,146 loci in 1,387 individuals.

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Overlay of the steepest slope values (upper 5%)

Microsatelite survey of naked mole rats in Meru National Park, Kenya (Jon Hess)

Haplotypes

One Method Is To Look At the Spatial Distribution of Globally Rare, Tip Haplotypes (Although They May be Locally Common)

Coalescent Theory Implies Such Haplotypes Are Recent, And Therefore Are Not In Equilibrium And Have Limited Spatial Distributions

Therefore, Globally Rare, Tip Haplotypes Provide A Straightforward Method of Observing The Movements of Genes Through Space Over Short and Recent Time Periods.

Schroeder, K. B. et al. Mol Biol Evol 2009 26:995-1016

Geographic distribution of the Asian and American populations genotyped for the microsatellite D9S1120

“Private” 9-repeat allele

Schroeder, K. B. et al. Mol Biol Evol 2009 26:995-1016

Visual genotypes, clustered by population, for individuals either homozygous or heterozygous for the 9-repeat allele

Implies that this “private allele” is identical by descent in all Western Beringians and Native Americans, which in turn implies that Native Americans Descended (at least in part) From These Western Beringian Populations.

Method for estimating the TMRCA of copies of an allele from the number of recombination events on its shared

haplotypic background

Under the different best models, the mean TMRCA of the 9-repeat

allele ranged from 293 generations to 1,596 generations; using a generation time of 25 years resulted in a TMRCA of 7,325-39,900

years ago. Averaging over all of our best models, the mean TMRCA is 513 generations ago or about 12,825 years ago. The

95% confidence intervals for all of the best models produced ages for the MRCA of the 9-repeat allele, that range from 144 to 1951

generations ago, or approximately 3,600-48,775 years ago.

Schematics of the demographic models used for the coalescent simulations: (A) population split with two equal-size descendant populations (Asia and America), (B) population split with NAs/NAm equal to 0.15 at TAs/Am, and (C) population

split with NAs/NAm equal to 0.02 at TAs/Am, followed by population growth such that NAs/NAm equals 0.15 at T0. Models D and E are the same as models B and

C, respectively, but include population substructure in Asia and in America.

Haplotype Trees

Are Biologically Meaningful Only When Recombination Is Absent Or Rare

Gives Some Information About Temporal Ordering of Mutational Variation, Both the Rare and the Common Mutations

Not Limited to Recent Events, But Can Go Back Further In Time (But Not Beyond the Most Recent Common Ancestral DNA Molecule)

A Haplotype Tree Should Never Be Equated To A Tree of

Populations. It Is Only The Tree of The Genetic Variation For

That DNA Region.There Is Information About Population History in the

Haplotype Tree, But It Must Be Extracted Carefully.

Haplotype Trees ≠Species or Population Trees

It is dangerous to equate a haplotype tree to a species tree.

It is NEVER justified to equate a haplotype tree to a tree of populations within a species because the problem of lineage sorting is greater and the

time between events is shorter. Moreover, a population tree need

not exist at all.

Nested Clade Analysis Converts Haplotype Trees Into A Nested

Statistical Design Other Data (Phenotypic or Geographical) Are

Then Overlaid Upon The Nested Design Statistical Tests Are Performed To Detect

Significant Associations Between the Data and The Haplotype Tree

DOES NOT EQUATE THE HAPLOTYPE TREE TO A POPULATION TREE!

NCPA Distance Measures

= Sample locations

A Haplotype Tree In Elephants

TsavoAmboseli

Sengwa

Hwange

Victoria Falls

Matetsi

Within 1-Step Clades Within Tota l Tree

Haplotypes No. in

sample

Dc Dn 1-Step Clades Dc Dn

1 35 1021L*** 1027L***

2 20 81S*** 657S***

3 1 0 601 1-1 884 1173L***

Old-Young 944L*** 373L***

4 11 959L*** 832L***

5 16 114 249S*

6 3 0 156S* 1-2 460S*** 768S***

Old-Young 862L*** 598L***

7 27 47 47

8 1 0 126

9 1 0 68 1-3 49S*** 759S**

Old-Young 47 -50 626L*** 409L***

Only When Statistical Significance Is Achieved Is The Biological Significance Interpreted With

Explicit, a priori Criteria

•For Example, Under Isolation By Distance, It Takes Many Generations For A New Haplotype To Spread Across Many Demes.•Therefore, Expect Older Haplotypes To Be More Widespread Than Younger Haplotypes•Younger Haplotypes Tend To Have Geographical Ranges Nested Within the Ranges of Their Ancestral Haplotypes

A Haplotype Tree In Elephants

TsavoAmboseli

Sengwa

Hwange

Victoria Falls

Matetsi

Gene flow with IBD

Gene flow with IBD

Gene flow with IBD

Gene flow with IBD

Historical Events Also LeaveLasting Patterns in Haplotype Trees.

For Example, When A Population Expands Into a New Area, Even Haplotypes Recently Created by

Mutation Can Become Geographically Widespread, and Haplotypes Created By

Mutation After the Expansion Can Be Located Far From the Geographical Center of Their Ancestral Haplotype.

Range Expansion

Present

Past

Area A Area B Area C

Nested Clade Analysis of the Chub (Leuciscus cephalus): Range Expansion (from Durand et al. 1999)

Older Clade

YoungerClade

2-1

SPE

Historical Events Also LeaveLasting Patterns in Haplotype Trees.

For Example, When A Population Is Fragmented or Otherwise Effectively Isolated, Haplotypes That Arise After

The Fragmentation/Isolation Event Cannot Spread to Other Geographical

Areas, and With Increasing Time, More Mutations Can Accumulate, Resulting In

Larger Than Average Branch Lengths Between Clades in Different Isolates.

FragmentationRecent Old

Area A Area B Area C

Area A Area B Area C

Fragmentation between Ambystoma tigrinum tigrinum (Clade 4-2) and A. t. mavortium (Clade 4-1)

The Nested Design Means That Inferences Are Robust To Topological Variation

Induced by the Evolutionary Stochasticity of the Coalescent Process

African Elephants(Roca, A. L., N. Georgiadis, and S. J. O'Brien. 2005. Cytonuclear genomic dissociation in African elephant species. 37:96-100.

Savanna ElephantForest Elephant

Fragmentation Inferences From NCA

All 5 DNA regions had a different topology with respect to the 3 elephant taxa (only BGN gave the “species tree”); yet NCPA inferred a fragmentation event between forest and savanna elephants in all 5 DNA regions.

Highly Significant Fragmentation Events Found In All Five Haplotype Trees

Past Fragmentation

Past Fragmentation Followed By Range Expansion and Secondary Contact

Y-DNAmtDNA

BGN PLP

PHKA2

Nested Clade Phylogeographic Analysis

Recurrent Gene Flow, Range Expansion and Fragmentation Could All Have Occurred at Different Times and/or Places.

NCPA Therefore Looks For Multiple Patterns, Not Just One

The Relative Temporal Ordering of Events in a Nested Series of Clades Is Also Inferred by NCPA

Inferences from mtDNA haplotype tree of Ambystoma tigrinum from NCPA and supplemental test for

secondary contact (Mol. Ecol. 10: 779-791, 2001)

Fragmentation

Secondary ContactRange Expansion

Range Expansion

Isolation by DistanceIsolation by Distance

By Analyzing Haplotype Trees for mtDNA, Y-DNA, X-linked DNA and Autosomal DNA, One Can Sample A Wide

Variety of Time Scales and Both Male and Female

Mediated Gene Flow and Historical Events

By Analyzing Multiple Haplotype Trees Can

Statistically Correct For The Evolutionary Stochasticity of The Coalescent Process For Any One Genomic Region

Inference Errors in Nested Clade Analysis

These errors can be minimized by studying multiple loci and requiring each inference (type, place and time) to be cross-

validated by two or more loci.

Inference Requires That An Appropriate Mutation Occurred At the Right Time and Right Place: Therefore, Some Events and Processes Are Missed With A Particular DNA Region.

Selection and Evolutionary Stochasticity Can Distort The Distribution of Haplotypes in Space and Time, Thereby Leading to False Positive Inferences.

Multilocus Nested Clade Analysis Perform Single Locus NCPA on n loci Discard any inferences made only by a single locus Group together all the inferences made by 2 or more loci that are

concordant by type of inference and geographical location. Test the null hypothesis that all inferences of an event that are concordant

by event type and location are a single event. Because gene flow is a recurrent process, inferences of gene flow between

two regions are not necessarily concordant in time, but can test the null hypothesis that there was no gene flow between two regions in an interval of time, say t1 to t2 given multiple inferences of gene flow between the two regions.

ALL RETAINED INFERENCES HAVE BEEN CROSS-VALIDATED ACROSS LOCI AND HAVE EXPLICIT, QUANTIFIED STATISTICAL SUPPORT.

Using Theory Developed by Tajima (1983) and Kimura (1970), The

Distribution Of The Inference Time Is:

where ki is the average pairwise nucleotide diversity among the haplotypes in DNA region i in the youngest monophyletic clade that contributed in a statistically significant fashion to the NCPA inference of interest, and Ti is the age obtained by the Takahata et al. molecular clock estimator (or perhaps some other method) for this inference from DNA region i.

Estimated Times To Common Ancestor (Method of Takahata et al. 2001)

Dh Nuc.Diff.Within Humans

Dhc Nuc.Diff.Between Humans

& Chimps

6 Million Years Ago

TMRCA = 12Dh/Dhc

A Likelihood Ratio Test of The Hypothesis That The Estimated Times of An Event From j Loci Are The Same

Highly Significant Fragmentation Events Found In All Five Haplotype Trees

Past Fragmentation

Past Fragmentation Followed By Range Expansion and Secondary Contact

Fragmentation Inferences From NCANull Hypothesis: there was a single fragmentation event between forest and savanna elephants.

log-likelihood ratio test = 1.497 with 4 degrees of freedom, p= 0.8272. Accept Null Hypothesis, with T = 4.2 MYA.

There are at least 2 lineages of African Elephants.

Y-DNAmtDNA

BGN PLP

PHKA2

Performed Nested Clade Analyses on 25 DNA Regions in Humans

• Mitochondrial DNA (Ingman et al. Nature 408, 708 - 713, 2000: Sykes

et al. American Journal of Human Genetics 57, 1463-1475, 1995; Torroni et al. American Journal of Human Genetics 53, 563-590, 1993, American Journal of Human Genetics 53, 591-608, 1993).

• Y-DNA (Hammer et al. Molecular Biology and Evolution 15, 427-441, 1998)

• 11 X-Linked Regions (Balciuniene et al. 2001; Garrigan et al. 2005;

Hammer et al. 2004; Harris. & Hey, 1999, 2001; Kaessmann et al. 1999; Nachman et al. 2004; Saunders et al. 2002; Verrelli et al. 2002; Yu et al. 2002)

• 12 Autosomal Genes (Bamshad et al. 2002, Harding et al. 1997; Hollox

et al. 2001; Jin et al. 1999; Koda et al. 2001; Rana et al. 1999; Rogers et al. 2000; Toomajian and Kreitman 2002; Wooding et al. 2002; Zhang & Rosenberg 2000).

The log likelihood ratio test rejects the null hypothesis that all 15 events are temporally concordant with a probability value of 3.89 10-15.

P = 0.95

P = 0.51

P = 0.62

Three Out-of-Africa Events, All DefinedBy Three or More Loci With A High

Degree of Temporal HomogeneityBut With Highly Significant

Heterogeneity BetweenThe Three Events

There Were At Least Three Out-of-Africa Expansion Events Over the Last 2 Million Years

Inferences of Gene Flow That Are Concordant Geographically Are NOT Necessarily Concordant Temporally Because Gene Flow is a Recurrent

Process. However, We Can Test The Null Hypothesis of NO GENE FLOW Between Two Geographical Regions

Over a Specified Time Interval.

Test Of The Null Hypothesis of NO GENE FLOW Between Two

Geographical Regions Over a Specified Time Interval l to u:

[l ,u ]=1 ti

ki exp ti (1 ki ) / Ti

Ti / (1 ki ) 1 ki (1 k

i)l

u

dti

LRT ([l,u])=-2 ln [l ,u ]i=1

j

Gamma Distributions For 19 African/Eurasian Gene Flow Inferences

With Isolation By Distance

Extensive overlap implies cross-validationwith the exception of MX1, the only locuswith most of its probability mass in the Pliocene.

The lack of clusters implies therewas no prolonged breaks in geneflow throughout the Pleistocene

Testing The Null Hypothesis of No African/Eurasian Gene Flow Throughout

the Pleistocene

The Null hypothesis of isolation (no gene

flow) in this time interval is rejected

with p < 10-8

All of The Cross Validated Inferences

Integrate Well Into A Single

Overview of The Emergence of

Humans.

Coalescent SimulationsSet of Fully Specified

Phylogeographic Hypotheses

Simulate Coal.Process Many TimesUnder Each Hypothesis

Virtual Current Generation

Draw Simulated Sample of Same Size as Real Sample

Statistics on Simulated Sample

Real Current Generation

Statistics from Real Sample

Compare Relative Fits of The Simulated Statistics Under Each Model to The Observed Statistics

Strong Vs. Weak Inference Falsification is the strongest inference possible in science, so this

is called “strong inference.” Inference in NCPA is based upon the falsification of null

hypotheses. Weak inference refers to the relative fit of a non-exhaustive set

of alternatives. It is rare that an exhaustive set of every conceivable

phylogeographic alternative can be simulated, so the coalescent simulation approach results in weak inference.

Weak inference can give high relative support to a false hypothesis when all the alternatives are also false.

E.g, Fagundes et al (PNAS 104:17614-17619, 2007)

Tested 3 Models of Human Evolution via Simulation

Templeton (Yearbook of Physical Anthropology

48:33-59, 2005) Falsified All Three Models, With AFREG

Rejected with p < 10-17

These Results Are NOT Contradictory!

E.g, Fagundes et al (PNAS 104:17614-17619, 2007)

Tested 3 Models of Human Evolution via Simulation

Eswaran et al (J. Human Evol. 49:1-18, 2005) Tested

AFREG vs. A model of Isolation By Distance and

Strongly Rejected AFREG.

These Results Are NOT Contradictory!Africa S. Europe S. Asia

Africa S. Europe N. Europe S. Asia N. Asia Pacific Americas

Interpretive Criteria• Simulations assign “probabilities” to complex models as a

whole, making it impossible to interpret the biological reason for a low probability.

• In contrast, NCPA allows individual components to be tested, making the biological interpretation clear.

Reject the Null hypothesis of no admixture with p < 10-17

Interpretive Criteria

The Null hypothesis of isolation (no gene flow) in the minimal time interval proposed by Fagundes et al is rejected with p = 1.6 X 10-6 by testing with multilocus NCPA.

Interpretive Criteria• Although Fagundes et al. (2007) interpreted the rejection of their assimilation

model as a rejection of admixture, the confounded nature of simulation inference means that such an interpretation has no logical validity.

• NCPA allows individual components to be tested, making it clear that the part of their assimilation model that is wrong is NOT admixture, but rather the assumption of prior isolation of archaic Africans and Eurasians.

X

Coherent Inference• Coherence is a property referring to nested and

composite hypotheses.

• The meaning of coherence is most easily illustrated with nested hypotheses:

B A

One measure of fit is the probability of the hypotheses. Because A is a nested subset of B, Prob.(B) ≥ Prob.(A). This relationship is “coherent”.

If one assigned Prob.(A) > Prob.(B), this is mathematically impossible and is said to be “incoherent”.

E.g, Fagundes et al (PNAS 104:17614-17619, 2007)

The “assimilation” model (B) allows the possibility of admixture between Africans and Eurasians, measured by the parameter M that can vary between 0 and 1. Note, M=0 corresponds to replacement, so the replacement model (A) is a proper subset of the assimilation model.

Note the probabilities assigned to A and B.

The ABC method is INCOHERENT!

Why Is ABC INCOHERENT?

There is no correction for dimensionality of the different hypotheses (indexed by i); and

The denominator treats all hypotheses as mutually exclusive events.

Equation 9 From Beaumont, M. A., W. Y. Zhang, and D. J. Balding. 2002. Approximate Bayesian computation in population genetics. Genetics 162:2025-2035.

E.g, Fagundes et al (PNAS 104:17614-17619, 2007)

Equation 9 states that the

Prob(A or B or C) = P(A)+P(B)+P(C)

A B C

CB A

Prob(A or B or C) = P(B)+P(C) - P(B & C)

Hence, the fundamental equation of ABC is

mathematically incoherent for nested and/or composite

hypotheses.

Other Methods of Evaluating Hypotheses in the Coalescent Simulation Approach are Incoherent

•Bayes Factors are known to be incoherent (Lavine, M., and M. J. Schervish. 1999. Bayes Factors: What They Are and What They Are Not. The American Statistician 53:119-122).

•Mesquite and all other programs that treat all phylogeographic hypotheses as mutually exclusive alternatives are incoherent.

•Coalescent Simulations Can Only Be Used to Test Single Parameter Models Against Their Complement (e.g., FST > 0 vs. FST = 0).

Statistical Phylogeography

Statistical Phylogeography

Multilocus NCPA provides a robust, flexible testing framework.

Simulations have multiple statistical flaws and cannot be used to test composite

phylogeographic hypotheses.NCPA defines the general model but does not

yield insight into details.Once the general model framework has been inferred by NCPA, simulations can be used to

estimate the underlying parameters.

Multilocus NCPA provides a robust, flexible testing framework.

Simulations have multiple statistical flaws and cannot be used to test composite

phylogeographic hypotheses.NCPA defines the general model but does not

yield insight into details.Once the general model framework has been inferred by NCPA, simulations can be used to

estimate the underlying parameters.

Statistical Phylogeography

Statistical Phylogeography

NCPA and simulation approaches are not so much alternative

techniques as they are complementary, and potentially

synergistic, techniques. Both add to the statistical toolkit of

intraspecific phylogeographers, and both should be used when

appropriate.

NCPA and simulation approaches are not so much alternative

techniques as they are complementary, and potentially

synergistic, techniques. Both add to the statistical toolkit of

intraspecific phylogeographers, and both should be used when

appropriate.

top related