phylogenetics 101 - mismsmisms.net/.../uploads/2016/08/holmes_phylogenetics101.pdfmolecular...
TRANSCRIPT
27/02/13
1
Phylogenetics 101
Eddie Holmes
Sydney Emerging Infections and Biosecurity Institute, School of Biological Sciences and Sydney Medical School,
The University of Sydney, Australia Fogarty International Center, National Institutes of Health, USA
Charles Darwin (February 12th 1809
- April 19th 1882)
On the Origin of Species (published 24th November 1859)
27/02/13
2
History of Virology Jenner starts vaccination: 1798
Many viruses discovered: 1920s-30s
Pasteur makes rabies vaccine: 1885
Vector of yellow fever found: 1900 Viruses discovered: 1898
Global influenza pandemic: 1918
History of ‘Darwinism’
1900: Mendel’s work rediscovered
1809: Darwin born
1920s-30s: Neo-Darwinian synthesis
1859: ‘On the Origin of Species’
1871: ‘Descent of Man’
1882: Darwin dies
1831-1836: Voyage of ‘The Beagle’ Yellow Fever
Evolutionary History of Yellow Fever Virus
• Strong correspondence between the timing and direction of the spread of yellow fever virus and the slave trade from West Africa to the Americas
Influenza virus H1N1/09pdm • Likely emergence in the first 7 weeks of 2009 • Extensive spatial mixing – e.g. multiple entries into Asia and Europe • Relatively rapid epidemic doubling times 17.0 days (95% HPD: 12.9, 24.2 days)
27/02/13
3
time
Viral shedding
Time
Immunity
Sampling scale
population
Intra-host
Metapopulation/regional/global
Transmission network
Viral dynamics in time and space
Intra-host spatial dynamics; viral mutation and immune escape.
Dynamics of neutral and adaptive viral diversity across infection chains. Characterization of transmission and social networks.
Measuring population-level spatial coupling, impact of viral immune escape on herd immunity.
Measuring metapopulation and global dynamics. Global spread and fate of escape variants and other mutants.
The Phylodynamics Approach Measles HIV-1
Epidemiological dynamics are written into the branching structure of phylogenetic trees
Phylodynamics and the Coalescent
This approach can be extended to examine patterns of viral spread in space (i.e. phylogeography)
Dis
cret
e G
ener
atio
ns
Ne(t)
time
AB
CD
E
• Given sequence data that is time-structured we can use Bayesian coalescent approaches to estimate values of: – evolutionary parameters
• substitution rate • times to common ancestry (TMRCA)
– demographic history: Ne(t)
Bayesian skyline plot
27/02/13
4
Useful Textbooks & Software
Books: • Page RDM & Holmes EC. (1998). Molecular Evolution: A Phylogenetic Approach. Blackwell Science Ltd, Oxford. • Lemey P, Salemi M & Vandamme A-M. (2009). The Phylogenetic Handbook, 2nd Edition. Cambridge University Press. Computer Software: • BEAST (Bayesian Evolutionary Analysis Sampling Trees)
- http://beast.bio.ed.ac.uk/ • MEGA (Molecular Evolutionary Genetics Analysis)
- http://megasoftware.net/ • MrBayes (Bayesian inference of phylogeny)
- http://mrbayes.csit.fsu.edu/ • PhyML (Maximum likelihood phylogenetics)
- http://www.atgc-montpellier.fr/phyml/ • HyPhy/DATAMONKEY (Selection, recombination & hypothesis testing)
- http://datamonkey.org/ • RDP3 (Recombination detection program)
- darwin.uvigo.es/rdp/rdp.html
• Estimating genetic distances between sequences • Inferring phylogenetic trees • Detecting recombination events • The inference of selection pressures (particularly detecting positive selection) • Estimating rates of evolutionary change • Inferring demographic history (population dynamics) • Phylogeography
Topics in Evolutionary Inference
27/02/13
5
Estimating Genetic Distances Between Sequences
Estimating Genetic Distance
!!
SIVcpz ATGGGTGCGA GAGCGTCAGT TCTAACAGGG GGAAAATTAG ATCGCTGGGA!HIV-1 ATGGGTGCGA GAGCGTCAGT ATTAAGCGGG GGAGAATTAG ATCGATGGGA!
SIVcpz AAAAGTTCGG CTTAGGCCCG GGGGAAGAAA AAGATATATG ATGAAACATT!HIV-1 AAAAATTCGG TTAAGGCCAG GGGGAAAGAA AAAATATAAA TTAAAACATA!
SIVcpz TAGTATGGGC AAGCAGGGAG CTGGAAAGAT TCGCATGTGA CCCCGGGCTA!HIV-1 TAGTATGGGC AAGCAGGGAG CTAGAACGAT TCGCAGTTAA TCCTGGCCTG!
SIVcpz ATGGAAAGTA AGGAAGGATG TACTAAATTG TTACAACAAT TAGAGCCAGC!HIV-1 TTAGAAACAT CAGAAGGCTG TAGACAAATA CTGGGACAGC TACAACCATC!
SIVcpz TCTCAAAACA GGCTCAGAAG GACTGCGGTC CTTGTTTAAC ACTCTGGCAG!HIV-1 CCTTCAGACA GGATCAGAAG AACTTAGATC ATTATATAAT ACAGTAGCAA!
SIVcpz TACTGTGGTG CATACATAGT GACATCACTG TAGAAGACAC ACAGAAAGCT!HIV-1 CCCTCTATTG TGTGCATCAA AGGATAGAGA TAAAAGACAC CAAGGAAGCT!
SIVcpz CTAGAACAGC TAAAGCGGCA TCATGGAGAA CAACAGAGCA AAACTGAAAG!HIV-1 TTAGACAAGA TAGAG--GAA -----GAGCA AAACAAAAGT AA---GAAAA!
SIVcpz TAACTCAGGA AGCCGTGAAG GGGGAGCCAG TCAAGGCGCT AGTGCCTCTG!HIV-1 AAGCACAGCA AGC-----AG CAGCTGACA- -CAGGACAC- AG--CAGC--!
SIVcpz CTGGCATTAG TGGAAATTAC!!
!HIV-1 CAGG--TCAG CCAAAATTAC!
27/02/13
6
Multiple Substitutions at a Single Site - Hidden Information
A
A
C
T Example 1
T
A
C
A Example 2 Only count 1 mutation when 2 have occurred
Count 0 mutations when 3 have occurred
The Problem of Multiple Substitution
• When % divergence is low, observed distance (p) is a good estimator of genetic distance (d) • When % divergence is high, p underestimates d and a “correction statistic” is required i.e. a model of DNA substitution
Time
% D
iver
genc
e
Actual Observed
50
25
75
Hidden information
27/02/13
7
Models of DNA Substitution
i. The probability of substitution between bases (e.g. A to C, C to T…) ii. The probability of substitution along a sequence (different sites/regions evolve at different rates)
• Models of DNA sequence evolution are required to recover the missing information through correcting for multiple substitutions.
Models of DNA Substitution 1 (Jukes-Cantor, 1969)
• Assumptions: i. All bases evolve independently ii. All bases are at equal frequency iii. Each base can change with equal probability (α) iv. Mutations arise according to a Poisson distribution
(rare and independent events) • From this the number of substitutions per site (d) can be estimated by;
d = -3/4 In (1-4/3P) where P is the proportion of observed nucleotide differences between 2 sequences.
27/02/13
8
A
T
C
G
α α
α
α
α
α
All substitutions occur at the same rate (α)
Is this model too simple for real data?
A
T
C
G
β β
β
β
α
α
Transitions (α) and transversions (β) occur at a different rate
27/02/13
9
Models of DNA Substitution 2 (Kimura 2-parameter, 1980)
• Assumptions: i. All bases evolve independently ii. All bases are at equal frequency iii. Transitions and transversions occur with different probabilities (α and β) iv. The Jukes-Cantor model is applied to transitions and transversions independently
• From this the expected number of substitutions per site (d) can be estimated by;
d = -1/2 In (1-2P-Q)√1-2Q where P is the proportion of observed transitions and Q the proportion of observed transversions between 2 sequences
Models of DNA Substitution 1. Base frequencies are equal and all substitutions are equally likely
(Jukes-Cantor)
2. Base frequencies are equal but transitions and transversions occur at different rates
(Kimura 2-parameter)
3. Unequal base frequencies and transitions and transversions occur at different rates
(Hasegawa-Kishino-Yano)
4. Unequal base frequencies and all substitution types occur at different rates
(General Reversible Model)
Simplest (few parameters)
Most complex (many parameters)
All these models can be tested using the program jMODELTEST (darwin.uvigo.es/software/jmodeltest.html)
27/02/13
10
Models of DNA Substitution
i. The probability of substitution between bases (e.g. A to C, C to T…) ii. The probability of substitution along a sequence (different sites/regions evolve at different rates)
A Gamma Distribution Can be Used to Model Among-Site Rate Heterogeneity
Little among-site rate variation
Frequent among-site rate variation
27/02/13
11
Estimates of a Shape Parameter of Among Site Rate Variation
Gene α Prolactin 1.37 Albumin 1.05 C-myc 0.47 Ctyochrome β (mtDNA) 0.44 Insulin 0.40 D-loop (mtDNA) 0.17 12S rRNA (mtDNA) 0.16
• Viruses are usually characterized by extensive among-site rate variation (α < 1). • Giving a different rate to each codon position also works well for viruses
• Uncorrected (p-distance) = 0.406 • Jukes-Cantor = 0.586 • Kimura 2-parameter = 0.602 • Hasegawa-Kishino-Yano = 0.611 • General reversible = 0.620 • General reversible + gamma = 1.017
Estimating Genetic Distance: SIVcpz vs HIVlai
27/02/13
12
Other Models
• Allowing a different rate of nucleotide substitution for each codon position in a coding sequence (SRD06; tends to work better than gamma distributions in RNA viruses) • Allowing different sets of nucleotides to change along different lineages (“covarion” model)
e.g. sites that are variable in bacteria might be conserved in eukaryotes
• Accounting for the non-independence of nucleotides (caused by protein and RNA secondary structures)
Inferring Phylogenetic Trees
27/02/13
13
Important Problems in Molecular Phylogenetic Analysis
• Is there a tree at all (e.g. recombination)? • Many possible trees:
- For 10 taxa there are 2 x 106 unrooted trees - For 50 taxa there are 3 x 1074 unrooted trees
- efficient and powerful search algorithms • Choosing the right model of nucleotide substitution • Rate variation among lineages (causes “long branch attraction”). Need a representative sample of taxa.
small tree long branches drawn together
(convergent sites pull branches together)
large tree long branches far apart
(convergent sites distributed across tree)
= convergent site
Why Having a Representative Sample of Taxa is Important
= informative site
Long branch attraction
27/02/13
14
Tree-Building Methods
No explicit model of sequence evolution
Explicit model of sequence evolution
parsimony
Application of the
parsimony principle
distance
pairwise comparison
of sequences
maximum likelihood and bayesian
Statistical approach
Methods for Inferring Phylogenetic Trees
• Parsimony (PAUP*) Find tree with the minimum number of mutations between sequences (i.e. choose tree with the least convergent evolution) • Neighbor-Joining (PAUP*, MEGA) Estimate genetic distances between sequences and cluster these distances into a tree that minimises genetic distance over the whole tree • Maximum Likelihood (PhyML, PAUP*, GARLi, RaxML, MEGA) Determine the probability of a tree (and branch lengths) given a particular model of molecular evolution and the observed sequence data • Bayesian (BEAST, Mr.Bayes) Similar to likelihood but where there is information about the prior distribution of parameters. Also returns a (posterior) distribution of trees
27/02/13
15
๏ Advantages: - Allows the use of an explicit model of evolution - Very fast - Simple
๏ Disadvantages: - Only produces one tree with no indication of its quality - Reduces all sequence information into a single distance
value - Dependent on the evolutionary model used (preferentially
this model should be estimated from the data)
Distance Methods
๏ Parsimony - Fast - Not statistically consistent with most models of evolution - “The” method for morphological data
๏ Maximum Likelihood - Requires explicit statement of evolutionary model - Slow - Statistically consistent - Most commonly used with molecular data
Optimality Methods
27/02/13
16
Maximum Likelihood in Phylogenetics
• Best described by Joe Felsenstein
‣ Felsenstein, J. (1981). Evolutionary trees from DNA sequences: a maximum likelihood approach. J. Mol. Evol. 17, 368-376
• Considered the most statistically valid approach to molecular phylogenetics along with the closely related Bayesian methods
• Allows us to incorporate extremely detailed models of molecular evolution
Likelihood • Likelihood is a quantity proportional to the probability of
observing an outcome/data/event X given a hypothesis H
P ( X | H ) or P ( X | p )
• then we would talk about the likelihood
L ( p | X )
that is, the likelihood of the parameters given the data. • In this case the hypothesis is a tree + branch lengths
and the data are the sequences R.A. Fisher (looking suitably grumpy)
Assumes an explicit model of molecular evolution, such as those described previously
27/02/13
17
Bayesian Phylogenetics ๏ Using Bayesian statistics, you search for a set of plausible trees instead
of a single best tree ๏ In this method, the “space” that you search in is limited by prior
information ๏ The posterior distribution of trees can be translated to a probability of
any branching event - Allows estimate of uncertainty! - BUT incorporates prior beliefs
Searching Through ‘Tree Space’
27/02/13
18
Searching Through Tree Space ๏ There are two ways in which we can search through tree
space to find the best tree for our data: – Branch-and-bound: finds the optimal tree by implicitly
checking all possible trees (cutting of paths in the search tree that cannot possibly lead to optimal trees)
– Heuristic: searches by randomly perturbing the tree, does not check all trees and cannot guarantee to find the optimal one(s). Most commonly used.
(exhaustive searching is only possible for very small data sets)
Heuristic searching
Global Maximum Likelihood tree
Like
lihoo
d
Trees
local optimum
Starting tree of the heuristic search
Starting tree of the heuristic search
27/02/13
19
Non-Parametric Bootstrap • Statistical technique that uses random resampling of data to determine sampling error. • Characters are resampled with replacement to create many replicate data sets. A tree is then inferred from each replicate. • Agreement among the resulting trees is summarized with a consensus tree. The frequencies of occurrence of groups, bootstrap proportions, are a measure of support for those groups Parametric Bootstrap (Monte Carlo simulation) • Compare the likelihoods of competing trees on the data. • Simulate replicate sequences using the parameters (including the tree) obtained for the worse tree (null hypothesis). • Compare the likelihoods trees for each replicate data set as before to create a null distribution.
Bootstrapping (How Robust is a Tree?)
Non-Parametric Bootstrapping A!A A!A!A!A!
C!C C!T!T!T!
1!2!3!4!5!6!
C!C C!C!C!G!
T!T T!T!T!G!
G!G A!A!A!A!
G!G C!T!T!A!
1!2 3!4!5!6!
T!T T!T!T!G!
G!G A!A!A!A!
1!2 3!4!5!6!
A!A A!A!A!A!
C!C C!C!C!G!
G!G A!A!A!A!
1!2 3!4!5!6!
G!G A!A!A!A!
G!G C!T!T!A!
A!A A!A!A!A!
C!C C!T!T!T!
T!T T!T!T!G!
G!G C!T!T!A!
C!C C!C!C!G!
1
1000
... Resample with replacement multiple times
27/02/13
20
Detecting Recombination
Recombination & Reassortment
• The Problems: - Generates new genetic configurations - Complicates our attempts to infer phylogenetic history and other evolutionary processes (e.g. positive selection)
• The Solutions:
- Find recombinants and remove them from the data set (usual plan) - Incorporate recombinants into an explicit evolutionary model (far harder)
• “Topological incongruence”, where different gene regions (or genes) produce different phylogenetic trees, is the strongest signal for recombination (although conservative)
27/02/13
21
Methods for Recombination Detection • Measure level of linkage disequilibrium: - LDhat, D’ • Look for changes in patterns of sequence similarity (often pairwise): - GENECOV, RDP, Max Chi-Square, SimPlot, SiScan, TOPAL • Look for incongruent phylogenetic trees: - BOOTSCAN, 3SEQ, LARD, PLATO, LIKEWIND
• Look for “networked” evolution - SplitsTree, NeighborNet • Look for excessive convergent evolution: - Homoplasy test, PIST
• See http://www.bioinf.manchester.ac.uk/recombination/programs.shtml for a more complete list • Many of these methods are available in the Recombination Detection Program (RDP3) – http://darwin.uvigo.es/rdp/rdp.html
Sliding Window Diversity Plots can Graphically Show Recombination (e.g. “SimPlot”)
• Magiorkinis et al. Gene 349, 165-171 (2005).
Hepatitis B virus
27/02/13
22
Detecting Recombination: Looking for Incongruent Trees
• Different genes produce different trees
Gene region 1
Gene region 2
A
B
C
Maximum likelihood break-point
• Programme “LARD” (a maximum likelihood approach) • Compute likelihood of each possible breakpoint in the alignment • Identify breakpoint with the highest likelihood in the alignment • Compare recombination likelihood to that with no recombination • Assess significance with Monte Carlo simulation
Although reassortment is commonplace in influenza virus, the occurrence of homologous recombination is more controversial
Analyzing Natural Selection
27/02/13
23
Ways of Measuring Selection Pressures (Especially Detecting Positive Selection)
• Phylogenetic methods: Identify cases of strong parallel or convergent evolution • Population genetic methods: (i) Look for regional reductions in genetic diversity, usually
using SNPs (commonly used with genomic data) (ii) Compare estimates of effective population size obtained
using different measures of genetic diversity (e.g. the H statistic of Fay & Wu)
(iii) Estimate the speed of allele fixation compared to neutrality • Combined phylogenetic and population genetic methods: Compare the relative numbers of nonsynonymous (dN) and synonymous (dN) substitutions per site
Detecting Positive Selection by Examining Patterns/Rates of Fixation
• Bhatt S, Holmes EC & Pybus OG. (2011). The genomic rate of molecular adaptation of the human influenza A virus. Mol.Biol.Evol. 28, 2443-2451.
27/02/13
24
• Compare the ratio of synonymous (dS) and nonsynonymous (dN) substitutions per site (dN/dS):
Ser Met Leu Gly Gly Seq 1: TCA ATG TTA GGG GGA † * † † ** Seq 2: TCG ATA CTA GGT ATA Ser Ile Leu Gly Ile
†Synonymous substitution *Nonsynonymous substitution dN/dS < 1.0 = purifying selection dN/dS ~ 1.0 = neutral evolution dN/dS > 1.0 = positive selection
• Cases where dN > dS > 1 are evidence for positive selection because the rate of fixation of nonsynonymous changes (dN) is greater than the neutral mutation rate (dS) which is impossible under genetic drift
Measuring Selection Pressures
Analysing Selection Pressures in Genes Using dN/dS
• Pairwise methods: (i) Compute dS and dN in each pair of sequences and then compute the
mean across all pairs (ii) Various methods, including:
- Nei & Gojobori 1986 (distance matrix method) - Li et al. 1985 (distance matrix method) - Yang et al. 2000 (maximum likelihood method)
(iii) Problems of pseudo-replication, sometimes use poor substitution models, and lack of power (many false-negatives) • Site-by-site (and branch) methods: (i) Incorporate phylogenetic relationships of sequences (i.e. estimate dN/dS
along a tree) (ii) Allow variable selection pressures among codons and realistic models
of nucleotide substitution (iii) Can employ parsimony, likelihood or Bayesian methods (iv) Has now been extended to account for directional selection (DEPS) (v) Tendency for false-positive results, especially in branch-site methods
27/02/13
25
Datamonkey http://www.datamonkey.org/
• Online version of the more powerful HyPhy package • Contains multiple (and continually updated) programs for the analysis of selection pressures (and recombination)
2/27/13 Adaptive Evolution Server @ Datamonkey.org
www.datamonkey.org 1/1
DATAMONKEY.ORG
Analyze your data.[Run SCUEAL]
[Run UDS analysis]
ANALYZE YOUR DATA HOME HELP CITATIONS JOB QUEUE STATS HYPHY PACKAGE
News
February 19rd, 2013. The FUBAR method paper has now beenpublished in Mol Biol and Evol. Give this much faster (can process1000 sequences < 10 mins) and statistically more robust methodthan REL (or PAML) a try, or see the papers which already citedit.
May 3rd, 2012. The MEME manuscript has been accepted by PLoSGenetics. MEME is our recommended method for identifying sitesunder selection. Unlike most other methods, MEME can findsignatures of episodic selection, even when the majority oflineages are subject to purifying selection.
Welcome to the free public server for comparative analysis of sequence
alignments using state-of-the-art statistical models. This service is brought
to you by the viral evolution group at the School Of Medicine of the University of California, San Diego.
Over its lifetime Datamonkey.org has processed 182614 analyses at a rate of 176.7 jobs/day (over the
last 30 days).
Datamonkey.org can help you answer the following questions ( publications citing datamonkey.org):
Acknowledgements and disclaimers.
Recent VisitorsDatamonkey.org is implemented on the Applecross/San DiegoAlliance cluster which was funded jointly by the UCSD CFAR grant,NSF award 0714991 and a Medical Research Council (UK) grant to the
University of Edinburgh (to Prof. Andy Leigh Brown). Further support
provided by the UCSD Center for AIDS Research BIT Core. Our data privacy policy Copyright notice
UCSD Viral Evolution Group 2004-2013
Find indvidual sites under diversifying/purifying selection
Find indvidual sites under other types of selection
Find individual lineages under diversifying selection
Tests for alignment-wide evidence of selection
Detect epistasis/co-evolution
Screen for recombination
Perform model selection
Reconstruct ancestral sequences
Variable Selection Pressures in RNA Viruses
27/02/13
26
SIHIGPGRAFYTTGE!SIPIGPGRAFYTTGQ!SIHIGPGGAFYTTGQ!SIHIGPGRAFYTTGD!SIPIGPGRAFYTTGD!GIHIGPGSAFYATGD!SIHIGPGRAFYTTGG!SIHIGPGRAVYTTGQ!GIHIGPGSAFYATGG!GIHIGPGRAVYTTEQ!RIHIGPGRAVYTTEQ!GIHIGPGSAFYATGR!RIYIGPGRAVYTTEQ!GIHIGPGSAVYATGG!RIYIGPGSAVYTTEQ!GIHIGPGSAFYATGG!RIGIGPGRSVYTAEQ!GIHIGPGSAVYATGD!GIHIGPGRAFYATGD!GIHIGPGRAVYTTGD!RIYIGPGRAVYTTDQ
Intra-Host Evolution of HIV-1
Tip of the V3 loop (part of the envelope protein of HIV-1) - diversity in a single patient
• The HIV-1 envelope protein is under very strong positive selection to help the virus escape from the human immune response (the V3 loop contains epitopes for neutralising antibodies and cytotoxic T-lymphocytes (CTLs). • V3 loop dN/dS = 13.182 (Nielsen & Yang. Genetics 148, 929. 1998).