phylogenetics 101 - mismsmisms.net/.../uploads/2016/08/holmes_phylogenetics101.pdfmolecular...

27/02/13

1

Phylogenetics 101

Eddie Holmes

Sydney Emerging Infections and Biosecurity Institute, School of Biological Sciences and Sydney Medical School,

The University of Sydney, Australia Fogarty International Center, National Institutes of Health, USA

Charles Darwin (February 12th 1809

- April 19th 1882)

On the Origin of Species (published 24th November 1859)

27/02/13

2

History of Virology Jenner starts vaccination: 1798

Many viruses discovered: 1920s-30s

Pasteur makes rabies vaccine: 1885

Vector of yellow fever found: 1900 Viruses discovered: 1898

Global influenza pandemic: 1918

History of ‘Darwinism’

1900: Mendel’s work rediscovered

1809: Darwin born

1920s-30s: Neo-Darwinian synthesis

1859: ‘On the Origin of Species’

1871: ‘Descent of Man’

1882: Darwin dies

1831-1836: Voyage of ‘The Beagle’ Yellow Fever

Evolutionary History of Yellow Fever Virus

• Strong correspondence between the timing and direction of the spread of yellow fever virus and the slave trade from West Africa to the Americas

Influenza virus H1N1/09pdm • Likely emergence in the first 7 weeks of 2009 • Extensive spatial mixing – e.g. multiple entries into Asia and Europe • Relatively rapid epidemic doubling times 17.0 days (95% HPD: 12.9, 24.2 days)

27/02/13

3

time

Viral shedding

Time

Immunity

Sampling scale

population

Intra-host

Metapopulation/regional/global

Transmission network

Viral dynamics in time and space

Intra-host spatial dynamics; viral mutation and immune escape.

Dynamics of neutral and adaptive viral diversity across infection chains. Characterization of transmission and social networks.

Measuring population-level spatial coupling, impact of viral immune escape on herd immunity.

Measuring metapopulation and global dynamics. Global spread and fate of escape variants and other mutants.

The Phylodynamics Approach Measles HIV-1

Epidemiological dynamics are written into the branching structure of phylogenetic trees

Phylodynamics and the Coalescent

This approach can be extended to examine patterns of viral spread in space (i.e. phylogeography)

Dis

cret

e G

ener

atio

ns

Ne(t)

time

AB

CD

E

•  Given sequence data that is time-structured we can use Bayesian coalescent approaches to estimate values of: –  evolutionary parameters

•  substitution rate •  times to common ancestry (TMRCA)

–  demographic history: Ne(t)

Bayesian skyline plot

27/02/13

4

Useful Textbooks & Software

Books: • Page RDM & Holmes EC. (1998). Molecular Evolution: A Phylogenetic Approach. Blackwell Science Ltd, Oxford. • Lemey P, Salemi M & Vandamme A-M. (2009). The Phylogenetic Handbook, 2nd Edition. Cambridge University Press. Computer Software: • BEAST (Bayesian Evolutionary Analysis Sampling Trees)

- http://beast.bio.ed.ac.uk/ • MEGA (Molecular Evolutionary Genetics Analysis)

-  http://megasoftware.net/ • MrBayes (Bayesian inference of phylogeny)

- http://mrbayes.csit.fsu.edu/ • PhyML (Maximum likelihood phylogenetics)

- http://www.atgc-montpellier.fr/phyml/ • HyPhy/DATAMONKEY (Selection, recombination & hypothesis testing)

- http://datamonkey.org/ • RDP3 (Recombination detection program)

- darwin.uvigo.es/rdp/rdp.html

• Estimating genetic distances between sequences • Inferring phylogenetic trees • Detecting recombination events • The inference of selection pressures (particularly detecting positive selection) • Estimating rates of evolutionary change • Inferring demographic history (population dynamics) • Phylogeography

Topics in Evolutionary Inference

27/02/13

5

Estimating Genetic Distances Between Sequences

Estimating Genetic Distance

!!

SIVcpz ATGGGTGCGA GAGCGTCAGT TCTAACAGGG GGAAAATTAG ATCGCTGGGA!HIV-1 ATGGGTGCGA GAGCGTCAGT ATTAAGCGGG GGAGAATTAG ATCGATGGGA!

SIVcpz AAAAGTTCGG CTTAGGCCCG GGGGAAGAAA AAGATATATG ATGAAACATT!HIV-1 AAAAATTCGG TTAAGGCCAG GGGGAAAGAA AAAATATAAA TTAAAACATA!

SIVcpz TAGTATGGGC AAGCAGGGAG CTGGAAAGAT TCGCATGTGA CCCCGGGCTA!HIV-1 TAGTATGGGC AAGCAGGGAG CTAGAACGAT TCGCAGTTAA TCCTGGCCTG!

SIVcpz ATGGAAAGTA AGGAAGGATG TACTAAATTG TTACAACAAT TAGAGCCAGC!HIV-1 TTAGAAACAT CAGAAGGCTG TAGACAAATA CTGGGACAGC TACAACCATC!

SIVcpz TCTCAAAACA GGCTCAGAAG GACTGCGGTC CTTGTTTAAC ACTCTGGCAG!HIV-1 CCTTCAGACA GGATCAGAAG AACTTAGATC ATTATATAAT ACAGTAGCAA!

SIVcpz TACTGTGGTG CATACATAGT GACATCACTG TAGAAGACAC ACAGAAAGCT!HIV-1 CCCTCTATTG TGTGCATCAA AGGATAGAGA TAAAAGACAC CAAGGAAGCT!

SIVcpz CTAGAACAGC TAAAGCGGCA TCATGGAGAA CAACAGAGCA AAACTGAAAG!HIV-1 TTAGACAAGA TAGAG--GAA -----GAGCA AAACAAAAGT AA---GAAAA!

SIVcpz TAACTCAGGA AGCCGTGAAG GGGGAGCCAG TCAAGGCGCT AGTGCCTCTG!HIV-1 AAGCACAGCA AGC-----AG CAGCTGACA- -CAGGACAC- AG--CAGC--!

SIVcpz CTGGCATTAG TGGAAATTAC!!

!HIV-1 CAGG--TCAG CCAAAATTAC!

27/02/13

6

Multiple Substitutions at a Single Site - Hidden Information

A

A

C

T Example 1

T

A

C

A Example 2 Only count 1 mutation when 2 have occurred

Count 0 mutations when 3 have occurred

The Problem of Multiple Substitution

• When % divergence is low, observed distance (p) is a good estimator of genetic distance (d) • When % divergence is high, p underestimates d and a “correction statistic” is required i.e. a model of DNA substitution

Time

% D

iver

genc

e

Actual Observed

50

25

75

Hidden information

27/02/13

7

Models of DNA Substitution

i.  The probability of substitution between bases (e.g. A to C, C to T…) ii. The probability of substitution along a sequence (different sites/regions evolve at different rates)

• Models of DNA sequence evolution are required to recover the missing information through correcting for multiple substitutions.

Models of DNA Substitution 1 (Jukes-Cantor, 1969)

• Assumptions: i. All bases evolve independently ii. All bases are at equal frequency iii. Each base can change with equal probability (α) iv. Mutations arise according to a Poisson distribution

(rare and independent events) • From this the number of substitutions per site (d) can be estimated by;

d = -3/4 In (1-4/3P) where P is the proportion of observed nucleotide differences between 2 sequences.

27/02/13

8

A

T

C

G

α α

α

α

α

α

All substitutions occur at the same rate (α)

Is this model too simple for real data?

A

T

C

G

β β

β

β

α

α

Transitions (α) and transversions (β) occur at a different rate

27/02/13

9

Models of DNA Substitution 2 (Kimura 2-parameter, 1980)

• Assumptions: i. All bases evolve independently ii. All bases are at equal frequency iii. Transitions and transversions occur with different probabilities (α and β) iv. The Jukes-Cantor model is applied to transitions and transversions independently

• From this the expected number of substitutions per site (d) can be estimated by;

d = -1/2 In (1-2P-Q)√1-2Q where P is the proportion of observed transitions and Q the proportion of observed transversions between 2 sequences

Models of DNA Substitution 1. Base frequencies are equal and all substitutions are equally likely

(Jukes-Cantor)

2. Base frequencies are equal but transitions and transversions occur at different rates

(Kimura 2-parameter)

3. Unequal base frequencies and transitions and transversions occur at different rates

(Hasegawa-Kishino-Yano)

4. Unequal base frequencies and all substitution types occur at different rates

(General Reversible Model)

Simplest (few parameters)

Most complex (many parameters)

All these models can be tested using the program jMODELTEST (darwin.uvigo.es/software/jmodeltest.html)

27/02/13

10

Models of DNA Substitution

i.  The probability of substitution between bases (e.g. A to C, C to T…) ii. The probability of substitution along a sequence (different sites/regions evolve at different rates)

A Gamma Distribution Can be Used to Model Among-Site Rate Heterogeneity

Little among-site rate variation

Frequent among-site rate variation

27/02/13

11

Estimates of a Shape Parameter of Among Site Rate Variation

Gene α Prolactin 1.37 Albumin 1.05 C-myc 0.47 Ctyochrome β (mtDNA) 0.44 Insulin 0.40 D-loop (mtDNA) 0.17 12S rRNA (mtDNA) 0.16

• Viruses are usually characterized by extensive among-site rate variation (α < 1). • Giving a different rate to each codon position also works well for viruses

• Uncorrected (p-distance) = 0.406 • Jukes-Cantor = 0.586 • Kimura 2-parameter = 0.602 • Hasegawa-Kishino-Yano = 0.611 • General reversible = 0.620 • General reversible + gamma = 1.017

Estimating Genetic Distance: SIVcpz vs HIVlai

27/02/13

12

Other Models

• Allowing a different rate of nucleotide substitution for each codon position in a coding sequence (SRD06; tends to work better than gamma distributions in RNA viruses) • Allowing different sets of nucleotides to change along different lineages (“covarion” model)

e.g. sites that are variable in bacteria might be conserved in eukaryotes

• Accounting for the non-independence of nucleotides (caused by protein and RNA secondary structures)

Inferring Phylogenetic Trees

27/02/13

13

Important Problems in Molecular Phylogenetic Analysis

• Is there a tree at all (e.g. recombination)? • Many possible trees:

- For 10 taxa there are 2 x 106 unrooted trees - For 50 taxa there are 3 x 1074 unrooted trees

- efficient and powerful search algorithms • Choosing the right model of nucleotide substitution • Rate variation among lineages (causes “long branch attraction”). Need a representative sample of taxa.

small tree long branches drawn together

(convergent sites pull branches together)

large tree long branches far apart

(convergent sites distributed across tree)

= convergent site

Why Having a Representative Sample of Taxa is Important

= informative site

Long branch attraction

27/02/13

14

Tree-Building Methods

No explicit model of sequence evolution

Explicit model of sequence evolution

parsimony

Application of the

parsimony principle

distance

pairwise comparison

of sequences

maximum likelihood and bayesian

Statistical approach

Methods for Inferring Phylogenetic Trees

• Parsimony (PAUP*) Find tree with the minimum number of mutations between sequences (i.e. choose tree with the least convergent evolution) • Neighbor-Joining (PAUP*, MEGA) Estimate genetic distances between sequences and cluster these distances into a tree that minimises genetic distance over the whole tree • Maximum Likelihood (PhyML, PAUP*, GARLi, RaxML, MEGA) Determine the probability of a tree (and branch lengths) given a particular model of molecular evolution and the observed sequence data • Bayesian (BEAST, Mr.Bayes) Similar to likelihood but where there is information about the prior distribution of parameters. Also returns a (posterior) distribution of trees

27/02/13

15

๏  Advantages: -  Allows the use of an explicit model of evolution -  Very fast -  Simple

๏  Disadvantages: -  Only produces one tree with no indication of its quality -  Reduces all sequence information into a single distance

value -  Dependent on the evolutionary model used (preferentially

this model should be estimated from the data)

Distance Methods

๏  Parsimony -  Fast -  Not statistically consistent with most models of evolution -  “The” method for morphological data

๏  Maximum Likelihood -  Requires explicit statement of evolutionary model -  Slow -  Statistically consistent -  Most commonly used with molecular data

Optimality Methods

27/02/13

16

Maximum Likelihood in Phylogenetics

•  Best described by Joe Felsenstein

‣  Felsenstein, J. (1981). Evolutionary trees from DNA sequences: a maximum likelihood approach. J. Mol. Evol. 17, 368-376

•  Considered the most statistically valid approach to molecular phylogenetics along with the closely related Bayesian methods

•  Allows us to incorporate extremely detailed models of molecular evolution

Likelihood •  Likelihood is a quantity proportional to the probability of

observing an outcome/data/event X given a hypothesis H

P ( X | H ) or P ( X | p )

•  then we would talk about the likelihood

L ( p | X )

that is, the likelihood of the parameters given the data. •  In this case the hypothesis is a tree + branch lengths

and the data are the sequences R.A. Fisher (looking suitably grumpy)

Assumes an explicit model of molecular evolution, such as those described previously

27/02/13

17

Bayesian Phylogenetics ๏  Using Bayesian statistics, you search for a set of plausible trees instead

of a single best tree ๏  In this method, the “space” that you search in is limited by prior

information ๏  The posterior distribution of trees can be translated to a probability of

any branching event -  Allows estimate of uncertainty! -  BUT incorporates prior beliefs

Searching Through ‘Tree Space’

27/02/13

18

Searching Through Tree Space ๏  There are two ways in which we can search through tree

space to find the best tree for our data: –  Branch-and-bound: finds the optimal tree by implicitly

checking all possible trees (cutting of paths in the search tree that cannot possibly lead to optimal trees)

–  Heuristic: searches by randomly perturbing the tree, does not check all trees and cannot guarantee to find the optimal one(s). Most commonly used.

(exhaustive searching is only possible for very small data sets)

Heuristic searching

Global Maximum Likelihood tree

Like

lihoo

d

Trees

local optimum

Starting tree of the heuristic search

Starting tree of the heuristic search

27/02/13

19

Non-Parametric Bootstrap • Statistical technique that uses random resampling of data to determine sampling error. • Characters are resampled with replacement to create many replicate data sets. A tree is then inferred from each replicate. • Agreement among the resulting trees is summarized with a consensus tree. The frequencies of occurrence of groups, bootstrap proportions, are a measure of support for those groups Parametric Bootstrap (Monte Carlo simulation) • Compare the likelihoods of competing trees on the data. • Simulate replicate sequences using the parameters (including the tree) obtained for the worse tree (null hypothesis). • Compare the likelihoods trees for each replicate data set as before to create a null distribution.

Bootstrapping (How Robust is a Tree?)

Non-Parametric Bootstrapping A!A A!A!A!A!

C!C C!T!T!T!

1!2!3!4!5!6!

C!C C!C!C!G!

T!T T!T!T!G!

G!G A!A!A!A!

G!G C!T!T!A!

1!2 3!4!5!6!

T!T T!T!T!G!

G!G A!A!A!A!

1!2 3!4!5!6!

A!A A!A!A!A!

C!C C!C!C!G!

G!G A!A!A!A!

1!2 3!4!5!6!

G!G A!A!A!A!

G!G C!T!T!A!

A!A A!A!A!A!

C!C C!T!T!T!

T!T T!T!T!G!

G!G C!T!T!A!

C!C C!C!C!G!

1

1000

... Resample with replacement multiple times

27/02/13

20

Detecting Recombination

Recombination & Reassortment

• The Problems: - Generates new genetic configurations - Complicates our attempts to infer phylogenetic history and other evolutionary processes (e.g. positive selection)

• The Solutions:

- Find recombinants and remove them from the data set (usual plan) - Incorporate recombinants into an explicit evolutionary model (far harder)

• “Topological incongruence”, where different gene regions (or genes) produce different phylogenetic trees, is the strongest signal for recombination (although conservative)

27/02/13

21

Methods for Recombination Detection • Measure level of linkage disequilibrium: - LDhat, D’ • Look for changes in patterns of sequence similarity (often pairwise): - GENECOV, RDP, Max Chi-Square, SimPlot, SiScan, TOPAL • Look for incongruent phylogenetic trees: -  BOOTSCAN, 3SEQ, LARD, PLATO, LIKEWIND

• Look for “networked” evolution - SplitsTree, NeighborNet • Look for excessive convergent evolution: - Homoplasy test, PIST

• See http://www.bioinf.manchester.ac.uk/recombination/programs.shtml for a more complete list • Many of these methods are available in the Recombination Detection Program (RDP3) – http://darwin.uvigo.es/rdp/rdp.html

Sliding Window Diversity Plots can Graphically Show Recombination (e.g. “SimPlot”)

• Magiorkinis et al. Gene 349, 165-171 (2005).

Hepatitis B virus

27/02/13

22

Detecting Recombination: Looking for Incongruent Trees

• Different genes produce different trees

Gene region 1

Gene region 2

A

B

C

Maximum likelihood break-point

• Programme “LARD” (a maximum likelihood approach) • Compute likelihood of each possible breakpoint in the alignment • Identify breakpoint with the highest likelihood in the alignment • Compare recombination likelihood to that with no recombination • Assess significance with Monte Carlo simulation

Although reassortment is commonplace in influenza virus, the occurrence of homologous recombination is more controversial

Analyzing Natural Selection

27/02/13

23

Ways of Measuring Selection Pressures (Especially Detecting Positive Selection)

• Phylogenetic methods: Identify cases of strong parallel or convergent evolution • Population genetic methods: (i)  Look for regional reductions in genetic diversity, usually

using SNPs (commonly used with genomic data) (ii)  Compare estimates of effective population size obtained

using different measures of genetic diversity (e.g. the H statistic of Fay & Wu)

(iii)  Estimate the speed of allele fixation compared to neutrality • Combined phylogenetic and population genetic methods: Compare the relative numbers of nonsynonymous (dN) and synonymous (dN) substitutions per site

Detecting Positive Selection by Examining Patterns/Rates of Fixation

• Bhatt S, Holmes EC & Pybus OG. (2011). The genomic rate of molecular adaptation of the human influenza A virus. Mol.Biol.Evol. 28, 2443-2451.

27/02/13

24

• Compare the ratio of synonymous (dS) and nonsynonymous (dN) substitutions per site (dN/dS):

Ser Met Leu Gly Gly Seq 1: TCA ATG TTA GGG GGA † * † † ** Seq 2: TCG ATA CTA GGT ATA Ser Ile Leu Gly Ile

†Synonymous substitution *Nonsynonymous substitution dN/dS < 1.0 = purifying selection dN/dS ~ 1.0 = neutral evolution dN/dS > 1.0 = positive selection

• Cases where dN > dS > 1 are evidence for positive selection because the rate of fixation of nonsynonymous changes (dN) is greater than the neutral mutation rate (dS) which is impossible under genetic drift

Measuring Selection Pressures

Analysing Selection Pressures in Genes Using dN/dS

• Pairwise methods: (i)  Compute dS and dN in each pair of sequences and then compute the

mean across all pairs (ii)  Various methods, including:

- Nei & Gojobori 1986 (distance matrix method) - Li et al. 1985 (distance matrix method) - Yang et al. 2000 (maximum likelihood method)

(iii) Problems of pseudo-replication, sometimes use poor substitution models, and lack of power (many false-negatives) • Site-by-site (and branch) methods: (i)  Incorporate phylogenetic relationships of sequences (i.e. estimate dN/dS

along a tree) (ii)  Allow variable selection pressures among codons and realistic models

of nucleotide substitution (iii)  Can employ parsimony, likelihood or Bayesian methods (iv)  Has now been extended to account for directional selection (DEPS) (v)  Tendency for false-positive results, especially in branch-site methods

27/02/13

25

Datamonkey http://www.datamonkey.org/

• Online version of the more powerful HyPhy package • Contains multiple (and continually updated) programs for the analysis of selection pressures (and recombination)

2/27/13 Adaptive Evolution Server @ Datamonkey.org

www.datamonkey.org 1/1

DATAMONKEY.ORG

Analyze your data.[Run SCUEAL]

[Run UDS analysis]

ANALYZE YOUR DATA HOME HELP CITATIONS JOB QUEUE STATS HYPHY PACKAGE

News

February 19rd, 2013. The FUBAR method paper has now beenpublished in Mol Biol and Evol. Give this much faster (can process1000 sequences < 10 mins) and statistically more robust methodthan REL (or PAML) a try, or see the papers which already citedit.

May 3rd, 2012. The MEME manuscript has been accepted by PLoSGenetics. MEME is our recommended method for identifying sitesunder selection. Unlike most other methods, MEME can findsignatures of episodic selection, even when the majority oflineages are subject to purifying selection.

Welcome to the free public server for comparative analysis of sequence

alignments using state-of-the-art statistical models. This service is brought

to you by the viral evolution group at the School Of Medicine of the University of California, San Diego.

Over its lifetime Datamonkey.org has processed 182614 analyses at a rate of 176.7 jobs/day (over the

last 30 days).

Datamonkey.org can help you answer the following questions ( publications citing datamonkey.org):

Acknowledgements and disclaimers.

Recent VisitorsDatamonkey.org is implemented on the Applecross/San DiegoAlliance cluster which was funded jointly by the UCSD CFAR grant,NSF award 0714991 and a Medical Research Council (UK) grant to the

University of Edinburgh (to Prof. Andy Leigh Brown). Further support

provided by the UCSD Center for AIDS Research BIT Core. Our data privacy policy Copyright notice

UCSD Viral Evolution Group 2004-2013

Find indvidual sites under diversifying/purifying selection

Find indvidual sites under other types of selection

Find individual lineages under diversifying selection

Tests for alignment-wide evidence of selection

Detect epistasis/co-evolution

Screen for recombination

Perform model selection

Reconstruct ancestral sequences

Variable Selection Pressures in RNA Viruses

27/02/13

26

SIHIGPGRAFYTTGE!SIPIGPGRAFYTTGQ!SIHIGPGGAFYTTGQ!SIHIGPGRAFYTTGD!SIPIGPGRAFYTTGD!GIHIGPGSAFYATGD!SIHIGPGRAFYTTGG!SIHIGPGRAVYTTGQ!GIHIGPGSAFYATGG!GIHIGPGRAVYTTEQ!RIHIGPGRAVYTTEQ!GIHIGPGSAFYATGR!RIYIGPGRAVYTTEQ!GIHIGPGSAVYATGG!RIYIGPGSAVYTTEQ!GIHIGPGSAFYATGG!RIGIGPGRSVYTAEQ!GIHIGPGSAVYATGD!GIHIGPGRAFYATGD!GIHIGPGRAVYTTGD!RIYIGPGRAVYTTDQ

Intra-Host Evolution of HIV-1

Tip of the V3 loop (part of the envelope protein of HIV-1) - diversity in a single patient

• The HIV-1 envelope protein is under very strong positive selection to help the virus escape from the human immune response (the V3 loop contains epitopes for neutralising antibodies and cytotoxic T-lymphocytes (CTLs). • V3 loop dN/dS = 13.182 (Nielsen & Yang. Genetics 148, 929. 1998).

phylogenetics 101 - mismsmisms.net/.../uploads/2016/08/holmes_phylogenetics101.pdfmolecular...

Documents