phylogenetic inference

Phylogenetic Inference

• Involves an attempt to estimate the evolutionary history of a collection of organisms (taxa) or a family of genes

• Two major components– Estimation of the evolutionary tree (branching

order)– Using estimated trees (phylogenies) as analytical

framework for further evolutionary study

• Traditional role: systematics and classification

Example 1: Closest living relatives of humans

Humans

Bonobos

Gorillas

Orangutans

Chimpanzees

MYA015-30

MYA

Chimpanzees

Orangutans

Humans

Bonobos

Gorillas

014

Pre-molecular view(morphology)

Emerging picture from mtDNA, most nuclear genes, DNA/DNA hybridization

Example 2: Who are whales related to?

Morphological data suggest that whales are a “sister clade” to extant artiodactylans, but molecular data suggest strongly that whales and hippos are more closely related to each other than hippos are to other artiodactylans

Morphology

Mt and nuclear DNA sequences, SINEs, LINEs

Other interesting applicationsForensics—Transmission of HIV by Florida dentist

DENTIST

DENTIST

Patient D

Patient F

Patient C

Patient A

Patient G

Patient BPatient E

Patient A

Local control 2

Local control 3

Local control 9

Local control 35

Local control 3

Yes:The HIV sequences fromthese patients fall withinthe clade of HIV sequences found in the dentist.

No

No

From Ou et al. (1992) and Page & Holmes (1998), redrawn by Caro-Beth Stewart

Phylogenetic treeof HIV sequencesfrom the DENTIST,his Patients, & LocalHIV-infected People:

Other interesting applicationsStudying dynamics of microbial communities:

Sequence 16s rDNA to identify and quantify microbes in soil before and after pesticide exposure (many microbes are previously unknown, so study gene sequences phylogenetically to follow changes in community composition)

Known sequences from database

Novel microbial sequences

Other interesting applicationsPredicting evolution of influenza viruses

Lineages with many mutations in one set of positively selected codons were usually the ones which led to successful strains in subsequent seasons

Other interesting applicationsPredicting functions of uncharacterized genes

Use “character-mapping” to infer functions based on parsimonious reconstructions

Many situations where similarity-based methods are inadequate, e.g.:

Other interesting applications• Drug Discovery—predicting natural ligands for cell

surface receptors that are potential drug targets (e.g., G-protein coupled receptors)

G-protein-coupled receptors are a pharmacologically important protein family with approximately 450 genes identified to date. Pathways involving these receptors are the targets of hundreds of drugs, including antihistamines, neuroleptics, antidepressants, and antihypertensives. The functions of many of these proteins are unknown, and determining ligands and signaling pathways is time-consuming and expensive. This difficulty motivates the search for a computational method which can predict ligand and second messenger with high reliability. Classifying this family of proteins helps us classify drugs, a technique which might be called "evolutionary pharmacology”… A computational method based on evolutionary tree reconstruction and employing an accepted-mutation stepmatrix can predict the ligand selectivities and intracellular signaling pathways of uncharacterized receptors, given only the amino acid sequence of the receptor. This dramatically increases the efficiency of functional characterization of new receptors. (http://www.cis.upenn.edu/~krice/receptor.html)

• Vaccine development—engineer vaccines to confer immunity against multiple virus populations by targeting their inferred common ancestors

Ancestral Node or ROOT of

the TreeInternal Nodes orDivergence Points

(represent hypothetical ancestors of the taxa)

Branches (edges) and lineages

Terminal Nodes

A

B

C

D

E

Represent theTAXA (genes,populations,species, etc.)used to inferthe phylogeny

Common Phylogenetic Tree Terminology

Completely unresolvedor "star" phylogeny

Partially resolvedphylogeny

Fully resolved, bifurcatingphylogeny (binary tree)

A A A

B

B B

C

C

C

E

E

E

D

D D

Polytomy or multifurcation A bifurcation

The goal of phylogeny inference is to resolve the branching orders of lineages in evolutionary trees:

C-B Stewart, NHGRI lecture, 12/5/00

Three possible unrooted trees for four taxa (A, B, C, D)

A C

B D

Tree 1

A B

C D

Tree 2

A B

D C

Tree 3

Phylogenetic tree building (or inference) methods are aimed at discovering which of the possible unrooted trees is "correct".We would like this to be the “true” biological tree — that is, one that accurately represents the evolutionary history of the taxa.However, we must settle for discovering the optimal tree for the phylogenetic method of choice (no guarantee that optimality = truth).

The number of unrooted trees increases in a greater than exponential manner with number of taxa

(2N - 5)!! = # unrooted trees for N taxa

CA

B D

A B

C

A D

B E

C

A D

B E

C

F

Inferring evolutionary relationships between the taxa requires rooting the tree:

To root a tree mentally, imagine that the tree is made of string. Grab the string at the root and tug on it until the ends of the string (the taxa) fall opposite the root: A

BC

Root D

A B C D

RootNote that in this rooted tree, taxon A is no more closely related to taxon B than it is to C or D.

Rooted tree

Unrooted tree

Now, try it again with the root at another position:

A

BC

Root

D

Unrooted tree

Note that in this rooted tree, taxon A is most closely related to taxon B, and together they are equally distantly related to taxa C and D.

C D

Root

Rooted tree

A

B

An unrooted, four-taxon tree theoretically can be rooted in five different places to produce five different rooted trees

The unrooted tree 1:

A C

B D

Rooted tree 1d

C

D

A

B

4

Rooted tree 1c

A

B

C

D

3

Rooted tree 1e

D

C

A

B

5

Rooted tree 1b

A

B

C

D

2

Rooted tree 1a

B

A

C

D

1

These trees show five different evolutionary relationships among the taxa

All of these rearrangements show the same evolutionary relationships between the taxa

B

A

C

D

A

B

D

C

B

C

A

D

B

D

A

C

B

AC

DRooted tree 1a

B

A

C

D

A

B

C

D

By outgroup: Uses taxa (the “outgroup”) that are known to fall outside of the group of interest (the “ingroup”). Requires some prior knowledge about the relationships among the taxa. The outgroup can either be species (e.g., birds to root a mammalian tree) or previous gene duplicates (e.g., -globins to root -globins).

There are two major ways to root trees:

A

B

C

D

10

2

3

5

2

By midpoint or distance:Roots the tree at the midway point between the two most distant taxa in the tree, as determined by branch lengths. Assumes that the taxa are evolving in a clock-like manner. This assumption is built into some of the distance-based tree building methods.

outgroup

d (A,D) = 10 + 3 + 5 = 18Midpoint = 18 / 2 = 9

C-B Stewart, NHGRI lecture,12/5/00

# T axa

3

4

5

6

7

8

9

.

.

.

.

30

# Un r oot e d

T rees

1

3

15

105

945

1 0 ,935

13 5 ,135

.

.

.

.

~3 . 58 x 10

3 6

# Root s

3

5

7

9

11

13

15

.

.

.

.

57

# Root e d

T rees

3

1 5

1 0 5

9 4 5

10,3 9 5

1 35,1 3 5

2, 0 27,0 2 5

.

.

.

.

~2 . 04 x 10

3 8

x =

CA

B D

A D

B E

C

A D

B E

C

F (2N - 3)!! = # unrooted trees for N taxa

Each unrooted tree theoretically can be rootedanywhere along any of its branches

Types of data used in phylogenetic inference:Character-based methods: Use the aligned characters, such as DNA

or protein sequences, directly during tree inference. Taxa Characters

Species A ATGGCTATTCTTATAGTACGSpecies B ATCGCTAGTCTTATATTACASpecies C TTCACTAGACCTGTGGTCCASpecies D TTGACCAGACCTGTGGTCCGSpecies E TTGACCAGTTCTCTAGTTCG

Distance-based methods: Transform the sequence data into pairwise distances (dissimilarities), and then use the matrix during tree building.

A B C D E Species A ---- 0.20 0.50 0.45 0.40 Species B 0.23 ---- 0.40 0.55 0.50 Species C 0.87 0.59 ---- 0.15 0.40 Species D 0.73 1.12 0.17 ---- 0.25 Species E 0.59 0.89 0.61 0.31 ----

Example 1: Uncorrected“p” distance(=observed percentsequence difference)

Example 2: Kimura 2-parameter distance(estimate of the true number of substitutions between taxa)

Similarity vs. Evolutionary Relationship:

Similarity and relationship are not the same thing, even thoughevolutionary relationship is inferred from certain types of similarity.

Similar: having likeness or resemblance (an observation)

Related: genetically connected (an historical fact)

Two taxa can be most similar without being most closely-related:

Taxon A

Taxon B

Taxon C

Taxon D

1

1

1

6

3

5

C is more similar in sequence to A (d = 3) than to B (d = 7),but C and B are most closelyrelated (that is, C and B shareda common ancestor more recentlythan either did with A).

Character-based methods can tease apart types of similarity and theoreticallyfind the true evolutionary tree. Similarity = relationship only if certain conditionsare met (if the distances are ‘ultrametric’).

Types of Similarity

Observed similarity between two entities can be due to:

Evolutionary relationship:Shared ancestral characters (‘symplesiomorphies’)Shared derived characters (‘’synapomorphy’)

Homoplasy (independent evolution of the same character):Convergent events (in either related on unrelated entities),Parallel events (in related entities), Reversals (in related entities)

CC

G

G

C

C

G

G

CG

G C

C

G

GT

METRIC DISTANCES between any two or three taxa(a, b, and c) have the following properties:

Property 1: d (a, b) ≥ 0 Non-negativity

Property 2: d (a, b) = d (b, a) Symmetry

Property 3: d (a, b) = 0 if and only if a = b Distinctness

and...

Property 4: d (a, c) ≤ d (a, b) + d (b, c) Triangle inequality:

a

b

c6

9

5

ULTRAMETRIC DISTANCESmust satisfy the previous four conditions, plus:

Property 5 d (a, b) ≤ maximum [d (a, c), d (b, c)]

If distances are ultrametric, then the sequences are evolving in a perfectly clock-like manner, thus can be used in UPGMA trees and for the most precise calculations of divergence dates.

a b4

66

c

Similarity = Relationship if the distances are ultrametric!

a

b

c

2

22

4

This implies that the two largest distances are equal, so that they define an isosceles triangle:

General strategy for estimating a phylogeny

1. Get data

2. Select an optimality criterion (e.g., parsimony, least-squares distance, maximum likelihood)

3. Choose a search strategy (e.g., stepwise addition with branch swapping, branch-and-bound)

4. Evaluate optimality criterion for each tree visited during search, always keeping track of best tree(s) found

Parsimony (optimality criterion)

• In general: choose the tree requiring the fewest number of (possibly weighted) character-state changes (= steps)

• Assume character independence; can calculate length required by each character and sum over characters to get total tree length

Parsimony variants used for molecular data

• Fitch parsimony (unordered/nonadditive): Each change counts 1 step, regardless of the nature of this change

• Transversion parsimony: changes between a purine (A or G) and a pyrimidine (C or T) (“transversions”) count 1, changes between two purines or between two pyrimidines (“transitions”) count 0

• Generalized parsimony: User specifies cost of each type of change

A C

G T

= 1 step

= 3 steps

Calculating tree lengths under parsimony using “brute force”

• For each character:– Consider every possible ancestral state

reconstruction– Count total cost required for each of these

reconstructions– Sum over all characters

G

A

A C

C

C

G

A

A T

C

C

G

A

A G

C

C

G

A

C A

C

C

G

A

C C

C

C

G

A

C T

C

C

G

A

C G

C

C

G

A

G A

C

C

G

A

G C

C

C

G

A

G T

C

C

G

A

G G

C

C

G

A

T A

C

C

G

A

T C

C

C

G

A

T T

C

C

G

A

T G

C

C

G

A

A A

C

C

equal: 1+0+0+1+1=3tv4: 1+0+0+4+4=9

equal: 1+1+1+1+1=5tv4: 4+4+4+4+4=20

equal: 0+1+1+1+1=4tv4: 0+1+1+4+4=10

equal: 1+1+1+1+1=5tv4: 4+4+4+4+4=20

equal: 1+0+1+0+0=2tv4: 1+0+4+0+0=5

equal: 1+1+0+0+0=2tv4: 4+4+0+0+0=8

equal: 0+1+1+0+0=2tv4: 0+1+4+0+0=5

equal: 1+1+1+0+0=3tv4: 4+4+1+0+0=9

equal: 1+0+1+1+1=4tv4: 1+0+1+4+4=10

equal: 1+1+1+1+1=5tv4: 4+4+4+4+4=20

equal: 0+1+0+1+1=3tv4: 0+1+0+4+4=9

equal: 1+1+1+1+1=5tv4: 4+4+4+4+4=20

equal: 1+0+1+1+1=3tv4: 1+0+4+1+1=7

equal: 1+1+1+1+1=5tv4: 4+4+1+1+1=11

equal: 0+1+1+1+1=4tv4: 0+1+4+1+1=7

equal: 1+1+0+1+1=4tv4: 4+4+0+1+1=10

0 1 1 11 0 1 11 1 0 11 1 1 0

equal =

0 4 1 44 0 4 11 4 0 44 1 4 0

tv4 =

Calculating tree lengths using dynamic programming

• Analogous to pairwise alignment: determine implications of each possible state assignment at one level (node) for length at next level (parent node)

G A C CA C G T A C G T A C G T A C G T

∞ ∞ ∞0 ∞ ∞∞ ∞ ∞∞ ∞ ∞∞000

W XY Z

1 2

3

A C G T A C G T

∞ ∞∞ ∞ ∞∞00

(min∞,4,∞,∞)+

(min∞,4,∞,∞)= 4 + 4 = 8

(min∞,0,∞,∞)+

(min∞,0,∞,∞)= 0 + 0 = 0

(min∞,4,∞,∞)+

(min∞,4,∞,∞)= 4 + 4 = 8

(min∞,1,∞,∞)+

(min∞,1,∞,∞)=1+1= 2

A C G T

2

X Z

min(1,12,2,12)+

min(8,4,9,6)= 1 + 4 = 5

min(5,8,5,9)+

min(12,0,12,3)= 5 + 0 = 5

min(2,12,1,12)+

min(9,4,8,6)= 1 + 4 = 5

min(5,9,5,8)+

min(12,1,12,2)= 5 + 1 = 6

A C G T

A C G T

1 8 81

A C G T

8 0 28

A C G T A C G T

∞ ∞ ∞0 ∞ ∞∞0

(min∞,∞,1,∞)+

(min0,∞,∞,∞)= 1 + 0 = 1

(min∞,∞,4,∞)+

(min4,∞,∞,∞)= 4 + 4 = 8

(min∞,∞,0,∞)+

(min1,∞,∞,∞)=0 + 1 = 1

(min∞,∞,4,∞)+

(min4,∞,∞,∞)= 4 + 4 = 8

A C G T

W Y

1

Faster algorithms for special cases

• Farris (1970) algorithm for ordered characters• Fitch (1971) algorithm for unordered characters

• Assign “state sets” to terminal taxa based on observed data, and initialize tree length to 0

• Traverse tree from tips to root; for each node consider state sets of two immediate descendants (children)

– If child state sets have a nonempty intersection, new state set equals this intersection

– Otherwise, make new state set equal to the union of the two child state sets, and add 1 to the tree length

{G}:0 {A}:0 {C}:0 {C}:0

1 2

3

W XY Z

{G}:0 {A}:0 {C}:0 {C}:0

{A,G}:1 2

3

{G}:0 {A}:0 {C}:0 {C}:0

{A,G}:1

3

{C}:0

{G}:0 {A}:0 {C}:0 {C}:0

{A,G}:1 {C}:0

{A,C,G}:2

Example of tree length calculation using Fitch optimization

Searching for trees

• Generation of all possible trees

B

C

A

D

D

D

B

CD

A

B

CD

B C

DB

A

1.Generate all 3 trees for first 4 taxa:

Searching for trees

B

C

D

AE

EE

C

DE

AB

C

DE

BA

C

DB

AE

D

EB

AC

C

EB

AD

2. Generate all 15 trees for first 5 taxa:

(likewise for each of the other two 4-taxon trees)

Searching for trees

3. Full search tree:

EA

CB

D

DA

CB

E

DA

EB

C

DA

EC

B

CB

ED

A

CA

DB

E

CA

EB

D

CA

ED

B

DB

EC

A

EA

DC

BE

B

DC

A

BA

DC

E

BA

EC

D

BA

ED

C

D

A

B

C

B

A

C

D

A

B

C

C

A

B

D

DB

EA

C

Searching for trees

Branch and bound algorithm:

The branch-and-bound algorithm for exact solution of the problem of finding an optimal parsimony tree. The search tree is the same as for exhaustive search, with tree lengths for a hypothetical data set shown in boldface type. If a tree lying at a node of this search tree has a length that exceeds the current lower bound on the optimal tree length, this path of the search tree is terminated (indicated by a cross-bar), and the algorithm backtracks and takes the next available path. When a tip of the search tree is reached (i.e., when we arrive at a tree containing the full set of taxa), the tree is either optimal (and hence retained) or suboptimal (and rejected). When all paths leading from the initial 3-taxon tree have been explored, the algorithm terminates, and all most-parsimonious trees will have been identified. Asterisks indicate points at which the current lower bound is reduced. See text for additional explanation, and circled numbers represent the order in which phylogenetic trees are visited in the search tree.

1

*229

EA

CB

D

DA

CB

E

DA

EB

C

DA

EC

B

CB

ED

A

CA

DB

E

CA

EB

D

DB

EC

A

D

A

B

C

A

B

C

233

235

237 237245

251258

C

A

B

D

280

221 213

B

A

C

D

234

*241

*242

242245

246247

249

268C

A

ED

B

245

241

241

244248

251

232

226

233

235

251

262

243

227

2

3

11

12

13-19

4-10

DB

EA

C

20

21

22

26

23

24

25

27

28-34

Searching for trees

Heuristic search methods

A greedy stepwise-addition search applied to the example used for branch-and-bound. The best 4-taxon tree is determined by evaluating the lengths of the three trees obtained by joining taxon D to tree 1 containing only the first three taxa. Taxa E and F are then connected to the five and seven possible locations, respectively, on trees 4 and 9, with only the shortest trees found during each step being used for the next step. In this example, the 233-step tree obtained is not a global optimum. Circled numbers indicate the order in which phylogenetic trees are evaluated in the stepwise-addition search.

EA

CB

D

DA

CB

E

DA

EB

C

DA

EC

B

CB

ED

A

D

A

B

C

A

B

C

233*

235

237 237245

251258

C

A

B

D

280

221 213

B

A

C

D

235

251

262

243

227

2

1

2

3

5

6

7

8

4

9

10-16

Searching for trees

Heuristic search methods continued

1

2 3 45

6

Nearest neighbor interchange:

1

2 3 45

6

1

2 4 35

6 1

2 3

4

5 6

1

32 4 5

6

3

21 4

5

6

1

2 3 5 4

6

1

2 3 64

5

All possible NNIs on 6-taxon tree:

Searching for trees


Subtree pruning regrafting:

1

2 3 45

6

x zy

x

1

2 3 4 5

6

a

bc

z

1

2 3 45

6

a

bc

d

y

1

2 3 4 5

6

a

b

Searching for trees


Trees resulting from SPR:

z.a.

1

2 4 3 5

6

z.b.

1

2

4 3

5

6z.c.

4

3 2 1 5

6

z.d.

3

4 1 2 5

6

y.a.

1

2 3 5 4

6

y.b.

1

2 3 6 4

5

x.a.

1

2 4 3 5

6

x.b.

1

2

4 3

5

6x.c.

1

2 5 63

4

x.d.

1

2 6 5 3

4

Searching for trees


Tree bisection-reconnection:

1

2 3 45

6

x zy

r

s

t u v

w

1

2 3 45

6

x zx'

u v

w1

2 4 3 5

6

1

2 3 45

6

0 01

1

2

2

Reconnection distances:

Searching for trees


Tree bisection-reconnection:

(D)

1

2 3 45

6

y

r

s

v

wy'

3

1 2 54

6

01

1

2 3 45

6

1

1

1

0Reconnection distances:

Star-decomposition search

1

2

3

4

5

1

3

2

4

5

3

5

1

2

4

•••

4

5

1

2

3

1

2

3

4

5

14

3

2

5

12

3

4

5

15

3

2

4

Step 1

Step 2 Step 3

Other search strategies

• These “hill-climbing” methods work well for up to 20-30 taxa. For larger numbers of taxa, highly prone to entrapment in local optima. Therefore, additional strategies may be necessary:– Random restart (random trees, stepwise addition

with random addition sequences)– Other optimization (meta)heuristics: iterated local

search (restart after random perturbations); simulated annealing and other stochastic optimization methods

– Genetic algorithms and other population-based approaches

Overview of maximum likelihood as used Overview of maximum likelihood as used in phylogeneticsin phylogenetics

• Overall goal: Find a tree topology (and associated Overall goal: Find a tree topology (and associated parameter estimates) that maximizes the probability of parameter estimates) that maximizes the probability of obtaining the observed data, given a model of evolutionobtaining the observed data, given a model of evolution



Likelihood(hypothesis) Likelihood(hypothesis) Prob(dataProb(data||hypothesis)hypothesis)

Likelihood(tree,model) = k Prob(observed sequences|Likelihood(tree,model) = k Prob(observed sequences|tree,model)tree,model)



Likelihood(hypothesis) Likelihood(hypothesis) Prob(dataProb(data||hypothesis)hypothesis)

Likelihood(tree,model) = k Prob(observed sequences|Likelihood(tree,model) = k Prob(observed sequences|tree,model)tree,model)

[[notnot Prob(tree Prob(tree||data,model)]data,model)]

Computing the likelihood of a single treeComputing the likelihood of a single tree

1 1 jj NN (1) C…GGACA…(1) C…GGACA…CC……

GTTTA…CGTTTA…C(2) C…AGACA…(2) C…AGACA…CC……

CTCTA…CCTCTA…C(3) C…GGATA…(3) C…GGATA…AA……

GTTAA…C GTTAA…C (4) C…GGATA…(4) C…GGATA…GG……

CCTAG…C CCTAG…C






CCTAG…C CCTAG…C (1)(1)

(2)(2)

(3)(3)

(4)(4)






CCTAG…C CCTAG…C (1)(1)

(2)(2)

(3)(3)

(4)(4)

CCCC AA GG

(6)(6)

(5)(5)


ProbProb

CCCC AA GG

AA

AA

Likelihood at site Likelihood at site jj = =


ProbProb

CCCC AA GG

AA

AA


+ Prob+ Prob

CCCC AA GG

AA

CC


ProbProb

CCCC AA GG

AA

AA


+ Prob+ Prob

CCCC AA GG

AA

CC

ProbProb

CCCC AA GG

TT

TT+ … ++ … +


ProbProb

CCCC AA GG

AA

AA


+ Prob+ Prob

CCCC AA GG

AA

CC

ProbProb

CCCC AA GG

TT

TT+ … ++ … +

But use Felsenstein (1981) pruning algorithmBut use Felsenstein (1981) pruning algorithm


L=L1L2L LN = Ljj=1

N

∏

lnL=lnL1 +lnL2 +L +lnLN = lnLjj=1

N

∑

Finding the maximum-likelihood treeFinding the maximum-likelihood tree(in principle)(in principle)

• Evaluate the likelihood of each possible Evaluate the likelihood of each possible tree for a given collection of taxa.tree for a given collection of taxa.

Finding the maximum-likelihood treeFinding the maximum-likelihood tree(in principle)(in principle)

• Evaluate the likelihood of each possible Evaluate the likelihood of each possible tree for a given collection of taxa.tree for a given collection of taxa.

• Choose the tree topology which Choose the tree topology which maximizes the likelihood over all maximizes the likelihood over all possible trees.possible trees.

Probability calculations Probability calculations require…require…

• An explicit model of substitution that specifies change probabilities for a given branch length:An explicit model of substitution that specifies change probabilities for a given branch length:



Q =

πArAA πCrAC πGrAG πTrAT

πArCA πCrCC πGrCG πTrCT

πArGA πCrGC πGrGG πTrGT

πArTA πCrTC πGrTG πTrTT

⎛

⎝

⎜ ⎜ ⎜ ⎜

⎞

⎠

⎟ ⎟ ⎟ ⎟

Jukes-CantorJukes-CantorKimura 2-parameterKimura 2-parameterHasegawa-Kishino-Yano (HKY)Hasegawa-Kishino-Yano (HKY)Felsenstein 1981, 1984 Felsenstein 1981, 1984 General time-reversibleGeneral time-reversible



• An estimate of optimal branch lengths in units of expected amount of change (An estimate of optimal branch lengths in units of expected amount of change ( = rate x time) = rate x time)

Q =

πArAA πCrAC πGrAG πTrAT

πArCA πCrCC πGrCG πTrCT

πArGA πCrGC πGrGG πTrGT

πArTA πCrTC πGrTG πTrTT

⎛

⎝

⎜ ⎜ ⎜ ⎜

⎞

⎠

⎟ ⎟ ⎟ ⎟

P(v)=eQν

Jukes-CantorJukes-CantorKimura 2-parameterKimura 2-parameterHasegawa-Kishino-Yano (HKY)Hasegawa-Kishino-Yano (HKY)Felsenstein 1981, 1984Felsenstein 1981, 1984General time-reversibleGeneral time-reversible

A Family of Reversible Substitution ModelsA Family of Reversible Substitution Models

GTR

SYMTrN

F81

JC

K3ST

K2P

HKY85F84

Equal base frequencies

3 substitution types(transitions,2 transversion classes)

2 substitution types(transitions vs. transversions)

3 substitution types(transversions, 2 transition classes)

2 substitution types(transitions vs.transversions)

Single substitution type

Equal basefrequencies

Single substitution typeEqual base frequencies

(general time-reversible)

(Tamura-Nei)

(Hasegawa-Kishino-Yano)

(Felsenstein)

Jukes-Cantor

(Kimura 2-parameter)

(Kimura 3-subst. type)

(Felsenstein)

E.g., transition probabilities forE.g., transition probabilities forHKY and F84:HKY and F84:

Pij t( ) =

π j +π j1

Π j

−1⎛

⎝ ⎜ ⎜

⎞

⎠ ⎟ ⎟ e

−μν +Π j −π j

Π j

⎛

⎝ ⎜ ⎜

⎞

⎠ ⎟ ⎟ e

−μνA (i= j)

π j +π j

1Π j

−1⎛

⎝ ⎜ ⎜

⎞

⎠ ⎟ ⎟ e

−μν −π j

Π j

⎛

⎝ ⎜ ⎜

⎞

⎠ ⎟ ⎟ e

−μνA (i≠ j, transition)

π j 1−e−μν( ) (i≠ j, transversion)

⎧

⎨

⎪ ⎪ ⎪ ⎪ ⎪

⎩

⎪ ⎪ ⎪ ⎪ ⎪

The Relevance of Branch LengthsThe Relevance of Branch LengthsC C A A A A A A A A

A

C

The Relevance of Branch LengthsThe Relevance of Branch LengthsC C A A A A A A A A

A

C

C C A A A A A A A A

CA

Concerns about statistical properties Concerns about statistical properties and suitability of models and suitability of models

(assumptions)(assumptions)

Concerns about statistical properties Concerns about statistical properties and suitability of models and suitability of models

(assumptions)(assumptions)

ConsistencyConsistency

If an estimator converges to the true value of a If an estimator converges to the true value of a parameter as the amount of data increases toward parameter as the amount of data increases toward infinity, the estimator is infinity, the estimator is consistentconsistent..

Two levels of maximizationTwo levels of maximization

• Nei (1987)Nei (1987)– “…“…the likelihood computed in this method is conditional for the likelihood computed in this method is conditional for

each topology, so it is not clear whether or not the topology each topology, so it is not clear whether or not the topology showing the highest likelihood has the highest probability showing the highest likelihood has the highest probability of being the true topology…”of being the true topology…”

Two levels of maximizationTwo levels of maximization

• Nei (1987)Nei (1987)– “…“…the likelihood computed in this method is conditional for the likelihood computed in this method is conditional for

each topology, so it is not clear whether or not the topology each topology, so it is not clear whether or not the topology showing the highest likelihood has the highest probability showing the highest likelihood has the highest probability of being the true topology…”of being the true topology…”

• Yang (1996)Yang (1996)– ““Literally it is a Literally it is a maximum maximum likelihoodmaximum maximum likelihood method method… …

The failure to recognize the complexity of the problem has The failure to recognize the complexity of the problem has caused much controversy … Felsenstein (1973, 1978) caused much controversy … Felsenstein (1973, 1978) referred to the regularity conditions of Wald (1949) for a referred to the regularity conditions of Wald (1949) for a proof of …consistency. These conditions would include proof of …consistency. These conditions would include the continuity and differentiability of the likelihood function the continuity and differentiability of the likelihood function with respect to the topology parameter. These concepts with respect to the topology parameter. These concepts are not defined.are not defined.

““Likelihood” Likelihood” isis consistent. consistent.

• Two proofs:Two proofs:– Chang (1996) in Chang (1996) in Mathematical BiosciencesMathematical Biosciences– Rogers (1997) in Rogers (1997) in Systematic BiologySystematic Biology

These proofs establish that the probability that the true tree has These proofs establish that the probability that the true tree has a higher likelihood than any other possible tree approaches one a higher likelihood than any other possible tree approaches one

as the number of sites (characters) increases toward infinityas the number of sites (characters) increases toward infinity. . Chang called his proof a “customized variant of the fundamental Chang called his proof a “customized variant of the fundamental consistency result of Wald.”consistency result of Wald.”

When does maximum likelihood work When does maximum likelihood work better than parsimony?better than parsimony?

When does maximum likelihood work When does maximum likelihood work better than parsimony?better than parsimony?

• When you’re in the “Felsenstein Zone”When you’re in the “Felsenstein Zone”

AA CC

BB DD

(Felsenstein, 1978)(Felsenstein, 1978)

In the Felsenstein ZoneIn the Felsenstein Zone

AA CC GG TTAA -- 55 66 22CC 55 -- 33 88GG 66 33 -- 11TT 22 88 11 --

Substitution rates:Substitution rates:

Base frequencies:Base frequencies: A=0.1A=0.1 C=0.2C=0.2 G=0.3G=0.3 T=0.4T=0.4

AA BB

CC DD

0.10.1

0.10.1 0.10.1

0.80.8 0.80.8


0

0.2

0.4

0.6

0.8

1

0 5000 10000

Sequence Length

parsimony

Pro

port

ion

corr

ect


0

0.2

0.4

0.6

0.8

1

0 5000 10000

Sequence Length

parsimonyML-GTR

Pro

port

ion

corr

ect

The long-branch attraction (LBA) problemThe long-branch attraction (LBA) problem

Pattern typePattern type

11 44AA I = Uninformative (constant)I = Uninformative (constant) AA

A AA A 22 33

The true phylogeny ofThe true phylogeny of1, 2, 3 and 41, 2, 3 and 4

(zero changes required on any (zero changes required on any tree)tree)



11 44AA I = Uninformative (constant)I = Uninformative (constant) AAAA II = UninformativeII = Uninformative GG

A AA A 22 33


(one change required on any tree)(one change required on any tree)



11 44AA I = Uninformative (constant)I = Uninformative (constant) AAAA II = UninformativeII = Uninformative GGCC III = UninformativeIII = Uninformative GG

A AA A 22 33


(two changes required on any tree)(two changes required on any tree)



11 44AA I = Uninformative (constant)I = Uninformative (constant) AAAA II = UninformativeII = Uninformative GGCC III = UninformativeIII = Uninformative GGG G IV = IV = MisinformativeMisinformative GG

A AA A 22 33


(two changes required on true tree)(two changes required on true tree)


GG 44

AA 22

AA 33

GG 11

… … but this tree needs only one stepbut this tree needs only one step

When do both methods fail?When do both methods fail?

When do both methods fail?When do both methods fail?

• When there is insufficient phylogenetic signal...When there is insufficient phylogenetic signal...

22

11 33

44

When does parsimony work “better” When does parsimony work “better” than maximum likelihood?than maximum likelihood?

When does parsimony work “better” When does parsimony work “better” than maximum likelihood?than maximum likelihood?

• When you’re in the Inverse-Felsenstein (“Farris”) zoneWhen you’re in the Inverse-Felsenstein (“Farris”) zone

AA

BB

CC

DD

(Siddall, 1998)(Siddall, 1998)

Siddall (1998) parameter space Siddall (1998) parameter space

a

a

b

b

b

Both methods do poorly

Parsimony has higheraccuracy than likelihood

Both methods do well

pa

pb0 0.75

0.75

Parsimony vs. likelihood in the Inverse-Felsenstein ZoneParsimony vs. likelihood in the Inverse-Felsenstein Zone

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

20 100 1,000 10,000 100,000

Sequence length

ParsimonyML/JC

15%67.5%

67.5%

(expected differences/site)

Acc

ura

cy

Why does parsimony do so well in theWhy does parsimony do so well in theInverse-Felsenstein Inverse-Felsenstein zone?zone?

A

A

C

C

AC

A

A

C

C

AG

A

C G

C

A

A

C

CAC

AC

True synapomorphyTrue synapomorphy

Apparent synapomorphiesApparent synapomorphiesactually due toactually due tomisinterpreted homoplasymisinterpreted homoplasy

Proportion of parsimony- Proportion of parsimony- informative sites for which informative sites for which

ancestral states are correctly ancestral states are correctly reconstructed and reconstructed and

interpreted as interpreted as synapomorphiessynapomorphies

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7

x

x yy

y

x

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

q

x

y yy

y

x

p

q

p

Proportion of parsimony- Proportion of parsimony- informative sites that are informative sites that are

interpreted as interpreted as synapomorphies but are synapomorphies but are actually misinterpreted actually misinterpreted

homoplasieshomoplasies

Parsimony vs. likelihood in the Felsenstein ZoneParsimony vs. likelihood in the Felsenstein Zone

15%

67.5% 67.5%

Acc

ura

cy

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

20 100 1,000 10,000 100,000

ParsimonyML/JC

(expected differences/site)

Sequence length

From the Farris Zone to the Felsenstein ZoneFrom the Farris Zone to the Felsenstein Zone

CC

DD

AA

BB

CC

DD

AA

BB

CC

DD

AA

BB

BB

CC

DD

AA

BB

DD

CC

AA

External branches = 0.5 or 0.05 substitutions/site, Jukes-Cantor model of nucleotide substitutionExternal branches = 0.5 or 0.05 substitutions/site, Jukes-Cantor model of nucleotide substitution

0

0.2

0.4

0.6

0.8

1.0

0.05 0.04 0.03 0.02 0.01 0 0.01 0.02 0.03 0.04 0.05

100 sites

1,000 sites

10,000 sites ML/JC

Length of internal branch ( d)Farris zone Felsenstein zone

0

0.2

0.4

0.6

0.8

0.05 0.04 0.03 0.02 0.01 0 0.01 0.02 0.03 0.04 0.05

Length of internal branch ( d)Farris zone Felsenstein zone

100 sites

1,000 sites

10,000 sites

1.0

Acc

ura

cyA

ccu

racy

ParsimonyParsimony

LikelihoodLikelihood

SimulationSimulationresults:results:

Maximum likelihood models are Maximum likelihood models are oversimplifications of reality. If I assume the oversimplifications of reality. If I assume the

wrong model, won’t my results be meaningless?wrong model, won’t my results be meaningless?

Maximum likelihood models are Maximum likelihood models are oversimplifications of reality. If I assume the oversimplifications of reality. If I assume the

wrong model, won’t my results be meaningless?wrong model, won’t my results be meaningless?

• Not necessarily (maximum likelihood is pretty robust)Not necessarily (maximum likelihood is pretty robust)

Returning to earlier example...Returning to earlier example...

AA CC GG TTAA -- 55 66 22CC 55 -- 33 88GG 66 33 -- 11TT 22 88 11 --

Substitution rates:Substitution rates:

Base frequencies:Base frequencies: A=0.1A=0.1 C=0.2C=0.2 G=0.3G=0.3 T=0.4T=0.4

AA BB

CC DD

0.10.1

0.10.1 0.10.1

0.80.8 0.80.8

Performance of ML when its model is Performance of ML when its model is violated (one example)violated (one example)

0

0.2

0.4

0.6

0.8

1

100 1000 10000

Sequence Length

parsimonyML-JCML-K2PML-HKYML-GTR

Performance of ML when its model is Performance of ML when its model is violated (another example)violated (another example)

...

0

0.02

0.04

0.06

0.08

0 1 2

Rate

=50

=200

Modeling among-site rate variation with a gamma distribution...Modeling among-site rate variation with a gamma distribution...

=2

=0.5

Fre

quen

cy


...

0

0.02

0.04

0.06

0.08

0 1 2

Rate

=50

=200

Modeling among-site rate variation with a gamma distribution...Modeling among-site rate variation with a gamma distribution...

……can also estimate a proportion of “invariable” sites (pcan also estimate a proportion of “invariable” sites (p invinv))

=2

=0.5

Fre

quen

cy


Sequence Length

Proportion Correct

Tree a = 0.5, =0.5pinv a = 1.0, =0.5pinv a = 1.0, =0.2pinv

00.10.20.30.40.50.60.70.80.91

100 1000 10000 100000

GTRigGTRgHKYgGTRiHKYiGTRerHKYerparsimony

HKYigGTRigGTRgHKYgGTRiHKYiGTRerHKYerparsimony

HKYig

00.10.20.30.40.50.60.70.80.91

100 1000 10000 100000


HKYig

00.10.20.30.40.50.60.70.80.91

100 1000 10000 100000

00.10.20.30.40.50.60.70.80.91

100 1000 10000 100000

GTRigHKYigGTRgHKYgGTRiHKYiGTRerHKYerParsimony

00.10.20.30.40.50.60.70.80.91

100 1000 10000 100000

GTRigHKYigGTRgHKTgGTRiHKYiGTRerHKYerparsimony

00.10.20.30.40.50.60.70.80.91

100 1000 10000 100000

GTRigHYYigGTRgHKYgGTRiHKYiGRTerHKYerparsimony

00.10.20.30.40.50.60.70.80.91

100 1000 10000 100000


HKYig

00.10.20.30.40.50.60.70.80.91

100 1000 10000 100000


HKYig

00.10.20.30.40.50.60.70.80.91

100 1000 10000 100000

“MODERATE”–Felsenstein zone

= 1.0, pinv=0.5

00.10.20.30.40.50.60.70.80.91

100 1000 10000 100000

JCerJC+GJC+IJC+I+GGTRerGTR+GGTR+IGTR+I+Gparsimony

“MODERATE”–Inverse-Felsenstein zone

00.10.20.30.40.50.60.70.80.91

100 1000 10000 100000


“MODERATE”–Equal branch lengths

00.10.20.30.40.50.60.70.80.91

100 1000 10000


100000


Sequence Length

Proportion Correct

Tree a = 0.5, =0.5pinv a = 1.0, =0.5pinv a = 1.0, =0.2pinv

00.10.20.30.40.50.60.70.80.91

100 1000 10000 100000


HKYigGTRigGTRgHKYgGTRiHKYiGTRerHKYerparsimony

HKYig

00.10.20.30.40.50.60.70.80.91

100 1000 10000 100000


HKYig

00.10.20.30.40.50.60.70.80.91

100 1000 10000 100000

00.10.20.30.40.50.60.70.80.91

100 1000 10000 100000

GTRigHKYigGTRgHKYgGTRiHKYiGTRerHKYerParsimony

00.10.20.30.40.50.60.70.80.91

100 1000 10000 100000

GTRigHKYigGTRgHKTgGTRiHKYiGTRerHKYerparsimony

00.10.20.30.40.50.60.70.80.91

100 1000 10000 100000

GTRigHYYigGTRgHKYgGTRiHKYiGRTerHKYerparsimony

00.10.20.30.40.50.60.70.80.91

100 1000 10000 100000


HKYig

00.10.20.30.40.50.60.70.80.91

100 1000 10000 100000


HKYig

00.10.20.30.40.50.60.70.80.91

100 1000 10000 100000

Extension to more taxa...Extension to more taxa...

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

200 1000 10000

HKY+I+ΓHKY+ΓHKY+IHKYerparsimony

Sequece Legth

Proportio Correct

Distance methods

DB

A

C

v3

v2

v1

v4

v5

"Input" distance matrix:

A B C DA - dAB dAC dADB dBA - dBC dBDC dCA dCB - dCDD dDA dDB dDC -

Distances are "additive" if, e.g.:

pAB = v1 + v2 = dAB

pAC = v1 + v3 + v4 = dAC

pAD = v1 + v3 + v5 = dAD

pBC = v2 + v3 + v4 = dBC

pBD = v2 + v3 + v5 = dBD

pCD = v4 + v5 = dCD

Distances in general will not be additive, sochoose optimal tree according to one of the

following criteria (objective functions):

"Goodness - of - fit" : minimize wij pij −diji < j∑

r

Typicall , y r = 2 (least-squares) and wij = 1/dij2 ("Fitch-

Margoliash" method)

"Minimum- "evolution : minimize vkk=1

#branches

∑ or vkk=1

#branches

∑

Neighbor joining:Neighbor joining:

A fast approximation to full searching under the minimum-evolution criterion A fast approximation to full searching under the minimum-evolution criterion using star-decomposition with iteratively updated branch lengthsusing star-decomposition with iteratively updated branch lengths

Uses the relationship:Uses the relationship:

ddAXAX = (d = (dABAB + d + dACAC - d - dBCBC)/2)/2

(etc.)(etc.)

AACC

BB

XX

Bayesian Inference in Phylogenetics

• Uses Bayes formula:

Pr(q|D) = Pr(D|q) Pr(q) Pr(D)

Pr(D|q) Pr(q)

L(q) Pr(q)

• Calculation involves integrating over all tree topologies and model-parameter values, subject to assumed prior distribution on parameters


• To approximate this posterior density (complicated multidimensional integral) we use Markov chain Monte Carlo (MCMC)– Simulated Markov chain in which transition probabilities are

assigned such that the stationary distribution of the chain is the posterior density of interest

– E.g., Metropolis-Hastings algorithm: Accept a proposed move from one state q to another state q* with probability min(r,1) where

r = Pr(q*|D) Pr(q| q*)

Pr(q|D) Pr(q*| q)– Sample chain at regular intervals to approximate posterior

distribution


• To approximate this posterior density (complicated multidimensional integral) we use Markov chain Monte Carlo (MCMC)– Simulated Markov chain in which transition

probabilities are assigned such that the stationary distribution of the chain is the posterior density of interest

– E.g., Metropolis-Hastings algorithm: Accept a proposed move from one state to another with probability min(r,1) where

phylogenetic inference

Documents

evolutionary history

evolutionary pharmacology

nuclear dna sequences

study gene sequences

family of proteins

important protein family

nuclear genes

computational method