maximum parsimony probabilistic models of evolutions distance based methods lecture 12 © shlomo...

Maximum Parsimony

Probabilistic Models of Evolutions

Distance Based Methods

Lecture 12

© Shlomo Moran, Ilan Gronau

2

Maximum Parsimony

A Character-based reconstruction method

Input:

h sequences (one per species), all of length k.

Goal:

Find a tree whose leaves are labeled by the input

sequences, and an assignment of sequences to internal

nodes, such that the total number of substitutions is

minimized.

3

Parsimony score

AGAGGA

AAAAAG

AAA AGA

AAA

11

1

Parsimony score = 3

GGAAAA

AGAAAG

AAA AAA

AAA

11 2

Parsimony score = 4

The parsimony score of a leaf-labeled tree T is the minimum possible number of mutations over all assignments of sequences to internal vertices of T.

4

Parsimony Based Reconstruction

We have here both the small and big problems:

1. The small problem: find the parsimony score for a given leaf labeled tree.

2. The big problem: Find a tree whose leaves are labeled by the input sequences, with the minimum possible parsimony score.

3. We will see efficient algorithms for (1). (2) is hard.

5

Fitch Algorithm:Maximum Parsimony for a Given Tree

Input: A rooted binary leaf labeled tree.

Output: Most parsimonious assignment of states to internal vertices

Work on each position independently. Make one pass from the leaves to the root, and another pass from the root to the leaves.

A

A/T

A A C T A

AA/C

7

Fitch’s Algorithm – Phase 1

Do a post-order (from leaves to root) traversal of tree, assign to each vertex a set of possible states. Each leaf has a unique possible state, given by the input.

The possible states Ri of internal node i with children j and k is given by:

otherwiseRR

RRifRRR

kj

kjkj

i

8


Claim (to be proved soon):# of substitutions in optimal solution = # of union operations

TC

T

CT

C

C T AG C

AGC

GC

9


do a pre-order (from root to leaves) traversal of tree

The state of the root is an arbitrary rroot Rroot

The state rj of internal node j with parent i is selected as follows:

otherwiseRstatearbitrary

Rrifrr

j

jii

j

10


C

T

T

C

C T AG C

AG

G

The algorithm could also select C as the assignment to the root. All other assignments cannot be changed.

Complexity: O(nk), where n is the number of leaves and k is the number of states. For m characters the complexity is O(nmk).

T

C

C C

11

Proof of Fitch’s Algorithm

We’ll show that Fitch minimizes the parsimony score of the leaf labeled input tree..

Definitions:

For a leaf-labeled tree T, let T* be an optimal

assignment of labels to internal nodes of T. T*(v)

be the assignment at internal node.

Let Tv be the tree rooted at v.

12

Claim: Let Ri be the set of states kept at the 1st phase at vertex i. Then s Ri iff there exists an optimal assignment Ti* with Ti* (i) = s.

Proof: By induction on the tree height h. Basis: h=1

I. If both children have the same state – zero change.

II. Otherwise – exactly one change.

A A

A

A B

A B

13

• Induction step: Assume correctness for height h and prove for h+1. Let p1 and p2 be the optimal costs of the subtrees of i’s children.

• If the intersection of i’s children lists is not empty, then the optimal score is p1+p2 and it can be achieved by labeling i with any member in the intersection, and only in this way.

• Otherwise, the optimal score is p1+p2+1, and it can be achieved by labeling i with any member in the union of the lists, and only in this way.

A,B C,D

A,B,C,D

A,B B,C

B

14

Weighted Maximum Parsimony

. Some mutations may be more probable than others. Hence, a natural generalization of the Maximum Parsimony problem is the Weighted Parsimony. You’ll see it in the tutorial.

Is Maximum Parsimony A Reliable Criterion?

The motivation for the Perfect Phylogeny and Maximum Parsimony methods comes from models where the characters are “significant”, and hence the number of observed mutations is likely to be as small as possible.

When the characters are DNA sequences, common models of evolution assume that mutations are random events.

A natural question is whether maximum parsimony is a good method for reconstructing phylogenies in such models. Next we formulate and discuss this question.

19

Probabilistic Models of Evolution

A simple (yet quite common) model of evolution, called Jukes Cantor (JC), assumes:

1.Mutations at different “sites” are i.i.d (independent identically distributed).

2.On each edge, all mutations have the same probability.

Other models usually assume 1, but give different probabilities to different types of mutations.

20

The JC model: each edge (u,v) corresponds to a probabilistic mutation matrix Puv.

u

v

A G C T

A 1-3p p p p

G p 1-3p p p

C p p 1-3p p

T p p p 1-3p

Puv =

p dpeneds on the “length” of

the edge

21

A “Model Tree”

A model tree in the JC model is an evolution tree which evolves according to the JC model. Formally, it consists of:1.A directed tree T=(V,E)2.A distribution of DNA letters at the root.3.Assignment of JC transition matrices to the edges of T.

The JC model (and other common models) assume that the distribution at the root is uniform: Each letter occurs with probability 0.25. This distribution is preserved in all other vertices of the tree.

22

23

A “model quartet” in the JC model

root

DC

AB

Each edge may have a different mutation

probability

Consistency of Reconstruction Algorithms

A tree reconstruction method (like maximum parsimony) is said to be “consistent” for a probabilistic model of evolution, if the following holds for any phylogenetic tree which fits the model:

When the sequences length goes to , the reconstructed tree is w.h.p. the true tree.

For the maximum parsimony method, this is equivalent to:

The true tree is w.h.p. a most parsimonious tree .

24

25

Of specific interest: reconstructing quartets

DC

AB

Correct reconstruction of (undirected) quartets is equivalent to finding the split defined by the middle edge, (A,B;C,D)

26

Example: Checking Consistency of Maximum Parsimony on Quartet Reconstruction

DC

AB

CCCGGAGCTTCTG…ACAA CCCGGAGCTTCTG…ACAA





CCCGGAGCTTCTG…ACAA CCCGGAGCTTCTG…ACAACCCGGAGCTTCTG…ACAA CCCGGAGCTTCTG…ACAA

(500 DNA bases)

Phase 1: Simulate evolution on the given quartet

27

DCCCGGAGCTTCTG…ACAA CCCGGAGCTTCTG…ACAA

CCCCGGAGCTTCTG…ACAA CCCGGAGCTTCTG…ACAA

BCCCGGAGCTTCTG…ACAA CCCGGAGCTTCTG…ACAA

ACCCGGAGCTTCTG…ACAA CCCGGAGCTTCTG…ACAA

Phase 2: Find a most parsimonious tree for the sequences at the leaves.

28

MP is consistent for the given model tree if w.h.p. the most parsimonious tree gives the

correct split

root

D

t

10t

CA

B

10t 10t 10t

t

root

D

t

10t

CA

B

10t 10t 10t

t

As we will see next, Maximum Parsimony is notConsistent for certain quartets

Most Parsimonious Tree

30

Inconsistency of Maximum Parsimony

Maximum Parsimony is not consistent for the JC and other similar probabilistic models of evolution of DNA.

In such models there are some scenarios of evolution, in which the most parsimonious tree is w.h.p. different from the true tree. We illustrate this on quartets.

A quartet on 4 species have 3 possible topologies (splits):

1

2

3

4

1

3

2

4

1

4

2

3

31

A quartet which is unlikely to be reconstructed by maximum parsimony

A

A A

1 4

32

Consider the following model quartet, where the probability for a substitution is proportional to edge lengths.

In this tree, characters in 2 and 3 are w.h.p. as the origin, and in 1 and 4 are more likely to be different.

32

A

A A

1 4

32

Parsimony may be useless/misleading for reconstructing the true tree

Assume the (likely) scenario where leaves 2 and 3 are the same. There are 4 patterns of substitution for leaves 1,4.

A I AA II GC III GG IV G

33

Case I all topologies get same parsimony score

A

A A

1 4

32

AA

1

2

3

4

A

A

A

A

1

3

2

4

A

A

A

A

1

4

2

3

A

A

A

A

Score=0 Score=0 Score=0

34

Case II all topologies get same score

A

A A

1 4

32

GA

1

2

3

4

A

A

A

G

1

3

2

4

A

A

A

G

1

4

2

3

A

G

A

A


35

Case III …same

A

A A

1 4

32

GC

1

2

3

4

A

A

C

G

1

3

2

4

A

A

C

G

1

4

2

3

A

G

C

A


36

Case III most parsimonious topology is wrong

A

A A

1 4

32

CC

1

2

3

4

A

A

C

C

1

3

2

4

A

A

C

C

1

4

2

3

A

C

C

A


37

Parsimony is useful only in the least likely cases

A

C A

1 4

32

AC

For most parsimonious tree to be the correct tree, it is necessary that 2 and 3 will have different characters – which is less likely than all other cases

38

Another problem with Maximum Parsimony (and other Character Based

Algorithms): Efficiency

There are no efficient algorithms for solving the “big” problem for maximum parsimony/Perfect phylogeny (both are known to be NP hard).

Mainly for this reason, the most used approaches for solving the big problem are distance based methods.

39

Distance-based Methodsfor Constructing Phylogenies

This approach attempts to overcome the two weaknesses of maximum parsimony:

1. It start by estimating inter-taxa distances from a well defined statistical model of evolution (distances correspond to probability of changes)

2. It provides efficient algorithms for the big problem.

Basic idea: The differences between species (usually represented by sequences of characters) are transformed to numerical distances, and an edge weighted tree realizing these distances is constructed.

40

Distance-Based Reconstruction

• Compute distances between all taxon-pairs

• Find a tree (edge-weighted) best-describing the distances

0

30

980

1514180

171620220

1615192190

D

4 5

7 21

210 61

42

DataDistancesTrees

1. Modeling question: given the data (eg DNA sequences of the taxa), how do we define distances between taxa?

2. Algorithmic question: Decide if the distances define a tree (ultrametric or additive – to be defined later), and if so, construct that tree.

3. In reality, the computed distances are noisy. So we need the algorithm to return a tree which approximates the distances of the input data.

In the following we shall study items 2 and 1, and briefly discuss item 3.

43

Ultrametric and Tree Metric

A distance metric on a set M of L objects is a function (represented by a symmetric matrix) satisfying:d(i,i)=0, and for i≠j, d(i,j)>0d(i,j)=d(j,i). For all i,j,k it holds that d(i,k) ≤ d(i,j)+d(j,k).

A metric is ultrametric if it corresponds to distances between leaves of a tree which admits molecular clock.It is a tree metric, or additive, if it corresponds to distances between nodes in a weighted tree.

:d M M R

44

1st model: Molecular Clock Ultrametric Trees

molecular clock assumes a constant rate of evolution. Namely, the distances from any extinct taxon (internal vertex) to all its current descendants are identical.A rooted tree satisfying this property is called ultrametric.

45

Ultrametric trees

Definition: An ultrametric tree is a rooted weighted tree all of whose leaves are at the same depth.

Basic property: Define the height of the leaves to be 0. Then edge weights can be represented by the heights of internal vertices.

A E D CB

8

5

0:

3333

2

5

5

3Edge weights:

Internal-vertices heights: 3 3

46

Least Common Ancestor and distances in Ultrametric Tree

Let LCA(i,j) denote the least common ancestor of leaves i and j. Let height(LCA(i, j)) be its distance from the leaves, and dist(i,j) be the distance from i to j.

Observation: For any pair of leaves i, j in an ultrametric tree:

height(LCA(i,j)) = 0.5 dist(i,j).

A B C D E

A 0 8 8 5 3

B 0 3 8 8

C 0 8 8

D 0 5

E 0A E D CB

8

53 3

47

Ultrametric Matrices

Definition: A distance matrix* U of dimension LL is ultrametric iff for each 3 indices i, j, k :

U(i,j) ≤ max {U(i,k),U(j,k)}. j k

i 9 6

j 9

Theorem: The following conditions are equivalent for an LL distance matrix U:

1. U is an ultrametric matrix.

2. There is an ultrametric tree with L leaves such that for each pair of leaves i,j:

U(i,j) = height(LCA(i,j)) = ½ dist(i,j).

* Recall: distance matrix is a symmetric matrix with positive non-diagonal entries,0 diagonal entries, which satisfies the triangle inequality.

48

Ultrametric tree Ultrametric matrix

There is an ultrametric tree s.t. U(i,j)=½dist(i,j).

U is an ultrametric matrix: By properties of Least Common Ancestors in trees

ijk

U(k,i) = U(j,i) ≥ U(k,j)

49

Ultrametric matrix Ultrametric tree:

The proof is based on the below two observations:

Definition: Let U be an LL matrix, and let S {1,...,L}.

U[S] is the submatrix of U consisting of the rows and columns with indices from S.

Observation 1: U is ultrametric iff for every S {1,...,L}, U[S] is ultrametric.

Observation 2: If U is ultrametric and maxi,jU(i,j)=M, then M appears in every row of U.

j k

i ? ?

j M

One of the “?” Must be M

50

Ultrametric matrix Ultrametric tree:Proof by induction

U is an ultrametric matrix U has an ultrametric tree : By induction on L, the size of U.

Basis: L= 1: T is a leaf

L= 2: T is a tree with two leaves

0 9

0

0

i

j

i j

i

i

9

ji

51

Induction step

Induction step: L>2.

Use the 1st row to split the set {1,…,L} to two subsets:

S1 ={i: U(1,i) =M},

S2={1,..,L}-S

(note: 0<|Si|<L)1 2 3 4 5

1 0 8 2 8 5

S1={2,4}, S2={1,3,5}

52

Induction step

By Observation 1, U1= U[S1] and U2= U[S2] are ultrametric.

Let M1 (M2) be the maximal entries in U1 (U2 resp.).

Note that M1≤ M, and M2 < M (M2 is the 2nd largest element in row 1( if

M2=0 then T2 is a leaf).

By induction there are ultrametric trees T1 and T2 for U1 and U2.

Join T1 and T2 to T with a root as shown.

T2

T1

M2

M

M1

53

Proof (end)

Need to prove: T is an ultrametric tree for U

ie, U(i,j) is the label of the LCA of i and j in T.

If i and j are in the same subtree, this holds by induction.

Else LCA(i,j) = M (since they are in different subtrees).

Also, [U(1,i)= M and U(1,j) ≠ M] U(i,j) = M.

i j

M l

i MT2

T1

M2

M

M1

ij

maximum parsimony probabilistic models of evolutions distance based methods lecture 12 © shlomo...

Documents

tree t

c t t c c t

t c c c slide

input tree

internal vertex slide

root traversal of tree

optimal assignment t

tree height