maximum parsimony probabilistic models of evolutions distance based methods lecture 12 © shlomo...

46
Maximum Parsimony Probabilistic Models of Evolutions Distance Based Methods Lecture 12 © Shlomo Moran, Ilan Gronau

Upload: chaim-gurley

Post on 14-Dec-2015

222 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Maximum Parsimony Probabilistic Models of Evolutions Distance Based Methods Lecture 12 © Shlomo Moran, Ilan Gronau

Maximum Parsimony

Probabilistic Models of Evolutions

Distance Based Methods

Lecture 12

© Shlomo Moran, Ilan Gronau

Page 2: Maximum Parsimony Probabilistic Models of Evolutions Distance Based Methods Lecture 12 © Shlomo Moran, Ilan Gronau

2

Maximum Parsimony

A Character-based reconstruction method

Input:

h sequences (one per species), all of length k.

Goal:

Find a tree whose leaves are labeled by the input

sequences, and an assignment of sequences to internal

nodes, such that the total number of substitutions is

minimized.

Page 3: Maximum Parsimony Probabilistic Models of Evolutions Distance Based Methods Lecture 12 © Shlomo Moran, Ilan Gronau

3

Parsimony score

AGAGGA

AAAAAG

AAA AGA

AAA

11

1

Parsimony score = 3

GGAAAA

AGAAAG

AAA AAA

AAA

11 2

Parsimony score = 4

The parsimony score of a leaf-labeled tree T is the minimum possible number of mutations over all assignments of sequences to internal vertices of T.

Page 4: Maximum Parsimony Probabilistic Models of Evolutions Distance Based Methods Lecture 12 © Shlomo Moran, Ilan Gronau

4

Parsimony Based Reconstruction

We have here both the small and big problems:

1. The small problem: find the parsimony score for a given leaf labeled tree.

2. The big problem: Find a tree whose leaves are labeled by the input sequences, with the minimum possible parsimony score.

3. We will see efficient algorithms for (1). (2) is hard.

Page 5: Maximum Parsimony Probabilistic Models of Evolutions Distance Based Methods Lecture 12 © Shlomo Moran, Ilan Gronau

5

Fitch Algorithm:Maximum Parsimony for a Given Tree

Input: A rooted binary leaf labeled tree.

Output: Most parsimonious assignment of states to internal vertices

Work on each position independently. Make one pass from the leaves to the root, and another pass from the root to the leaves.

A

A/T

A A C T A

AA/C

Page 6: Maximum Parsimony Probabilistic Models of Evolutions Distance Based Methods Lecture 12 © Shlomo Moran, Ilan Gronau

7

Fitch’s Algorithm – Phase 1

Do a post-order (from leaves to root) traversal of tree, assign to each vertex a set of possible states. Each leaf has a unique possible state, given by the input.

The possible states Ri of internal node i with children j and k is given by:

otherwiseRR

RRifRRR

kj

kjkj

i

Page 7: Maximum Parsimony Probabilistic Models of Evolutions Distance Based Methods Lecture 12 © Shlomo Moran, Ilan Gronau

8

Fitch’s Algorithm – Phase 1

Claim (to be proved soon):# of substitutions in optimal solution = # of union operations

TC

T

CT

C

C T AG C

AGC

GC

Page 8: Maximum Parsimony Probabilistic Models of Evolutions Distance Based Methods Lecture 12 © Shlomo Moran, Ilan Gronau

9

Fitch’s Algorithm – Phase 2

do a pre-order (from root to leaves) traversal of tree

The state of the root is an arbitrary rroot Rroot

The state rj of internal node j with parent i is selected as follows:

otherwiseRstatearbitrary

Rrifrr

j

jii

j

Page 9: Maximum Parsimony Probabilistic Models of Evolutions Distance Based Methods Lecture 12 © Shlomo Moran, Ilan Gronau

10

Fitch’s Algorithm – Phase 2

C

T

T

C

C T AG C

AG

G

The algorithm could also select C as the assignment to the root. All other assignments cannot be changed.

Complexity: O(nk), where n is the number of leaves and k is the number of states. For m characters the complexity is O(nmk).

T

C

C C

Page 10: Maximum Parsimony Probabilistic Models of Evolutions Distance Based Methods Lecture 12 © Shlomo Moran, Ilan Gronau

11

Proof of Fitch’s Algorithm

We’ll show that Fitch minimizes the parsimony score of the leaf labeled input tree..

Definitions:

For a leaf-labeled tree T, let T* be an optimal

assignment of labels to internal nodes of T. T*(v)

be the assignment at internal node.

Let Tv be the tree rooted at v.

Page 11: Maximum Parsimony Probabilistic Models of Evolutions Distance Based Methods Lecture 12 © Shlomo Moran, Ilan Gronau

12

Claim: Let Ri be the set of states kept at the 1st phase at vertex i. Then s Ri iff there exists an optimal assignment Ti* with Ti* (i) = s.

Proof: By induction on the tree height h. Basis: h=1

I. If both children have the same state – zero change.

II. Otherwise – exactly one change.

A A

A

A B

A B

Page 12: Maximum Parsimony Probabilistic Models of Evolutions Distance Based Methods Lecture 12 © Shlomo Moran, Ilan Gronau

13

• Induction step: Assume correctness for height h and prove for h+1. Let p1 and p2 be the optimal costs of the subtrees of i’s children.

• If the intersection of i’s children lists is not empty, then the optimal score is p1+p2 and it can be achieved by labeling i with any member in the intersection, and only in this way.

• Otherwise, the optimal score is p1+p2+1, and it can be achieved by labeling i with any member in the union of the lists, and only in this way.

A,B C,D

A,B,C,D

A,B B,C

B

Page 13: Maximum Parsimony Probabilistic Models of Evolutions Distance Based Methods Lecture 12 © Shlomo Moran, Ilan Gronau

14

Weighted Maximum Parsimony

. Some mutations may be more probable than others. Hence, a natural generalization of the Maximum Parsimony problem is the Weighted Parsimony. You’ll see it in the tutorial.

Page 14: Maximum Parsimony Probabilistic Models of Evolutions Distance Based Methods Lecture 12 © Shlomo Moran, Ilan Gronau

Is Maximum Parsimony A Reliable Criterion?

The motivation for the Perfect Phylogeny and Maximum Parsimony methods comes from models where the characters are “significant”, and hence the number of observed mutations is likely to be as small as possible.

When the characters are DNA sequences, common models of evolution assume that mutations are random events.

A natural question is whether maximum parsimony is a good method for reconstructing phylogenies in such models. Next we formulate and discuss this question.

19

Page 15: Maximum Parsimony Probabilistic Models of Evolutions Distance Based Methods Lecture 12 © Shlomo Moran, Ilan Gronau

Probabilistic Models of Evolution

A simple (yet quite common) model of evolution, called Jukes Cantor (JC), assumes:

1.Mutations at different “sites” are i.i.d (independent identically distributed).

2.On each edge, all mutations have the same probability.

Other models usually assume 1, but give different probabilities to different types of mutations.

20

Page 16: Maximum Parsimony Probabilistic Models of Evolutions Distance Based Methods Lecture 12 © Shlomo Moran, Ilan Gronau

The JC model: each edge (u,v) corresponds to a probabilistic mutation matrix Puv.

u

v

A G C T

A 1-3p p p p

G p 1-3p p p

C p p 1-3p p

T p p p 1-3p

Puv =

p dpeneds on the “length” of

the edge

21

Page 17: Maximum Parsimony Probabilistic Models of Evolutions Distance Based Methods Lecture 12 © Shlomo Moran, Ilan Gronau

A “Model Tree”

A model tree in the JC model is an evolution tree which evolves according to the JC model. Formally, it consists of:1.A directed tree T=(V,E)2.A distribution of DNA letters at the root.3.Assignment of JC transition matrices to the edges of T.

The JC model (and other common models) assume that the distribution at the root is uniform: Each letter occurs with probability 0.25. This distribution is preserved in all other vertices of the tree.

22

Page 18: Maximum Parsimony Probabilistic Models of Evolutions Distance Based Methods Lecture 12 © Shlomo Moran, Ilan Gronau

23

A “model quartet” in the JC model

root

DC

AB

Each edge may have a different mutation

probability

Page 19: Maximum Parsimony Probabilistic Models of Evolutions Distance Based Methods Lecture 12 © Shlomo Moran, Ilan Gronau

Consistency of Reconstruction Algorithms

A tree reconstruction method (like maximum parsimony) is said to be “consistent” for a probabilistic model of evolution, if the following holds for any phylogenetic tree which fits the model:

When the sequences length goes to , the reconstructed tree is w.h.p. the true tree.

For the maximum parsimony method, this is equivalent to:

The true tree is w.h.p. a most parsimonious tree .

24

Page 20: Maximum Parsimony Probabilistic Models of Evolutions Distance Based Methods Lecture 12 © Shlomo Moran, Ilan Gronau

25

Of specific interest: reconstructing quartets

DC

AB

Correct reconstruction of (undirected) quartets is equivalent to finding the split defined by the middle edge, (A,B;C,D)

Page 21: Maximum Parsimony Probabilistic Models of Evolutions Distance Based Methods Lecture 12 © Shlomo Moran, Ilan Gronau

26

Example: Checking Consistency of Maximum Parsimony on Quartet Reconstruction

DC

AB

CCCGGAGCTTCTG…ACAA CCCGGAGCTTCTG…ACAA

CCCGGAGCTTCTG…ACAA CCCGGAGCTTCTG…ACAA

CCCGGAGCTTCTG…ACAA CCCGGAGCTTCTG…ACAA

CCCGGAGCTTCTG…ACAA CCCGGAGCTTCTG…ACAA

CCCGGAGCTTCTG…ACAA CCCGGAGCTTCTG…ACAA

CCCGGAGCTTCTG…ACAA CCCGGAGCTTCTG…ACAACCCGGAGCTTCTG…ACAA CCCGGAGCTTCTG…ACAA

(500 DNA bases)

Phase 1: Simulate evolution on the given quartet

Page 22: Maximum Parsimony Probabilistic Models of Evolutions Distance Based Methods Lecture 12 © Shlomo Moran, Ilan Gronau

27

DCCCGGAGCTTCTG…ACAA CCCGGAGCTTCTG…ACAA

CCCCGGAGCTTCTG…ACAA CCCGGAGCTTCTG…ACAA

BCCCGGAGCTTCTG…ACAA CCCGGAGCTTCTG…ACAA

ACCCGGAGCTTCTG…ACAA CCCGGAGCTTCTG…ACAA

Phase 2: Find a most parsimonious tree for the sequences at the leaves.

Page 23: Maximum Parsimony Probabilistic Models of Evolutions Distance Based Methods Lecture 12 © Shlomo Moran, Ilan Gronau

28

MP is consistent for the given model tree if w.h.p. the most parsimonious tree gives the

correct split

root

D

t

10t

CA

B

10t 10t 10t

t

root

D

t

10t

CA

B

10t 10t 10t

t

As we will see next, Maximum Parsimony is notConsistent for certain quartets

Most Parsimonious Tree

Page 24: Maximum Parsimony Probabilistic Models of Evolutions Distance Based Methods Lecture 12 © Shlomo Moran, Ilan Gronau

30

Inconsistency of Maximum Parsimony

Maximum Parsimony is not consistent for the JC and other similar probabilistic models of evolution of DNA.

In such models there are some scenarios of evolution, in which the most parsimonious tree is w.h.p. different from the true tree. We illustrate this on quartets.

A quartet on 4 species have 3 possible topologies (splits):

1

2

3

4

1

3

2

4

1

4

2

3

Page 25: Maximum Parsimony Probabilistic Models of Evolutions Distance Based Methods Lecture 12 © Shlomo Moran, Ilan Gronau

31

A quartet which is unlikely to be reconstructed by maximum parsimony

A

A A

1 4

32

Consider the following model quartet, where the probability for a substitution is proportional to edge lengths.

In this tree, characters in 2 and 3 are w.h.p. as the origin, and in 1 and 4 are more likely to be different.

Page 26: Maximum Parsimony Probabilistic Models of Evolutions Distance Based Methods Lecture 12 © Shlomo Moran, Ilan Gronau

32

A

A A

1 4

32

Parsimony may be useless/misleading for reconstructing the true tree

Assume the (likely) scenario where leaves 2 and 3 are the same. There are 4 patterns of substitution for leaves 1,4.

A I AA II GC III GG IV G

Page 27: Maximum Parsimony Probabilistic Models of Evolutions Distance Based Methods Lecture 12 © Shlomo Moran, Ilan Gronau

33

Case I all topologies get same parsimony score

A

A A

1 4

32

AA

1

2

3

4

A

A

A

A

1

3

2

4

A

A

A

A

1

4

2

3

A

A

A

A

Score=0 Score=0 Score=0

Page 28: Maximum Parsimony Probabilistic Models of Evolutions Distance Based Methods Lecture 12 © Shlomo Moran, Ilan Gronau

34

Case II all topologies get same score

A

A A

1 4

32

GA

1

2

3

4

A

A

A

G

1

3

2

4

A

A

A

G

1

4

2

3

A

G

A

A

Score=1 Score=1 Score=1

Page 29: Maximum Parsimony Probabilistic Models of Evolutions Distance Based Methods Lecture 12 © Shlomo Moran, Ilan Gronau

35

Case III …same

A

A A

1 4

32

GC

1

2

3

4

A

A

C

G

1

3

2

4

A

A

C

G

1

4

2

3

A

G

C

A

Score=2 Score=2 Score=2

Page 30: Maximum Parsimony Probabilistic Models of Evolutions Distance Based Methods Lecture 12 © Shlomo Moran, Ilan Gronau

36

Case III most parsimonious topology is wrong

A

A A

1 4

32

CC

1

2

3

4

A

A

C

C

1

3

2

4

A

A

C

C

1

4

2

3

A

C

C

A

Score=2 Score=2 Score=1

Page 31: Maximum Parsimony Probabilistic Models of Evolutions Distance Based Methods Lecture 12 © Shlomo Moran, Ilan Gronau

37

Parsimony is useful only in the least likely cases

A

C A

1 4

32

AC

For most parsimonious tree to be the correct tree, it is necessary that 2 and 3 will have different characters – which is less likely than all other cases

Page 32: Maximum Parsimony Probabilistic Models of Evolutions Distance Based Methods Lecture 12 © Shlomo Moran, Ilan Gronau

38

Another problem with Maximum Parsimony (and other Character Based

Algorithms): Efficiency

There are no efficient algorithms for solving the “big” problem for maximum parsimony/Perfect phylogeny (both are known to be NP hard).

Mainly for this reason, the most used approaches for solving the big problem are distance based methods.

Page 33: Maximum Parsimony Probabilistic Models of Evolutions Distance Based Methods Lecture 12 © Shlomo Moran, Ilan Gronau

39

Distance-based Methodsfor Constructing Phylogenies

This approach attempts to overcome the two weaknesses of maximum parsimony:

1. It start by estimating inter-taxa distances from a well defined statistical model of evolution (distances correspond to probability of changes)

2. It provides efficient algorithms for the big problem.

Basic idea: The differences between species (usually represented by sequences of characters) are transformed to numerical distances, and an edge weighted tree realizing these distances is constructed.

Page 34: Maximum Parsimony Probabilistic Models of Evolutions Distance Based Methods Lecture 12 © Shlomo Moran, Ilan Gronau

40

Distance-Based Reconstruction

• Compute distances between all taxon-pairs

• Find a tree (edge-weighted) best-describing the distances

0

30

980

1514180

171620220

1615192190

D

4 5

7 21

210 61

Page 35: Maximum Parsimony Probabilistic Models of Evolutions Distance Based Methods Lecture 12 © Shlomo Moran, Ilan Gronau

42

DataDistancesTrees

1. Modeling question: given the data (eg DNA sequences of the taxa), how do we define distances between taxa?

2. Algorithmic question: Decide if the distances define a tree (ultrametric or additive – to be defined later), and if so, construct that tree.

3. In reality, the computed distances are noisy. So we need the algorithm to return a tree which approximates the distances of the input data.

In the following we shall study items 2 and 1, and briefly discuss item 3.

Page 36: Maximum Parsimony Probabilistic Models of Evolutions Distance Based Methods Lecture 12 © Shlomo Moran, Ilan Gronau

43

Ultrametric and Tree Metric

A distance metric on a set M of L objects is a function (represented by a symmetric matrix) satisfying:d(i,i)=0, and for i≠j, d(i,j)>0d(i,j)=d(j,i). For all i,j,k it holds that d(i,k) ≤ d(i,j)+d(j,k).

A metric is ultrametric if it corresponds to distances between leaves of a tree which admits molecular clock.It is a tree metric, or additive, if it corresponds to distances between nodes in a weighted tree.

:d M M R

Page 37: Maximum Parsimony Probabilistic Models of Evolutions Distance Based Methods Lecture 12 © Shlomo Moran, Ilan Gronau

44

1st model: Molecular Clock Ultrametric Trees

molecular clock assumes a constant rate of evolution. Namely, the distances from any extinct taxon (internal vertex) to all its current descendants are identical.A rooted tree satisfying this property is called ultrametric.

Page 38: Maximum Parsimony Probabilistic Models of Evolutions Distance Based Methods Lecture 12 © Shlomo Moran, Ilan Gronau

45

Ultrametric trees

Definition: An ultrametric tree is a rooted weighted tree all of whose leaves are at the same depth.

Basic property: Define the height of the leaves to be 0. Then edge weights can be represented by the heights of internal vertices.

A E D CB

8

5

0:

3333

2

5

5

3Edge weights:

Internal-vertices heights: 3 3

Page 39: Maximum Parsimony Probabilistic Models of Evolutions Distance Based Methods Lecture 12 © Shlomo Moran, Ilan Gronau

46

Least Common Ancestor and distances in Ultrametric Tree

Let LCA(i,j) denote the least common ancestor of leaves i and j. Let height(LCA(i, j)) be its distance from the leaves, and dist(i,j) be the distance from i to j.

Observation: For any pair of leaves i, j in an ultrametric tree:

height(LCA(i,j)) = 0.5 dist(i,j).

A B C D E

A 0 8 8 5 3

B 0 3 8 8

C 0 8 8

D 0 5

E 0A E D CB

8

53 3

Page 40: Maximum Parsimony Probabilistic Models of Evolutions Distance Based Methods Lecture 12 © Shlomo Moran, Ilan Gronau

47

Ultrametric Matrices

Definition: A distance matrix* U of dimension LL is ultrametric iff for each 3 indices i, j, k :

U(i,j) ≤ max {U(i,k),U(j,k)}. j k

i 9 6

j 9

Theorem: The following conditions are equivalent for an LL distance matrix U:

1. U is an ultrametric matrix.

2. There is an ultrametric tree with L leaves such that for each pair of leaves i,j:

U(i,j) = height(LCA(i,j)) = ½ dist(i,j).

* Recall: distance matrix is a symmetric matrix with positive non-diagonal entries,0 diagonal entries, which satisfies the triangle inequality.

Page 41: Maximum Parsimony Probabilistic Models of Evolutions Distance Based Methods Lecture 12 © Shlomo Moran, Ilan Gronau

48

Ultrametric tree Ultrametric matrix

There is an ultrametric tree s.t. U(i,j)=½dist(i,j).

U is an ultrametric matrix: By properties of Least Common Ancestors in trees

ijk

U(k,i) = U(j,i) ≥ U(k,j)

Page 42: Maximum Parsimony Probabilistic Models of Evolutions Distance Based Methods Lecture 12 © Shlomo Moran, Ilan Gronau

49

Ultrametric matrix Ultrametric tree:

The proof is based on the below two observations:

Definition: Let U be an LL matrix, and let S {1,...,L}.

U[S] is the submatrix of U consisting of the rows and columns with indices from S.

Observation 1: U is ultrametric iff for every S {1,...,L}, U[S] is ultrametric.

Observation 2: If U is ultrametric and maxi,jU(i,j)=M, then M appears in every row of U.

j k

i ? ?

j M

One of the “?” Must be M

Page 43: Maximum Parsimony Probabilistic Models of Evolutions Distance Based Methods Lecture 12 © Shlomo Moran, Ilan Gronau

50

Ultrametric matrix Ultrametric tree:Proof by induction

U is an ultrametric matrix U has an ultrametric tree : By induction on L, the size of U.

Basis: L= 1: T is a leaf

L= 2: T is a tree with two leaves

0 9

0

0

i

j

i j

i

i

9

ji

Page 44: Maximum Parsimony Probabilistic Models of Evolutions Distance Based Methods Lecture 12 © Shlomo Moran, Ilan Gronau

51

Induction step

Induction step: L>2.

Use the 1st row to split the set {1,…,L} to two subsets:

S1 ={i: U(1,i) =M},

S2={1,..,L}-S

(note: 0<|Si|<L)1 2 3 4 5

1 0 8 2 8 5

S1={2,4}, S2={1,3,5}

Page 45: Maximum Parsimony Probabilistic Models of Evolutions Distance Based Methods Lecture 12 © Shlomo Moran, Ilan Gronau

52

Induction step

By Observation 1, U1= U[S1] and U2= U[S2] are ultrametric.

Let M1 (M2) be the maximal entries in U1 (U2 resp.).

Note that M1≤ M, and M2 < M (M2 is the 2nd largest element in row 1( if

M2=0 then T2 is a leaf).

By induction there are ultrametric trees T1 and T2 for U1 and U2.

Join T1 and T2 to T with a root as shown.

T2

T1

M2

M

M1

Page 46: Maximum Parsimony Probabilistic Models of Evolutions Distance Based Methods Lecture 12 © Shlomo Moran, Ilan Gronau

53

Proof (end)

Need to prove: T is an ultrametric tree for U

ie, U(i,j) is the label of the LCA of i and j in T.

If i and j are in the same subtree, this holds by induction.

Else LCA(i,j) = M (since they are in different subtrees).

Also, [U(1,i)= M and U(1,j) ≠ M] U(i,j) = M.

i j

M l

i MT2

T1

M2

M

M1

ij