lecture 3. phylogeny methods: branch and bound,...

25
Lecture 3. Phylogeny methods: Branch and bound, distance methods Joe Felsenstein Department of Genome Sciences and Department of Biology Lecture 3. Phylogeny methods: Branch and bound, distance methods – p.1/25

Upload: ngohuong

Post on 19-Sep-2018

220 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lecture 3. Phylogeny methods: Branch and bound, …evolution.gs.washington.edu/gs541/2010/lecture3.pdf · Lecture 3. Phylogeny methods: Branch and bound, distance methods Joe Felsenstein

Lecture 3. Phylogeny methods: Branch and bound,distance methods

Joe Felsenstein

Department of Genome Sciences and Department of Biology

Lecture 3. Phylogeny methods: Branch and bound, distance methods – p.1/25

Page 2: Lecture 3. Phylogeny methods: Branch and bound, …evolution.gs.washington.edu/gs541/2010/lecture3.pdf · Lecture 3. Phylogeny methods: Branch and bound, distance methods Joe Felsenstein

Greedy search by sequential addition

A

D

B

C

A

B

C

B

C

D

A

D

A

C

B

8 7 9

BA

D

C

E11

A

D 9

E

C

B

A

D E9

BC

A

9 C

B

E

D

D 9 C

BEA

Greedy search by addition of species in a fixed order (A, B, C, D, E) in thebest place each time. Lecture 3. Phylogeny methods: Branch and bound, distance methods – p.2/25

Page 3: Lecture 3. Phylogeny methods: Branch and bound, …evolution.gs.washington.edu/gs541/2010/lecture3.pdf · Lecture 3. Phylogeny methods: Branch and bound, distance methods Joe Felsenstein

Goloboff’s time-saving trick

H−K

L

M−R

S−U

A

V−Z

V−Z

A−G H−R

S−U

B−G

Goloboff’s economy in computing scores of rearranged treesOnce the “views” have been computed, they can be taken to

represent subtrees, without going inside those subtrees

Lecture 3. Phylogeny methods: Branch and bound, distance methods – p.3/25

Page 4: Lecture 3. Phylogeny methods: Branch and bound, …evolution.gs.washington.edu/gs541/2010/lecture3.pdf · Lecture 3. Phylogeny methods: Branch and bound, distance methods Joe Felsenstein

Star decomposition

A

C

D

E F

B

E

C

D

A

B

F

B C

D

A

E F

E

C

D

A

B

F

B C

D

A

F

E

C

D

A

B

F

E

“Star decomposition" search for best tree can happen in multiple ways

Lecture 3. Phylogeny methods: Branch and bound, distance methods – p.4/25

Page 5: Lecture 3. Phylogeny methods: Branch and bound, …evolution.gs.washington.edu/gs541/2010/lecture3.pdf · Lecture 3. Phylogeny methods: Branch and bound, distance methods Joe Felsenstein

Disk-covering

A

B

C D

EF

0.1

0.05

0.1 0.04 0.1

0.030.030.02

0.05

“Disk covering" – assembly of a tree from overlapping estimated subtrees

Lecture 3. Phylogeny methods: Branch and bound, distance methods – p.5/25

Page 6: Lecture 3. Phylogeny methods: Branch and bound, …evolution.gs.washington.edu/gs541/2010/lecture3.pdf · Lecture 3. Phylogeny methods: Branch and bound, distance methods Joe Felsenstein

Shortest Hamiltonian path problem(a) (b)

(c) (d)

Lecture 3. Phylogeny methods: Branch and bound, distance methods – p.6/25

Page 7: Lecture 3. Phylogeny methods: Branch and bound, …evolution.gs.washington.edu/gs541/2010/lecture3.pdf · Lecture 3. Phylogeny methods: Branch and bound, distance methods Joe Felsenstein

Search tree for this problem

etc. etc.

etc.etc.

add 1 add 2 add 3

add 2 add 3 add 4 add 5

add 3 add 5

add 8 add 10add 9

add 9

add 9add 3add 10

add 10 add 8

add 8add 3add 10

add 10 add 8

add 8add 3

add 9

etc. etc.

start

(1,2,3,4,5,6,7,8,9,10) (1,2,3,4,5,6,7,9,8,10) (1,2,3,4,5,6,7,10,8,9)

(1,2,3,4,5,6,7,8,10,9) (1,2,3,4,5,6,7,9,10,8) (1,2,3,4,5,6,7,10,9,8)

add 4

etc.

add 9

Lecture 3. Phylogeny methods: Branch and bound, distance methods – p.7/25

Page 8: Lecture 3. Phylogeny methods: Branch and bound, …evolution.gs.washington.edu/gs541/2010/lecture3.pdf · Lecture 3. Phylogeny methods: Branch and bound, distance methods Joe Felsenstein

Search tree of trees

C

A

B

D C

A B

A

C

B

D

A

B

C

D

A

ED

B

C

E

DA

B

C

D

AE

B

C

D

AC

B

E

D

AB

C

E

E

AC

B

D

E

C A B

DC

AE

B

D

C

AD

B

E

C

AB

D

E

E

AB

C

D

E

BA

C

D

B

AE

C

D

B

AD

C

E

B

A C D

ELecture 3. Phylogeny methods: Branch and bound, distance methods – p.8/25

Page 9: Lecture 3. Phylogeny methods: Branch and bound, …evolution.gs.washington.edu/gs541/2010/lecture3.pdf · Lecture 3. Phylogeny methods: Branch and bound, distance methods Joe Felsenstein

same, with parsimony scores in place of trees

8

11

11

9

3

9

7 8

9

9

9

10

10

11

1111

11

9

11

Lecture 3. Phylogeny methods: Branch and bound, distance methods – p.9/25

Page 10: Lecture 3. Phylogeny methods: Branch and bound, …evolution.gs.washington.edu/gs541/2010/lecture3.pdf · Lecture 3. Phylogeny methods: Branch and bound, distance methods Joe Felsenstein

Polynomial time and exponential time

1 10 10010

0

101

102

103

104

105

106

Tim

e

Problem size

6n +4n−33

e0.5n

How does the time taken by an algorithm depend on the size of theproblem? If it is a polynomial (even one with big coefficients), with a bigenough case it is faster than one that depends on the size exponentially.

Lecture 3. Phylogeny methods: Branch and bound, distance methods – p.10/25

Page 11: Lecture 3. Phylogeny methods: Branch and bound, …evolution.gs.washington.edu/gs541/2010/lecture3.pdf · Lecture 3. Phylogeny methods: Branch and bound, distance methods Joe Felsenstein

NP completeness and NP hardness

P

NP

does thispart exist?

NP Hard

is P = NP?

NP Complete

(This diagram is not quite correct – see the diagrams on the Wikipedia page for “NP-hard”).

P = problems that can be solved by a polynomial time algorithm

NP complete = problems for which a proposed solution can be checked in polynomial timebut for which it can be proven that if one of them is in P, all are.

NP hard = problems for which a solution can be checked in polynomial time, but might be notsolvable in polynomial time.

Lecture 3. Phylogeny methods: Branch and bound, distance methods – p.11/25

Page 12: Lecture 3. Phylogeny methods: Branch and bound, …evolution.gs.washington.edu/gs541/2010/lecture3.pdf · Lecture 3. Phylogeny methods: Branch and bound, distance methods Joe Felsenstein

Distance methodsThese have been attractive, particular to mathematical scientists who lovegeometry. This has its good and bad effects.

1. Take the sequences in all pairs.

2. For each pair compute a distance. (As we will see, this is bestthought of as the length of the 2-species tree for those species).

3. Try to find that tree which best fits the table of distances.

Lecture 3. Phylogeny methods: Branch and bound, distance methods – p.12/25

Page 13: Lecture 3. Phylogeny methods: Branch and bound, …evolution.gs.washington.edu/gs541/2010/lecture3.pdf · Lecture 3. Phylogeny methods: Branch and bound, distance methods Joe Felsenstein

A phylogeny with branch lengths

A B C D E

A

B

C

D

E

0

0

0

0

0

0.23 0.16 0.20 0.17

0.23 0.17 0.24

0.15 0.11

0.21

0.23

0.16

0.20

0.17

0.23

0.17

0.24

0.15

0.11 0.21

0.10

0.07

0.05

0.08

0.030.06

0.05

A B

CD

E

and the pairwise distances it predicts

Lecture 3. Phylogeny methods: Branch and bound, distance methods – p.13/25

Page 14: Lecture 3. Phylogeny methods: Branch and bound, …evolution.gs.washington.edu/gs541/2010/lecture3.pdf · Lecture 3. Phylogeny methods: Branch and bound, distance methods Joe Felsenstein

A phylogeny with branch lengths

A B

CD

E

v1v2

v3v4

v5 v6

v7

Lecture 3. Phylogeny methods: Branch and bound, distance methods – p.14/25

Page 15: Lecture 3. Phylogeny methods: Branch and bound, …evolution.gs.washington.edu/gs541/2010/lecture3.pdf · Lecture 3. Phylogeny methods: Branch and bound, distance methods Joe Felsenstein

Least squares trees

Least squares methods minimize

Q =n

i=1

j 6=i

wij(Dij − dij)2

over all trees, using the distances dij that they predict.Cavalli-Sforza and Edwards suggested wij = 1, Fitch andMargoliash suggested wij = 1/D2

ij.

Lecture 3. Phylogeny methods: Branch and bound, distance methods – p.15/25

Page 16: Lecture 3. Phylogeny methods: Branch and bound, …evolution.gs.washington.edu/gs541/2010/lecture3.pdf · Lecture 3. Phylogeny methods: Branch and bound, distance methods Joe Felsenstein

Statistical assumptions of least squares trees

Implicit assumption is that distances are (independently?) Normallydistributed with expectation dij and variance proportional to 1/w2

ij:

Dij ∼ N (dij, K/wij)

Thus the different weightings correspond to different assumptions aboutthe error in the distances. Also, there is assumed to be no covariance ofdistances.

In fact, the distances will covary, since a change in an interior branch ofthe tree increases (or decreases) all distances whose paths go throughthat branch.

Lecture 3. Phylogeny methods: Branch and bound, distance methods – p.16/25

Page 17: Lecture 3. Phylogeny methods: Branch and bound, …evolution.gs.washington.edu/gs541/2010/lecture3.pdf · Lecture 3. Phylogeny methods: Branch and bound, distance methods Joe Felsenstein

Matrix approach to fitting branch lengthsIf we stack the distances up into a column vector D, we can solve the least squares equation(obtained by taking derivatives of the quadratic form Q):

DT = (D12, D13, D14, D15, D23, D24, D25, D34, D35, D45)

XTD =

XTX

v.

where the “design matrix” X for the given tree topology has 1’s whenever a given branch lieson the path between those two species. Here is the design matrix for the tree we just saw.

Branches which1 2 3 4 5 6 7 D

X =

2

6

6

6

6

6

6

6

6

6

6

6

6

4

1 1 0 0 0 0 11 0 1 0 0 1 01 0 0 1 0 0 11 0 0 0 1 1 00 1 1 0 0 1 10 1 0 1 0 0 00 1 0 0 1 1 10 0 1 1 0 1 10 0 1 0 1 0 00 0 0 1 1 1 1

3

7

7

7

7

7

7

7

7

7

7

7

7

5

1, 21, 31, 41, 52, 32, 42, 53, 43, 54, 5

A B

CD

E

v1v2

v3v4

v5 v6

v7

Lecture 3. Phylogeny methods: Branch and bound, distance methods – p.17/25

Page 18: Lecture 3. Phylogeny methods: Branch and bound, …evolution.gs.washington.edu/gs541/2010/lecture3.pdf · Lecture 3. Phylogeny methods: Branch and bound, distance methods Joe Felsenstein

The Jukes-Cantor model for DNA

A G

C T

u/3

u/3

u/3u/3 u/3

u/3

Lecture 3. Phylogeny methods: Branch and bound, distance methods – p.18/25

Page 19: Lecture 3. Phylogeny methods: Branch and bound, …evolution.gs.washington.edu/gs541/2010/lecture3.pdf · Lecture 3. Phylogeny methods: Branch and bound, distance methods Joe Felsenstein

Derivation of the probability of change

1. Imagine events occuring at rate 43u per unit time which replace a

base by one of the 4 bases chosen at random.

2. Persuade yourself that this is no different in outcome from events u

per unit time that replace it by one of the other 3 chosen at random.

3. The probability a branch has none of these (first kind of) events if itis of length t is exp(− 4

3u t) . (Think the zero term of a Poisson

distribution).

4. If it does have one or more of these events, you end up with one ofthe 4 bases chosen at random.

5. Therefore the probability of a net change is:

3

4

(

1 − e(− 4

3u t)

)

Lecture 3. Phylogeny methods: Branch and bound, distance methods – p.19/25

Page 20: Lecture 3. Phylogeny methods: Branch and bound, …evolution.gs.washington.edu/gs541/2010/lecture3.pdf · Lecture 3. Phylogeny methods: Branch and bound, distance methods Joe Felsenstein

The distance for the Jukes-Cantor model

0

1

0

0.75

0.49

0.7945

per

site

diffe

renc

es

branch length

Lecture 3. Phylogeny methods: Branch and bound, distance methods – p.20/25

Page 21: Lecture 3. Phylogeny methods: Branch and bound, …evolution.gs.washington.edu/gs541/2010/lecture3.pdf · Lecture 3. Phylogeny methods: Branch and bound, distance methods Joe Felsenstein

If you don’t correct for “multiple hits”

A

B

C

0.155 0.155

0.0206

A

B

C

0.20 0.20

0.00

Left: the true tree. Right: a tree fitting the uncorrected distances

Lecture 3. Phylogeny methods: Branch and bound, distance methods – p.21/25

Page 22: Lecture 3. Phylogeny methods: Branch and bound, …evolution.gs.washington.edu/gs541/2010/lecture3.pdf · Lecture 3. Phylogeny methods: Branch and bound, distance methods Joe Felsenstein

References, page 1Maddison, D. R. 1991. The discovery and importance of multiple islands of most-parsimonious

trees. Systematic Zoology40: 315-328. [Discusses heuristic search strategy involving ties,multiple starts]

Farris, J. S. 1970. Methods for computing Wagner trees. Systematic Zoology19: 83-92. [Earlyparsimony algorithms paper is one of first to mention sequential addition strategy]

Saitou, N., and M. Nei. 1987. The neighbor-joining method: a new method for reconstructingphylogenetic trees. Molecular Biology and Evolution4: 406-425. [First mention ofstar-decomposition search for best trees, sort of]

Strimmer, K., and A. von Haeseler. 1996. Quartet puzzling: a quartet maximum likelihoodmethod for reconstructing tree topologies. Molecular Biology and Evolution13: 964-969.[Assembles trees out of quartets]

Huson, D., S. Nettles, L. Parida, T. Warnow, and S. Yooseph. 1998. The disk-covering method fortree reconstruction. pp. 62-75 in Proceedings of “Algorithms and Experiments” (ALEX98), Trento,

Italy, Feb. 9-11, 1998, ed. R. Battiti and A. A. Bertossi. [“Disk-covering method” for longstringy trees]

Lecture 3. Phylogeny methods: Branch and bound, distance methods – p.22/25

Page 23: Lecture 3. Phylogeny methods: Branch and bound, …evolution.gs.washington.edu/gs541/2010/lecture3.pdf · Lecture 3. Phylogeny methods: Branch and bound, distance methods Joe Felsenstein

References, page 2Foulds, L. R. and R. L. Graham. 1982. The Steiner problem in phylogeny is NP-complete.

Advances in Applied Mathematics3: 43-49. [Parsimony is NP-hard]Graham, R. L. and L. R. Foulds. 1982. Unlikelihood that minimal phylogenies for a realistic

biological study can be constructed in reasonable computat ional time. Mathematical

Biosciences60: 133-142. [ ... and more]Hendy, M. D. and D. Penny. 1982. Branch and bound algorithms to determine minimal

evolutionary trees. Mathematical Biosciences60: 133-142 [Introduced branch-and-bound forphylogenies]

Felsenstein, J. 2004. Inferring Phylogenies.Sinauer Associates, Sunderland, Massachusetts. [Forthis lecture the material is chapters 4, and 5]

Semple, C. and M. Steel. 2003. Phylogenetics.Oxford University Press, Oxford. [Also coverssearch strategies]

Lecture 3. Phylogeny methods: Branch and bound, distance methods – p.23/25

Page 24: Lecture 3. Phylogeny methods: Branch and bound, …evolution.gs.washington.edu/gs541/2010/lecture3.pdf · Lecture 3. Phylogeny methods: Branch and bound, distance methods Joe Felsenstein

References, page 3

Felsenstein, J. 1984. Distance methods for inferring phylogenies: a justification.Evolution38: 16-24. [Argument for statistical interpretation of distancemethods]

Farris, J. S. 1985. Distance data revisited. Cladistics1: 67 -85. [Reply to my1984 paper]

Felsenstein, J. 1986. Distance methods: reply to Farris. Cladistics2: 130-143.[reply to Farris 1985]

Farris, J. S. 1986. Distances and statistics. Cladistics2: 1 44-157. [debate wascut off after this]

Lecture 3. Phylogeny methods: Branch and bound, distance methods – p.24/25

Page 25: Lecture 3. Phylogeny methods: Branch and bound, …evolution.gs.washington.edu/gs541/2010/lecture3.pdf · Lecture 3. Phylogeny methods: Branch and bound, distance methods Joe Felsenstein

References, page 4

Bryant, D., and P. Waddell. 1998. Rapid evaluation of least-squares andminimum-evolution criteria on phylogenetic trees. Molecular Biology andEvolution15: 1346-1359. [quicker least squares distance trees]

Felsenstein, J. 2004. Inferring Phylogenies.Sinauer Associates, Sunderland,Massachusetts. [See chapter 11]

Semple, C. and M. Steel. 2003. Phylogenetics. Oxford University Press, Oxford.[See pp. 145-160]

Yang, Z. 2007. Computational Molecular Evolution.Oxford University Press,Oxford. [See pages 89-93]

Lecture 3. Phylogeny methods: Branch and bound, distance methods – p.25/25