benjamin loyle 2004 cse 397 solving phylogenetic trees benjamin loyle march 16, 2004 cse 397 : intro...

54
Benjamin Loyle 2004 Cse 3 97 Solving Phylogenetic Trees Benjamin Loyle March 16, 2004 Cse 397 : Intro to MBIO

Upload: dorothy-higgins

Post on 03-Jan-2016

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Benjamin Loyle 2004 Cse 397 Solving Phylogenetic Trees Benjamin Loyle March 16, 2004 Cse 397 : Intro to MBIO

Benjamin Loyle 2004 Cse 397

Solving Phylogenetic Trees

Benjamin Loyle

March 16, 2004

Cse 397 : Intro to MBIO

Page 2: Benjamin Loyle 2004 Cse 397 Solving Phylogenetic Trees Benjamin Loyle March 16, 2004 Cse 397 : Intro to MBIO

Benjamin Loyle 2004 Cse 397

Table of Contents

Problem & Term Definitions A DCM*-NJ Solution Performance Measurements Possible Improvements

Page 3: Benjamin Loyle 2004 Cse 397 Solving Phylogenetic Trees Benjamin Loyle March 16, 2004 Cse 397 : Intro to MBIO

Benjamin Loyle 2004 Cse 397

From the Tree of the Life Website,University of Arizona

Orangutan Gorilla Chimpanzee Human

Phylogeny

Page 4: Benjamin Loyle 2004 Cse 397 Solving Phylogenetic Trees Benjamin Loyle March 16, 2004 Cse 397 : Intro to MBIO

Benjamin Loyle 2004 Cse 397

-3 mil yrs

-2 mil yrs

-1 mil yrs

today

AAGACTT

TGGACTTAAGGCCT

AGGGCAT TAGCCCT AGCACTT

AAGGCCT TGGACTT

TAGCCCA TAGACTT AGCGCTTAGCACAAAGGGCAT

AGGGCAT TAGCCCT AGCACTT

DNA Sequence Evolution

Page 5: Benjamin Loyle 2004 Cse 397 Solving Phylogenetic Trees Benjamin Loyle March 16, 2004 Cse 397 : Intro to MBIO

Benjamin Loyle 2004 Cse 397

Problem Definition

The Tree of Life Connecting all living organisms All encompassing Find evolution from simple beginnings

Even smaller relations are tough Impossible

Infer possible ancestral history.

Page 6: Benjamin Loyle 2004 Cse 397 Solving Phylogenetic Trees Benjamin Loyle March 16, 2004 Cse 397 : Intro to MBIO

Benjamin Loyle 2004 Cse 397

So what….

Genome sequencing provides entire map of a species, why link them?

We can understand evolution Viable drug testing and design Predict the function of genes Influenza evolution

Page 7: Benjamin Loyle 2004 Cse 397 Solving Phylogenetic Trees Benjamin Loyle March 16, 2004 Cse 397 : Intro to MBIO

Benjamin Loyle 2004 Cse 397

Why is that a problem?

Over 8 million organisms Current solutions are NP-hard Computing a few hundred species takes

years Error is a very large factor

Page 8: Benjamin Loyle 2004 Cse 397 Solving Phylogenetic Trees Benjamin Loyle March 16, 2004 Cse 397 : Intro to MBIO

Benjamin Loyle 2004 Cse 397

What do we want?

Input A collection of nodes such as taxa or protein

strings to compare in a tree Output

A topological link to compare those nodes to each other

When do we want it? FAST!

Page 9: Benjamin Loyle 2004 Cse 397 Solving Phylogenetic Trees Benjamin Loyle March 16, 2004 Cse 397 : Intro to MBIO

Benjamin Loyle 2004 Cse 397

Preparing the input

Create a distance matrix Sum up all of the known distances into a

matrix sized n x n N is the number of nodes or taxa

Found with sequence comparison

Page 10: Benjamin Loyle 2004 Cse 397 Solving Phylogenetic Trees Benjamin Loyle March 16, 2004 Cse 397 : Intro to MBIO

Benjamin Loyle 2004 Cse 397

Distance Matrix

Take 5 separate DNA strings

A : GATCCATGA B : GATCTATGCC : GTCCCATTTD : AATCCGATCE : TCTCGATAG

The distance between A and B is 2 The distance between A and C is 4

This is subjective based on what your criteria are.

Page 11: Benjamin Loyle 2004 Cse 397 Solving Phylogenetic Trees Benjamin Loyle March 16, 2004 Cse 397 : Intro to MBIO

Benjamin Loyle 2004 Cse 397

Distance Matrix

Lets start with an example matrix

0 63 94 111 67

0 79 96 16

0 47 83

0 100

0

A

B

C

D

E

A B C D E

Page 12: Benjamin Loyle 2004 Cse 397 Solving Phylogenetic Trees Benjamin Loyle March 16, 2004 Cse 397 : Intro to MBIO

Benjamin Loyle 2004 Cse 397

Lets make it simple (constrain the input)

Lets keep the distance between nodes within a certain limit From F -> G F and G have the largest distance; they are

the most dissimilar of any nodes. This is called the diameter of the tree

Lets keep the length of the input (length of the strings) polynomial.

Page 13: Benjamin Loyle 2004 Cse 397 Solving Phylogenetic Trees Benjamin Loyle March 16, 2004 Cse 397 : Intro to MBIO

Benjamin Loyle 2004 Cse 397

ERROR?!?!!?

All trees are inferred, how do you ever know if you’re right?

How accurate do we have to be? We can create data sets to test trees that

we create and assume that it will then work in the real world

Page 14: Benjamin Loyle 2004 Cse 397 Solving Phylogenetic Trees Benjamin Loyle March 16, 2004 Cse 397 : Intro to MBIO

Benjamin Loyle 2004 Cse 397

Data Sets

JC Model Sites evolve independent Sites change with the same probability Changes are single character changes

• Ie. A -> G or T -> C The expectation of change is a Poisson

variable (e)

Page 15: Benjamin Loyle 2004 Cse 397 Solving Phylogenetic Trees Benjamin Loyle March 16, 2004 Cse 397 : Intro to MBIO

Benjamin Loyle 2004 Cse 397

More Data Sets

K2P Model Based on JC Model Allows for probability of transitions to

tranversions• It’s more likely for A and T to switch and G and C

to switch• Normally set to twice as likely

Page 16: Benjamin Loyle 2004 Cse 397 Solving Phylogenetic Trees Benjamin Loyle March 16, 2004 Cse 397 : Intro to MBIO

Benjamin Loyle 2004 Cse 397

Data Use

Using these data sets we can create our own evolution of data.

Start with one “ancestor” and create evolutions

Plug the evolutions back and see if you get what you started with

Page 17: Benjamin Loyle 2004 Cse 397 Solving Phylogenetic Trees Benjamin Loyle March 16, 2004 Cse 397 : Intro to MBIO

Benjamin Loyle 2004 Cse 397

Aspects of Trees

Topology• The method in which nodes are connected to

each other• “Are we really connected to apes directly, or just

linked long before we could be considered mammals?”

Distance• The sum of the weighted edges to reach one

node from another

Page 18: Benjamin Loyle 2004 Cse 397 Solving Phylogenetic Trees Benjamin Loyle March 16, 2004 Cse 397 : Intro to MBIO

Benjamin Loyle 2004 Cse 397

What can distance tell us?

The distance between nodes IS the evolutionary distance between the nodes

The distance between an ancestor and a leaf(present day object) can be interpreted as an estimate of the number of evolutionary ‘steps’ that occurred.

Page 19: Benjamin Loyle 2004 Cse 397 Solving Phylogenetic Trees Benjamin Loyle March 16, 2004 Cse 397 : Intro to MBIO

Benjamin Loyle 2004 Cse 397

Current Techniques Maximum Parsimony

Minimize the total number of evolutionary events

Find the tree that has a minimum amount of changes from ancestors

Maximum Likelihood Probability based Which tree is most probable to occur based

on current data

Page 20: Benjamin Loyle 2004 Cse 397 Solving Phylogenetic Trees Benjamin Loyle March 16, 2004 Cse 397 : Intro to MBIO

Benjamin Loyle 2004 Cse 397

More Techniques

Neighbor Joining Repeatedly joins pairs of leaves (or subtrees)

by rules of numerical optimization It shrinks the distance matrix by considering

two ‘neighbors’ as one node

Page 21: Benjamin Loyle 2004 Cse 397 Solving Phylogenetic Trees Benjamin Loyle March 16, 2004 Cse 397 : Intro to MBIO

Benjamin Loyle 2004 Cse 397

Learning Neighbor Joining

It will become apparent later on, but lets learn how to do Neighbor Joining (NJ)

0 3 3 4 3

0 3 3 4

0 3 3

0 3

0

A

B

C

D

E

A B C D E

Page 22: Benjamin Loyle 2004 Cse 397 Solving Phylogenetic Trees Benjamin Loyle March 16, 2004 Cse 397 : Intro to MBIO

Benjamin Loyle 2004 Cse 397

NJ Part 1

First start with a “star tree”

A

B C

D

E

Page 23: Benjamin Loyle 2004 Cse 397 Solving Phylogenetic Trees Benjamin Loyle March 16, 2004 Cse 397 : Intro to MBIO

Benjamin Loyle 2004 Cse 397

NJ Part 2

Combine the closest two nodes (from distance matrix)

• In our case it is node A and B at distance 3

A

B C

D

E

Page 24: Benjamin Loyle 2004 Cse 397 Solving Phylogenetic Trees Benjamin Loyle March 16, 2004 Cse 397 : Intro to MBIO

Benjamin Loyle 2004 Cse 397

NJ Part 3

Repeat this until you have added n-2 nodes (3)

• N-2 will make it a binary tree, so we only have to include one more node.

A

B C

D

E

Page 25: Benjamin Loyle 2004 Cse 397 Solving Phylogenetic Trees Benjamin Loyle March 16, 2004 Cse 397 : Intro to MBIO

Benjamin Loyle 2004 Cse 397

Are we done?

ML and MP, even in heuristic form take too long for large data sets

NJ has poor topological accuracy, especially for large diameter trees

We need something that works for large diameter trees and can be run fast.

Page 26: Benjamin Loyle 2004 Cse 397 Solving Phylogenetic Trees Benjamin Loyle March 16, 2004 Cse 397 : Intro to MBIO

Benjamin Loyle 2004 Cse 397

Here’s what we want

Our Goal An “Absolute Fast Converging” Method

is afc if, for all positive f,g, €, on the Model M, there is a polynomial p such that, for all (T,{(e)}) is in the set Mf,g on a set S of n sequences of length at least p(n) generated on T, we have Pr[(S) = T] > 1- €.

• Simply: Lets make it in polynomial time within a degree of error.

Page 27: Benjamin Loyle 2004 Cse 397 Solving Phylogenetic Trees Benjamin Loyle March 16, 2004 Cse 397 : Intro to MBIO

Benjamin Loyle 2004 Cse 397

A DCM* - NJ Solution

2 Phase construction of a final phylogenetic tree given a distance matrix d.

Phase 1 : Create a set of plausible trees for the distance matrix

Phase 2 : Find the best fitting tree

Page 28: Benjamin Loyle 2004 Cse 397 Solving Phylogenetic Trees Benjamin Loyle March 16, 2004 Cse 397 : Intro to MBIO

Benjamin Loyle 2004 Cse 397

Phase 1

For each q in {dij}, compute a tree tq

Let T = { tq : q in {dij} }

Page 29: Benjamin Loyle 2004 Cse 397 Solving Phylogenetic Trees Benjamin Loyle March 16, 2004 Cse 397 : Intro to MBIO

Benjamin Loyle 2004 Cse 397

Finding tq

Step 1: Compute Thresh(d,q) Step 2: Triangulate Thresh(d,q) Step 3: Compute a NJ Tree for all

maximal cliques Step 4: Merge the subtrees into a

supertree

Page 30: Benjamin Loyle 2004 Cse 397 Solving Phylogenetic Trees Benjamin Loyle March 16, 2004 Cse 397 : Intro to MBIO

Benjamin Loyle 2004 Cse 397

What does that mean

Breaking the problem up Create a threshold of diameters to break the

problem into• A bunch of smaller diameter trees (cliques)

Apply NJ to those cliques Merge them back

Page 31: Benjamin Loyle 2004 Cse 397 Solving Phylogenetic Trees Benjamin Loyle March 16, 2004 Cse 397 : Intro to MBIO

Benjamin Loyle 2004 Cse 397

Finding tq (terms)

Threshold Graph Thresh(d,q) is the threshold graph where (i,j)

is an edge if and only if dij <= q.

Page 32: Benjamin Loyle 2004 Cse 397 Solving Phylogenetic Trees Benjamin Loyle March 16, 2004 Cse 397 : Intro to MBIO

Benjamin Loyle 2004 Cse 397

Threshold

Lets bring back our distance matrix and create a threshold with q equal to d15 or the distance between A and E So q = 67

Page 33: Benjamin Loyle 2004 Cse 397 Solving Phylogenetic Trees Benjamin Loyle March 16, 2004 Cse 397 : Intro to MBIO

Benjamin Loyle 2004 Cse 397

Distance Matrix

Our old example matrix

0 63 94 111 67

0 79 96 16

0 47 83

0 100

0

A

B

C

D

E

A B C D E

Page 34: Benjamin Loyle 2004 Cse 397 Solving Phylogenetic Trees Benjamin Loyle March 16, 2004 Cse 397 : Intro to MBIO

Benjamin Loyle 2004 Cse 397

With q = D15 = 67

A

B

C

D

E

47

6763

16

Page 35: Benjamin Loyle 2004 Cse 397 Solving Phylogenetic Trees Benjamin Loyle March 16, 2004 Cse 397 : Intro to MBIO

Benjamin Loyle 2004 Cse 397

Triangulating

A graph is triangulated if any cycle with four or more vertices has a chord That is, an edge joining two nonconsecutive

vertices of the cycle. Our example is already triangulated, but

lets look at another

Page 36: Benjamin Loyle 2004 Cse 397 Solving Phylogenetic Trees Benjamin Loyle March 16, 2004 Cse 397 : Intro to MBIO

Benjamin Loyle 2004 Cse 397

Triangulating

W X

Y Z

5

5

5

5

Lets say this is for q = 5

10

15

10 and 15 wouldNot be in the graph

To triangulate this graph you add theedge length 10.

Page 37: Benjamin Loyle 2004 Cse 397 Solving Phylogenetic Trees Benjamin Loyle March 16, 2004 Cse 397 : Intro to MBIO

Benjamin Loyle 2004 Cse 397

Maximal Cliques

A clique that cannot be enlarged by the addition of another vertex.

Recall our original threshold graph which is triangulated:

Page 38: Benjamin Loyle 2004 Cse 397 Solving Phylogenetic Trees Benjamin Loyle March 16, 2004 Cse 397 : Intro to MBIO

Benjamin Loyle 2004 Cse 397

Triangulated Threshold Graph

Our old Graph

A

B

C

D

E

47

6763

16

Page 39: Benjamin Loyle 2004 Cse 397 Solving Phylogenetic Trees Benjamin Loyle March 16, 2004 Cse 397 : Intro to MBIO

Benjamin Loyle 2004 Cse 397

Clique

Our maximal cliques would be:

{A, B, E}

{C, D}

Page 40: Benjamin Loyle 2004 Cse 397 Solving Phylogenetic Trees Benjamin Loyle March 16, 2004 Cse 397 : Intro to MBIO

Benjamin Loyle 2004 Cse 397

Create Trees for the Cliques

We have two maximal cliques, so we make two trees; {A, B, E} and {C, D} How do we make these trees? Remember NJ?

Page 41: Benjamin Loyle 2004 Cse 397 Solving Phylogenetic Trees Benjamin Loyle March 16, 2004 Cse 397 : Intro to MBIO

Benjamin Loyle 2004 Cse 397

Tree {A, B, E} and {C,D}

A

B

E

C D

Page 42: Benjamin Loyle 2004 Cse 397 Solving Phylogenetic Trees Benjamin Loyle March 16, 2004 Cse 397 : Intro to MBIO

Benjamin Loyle 2004 Cse 397

Merge your separate trees together.

Create one Supertree This is done by creating a minimum set of

edges in the trees and calling that the “backbone”

This is it’s own doctorial thesis, so lets do a little hand waving

Page 43: Benjamin Loyle 2004 Cse 397 Solving Phylogenetic Trees Benjamin Loyle March 16, 2004 Cse 397 : Intro to MBIO

Benjamin Loyle 2004 Cse 397

That sounds like NP-hard! Computing Threshold is Polynomial Minimally triangulating is NP-hard, but can be

obtained in polynomial time using a greedy heuristic without too much loss in performance.

Maximal cliques is only polynomial if the data input is triangulated (which it is!).

If all previous are done, creating a supertree can be done in polynomial time as well.

Page 44: Benjamin Loyle 2004 Cse 397 Solving Phylogenetic Trees Benjamin Loyle March 16, 2004 Cse 397 : Intro to MBIO

Benjamin Loyle 2004 Cse 397

Where are we now? We now have a finalized phylogeny created for from smaller

trees in our matrix joined together Remember we started from all possible size of smaller trees.

Page 45: Benjamin Loyle 2004 Cse 397 Solving Phylogenetic Trees Benjamin Loyle March 16, 2004 Cse 397 : Intro to MBIO

Benjamin Loyle 2004 Cse 397

Phase 2

Which one is right? Found using the SQS (Short Quartet

Support) method Let T be a tree in S (made from part 1) Break the data into sets of four taxa

• {A, B, C, D} {A, C, D, E} {A, B, D, E}… etc• Reduce the larger tree to only hold “one set”• These are called Quartets

Page 46: Benjamin Loyle 2004 Cse 397 Solving Phylogenetic Trees Benjamin Loyle March 16, 2004 Cse 397 : Intro to MBIO

Benjamin Loyle 2004 Cse 397

SQS - A Guide

Q(T) is the set of trees induced by T on each set of four leaves.

Let Qw (different Q) be a set of quartets with diameter less than or equal to w

Find the maximum w where the quartets are inclusive of the nodes of the tree

This w is the “support” of that tree

Page 47: Benjamin Loyle 2004 Cse 397 Solving Phylogenetic Trees Benjamin Loyle March 16, 2004 Cse 397 : Intro to MBIO

Benjamin Loyle 2004 Cse 397

SQS - Refrased

Qw is the set of quartet trees which have a diameter <= w

Support of T is the max w where Qw is a subset of Q(T) Support is our “quality measure” What are we exactly measuring?,

Page 48: Benjamin Loyle 2004 Cse 397 Solving Phylogenetic Trees Benjamin Loyle March 16, 2004 Cse 397 : Intro to MBIO

Benjamin Loyle 2004 Cse 397

Qw =

A B C D A B D E

A B C D A B C DE E

Page 49: Benjamin Loyle 2004 Cse 397 Solving Phylogenetic Trees Benjamin Loyle March 16, 2004 Cse 397 : Intro to MBIO

Benjamin Loyle 2004 Cse 397

SQS Method

Return the tree in which the support of that tree is the maximum. If more than one such tree exists return the

tree found first. This is the tree with the smallest original

diameter (remember from phase 1)

Page 50: Benjamin Loyle 2004 Cse 397 Solving Phylogenetic Trees Benjamin Loyle March 16, 2004 Cse 397 : Intro to MBIO

Benjamin Loyle 2004 Cse 397

How do we know we’re right? Compare it to the data set we created Look at Robinson-Foulds accuracy

Remove one edge in the tree we’ve created.• We now have two trees

Is there anyway to create the same set of leaves by removing one edge in our data set?

• If no, add a ‘point’ of error. Repeat this for all edges When the value is not zero then the trees are not

identical

Page 51: Benjamin Loyle 2004 Cse 397 Solving Phylogenetic Trees Benjamin Loyle March 16, 2004 Cse 397 : Intro to MBIO

Benjamin Loyle 2004 Cse 397

Performance of DCM * - NJ

Outperforms NJ method at sequence lengths above 4000 and with more taxa.

NJ

DCM-NJ

0 400 800 16001200No. Taxa

0

0.2

0.4

0.6

0.8

Err

or R

ate

Page 52: Benjamin Loyle 2004 Cse 397 Solving Phylogenetic Trees Benjamin Loyle March 16, 2004 Cse 397 : Intro to MBIO

Benjamin Loyle 2004 Cse 397

Improvements

Improvement possibilities like in Phase 2 Include test of Maximum Parsimony (MP)

Try and minimize the overall size of the tree Test using statistical evidence

Maximum Likelihood (ML)

Page 53: Benjamin Loyle 2004 Cse 397 Solving Phylogenetic Trees Benjamin Loyle March 16, 2004 Cse 397 : Intro to MBIO

Benjamin Loyle 2004 Cse 397

Performance gains

Simply changing Phase 2 has massive gains in accuracy!

DCM - NJ + MP and DCM -NJ + ML are VERY accurate for data sets greater than 4000 and are NOT NP hard.

DCM - NJ + MP finished its analysis on a 107 taxon tree in under three minutes.

Page 54: Benjamin Loyle 2004 Cse 397 Solving Phylogenetic Trees Benjamin Loyle March 16, 2004 Cse 397 : Intro to MBIO

Benjamin Loyle 2004 Cse 397

Comparing Improvements

DCM-NJ+SQS

NJ

DCM-NJ+MP

HGT-FP

0 400 800 16001200# leaves

0

0.2

0.4

0.6

0.8

Err

or R

ate