benjamin loyle 2004 cse 397 solving phylogenetic trees benjamin loyle march 16, 2004 cse 397 : intro...
TRANSCRIPT
Benjamin Loyle 2004 Cse 397
Solving Phylogenetic Trees
Benjamin Loyle
March 16, 2004
Cse 397 : Intro to MBIO
Benjamin Loyle 2004 Cse 397
Table of Contents
Problem & Term Definitions A DCM*-NJ Solution Performance Measurements Possible Improvements
Benjamin Loyle 2004 Cse 397
From the Tree of the Life Website,University of Arizona
Orangutan Gorilla Chimpanzee Human
Phylogeny
Benjamin Loyle 2004 Cse 397
-3 mil yrs
-2 mil yrs
-1 mil yrs
today
AAGACTT
TGGACTTAAGGCCT
AGGGCAT TAGCCCT AGCACTT
AAGGCCT TGGACTT
TAGCCCA TAGACTT AGCGCTTAGCACAAAGGGCAT
AGGGCAT TAGCCCT AGCACTT
DNA Sequence Evolution
Benjamin Loyle 2004 Cse 397
Problem Definition
The Tree of Life Connecting all living organisms All encompassing Find evolution from simple beginnings
Even smaller relations are tough Impossible
Infer possible ancestral history.
Benjamin Loyle 2004 Cse 397
So what….
Genome sequencing provides entire map of a species, why link them?
We can understand evolution Viable drug testing and design Predict the function of genes Influenza evolution
Benjamin Loyle 2004 Cse 397
Why is that a problem?
Over 8 million organisms Current solutions are NP-hard Computing a few hundred species takes
years Error is a very large factor
Benjamin Loyle 2004 Cse 397
What do we want?
Input A collection of nodes such as taxa or protein
strings to compare in a tree Output
A topological link to compare those nodes to each other
When do we want it? FAST!
Benjamin Loyle 2004 Cse 397
Preparing the input
Create a distance matrix Sum up all of the known distances into a
matrix sized n x n N is the number of nodes or taxa
Found with sequence comparison
Benjamin Loyle 2004 Cse 397
Distance Matrix
Take 5 separate DNA strings
A : GATCCATGA B : GATCTATGCC : GTCCCATTTD : AATCCGATCE : TCTCGATAG
The distance between A and B is 2 The distance between A and C is 4
This is subjective based on what your criteria are.
Benjamin Loyle 2004 Cse 397
Distance Matrix
Lets start with an example matrix
0 63 94 111 67
0 79 96 16
0 47 83
0 100
0
A
B
C
D
E
A B C D E
Benjamin Loyle 2004 Cse 397
Lets make it simple (constrain the input)
Lets keep the distance between nodes within a certain limit From F -> G F and G have the largest distance; they are
the most dissimilar of any nodes. This is called the diameter of the tree
Lets keep the length of the input (length of the strings) polynomial.
Benjamin Loyle 2004 Cse 397
ERROR?!?!!?
All trees are inferred, how do you ever know if you’re right?
How accurate do we have to be? We can create data sets to test trees that
we create and assume that it will then work in the real world
Benjamin Loyle 2004 Cse 397
Data Sets
JC Model Sites evolve independent Sites change with the same probability Changes are single character changes
• Ie. A -> G or T -> C The expectation of change is a Poisson
variable (e)
Benjamin Loyle 2004 Cse 397
More Data Sets
K2P Model Based on JC Model Allows for probability of transitions to
tranversions• It’s more likely for A and T to switch and G and C
to switch• Normally set to twice as likely
Benjamin Loyle 2004 Cse 397
Data Use
Using these data sets we can create our own evolution of data.
Start with one “ancestor” and create evolutions
Plug the evolutions back and see if you get what you started with
Benjamin Loyle 2004 Cse 397
Aspects of Trees
Topology• The method in which nodes are connected to
each other• “Are we really connected to apes directly, or just
linked long before we could be considered mammals?”
Distance• The sum of the weighted edges to reach one
node from another
Benjamin Loyle 2004 Cse 397
What can distance tell us?
The distance between nodes IS the evolutionary distance between the nodes
The distance between an ancestor and a leaf(present day object) can be interpreted as an estimate of the number of evolutionary ‘steps’ that occurred.
Benjamin Loyle 2004 Cse 397
Current Techniques Maximum Parsimony
Minimize the total number of evolutionary events
Find the tree that has a minimum amount of changes from ancestors
Maximum Likelihood Probability based Which tree is most probable to occur based
on current data
Benjamin Loyle 2004 Cse 397
More Techniques
Neighbor Joining Repeatedly joins pairs of leaves (or subtrees)
by rules of numerical optimization It shrinks the distance matrix by considering
two ‘neighbors’ as one node
Benjamin Loyle 2004 Cse 397
Learning Neighbor Joining
It will become apparent later on, but lets learn how to do Neighbor Joining (NJ)
0 3 3 4 3
0 3 3 4
0 3 3
0 3
0
A
B
C
D
E
A B C D E
Benjamin Loyle 2004 Cse 397
NJ Part 1
First start with a “star tree”
A
B C
D
E
Benjamin Loyle 2004 Cse 397
NJ Part 2
Combine the closest two nodes (from distance matrix)
• In our case it is node A and B at distance 3
A
B C
D
E
Benjamin Loyle 2004 Cse 397
NJ Part 3
Repeat this until you have added n-2 nodes (3)
• N-2 will make it a binary tree, so we only have to include one more node.
A
B C
D
E
Benjamin Loyle 2004 Cse 397
Are we done?
ML and MP, even in heuristic form take too long for large data sets
NJ has poor topological accuracy, especially for large diameter trees
We need something that works for large diameter trees and can be run fast.
Benjamin Loyle 2004 Cse 397
Here’s what we want
Our Goal An “Absolute Fast Converging” Method
is afc if, for all positive f,g, €, on the Model M, there is a polynomial p such that, for all (T,{(e)}) is in the set Mf,g on a set S of n sequences of length at least p(n) generated on T, we have Pr[(S) = T] > 1- €.
• Simply: Lets make it in polynomial time within a degree of error.
Benjamin Loyle 2004 Cse 397
A DCM* - NJ Solution
2 Phase construction of a final phylogenetic tree given a distance matrix d.
Phase 1 : Create a set of plausible trees for the distance matrix
Phase 2 : Find the best fitting tree
Benjamin Loyle 2004 Cse 397
Phase 1
For each q in {dij}, compute a tree tq
Let T = { tq : q in {dij} }
Benjamin Loyle 2004 Cse 397
Finding tq
Step 1: Compute Thresh(d,q) Step 2: Triangulate Thresh(d,q) Step 3: Compute a NJ Tree for all
maximal cliques Step 4: Merge the subtrees into a
supertree
Benjamin Loyle 2004 Cse 397
What does that mean
Breaking the problem up Create a threshold of diameters to break the
problem into• A bunch of smaller diameter trees (cliques)
Apply NJ to those cliques Merge them back
Benjamin Loyle 2004 Cse 397
Finding tq (terms)
Threshold Graph Thresh(d,q) is the threshold graph where (i,j)
is an edge if and only if dij <= q.
Benjamin Loyle 2004 Cse 397
Threshold
Lets bring back our distance matrix and create a threshold with q equal to d15 or the distance between A and E So q = 67
Benjamin Loyle 2004 Cse 397
Distance Matrix
Our old example matrix
0 63 94 111 67
0 79 96 16
0 47 83
0 100
0
A
B
C
D
E
A B C D E
Benjamin Loyle 2004 Cse 397
With q = D15 = 67
A
B
C
D
E
47
6763
16
Benjamin Loyle 2004 Cse 397
Triangulating
A graph is triangulated if any cycle with four or more vertices has a chord That is, an edge joining two nonconsecutive
vertices of the cycle. Our example is already triangulated, but
lets look at another
Benjamin Loyle 2004 Cse 397
Triangulating
W X
Y Z
5
5
5
5
Lets say this is for q = 5
10
15
10 and 15 wouldNot be in the graph
To triangulate this graph you add theedge length 10.
Benjamin Loyle 2004 Cse 397
Maximal Cliques
A clique that cannot be enlarged by the addition of another vertex.
Recall our original threshold graph which is triangulated:
Benjamin Loyle 2004 Cse 397
Triangulated Threshold Graph
Our old Graph
A
B
C
D
E
47
6763
16
Benjamin Loyle 2004 Cse 397
Clique
Our maximal cliques would be:
{A, B, E}
{C, D}
Benjamin Loyle 2004 Cse 397
Create Trees for the Cliques
We have two maximal cliques, so we make two trees; {A, B, E} and {C, D} How do we make these trees? Remember NJ?
Benjamin Loyle 2004 Cse 397
Tree {A, B, E} and {C,D}
A
B
E
C D
Benjamin Loyle 2004 Cse 397
Merge your separate trees together.
Create one Supertree This is done by creating a minimum set of
edges in the trees and calling that the “backbone”
This is it’s own doctorial thesis, so lets do a little hand waving
Benjamin Loyle 2004 Cse 397
That sounds like NP-hard! Computing Threshold is Polynomial Minimally triangulating is NP-hard, but can be
obtained in polynomial time using a greedy heuristic without too much loss in performance.
Maximal cliques is only polynomial if the data input is triangulated (which it is!).
If all previous are done, creating a supertree can be done in polynomial time as well.
Benjamin Loyle 2004 Cse 397
Where are we now? We now have a finalized phylogeny created for from smaller
trees in our matrix joined together Remember we started from all possible size of smaller trees.
Benjamin Loyle 2004 Cse 397
Phase 2
Which one is right? Found using the SQS (Short Quartet
Support) method Let T be a tree in S (made from part 1) Break the data into sets of four taxa
• {A, B, C, D} {A, C, D, E} {A, B, D, E}… etc• Reduce the larger tree to only hold “one set”• These are called Quartets
Benjamin Loyle 2004 Cse 397
SQS - A Guide
Q(T) is the set of trees induced by T on each set of four leaves.
Let Qw (different Q) be a set of quartets with diameter less than or equal to w
Find the maximum w where the quartets are inclusive of the nodes of the tree
This w is the “support” of that tree
Benjamin Loyle 2004 Cse 397
SQS - Refrased
Qw is the set of quartet trees which have a diameter <= w
Support of T is the max w where Qw is a subset of Q(T) Support is our “quality measure” What are we exactly measuring?,
Benjamin Loyle 2004 Cse 397
Qw =
A B C D A B D E
A B C D A B C DE E
Benjamin Loyle 2004 Cse 397
SQS Method
Return the tree in which the support of that tree is the maximum. If more than one such tree exists return the
tree found first. This is the tree with the smallest original
diameter (remember from phase 1)
Benjamin Loyle 2004 Cse 397
How do we know we’re right? Compare it to the data set we created Look at Robinson-Foulds accuracy
Remove one edge in the tree we’ve created.• We now have two trees
Is there anyway to create the same set of leaves by removing one edge in our data set?
• If no, add a ‘point’ of error. Repeat this for all edges When the value is not zero then the trees are not
identical
Benjamin Loyle 2004 Cse 397
Performance of DCM * - NJ
Outperforms NJ method at sequence lengths above 4000 and with more taxa.
NJ
DCM-NJ
0 400 800 16001200No. Taxa
0
0.2
0.4
0.6
0.8
Err
or R
ate
Benjamin Loyle 2004 Cse 397
Improvements
Improvement possibilities like in Phase 2 Include test of Maximum Parsimony (MP)
Try and minimize the overall size of the tree Test using statistical evidence
Maximum Likelihood (ML)
Benjamin Loyle 2004 Cse 397
Performance gains
Simply changing Phase 2 has massive gains in accuracy!
DCM - NJ + MP and DCM -NJ + ML are VERY accurate for data sets greater than 4000 and are NOT NP hard.
DCM - NJ + MP finished its analysis on a 107 taxon tree in under three minutes.
Benjamin Loyle 2004 Cse 397
Comparing Improvements
DCM-NJ+SQS
NJ
DCM-NJ+MP
HGT-FP
0 400 800 16001200# leaves
0
0.2
0.4
0.6
0.8
Err
or R
ate