cs 394c: computational biology algorithms

24
CS 394C: Computational Biology Algorithms Tandy Warnow Department of Computer Sciences University of Texas at Austin

Upload: hoyt-ramirez

Post on 02-Jan-2016

30 views

Category:

Documents


3 download

DESCRIPTION

CS 394C: Computational Biology Algorithms. Tandy Warnow Department of Computer Sciences University of Texas at Austin. -3 mil yrs. AAGACTT. AAGACTT. -2 mil yrs. AAG G C C T. AAGGCCT. AAGGCCT. AAGGCCT. TGGACTT. TGGACTT. TGGACTT. TG GACTT. -1 mil yrs. A G GGC A T. AGGGCAT. AGGGCAT. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: CS 394C: Computational Biology Algorithms

CS 394C: Computational Biology Algorithms

Tandy WarnowDepartment of Computer Sciences

University of Texas at Austin

Page 2: CS 394C: Computational Biology Algorithms

DNA Sequence Evolution

AAGACTT

TGGACTTAAGGCCT

-3 mil yrs

-2 mil yrs

-1 mil yrs

today

AGGGCAT TAGCCCT AGCACTT

AAGGCCT TGGACTT

TAGCCCA TAGACTT AGCGCTTAGCACAAAGGGCAT

AGGGCAT TAGCCCT AGCACTT

AAGACTT

TGGACTTAAGGCCT

AGGGCAT TAGCCCT AGCACTT

AAGGCCT TGGACTT

AGCGCTTAGCACAATAGACTTTAGCCCAAGGGCAT

Page 3: CS 394C: Computational Biology Algorithms

Molecular Systematics

TAGCCCA TAGACTT TGCACAA TGCGCTTAGGGCAT

U V W X Y

U

V W

X

Y

Page 4: CS 394C: Computational Biology Algorithms

Phylogeny estimation methods

• Distance-based (Neighbor joining, NQM, and others): mostly statistically consistent and polynomial time

• Maximum parsimony and maximum compatibility: NP-hard and not statistically consistent

• Maximum likelihood: NP-hard and usually statistically consistent (if solved exactly)

• Bayesian Methods: statistically consistent if run long enough

Page 5: CS 394C: Computational Biology Algorithms

Distance-based methods

• Theorem: Let (T,) be a Cavender-Farris model tree, with additive matrix [(i,j)]. Let >0 be given. The sequence length that suffices for accuracy with probability at least 1- of NJ (neighbor joining) and the Naïve Quartet Method is

O(log n e(O(max (i,j))).

Page 6: CS 394C: Computational Biology Algorithms

Neighbor joining (although statistically consistent) has poor performance on large diameter trees

[Nakhleh et al. ISMB 2001]

Simulation study based upon fixed edge lengths, K2P model of evolution, sequence lengths fixed to 1000 nucleotides.

Error rates reflect proportion of incorrect edges in inferred trees.

NJ

0 400 800 16001200No. Taxa

0

0.2

0.4

0.6

0.8

Err

or R

ate

Page 7: CS 394C: Computational Biology Algorithms

Maximum Parsimony

• Input: Set S of n aligned sequences of length k

• Output: A phylogenetic tree T– leaf-labeled by sequences in S– additional sequences of length k labeling the

internal nodes of T

such that is minimized. ∑∈ )(),(

),(TEji

jiH

Page 8: CS 394C: Computational Biology Algorithms

Maximum parsimony (example)

• Input: Four sequences– ACT– ACA– GTT– GTA

• Question: which of the three trees has the best MP scores?

Page 9: CS 394C: Computational Biology Algorithms

Maximum Parsimony

ACT

GTT ACA

GTA ACA ACT

GTAGTT

ACT

ACA

GTT

GTA

Page 10: CS 394C: Computational Biology Algorithms

Maximum Parsimony

ACT

GTT

GTT GTA

ACA

GTA

12

2

MP score = 5

ACA ACT

GTAGTT

ACA ACT

3 1 3

MP score = 7

ACT

ACA

GTT

GTAACA GTA

1 2 1

MP score = 4

Optimal MP tree

Page 11: CS 394C: Computational Biology Algorithms

Maximum Parsimony

ACT

ACA

GTT

GTAACA GTA

1 2 1

MP score = 4

Finding the optimal MP tree is NP-hard

Optimal labeling can be computed in polynomial time using Dynamic Programming

Page 12: CS 394C: Computational Biology Algorithms

Solving NP-hard problems exactly is … unlikely

• Number of (unrooted) binary trees on n leaves is (2n-5)!!

• If each tree on 1000 taxa could be analyzed in 0.001 seconds, we would find the best tree in

2890 millennia

#leaves #trees

4 3

5 15

6 105

7 945

8 10395

9 135135

10 2027025

20 2.2 x 1020

100 4.5 x 10190

1000 2.7 x 102900

Page 13: CS 394C: Computational Biology Algorithms

1. Hill-climbing heuristics (which can get stuck in local optima)2. Randomized algorithms for getting out of local optima3. Approximation algorithms for MP (based upon Steiner Tree approximation

algorithms) -- however, the approx. ratio that is needed is probably 1.01 or smaller!

Approaches for “solving” MP and ML(and other NP-hard problems in phylogeny)

Phylogenetic trees

Cost

Global optimum

Local optimum

Page 14: CS 394C: Computational Biology Algorithms

Problems with techniques for MP and ML

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

0 4 8 12 16 20 24

Hours

Average MP score above

optimal, shown as a percentage of

the optimal

Shown here is the performance of a TNT heuristic maximum parsimony analysis on a real dataset of almost 14,000 sequences. (“Optimal” here means best score to date, using any method for any amount of time.) Acceptable error is below 0.01%.

Performance of TNT with time

Page 15: CS 394C: Computational Biology Algorithms

MP and Cavender-Farris

• Consider a tree (AB,CD) with two very long branches leading to A and C, and all other branches very short.

• MP will be statistically inconsistent (and “positively misleading”) on this tree.

Page 16: CS 394C: Computational Biology Algorithms

Problems with existing phylogeny reconstruction methods

• Polynomial time methods (generally based upon distances) have poor accuracy with large diameter datasets.

• Heuristics for NP-hard optimization problems take too long (months to reach acceptable local optima).

Page 17: CS 394C: Computational Biology Algorithms

Warnow et al.: Meta-algorithms for phylogenetics

• Basic technique: determine the conditions under which a phylogeny reconstruction method does well (or poorly), and design a divide-and-conquer strategy (specific to the method) to improve its performance

• Warnow et al. developed a class of divide-and-conquer methods, collectively called DCMs (Disk-Covering Methods). These are based upon chordal graph theory to give fast decompositions and provable performance guarantees.

Page 18: CS 394C: Computational Biology Algorithms

Disk-Covering Method (DCM)

Page 19: CS 394C: Computational Biology Algorithms

Improving phylogeny reconstruction methods using DCMs

• Improving the theoretical convergence rate and performance of polynomial time distance-based methods using DCM1

• Speeding up heuristics for NP-hard optimization problems (Maximum Parsimony and Maximum Likelihood) using Rec-I-DCM3

Page 20: CS 394C: Computational Biology Algorithms

DCM1 Warnow, St. John, and Moret, SODA 2001

• A two-phase procedure which reduces the sequence length requirement of methods. The DCM phase produces a collection of trees, and the SQS phase picks the “best” tree.

• The “base method” is applied to subsets of the original dataset. When the base method is NJ, you get DCM1-NJ.

DCM SQSExponentiallyconvergingmethod

Absolute fast convergingmethod

Page 21: CS 394C: Computational Biology Algorithms

DCM1-boosting distance-based methods[Nakhleh et al. ISMB 2001]

•Theorem: DCM1-NJ converges to the true tree from polynomial length sequences

NJ

DCM1-NJ

0 400 800 16001200No. Taxa

0

0.2

0.4

0.6

0.8

Err

or R

ate

Page 22: CS 394C: Computational Biology Algorithms

Rec-I-DCM3 significantly improves performance (Roshan et al. CSB 2004)

Comparison of TNT to Rec-I-DCM3(TNT) on one large dataset.Similar improvements obtained for RAxML (maximum likelihood).

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

0 4 8 12 16 20 24

Hours

Average MP score above

optimal, shown as a percentage of

the optimal

Current best techniques

DCM boosted version of best techniques

Page 23: CS 394C: Computational Biology Algorithms

Summary (so far)

• Optimization problems in biology are almost all NP-hard, and heuristics may run for months before finding local optima.

• The challenge here is to find better heuristics, since exact solutions are very unlikely to ever be achievable on large datasets.

Page 24: CS 394C: Computational Biology Algorithms

Summary

• NP-hard optimization problems abound in phylogeny reconstruction, and in computational biology in general, and need very accurate solutions

• Many real problems have beautiful and natural combinatorial and graph-theoretic formulations