efficient inference in phylogenetic indel treesbouchard/pub/poster-nips09.pdf · known to perform...

1
Efficient Inference in Phylogenetic InDel Trees Alexandre Bouchard-Côté Michael I. Jordan Dan Klein Computer Science Division University of California at Berkeley • What is a multi-alignment? A C T A C a : A C G b : T A C C c : A C T A C a : A C G b : T A C C c : - - - - - - • What are guide trees and ancestral reconstructions? a b c d A C T A C a : A C G b : T A C C c : ? ? ? .. .. d : • Assuming we have an evolutionary InDel model, both tasks can be cast as an inference problem (High-level) graphical model: application: homology modeling application: the origins of life evolutionary parameters string-valued r.v. evolution (InDel process) A C T A C A C G G Sub Sub Sub Ins Ins Del Obtaining multi-alignments: A C G b : A C T A C a : A C T C d : T A C C c : T A C r : • Solution: always condition on the data ...ACCGTGCGTGCT... ...ACCCTGCGGTGCT... ...ACGCTGCGTGAT... ...ACGTTGCGTGAT ... ...ACGCTGTTGTGAT ... ...ACGGTGCTGTGCT.. • How should "vertical slices" be defined? Deterministic functions of a random graphical model • What is the state of the sampler? A C T A C A C G G Sub Sub Sub Ins Ins Del State augmentation: alignments } } derivation resample "thin vertical slices" a d r A C T A C r : A C G d : T A C C a : r : d : a : observed characters selected characters anchor Legend Indexed by a span on an observed word (anchor) r : d : a : anchor G A C A G A C C A G A C C • Connected component of the anchor? r : d : a : Not irreducible Non-contiguous lineages • Contiguous version? r : d : a : Not reversible !! • A correct definition: r : d : a : r : d : a : A C G b : A C T A C a : A C T C d : T A C C c : T A C r : G A C G C T A C C G A C C C A C G A C T A C A C G C T A C C G A C • Next: how to sample? Cylindric proposals: use a DP (SSR) (AR) [1] [2] [2] a b c d • Approach 1: heuristic A C T A C a : A C G b : T A C C c : ? ? ? ? ? ? d : ? ? ? ? ? ? C C A A A T C T G ... • Approach 2: naive Gibbs sampling character-valued! ...ACCGTGCGATGCT... ...ACCCTGCGGTGCT... ...ACGCTGCGGTGAT... ...ACGTTGCGGTGAT ... ...ACGCTGTTGTGAT ... ...ACGGTGCTGTGCT.. Problem I: random walk behavior A C G C T C T C C T C T A C C A C G A C T C T A C C T A C T A C C A C G A C T C T C C T A C T A C C A C G A C T C T C C T C T A C C Problem II: each step is expensive (cubic in the sequence length) [8] • Experiments 1. Reconstruction - synthetic - 4 characters types - 124010 character tokens - tree has 7 nodes - all internal nodes held-out - evaluated on root reconstruction - loss: Levenshtein distance 0 1 2 3 4 5 0 100 200 300 Time Error SSR AR 2. Multi-alignment 0 25 50 75 100 3 4 5 6 7 8 Depth CS SSR AR 70 80 90 100 3 4 5 6 7 8 Depth SP SSR AR System CS SP SSR (Handel) 0.63 0.77 AR (this paper) 0.77 0.86 - annotated protein data - source: BAliBASE - loss: Sum-of-Pairs (SP) Column-Score (CS) The taller the tree, the larger the gap: References [1] N. Galtier, N. Tourasse, and M. Gouy. A nonhyperthermophilic common ancestor to extant life forms. Science, 283:220–221, 1999. [2] I. Holmes and W. J. Bruno. Evolutionary HMM: a bayesian approach to multiple alignment. Bioinformatics, 17:803–820, 2001. [3] J. Felsenstein. Inferring Phylogenies. Sinauer Associates, 2003. [4] Z. Yang and B. Rannala. Bayesian phylogenetic inference using DNA sequences: A Markov Chain Monte Carlo Method. Mol. Biol. Evol., 1997. [5] B. Mau and M. A. Newton. Phylogenetic inference for binary data on dendrograms using markov chain monte carlo. J. of Comp. and Graphical Statistics, 1997. [6] S. Li, D. K. Pearl, and H. Doss. Phylogenetic tree construction using Markov Chain Monte Carlo. J. of the American Statistical Association, 2000. [7] W. J. Bruno, N. D. Socci, and A. L. Halpern. Weighted neighbor joining: A likelihood-based approach to distance-based phylogeny reconstruction. Mol. Biol. Evol., 17:189–197, 2000. [8] D. G. Higgins and P. M Sharp. Clustal: a package for performing multiple sequence alignment on a microcomputer. Gene, 73:237–244, 1988. [9] J. L. Thorne, H. Kishino, and J. Felsenstein. Inching toward reality: an improved likelihood model of sequence evolution. J. of Mol. Evol., 34:3–16, 1992. [10] J. L. Thorne, H. Kishino, and J. Felsenstein. An evolutionary model for maximum likelihood [10] J. L. Thorne, H. Kishino, and J. Felsenstein. An evolutionary model for maximum likelihood alignment of DNA sequences. J. of Mol. Evol., 33:114–124, 1991. [11] G. A Lunter, I. Mikl´ os, Y. S. Song, and J. Hein. An efficient algorithm for statistical multiple alignment on arbitrary phylogenetic trees. J. Comp. Biol., 10:869–889, 2003. [12] A. Bouchard-Cˆ ot´ e, P. Liang, T. L. Griffiths, and D. Klein. A probabilistic approach to language change. In EMNLP 07, 2007. [13] P. Diaconis, S. Holmes, and R. M. Neal. Analysis of a non-reversible Markov chain sampler. Technical report, Cornell University, 1997. [14] S. Needleman and C. Wunsch. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. of Mol. Biol., 48:443–453, 1970. [15] K. St. John, T. Warnow, B. M. E. Moret, and L. Vawter. Performance study of phylogenetic methods: (unweighted) quartet methods and neighbor-joining. J. of Algorithms, 48:173–193, 2003. [16] J. Thompson, F. Plewniak, and O. Poch. Balibase: A benchmark alignments database for the evaluation of multiple sequence alignment programs. Bioinformatics, 15:87–88, 1999. [17] C. B. Do, M. S. P. Mahabhashyam, M. Brudno, and S. Batzoglou. Probcons: Probabilistic consistency-based multiple sequence alignment. Genome Research, 15:330–340, 2005. gc2_cavpo 1 .ktisktkgaprm PDVYTLPPSRD alc_mouse 1 ..tiakvtvntfp PQVHLLPPPSE 1yuh 1 .tkltvlgqpkss PSVTL FPPSSE 1mim 1 gtkleikr.tvaa PSVFIF PPSDE 1nld 1 .ttvtvssastta PSVYPL APVSG gc2_cavpo 49 asnrvvsekpeykntppieda... alc_mouse 49 lhgn....eelspesylveplkep 1yuh 49 kvdgtpvtegmettqpskqs.... 1mim 49 kvdnalqsgnsqesvteqdsk... 1nld 49 nsg..slssgvhtfpavlqs.... gc2_cavpo 92 YTCSVMHealhnhvtqkaisrsp alc_mouse 95 YSCMVGHealpmnftqktidrls 1yuh 91 YSCQVTHeg..htve.kslsra. 1mim 92 YACEVT Hqglsspvt.ksfnrg. 1nld 87 ITCNVA Hpasstkvdkkieprg. [16]

Upload: others

Post on 24-Feb-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Efficient Inference in Phylogenetic InDel Treesbouchard/pub/Poster-nips09.pdf · known to perform better than Handel on this dataset [7, 4], they leverage more sophisticated features

Efficient Inference in Phylogenetic InDel TreesAlexandre Bouchard-Côté Michael I. Jordan Dan Klein

Computer Science Division University of California at Berkeley

• What is a multi-alignment?

AC

TA

C

a :

AC

G

b :

T ACC c :

AC T A Ca :

AC Gb :

TA C Cc : -- - -

-

-

• What are guide trees and ancestral reconstructions?

a

b

c

d

AC T A Ca :

AC Gb :

TA C Cc :

?? ? .. ..d :

• Assuming we have an evolutionary InDel model, both tasks can be cast as an inference problem

(High-level) graphical model:

application: homology modeling

application: the origins of life

evolutionary parameters

string-valued r.v.

evolution (InDel process)

AC T A C

AC G G

Sub Sub Sub

Ins Ins

Del

Obtaining multi-alignments:

AC Gb :AC T A Ca :

AC T Cd :

TA C Cc :

TA Cr :

• Solution: always condition on the data

...ACCGTGCGTGCT...

...ACCCTGCGGTGCT...

...ACGCTGCGTGAT...

...ACGTTGCGTGAT... ...ACGCTGTTGTGAT... ...ACGGTGCTGTGCT..

• How should "vertical slices" be defined?

Deterministic functions of a random graphical model

• What is the state of the sampler?

AC T A C

AC G G

Sub Sub Sub

Ins Ins

Del

State augmentation: alignments

} }derivation

resample "thin vertical slices"

a

d

r AC T A Cr :

AC Gd :

TA C Ca :

r :

d :

a :

observed charactersselected charactersanchor

Legend

Indexed by a span on an observed word (anchor)

r :

d :

a :

anchor

AC GAC T A C

AC G CTA C C

GA CC

C

AC GAC T A C

AC G CTA C C

GA CC

C

AC GAC T A C

AC G CTA C C

GA CC

C

• Connected component of the anchor?

r :

d :

a :

• Not irreducible

• Non-contiguous lineages

• Contiguous version?

r :

d :

a :

• Not reversible

!!

• A correct definition:

r :

d :

a :

r :

d :

a :

AC Gb :AC T A Ca :

AC T Cd :TA C Cc :

TA Cr :

AC GAC T A C

AC G CTA C C

GA CC

C

AC GAC T A C

AC G CTA C C

GA C

• Next: how to sample?

Cylindric proposals: use a DP

(SSR)

(AR)

[1]

[2]

[2]

a

b

c

d

• Approach 1: heuristic

AC T A Ca :

AC Gb :

TA C Cc : ?? ? ?

?

?

d : ? ?? ? ? ?

C C A A A T C T G

...

• Approach 2: naive Gibbs samplingcharacter-valued!

...ACCGTGCGATGCT...

...ACCCTGCGGTGCT...

...ACGCTGCGGTGAT...

...ACGTTGCGGTGAT... ...ACGCTGTTGTGAT... ...ACGGTGCTGTGCT..

Problem I: random walk behavior

AC G

C T CT C C

T C

TA C C AC G

AC T CTA C C

TA C

TA C C

AC G

AC T CT C C

TA C

TA C CAC G

AC T CT C C

T C

TA C C

Problem II: each step is expensive (cubic in the sequence length)

[8] • Experiments

1. Reconstruction

- synthetic- 4 characters types- 124010 character tokens- tree has 7 nodes- all internal nodes held-out- evaluated on root reconstruction- loss: Levenshtein distance

0

1

2

3

4

5

0 100 200 300

Time

Error

SSR AR

2. Multi-alignment

0

25

50

75

100

3 4 5 6 7 8

Depth

CS

SSR AR

70

80

90

100

3 4 5 6 7 8

Depth

SP

SSR AR

System CS SPSSR (Handel) 0.63 0.77AR (this paper) 0.77 0.86

Table 1: XXXXXX—clustalw

0

1

2

3

4

5

0 5000 10000 15000 20000 25000 30000

Time

Error

SSR AR

(a) Fig 1a

0

1

2

3

4

5

0 2000 4000 6000 8000 10000 12000 14000 16000 18000 20000

Time

Error

Max dev = 2 Max dev = MAX

(b) Fig 1b

Figure 1: XXXXX

System CS SPSSR (Handel) 0.63 0.77SSR+AR 0.77 0.86

Table 1: XXXXXX—clustalw

The multi-alignment system in the litterature that most resembles our lineage sampler is Handel. It models ina probabilistic fashion an InDel evolutionary derivation along a tree above the observed proteins and producesa multi-alignment as described above. The key difference with our approach is that their inference algorithmis based on SSR rather than the lineage sampling moves that we advocate in this paper.

The BAliBASE multi-alignment dataset [13] is a standard benchmark for this type of comparison and wasthe dataset used to assess the performance of Handel in its original paper [8]. While other approaches areknown to perform better than Handel on this dataset [7, 4], they leverage more sophisticated features such asaffine gap penalty and hydrophobic core modelling. While these features could be incorporated in our model,we leave this for future work since the topic of this paper is focused on inference. This way we can get ameaningful comparison with Handel.

We built evolutionary trees based on the tkfalign and weighbor [2] tools provided in the Handel package andran both Handel and our algorithm on the same sequences and evolutionary trees.

5 Future Directions

References

[1] Alexandre Bouchard, Percy Liang, Thomas L. Griffiths, and Dan Klein. A probabilistic approach tolanguage change. In EMNLP 07, 2007.

[2] William J. Bruno, Nicholas D. Socci, and Aaron L. Halpern. Weighted neighbor joining: A likelihood-based approach to distance-based phylogeny reconstruction. Mol. Biol. Evol., 17:189–197, 2000.

[3] P. Diaconis, S. Holmes, and R.M. Neal. Analysis of a non-reversible markov chain sampler. Technicalreport, Cornell University, 1997.

[4] C.B. Do, M.S.P. Mahabhashyam, M. Brudno, and S. Batzoglou. Probcons: Probabilistic consistency-based multiple sequence alignment. Genome Research, 15:330–340, 2005.

[5] J. Felsenstein. Inferring Phylogenies. Sinauer Associates, 2003.

6

0

25

50

75

100

3 4 5 6 7 870

80

90

100

3 4 5 6 7 8

Figure 4: CS and SP XXXCITE are recall metrics XXX

sequence alignment can be extracted from the infered derivation D as follow: deem the amino acids x, y ∈ Saligned iff y ∈ A0(D,x).

The state-of-the-art for multiple sequence alignment systems based on an evolutionary model is Handel [2].It is based on TKF91 and produces a multiple sequence alignment as described above. The key differencewith our approach is that their inference algorithm is based on SSR rather than the AR move that we advocatein this paper.

While other heuristic approaches are known to perform better than Handel on this dataset [7, 16], they are notbased on explicit evolutionary models. They perform better because they leverage more sophisticated featuressuch as affine gap penalty and hydrophobic core modelling. While these features can be incorporated in ourmodel, we leave this for future work since the topic of this paper is inference.

We built evolutionary trees using weighbor [17]. We ran each system for the same time on the sequencesin the ref1 directory of BAliBASE v.1. Decoding for this experiment was done by picking the sample withhighest likelihood. We report in Figure 4, left, the CS and SP Scores, the two standard metrics for this task.Both are recall measures on the subset of the alignments that were labeled, called the core blocks, see [16]for the details. For both metrics, our approach performs better.

In order to investigate where the advantage comes from, we did another multiple alignment experiment plot-ting this time performance after a fixed time as a function of depth of the trees. If the random walk argumentpresented in the introduction held, we would expect the advantage of AR over SSR to increase as the treegets taller. This prediction is confirmed as illustrated in Figure 4, middle, right. For short trees, the twoalgorithms perform equally, SSR beating AR slightly for trees with three nodes, which is not surprising sinceSSR actually performs exact inference in this tiny topology configuration. However, as trees get taller, thetask becomes more difficult, and only AR manages to maintain good performances.

5 Future Directions

TODO: integrating affine gap, hydrophobic core modelling, CRF models

References[1] N. Galtier, N. Tourasse, and M. Gouy. A nonhyperthermophilic common ancestor to extant life forms. Science,

283:220–221, 1999.[2] I. Holmes and W. J. Bruno. Evolutionary hmm: a bayesian approach to multiple alignment. Bioinformatics, 17:803–

820, 2001.

7

- annotated protein data- source: BAliBASE- loss: Sum-of-Pairs (SP) Column-Score (CS)

The taller the tree, the larger the gap:

References[1] N. Galtier, N. Tourasse, and M. Gouy. A nonhyperthermophilic common ancestor to extant

life forms. Science, 283:220–221, 1999.[2] I. Holmes and W. J. Bruno. Evolutionary HMM: a bayesian approach to multiple alignment.

Bioinformatics, 17:803–820, 2001.[3] J. Felsenstein. Inferring Phylogenies. Sinauer Associates, 2003.[4] Z. Yang and B. Rannala. Bayesian phylogenetic inference using DNA sequences: A Markov

Chain Monte Carlo Method. Mol. Biol. Evol., 1997.[5] B. Mau and M. A. Newton. Phylogenetic inference for binary data on dendrograms using

markov chain monte carlo. J. of Comp. and Graphical Statistics, 1997.[6] S. Li, D. K. Pearl, and H. Doss. Phylogenetic tree construction using Markov Chain Monte

Carlo. J. of the American Statistical Association, 2000.[7] W. J. Bruno, N. D. Socci, and A. L. Halpern. Weighted neighbor joining: A likelihood-based

approach to distance-based phylogeny reconstruction. Mol. Biol. Evol., 17:189–197, 2000.[8] D. G. Higgins and P. M Sharp. Clustal: a package for performing multiple sequence alignment

on a microcomputer. Gene, 73:237–244, 1988.[9] J. L. Thorne, H. Kishino, and J. Felsenstein. Inching toward reality: an improved likelihood

model of sequence evolution. J. of Mol. Evol., 34:3–16, 1992.[10] J. L. Thorne, H. Kishino, and J. Felsenstein. An evolutionary model for maximum likelihood

alignment of DNA sequences. J. of Mol. Evol., 33:114–124, 1991.[11] G. A Lunter, I. Miklos, Y. S. Song, and J. Hein. An efficient algorithm for statistical multiple

alignment on arbitrary phylogenetic trees. J. Comp. Biol., 10:869–889, 2003.[12] A. Bouchard-Cote, P. Liang, T. L. Griffiths, and D. Klein. A probabilistic approach to language

change. In EMNLP 07, 2007.[13] P. Diaconis, S. Holmes, and R. M. Neal. Analysis of a non-reversible Markov chain sampler.

Technical report, Cornell University, 1997.[14] S. Needleman and C. Wunsch. A general method applicable to the search for similarities in the

amino acid sequence of two proteins. J. of Mol. Biol., 48:443–453, 1970.[15] K. St. John, T. Warnow, B. M. E. Moret, and L. Vawter. Performance study of phylogenetic

methods: (unweighted) quartet methods and neighbor-joining. J. of Algorithms, 48:173–193,2003.

[16] J. Thompson, F. Plewniak, and O. Poch. Balibase: A benchmark alignments database for theevaluation of multiple sequence alignment programs. Bioinformatics, 15:87–88, 1999.

[17] C. B. Do, M. S. P. Mahabhashyam, M. Brudno, and S. Batzoglou. Probcons: Probabilisticconsistency-based multiple sequence alignment. Genome Research, 15:330–340, 2005.

8

References[1] N. Galtier, N. Tourasse, and M. Gouy. A nonhyperthermophilic common ancestor to extant

life forms. Science, 283:220–221, 1999.[2] I. Holmes and W. J. Bruno. Evolutionary HMM: a bayesian approach to multiple alignment.

Bioinformatics, 17:803–820, 2001.[3] J. Felsenstein. Inferring Phylogenies. Sinauer Associates, 2003.[4] Z. Yang and B. Rannala. Bayesian phylogenetic inference using DNA sequences: A Markov

Chain Monte Carlo Method. Mol. Biol. Evol., 1997.[5] B. Mau and M. A. Newton. Phylogenetic inference for binary data on dendrograms using

markov chain monte carlo. J. of Comp. and Graphical Statistics, 1997.[6] S. Li, D. K. Pearl, and H. Doss. Phylogenetic tree construction using Markov Chain Monte

Carlo. J. of the American Statistical Association, 2000.[7] W. J. Bruno, N. D. Socci, and A. L. Halpern. Weighted neighbor joining: A likelihood-based

approach to distance-based phylogeny reconstruction. Mol. Biol. Evol., 17:189–197, 2000.[8] D. G. Higgins and P. M Sharp. Clustal: a package for performing multiple sequence alignment

on a microcomputer. Gene, 73:237–244, 1988.[9] J. L. Thorne, H. Kishino, and J. Felsenstein. Inching toward reality: an improved likelihood

model of sequence evolution. J. of Mol. Evol., 34:3–16, 1992.[10] J. L. Thorne, H. Kishino, and J. Felsenstein. An evolutionary model for maximum likelihood

alignment of DNA sequences. J. of Mol. Evol., 33:114–124, 1991.[11] G. A Lunter, I. Miklos, Y. S. Song, and J. Hein. An efficient algorithm for statistical multiple

alignment on arbitrary phylogenetic trees. J. Comp. Biol., 10:869–889, 2003.[12] A. Bouchard-Cote, P. Liang, T. L. Griffiths, and D. Klein. A probabilistic approach to language

change. In EMNLP 07, 2007.[13] P. Diaconis, S. Holmes, and R. M. Neal. Analysis of a non-reversible Markov chain sampler.

Technical report, Cornell University, 1997.[14] S. Needleman and C. Wunsch. A general method applicable to the search for similarities in the

amino acid sequence of two proteins. J. of Mol. Biol., 48:443–453, 1970.[15] K. St. John, T. Warnow, B. M. E. Moret, and L. Vawter. Performance study of phylogenetic

methods: (unweighted) quartet methods and neighbor-joining. J. of Algorithms, 48:173–193,2003.

[16] J. Thompson, F. Plewniak, and O. Poch. Balibase: A benchmark alignments database for theevaluation of multiple sequence alignment programs. Bioinformatics, 15:87–88, 1999.

[17] C. B. Do, M. S. P. Mahabhashyam, M. Brudno, and S. Batzoglou. Probcons: Probabilisticconsistency-based multiple sequence alignment. Genome Research, 15:330–340, 2005.

8

gc2_cavpo 1 .ktisktkgaprmPDVYTLPPSRDalc_mouse 1 ..tiakvtvntfpPQVHLLPPPSE1yuh 1 .tkltvlgqpkssPSVTLFPPSSE1mim 1 gtkleikr.tvaaPSVFIFPPSDE1nld 1 .ttvtvssasttaPSVYPLAPVSG

gc2_cavpo 49 asnrvvsekpeykntppieda...alc_mouse 49 lhgn....eelspesylveplkep1yuh 49 kvdgtpvtegmettqpskqs....1mim 49 kvdnalqsgnsqesvteqdsk...1nld 49 nsg..slssgvhtfpavlqs....

gc2_cavpo 92 YTCSVMHealhnhvtqkaisrspalc_mouse 95 YSCMVGHealpmnftqktidrls1yuh 91 YSCQVTHeg..htve.kslsra.1mim 92 YACEVTHqglsspvt.ksfnrg.1nld 87 ITCNVAHpasstkvdkkieprg.

[16]