efficient inference in phylogenetic indel treesbouchard/pub/poster-nips09.pdf · known to perform...
TRANSCRIPT
Efficient Inference in Phylogenetic InDel TreesAlexandre Bouchard-Côté Michael I. Jordan Dan Klein
Computer Science Division University of California at Berkeley
• What is a multi-alignment?
AC
TA
C
a :
AC
G
b :
T ACC c :
AC T A Ca :
AC Gb :
TA C Cc : -- - -
-
-
• What are guide trees and ancestral reconstructions?
a
b
c
d
AC T A Ca :
AC Gb :
TA C Cc :
?? ? .. ..d :
• Assuming we have an evolutionary InDel model, both tasks can be cast as an inference problem
(High-level) graphical model:
application: homology modeling
application: the origins of life
evolutionary parameters
string-valued r.v.
evolution (InDel process)
AC T A C
AC G G
Sub Sub Sub
Ins Ins
Del
Obtaining multi-alignments:
AC Gb :AC T A Ca :
AC T Cd :
TA C Cc :
TA Cr :
• Solution: always condition on the data
...ACCGTGCGTGCT...
...ACCCTGCGGTGCT...
...ACGCTGCGTGAT...
...ACGTTGCGTGAT... ...ACGCTGTTGTGAT... ...ACGGTGCTGTGCT..
• How should "vertical slices" be defined?
Deterministic functions of a random graphical model
• What is the state of the sampler?
AC T A C
AC G G
Sub Sub Sub
Ins Ins
Del
State augmentation: alignments
} }derivation
resample "thin vertical slices"
a
d
r AC T A Cr :
AC Gd :
TA C Ca :
r :
d :
a :
observed charactersselected charactersanchor
Legend
Indexed by a span on an observed word (anchor)
r :
d :
a :
anchor
AC GAC T A C
AC G CTA C C
GA CC
C
AC GAC T A C
AC G CTA C C
GA CC
C
AC GAC T A C
AC G CTA C C
GA CC
C
• Connected component of the anchor?
r :
d :
a :
• Not irreducible
• Non-contiguous lineages
• Contiguous version?
r :
d :
a :
• Not reversible
!!
• A correct definition:
r :
d :
a :
r :
d :
a :
AC Gb :AC T A Ca :
AC T Cd :TA C Cc :
TA Cr :
AC GAC T A C
AC G CTA C C
GA CC
C
AC GAC T A C
AC G CTA C C
GA C
• Next: how to sample?
Cylindric proposals: use a DP
(SSR)
(AR)
[1]
[2]
[2]
a
b
c
d
• Approach 1: heuristic
AC T A Ca :
AC Gb :
TA C Cc : ?? ? ?
?
?
d : ? ?? ? ? ?
C C A A A T C T G
...
• Approach 2: naive Gibbs samplingcharacter-valued!
...ACCGTGCGATGCT...
...ACCCTGCGGTGCT...
...ACGCTGCGGTGAT...
...ACGTTGCGGTGAT... ...ACGCTGTTGTGAT... ...ACGGTGCTGTGCT..
Problem I: random walk behavior
AC G
C T CT C C
T C
TA C C AC G
AC T CTA C C
TA C
TA C C
AC G
AC T CT C C
TA C
TA C CAC G
AC T CT C C
T C
TA C C
Problem II: each step is expensive (cubic in the sequence length)
[8] • Experiments
1. Reconstruction
- synthetic- 4 characters types- 124010 character tokens- tree has 7 nodes- all internal nodes held-out- evaluated on root reconstruction- loss: Levenshtein distance
0
1
2
3
4
5
0 100 200 300
Time
Error
SSR AR
2. Multi-alignment
0
25
50
75
100
3 4 5 6 7 8
Depth
CS
SSR AR
70
80
90
100
3 4 5 6 7 8
Depth
SP
SSR AR
System CS SPSSR (Handel) 0.63 0.77AR (this paper) 0.77 0.86
Table 1: XXXXXX—clustalw
0
1
2
3
4
5
0 5000 10000 15000 20000 25000 30000
Time
Error
SSR AR
(a) Fig 1a
0
1
2
3
4
5
0 2000 4000 6000 8000 10000 12000 14000 16000 18000 20000
Time
Error
Max dev = 2 Max dev = MAX
(b) Fig 1b
Figure 1: XXXXX
System CS SPSSR (Handel) 0.63 0.77SSR+AR 0.77 0.86
Table 1: XXXXXX—clustalw
The multi-alignment system in the litterature that most resembles our lineage sampler is Handel. It models ina probabilistic fashion an InDel evolutionary derivation along a tree above the observed proteins and producesa multi-alignment as described above. The key difference with our approach is that their inference algorithmis based on SSR rather than the lineage sampling moves that we advocate in this paper.
The BAliBASE multi-alignment dataset [13] is a standard benchmark for this type of comparison and wasthe dataset used to assess the performance of Handel in its original paper [8]. While other approaches areknown to perform better than Handel on this dataset [7, 4], they leverage more sophisticated features such asaffine gap penalty and hydrophobic core modelling. While these features could be incorporated in our model,we leave this for future work since the topic of this paper is focused on inference. This way we can get ameaningful comparison with Handel.
We built evolutionary trees based on the tkfalign and weighbor [2] tools provided in the Handel package andran both Handel and our algorithm on the same sequences and evolutionary trees.
5 Future Directions
References
[1] Alexandre Bouchard, Percy Liang, Thomas L. Griffiths, and Dan Klein. A probabilistic approach tolanguage change. In EMNLP 07, 2007.
[2] William J. Bruno, Nicholas D. Socci, and Aaron L. Halpern. Weighted neighbor joining: A likelihood-based approach to distance-based phylogeny reconstruction. Mol. Biol. Evol., 17:189–197, 2000.
[3] P. Diaconis, S. Holmes, and R.M. Neal. Analysis of a non-reversible markov chain sampler. Technicalreport, Cornell University, 1997.
[4] C.B. Do, M.S.P. Mahabhashyam, M. Brudno, and S. Batzoglou. Probcons: Probabilistic consistency-based multiple sequence alignment. Genome Research, 15:330–340, 2005.
[5] J. Felsenstein. Inferring Phylogenies. Sinauer Associates, 2003.
6
0
25
50
75
100
3 4 5 6 7 870
80
90
100
3 4 5 6 7 8
Figure 4: CS and SP XXXCITE are recall metrics XXX
sequence alignment can be extracted from the infered derivation D as follow: deem the amino acids x, y ∈ Saligned iff y ∈ A0(D,x).
The state-of-the-art for multiple sequence alignment systems based on an evolutionary model is Handel [2].It is based on TKF91 and produces a multiple sequence alignment as described above. The key differencewith our approach is that their inference algorithm is based on SSR rather than the AR move that we advocatein this paper.
While other heuristic approaches are known to perform better than Handel on this dataset [7, 16], they are notbased on explicit evolutionary models. They perform better because they leverage more sophisticated featuressuch as affine gap penalty and hydrophobic core modelling. While these features can be incorporated in ourmodel, we leave this for future work since the topic of this paper is inference.
We built evolutionary trees using weighbor [17]. We ran each system for the same time on the sequencesin the ref1 directory of BAliBASE v.1. Decoding for this experiment was done by picking the sample withhighest likelihood. We report in Figure 4, left, the CS and SP Scores, the two standard metrics for this task.Both are recall measures on the subset of the alignments that were labeled, called the core blocks, see [16]for the details. For both metrics, our approach performs better.
In order to investigate where the advantage comes from, we did another multiple alignment experiment plot-ting this time performance after a fixed time as a function of depth of the trees. If the random walk argumentpresented in the introduction held, we would expect the advantage of AR over SSR to increase as the treegets taller. This prediction is confirmed as illustrated in Figure 4, middle, right. For short trees, the twoalgorithms perform equally, SSR beating AR slightly for trees with three nodes, which is not surprising sinceSSR actually performs exact inference in this tiny topology configuration. However, as trees get taller, thetask becomes more difficult, and only AR manages to maintain good performances.
5 Future Directions
TODO: integrating affine gap, hydrophobic core modelling, CRF models
References[1] N. Galtier, N. Tourasse, and M. Gouy. A nonhyperthermophilic common ancestor to extant life forms. Science,
283:220–221, 1999.[2] I. Holmes and W. J. Bruno. Evolutionary hmm: a bayesian approach to multiple alignment. Bioinformatics, 17:803–
820, 2001.
7
- annotated protein data- source: BAliBASE- loss: Sum-of-Pairs (SP) Column-Score (CS)
The taller the tree, the larger the gap:
References[1] N. Galtier, N. Tourasse, and M. Gouy. A nonhyperthermophilic common ancestor to extant
life forms. Science, 283:220–221, 1999.[2] I. Holmes and W. J. Bruno. Evolutionary HMM: a bayesian approach to multiple alignment.
Bioinformatics, 17:803–820, 2001.[3] J. Felsenstein. Inferring Phylogenies. Sinauer Associates, 2003.[4] Z. Yang and B. Rannala. Bayesian phylogenetic inference using DNA sequences: A Markov
Chain Monte Carlo Method. Mol. Biol. Evol., 1997.[5] B. Mau and M. A. Newton. Phylogenetic inference for binary data on dendrograms using
markov chain monte carlo. J. of Comp. and Graphical Statistics, 1997.[6] S. Li, D. K. Pearl, and H. Doss. Phylogenetic tree construction using Markov Chain Monte
Carlo. J. of the American Statistical Association, 2000.[7] W. J. Bruno, N. D. Socci, and A. L. Halpern. Weighted neighbor joining: A likelihood-based
approach to distance-based phylogeny reconstruction. Mol. Biol. Evol., 17:189–197, 2000.[8] D. G. Higgins and P. M Sharp. Clustal: a package for performing multiple sequence alignment
on a microcomputer. Gene, 73:237–244, 1988.[9] J. L. Thorne, H. Kishino, and J. Felsenstein. Inching toward reality: an improved likelihood
model of sequence evolution. J. of Mol. Evol., 34:3–16, 1992.[10] J. L. Thorne, H. Kishino, and J. Felsenstein. An evolutionary model for maximum likelihood
alignment of DNA sequences. J. of Mol. Evol., 33:114–124, 1991.[11] G. A Lunter, I. Miklos, Y. S. Song, and J. Hein. An efficient algorithm for statistical multiple
alignment on arbitrary phylogenetic trees. J. Comp. Biol., 10:869–889, 2003.[12] A. Bouchard-Cote, P. Liang, T. L. Griffiths, and D. Klein. A probabilistic approach to language
change. In EMNLP 07, 2007.[13] P. Diaconis, S. Holmes, and R. M. Neal. Analysis of a non-reversible Markov chain sampler.
Technical report, Cornell University, 1997.[14] S. Needleman and C. Wunsch. A general method applicable to the search for similarities in the
amino acid sequence of two proteins. J. of Mol. Biol., 48:443–453, 1970.[15] K. St. John, T. Warnow, B. M. E. Moret, and L. Vawter. Performance study of phylogenetic
methods: (unweighted) quartet methods and neighbor-joining. J. of Algorithms, 48:173–193,2003.
[16] J. Thompson, F. Plewniak, and O. Poch. Balibase: A benchmark alignments database for theevaluation of multiple sequence alignment programs. Bioinformatics, 15:87–88, 1999.
[17] C. B. Do, M. S. P. Mahabhashyam, M. Brudno, and S. Batzoglou. Probcons: Probabilisticconsistency-based multiple sequence alignment. Genome Research, 15:330–340, 2005.
8
References[1] N. Galtier, N. Tourasse, and M. Gouy. A nonhyperthermophilic common ancestor to extant
life forms. Science, 283:220–221, 1999.[2] I. Holmes and W. J. Bruno. Evolutionary HMM: a bayesian approach to multiple alignment.
Bioinformatics, 17:803–820, 2001.[3] J. Felsenstein. Inferring Phylogenies. Sinauer Associates, 2003.[4] Z. Yang and B. Rannala. Bayesian phylogenetic inference using DNA sequences: A Markov
Chain Monte Carlo Method. Mol. Biol. Evol., 1997.[5] B. Mau and M. A. Newton. Phylogenetic inference for binary data on dendrograms using
markov chain monte carlo. J. of Comp. and Graphical Statistics, 1997.[6] S. Li, D. K. Pearl, and H. Doss. Phylogenetic tree construction using Markov Chain Monte
Carlo. J. of the American Statistical Association, 2000.[7] W. J. Bruno, N. D. Socci, and A. L. Halpern. Weighted neighbor joining: A likelihood-based
approach to distance-based phylogeny reconstruction. Mol. Biol. Evol., 17:189–197, 2000.[8] D. G. Higgins and P. M Sharp. Clustal: a package for performing multiple sequence alignment
on a microcomputer. Gene, 73:237–244, 1988.[9] J. L. Thorne, H. Kishino, and J. Felsenstein. Inching toward reality: an improved likelihood
model of sequence evolution. J. of Mol. Evol., 34:3–16, 1992.[10] J. L. Thorne, H. Kishino, and J. Felsenstein. An evolutionary model for maximum likelihood
alignment of DNA sequences. J. of Mol. Evol., 33:114–124, 1991.[11] G. A Lunter, I. Miklos, Y. S. Song, and J. Hein. An efficient algorithm for statistical multiple
alignment on arbitrary phylogenetic trees. J. Comp. Biol., 10:869–889, 2003.[12] A. Bouchard-Cote, P. Liang, T. L. Griffiths, and D. Klein. A probabilistic approach to language
change. In EMNLP 07, 2007.[13] P. Diaconis, S. Holmes, and R. M. Neal. Analysis of a non-reversible Markov chain sampler.
Technical report, Cornell University, 1997.[14] S. Needleman and C. Wunsch. A general method applicable to the search for similarities in the
amino acid sequence of two proteins. J. of Mol. Biol., 48:443–453, 1970.[15] K. St. John, T. Warnow, B. M. E. Moret, and L. Vawter. Performance study of phylogenetic
methods: (unweighted) quartet methods and neighbor-joining. J. of Algorithms, 48:173–193,2003.
[16] J. Thompson, F. Plewniak, and O. Poch. Balibase: A benchmark alignments database for theevaluation of multiple sequence alignment programs. Bioinformatics, 15:87–88, 1999.
[17] C. B. Do, M. S. P. Mahabhashyam, M. Brudno, and S. Batzoglou. Probcons: Probabilisticconsistency-based multiple sequence alignment. Genome Research, 15:330–340, 2005.
8
gc2_cavpo 1 .ktisktkgaprmPDVYTLPPSRDalc_mouse 1 ..tiakvtvntfpPQVHLLPPPSE1yuh 1 .tkltvlgqpkssPSVTLFPPSSE1mim 1 gtkleikr.tvaaPSVFIFPPSDE1nld 1 .ttvtvssasttaPSVYPLAPVSG
gc2_cavpo 49 asnrvvsekpeykntppieda...alc_mouse 49 lhgn....eelspesylveplkep1yuh 49 kvdgtpvtegmettqpskqs....1mim 49 kvdnalqsgnsqesvteqdsk...1nld 49 nsg..slssgvhtfpavlqs....
gc2_cavpo 92 YTCSVMHealhnhvtqkaisrspalc_mouse 95 YSCMVGHealpmnftqktidrls1yuh 91 YSCQVTHeg..htve.kslsra.1mim 92 YACEVTHqglsspvt.ksfnrg.1nld 87 ITCNVAHpasstkvdkkieprg.
[16]