computing triplet and quartet distances between trees gerth stølting brodal, morten kragelund holt,...
TRANSCRIPT
![Page 1: Computing Triplet and Quartet Distances Between Trees Gerth Stølting Brodal, Morten Kragelund Holt, Jens Johansen Aarhus University Rolf Fagerberg University](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d015503460f949d38b6/html5/thumbnails/1.jpg)
Computing Triplet and Quartet Distances Between Trees
Gerth Stølting Brodal, Morten Kragelund Holt, Jens JohansenAarhus University
Rolf FagerbergUniversity of Southern Denmark
Thomas Mailund, Christian N. S. Pedersen, Andreas SandAarhus University, Bioinformatics Research Center
Computer Science Institute, Charles University, Prague, Czech Republic, 18 September 2014
Work presented at SODA 2013 and ALENEX 2014
![Page 2: Computing Triplet and Quartet Distances Between Trees Gerth Stølting Brodal, Morten Kragelund Holt, Jens Johansen Aarhus University Rolf Fagerberg University](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d015503460f949d38b6/html5/thumbnails/2.jpg)
Outline
Evolutionary trees – rooted vs. unrooted, binary vs. arbitrary degree
Tree distances– Robinson-Foulds, triplet, quartet
Results and previous work– triplet, quartet distances
Algorithms– triplet (quartet)
Experimental results (ALENEX 2014)
![Page 3: Computing Triplet and Quartet Distances Between Trees Gerth Stølting Brodal, Morten Kragelund Holt, Jens Johansen Aarhus University Rolf Fagerberg University](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d015503460f949d38b6/html5/thumbnails/3.jpg)
Evolutionary Tree
Bonobo Chimpanzee Human Neanderthal Gorilla Orangutan
Tim
e
Rooted
![Page 4: Computing Triplet and Quartet Distances Between Trees Gerth Stølting Brodal, Morten Kragelund Holt, Jens Johansen Aarhus University Rolf Fagerberg University](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d015503460f949d38b6/html5/thumbnails/4.jpg)
Unrooted Evolutionary Tree
Dominant modern approach to study evolution is from DNA analysis
![Page 5: Computing Triplet and Quartet Distances Between Trees Gerth Stølting Brodal, Morten Kragelund Holt, Jens Johansen Aarhus University Rolf Fagerberg University](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d015503460f949d38b6/html5/thumbnails/5.jpg)
Constructing Evolutionary Trees – Binary or Arbitrary Degrees ?
Sequence dataDistance matrix
1 2 3 ··· n123···
n
Neighbor JoiningSaitou, Nei 1987
[ O(n3) Saitou, Nei 1987 ]
Refined Buneman TreesMoulton, Steel 1999
[ O(n3) Brodal et al. 2003 ]
Buneman TreesBuneman 1971
[ O(n3) Berry, Bryan 1999 ]
123···
n
.... ....
Binary trees(despite no evidence
in distance data)
Arbitrary degrees(strong support for all edges ; few branches)
Arbitrary degree (compromise ; good
support for all edges)
![Page 6: Computing Triplet and Quartet Distances Between Trees Gerth Stølting Brodal, Morten Kragelund Holt, Jens Johansen Aarhus University Rolf Fagerberg University](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d015503460f949d38b6/html5/thumbnails/6.jpg)
Data Analysis vs Expert Trees – Binary vs Arbitrary Degrees ?
Linguistic expert classification (Aryon Rodrigues)
Neighbor Joining on linguistic data
Cultural Phylogenetics of the Tupi Language Family in Lowland South America. R. S. Walker, S. Wichmann, T. Mailund, C. J. Atkisson. PLoS One. 7(4), 2012.
![Page 7: Computing Triplet and Quartet Distances Between Trees Gerth Stølting Brodal, Morten Kragelund Holt, Jens Johansen Aarhus University Rolf Fagerberg University](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d015503460f949d38b6/html5/thumbnails/7.jpg)
Evolutionary Tree Comparison
?
split1357|2468
1
4
32
5
67
T2
81
6
3
2
5
47
T1
8
Common Only T1 Only T2
1357|2468 35|124678 57|123468
13567|248 48|123567
Robinson-Foulds distance = # non-common splits = 2 + 1 = 3
[Day 1985] O(n) time algorithm using 2 x DFS + radix sort
D. F. Robinson and L. R. Foulds. Comparison of weighted labeled trees. In Combinatorial mathematics, VI, Lecture Notes in Mathematics, pages 119–126. Springer, 1979.
![Page 8: Computing Triplet and Quartet Distances Between Trees Gerth Stølting Brodal, Morten Kragelund Holt, Jens Johansen Aarhus University Rolf Fagerberg University](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d015503460f949d38b6/html5/thumbnails/8.jpg)
8
8
Robinson-Foulds Distance (unrooted trees)
?
T1
Common Only T1 Only T2
(none) 12567|3481257|3468157|2346857|123468
125678|3412578|3461578|2346578|1234678|123456
1
6
2
5
4
7T2
3
1
6
2
5
4
7
3
RF-dist(T1 , T2) = 4 + 5 = 9RF-dist(T1\{8} , T2\{8}) = 0
Robinson-Foulds very sensitive to outliers
D. F. Robinson and L. R. Foulds. Comparison of weighted labeled trees. In Combinatorialmathematics, VI, Lecture Notes in Mathematics, pages 119–126. Springer, 1979.
![Page 9: Computing Triplet and Quartet Distances Between Trees Gerth Stølting Brodal, Morten Kragelund Holt, Jens Johansen Aarhus University Rolf Fagerberg University](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d015503460f949d38b6/html5/thumbnails/9.jpg)
resolved : ij|kl
Quartet Distance (unrooted trees)
Consider all quartets, i.e. topologies of subsets of 4 leaves {i,j,k,l}
Quartet T1 T2
{1,2,3,4} 14|23 14|23{1,2,3,5} 13|25 15|23{1,2,4,5} 14|25 1245{1,3,4,5} 14|35 1345{2,3,4,5} 25|34 23|45
i
j
k
l
i
j
k
l
unresolved : ijkl(only non-binary trees)
Quartet-dist(T1 , T2) = - # common quartets = 5 - 1 = 4
1
3
2 5
24
3 1
5
T1 T2
4
G. Estabrook, F. McMorris, and C. Meacham. Comparison of undirected phylogenetic trees based on subtrees of four evolutionary units. Systematic Zoology, 34:193-200, 1985.
![Page 10: Computing Triplet and Quartet Distances Between Trees Gerth Stølting Brodal, Morten Kragelund Holt, Jens Johansen Aarhus University Rolf Fagerberg University](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d015503460f949d38b6/html5/thumbnails/10.jpg)
Triplet Distance (rooted trees)
Consider all triplets, i.e. topologies of subsets of 3 leaves {i,j,k}
Triplet T1 T2
{1,2,3} 2|13 2|13{1,2,4} 1|24 4|12{1,2,5} 1|25 5|12{1,3,4} 4|13 4|13{1,3,5} 5|13 5|13{1,4,5} 1|45 1|45{2,3,4} 3|24 4|23{2,3,5} 3|25 5|23{2,4,5} 5|24 2|45{3,4,5} 3|45 3|45
resolved : k|iji j
k i j k
unresolved : ijk(only non-binary trees)
Triplet-dist(T1 , T2) = - # common triplets = 10 - 5 = 5
1
2
53
T1
44
1
52
T2
3
D. E. Critchlow, D. K. Pearl, C. L. Qian: The triples distance for rooted bifurcating phylogenetic trees. Systematic Biology, 45(3):323-334, 1996.
![Page 11: Computing Triplet and Quartet Distances Between Trees Gerth Stølting Brodal, Morten Kragelund Holt, Jens Johansen Aarhus University Rolf Fagerberg University](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d015503460f949d38b6/html5/thumbnails/11.jpg)
RootedTriplet distance
UnrootedQuartet distance
Binary O(n2)O(nlog2 n)O(nlog n)
CPQ 1996
SBFPM 2013
[SODA 2013]
O(n3)O(n2)
O(nlog2 n)O(nlog n)
D 1985
BTKL 2000
BFP 2001
BFP 2003
Arbitrarydegrees
O(n2)O(nlog n)
BDF 2011
[SODA 2013]
O(d 9nlog n)
O(n2.688)O(dnlog n)
SPMBF 2007
NKMP 2011
[SODA 2013][ALENEX 2014]
Computational Results
12
534 1
3
2 5
4
12
534 6
712
3 110
6 7 1311
58
9
![Page 12: Computing Triplet and Quartet Distances Between Trees Gerth Stølting Brodal, Morten Kragelund Holt, Jens Johansen Aarhus University Rolf Fagerberg University](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d015503460f949d38b6/html5/thumbnails/12.jpg)
Distance Computation T2
Resolved Unresolved
T1
ResolvedA : Agree
C
B : Disagree
Unresolved D E
i jk
i j k i j k
i j k
i jk
j ki
i kj
i kj
Triplet-dist(T1 , T2) = B + C + D = A – E
Sufficient to compute A and E D + E and C + E unresolved in one tree(For binary trees C, D and E are all zero)
i jk
j ki
O(n) triplets & quartets
O(n·log n) triplets & quartets
O(n·log n) triplets
O(d·n·log n) quartets
![Page 13: Computing Triplet and Quartet Distances Between Trees Gerth Stølting Brodal, Morten Kragelund Holt, Jens Johansen Aarhus University Rolf Fagerberg University](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d015503460f949d38b6/html5/thumbnails/13.jpg)
Parameterized Triplet & Quartet DistancesB + α·(C + D) , 0 α 1
BDF 2011 O(n2) for triplet, NKMP 2011 O(n2.688) for quartet[SODA 2013/ALENEX 2014] O(n·log n) and O(d·n·log n), respectively
T2
Resolved Unresolved
T1
ResolvedA : Agree
C
B : Disagree
Unresolved D E
i jk
i j k i j k
i j k
i jk
j ki
i kj
i kj
i jk
j ki
![Page 14: Computing Triplet and Quartet Distances Between Trees Gerth Stølting Brodal, Morten Kragelund Holt, Jens Johansen Aarhus University Rolf Fagerberg University](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d015503460f949d38b6/html5/thumbnails/14.jpg)
Counting Unresolved Triplets in One Tree
n1 n2 n3 ··· nd
v∑v
∑i < j < k
n i ·n j·n k
∑v ( ∑
i < j <k <l
n i· n j· n k · n l+(n−∑l
n l ) ∑i < j < k
n i ·n j·n k )
Computable in O(n) time using DFS + dynamic programming
n1 n2 n3 ··· nd
v
Triplet anchored at v
Quartet anchored at v
Quartets (root tree arbitrary)
![Page 15: Computing Triplet and Quartet Distances Between Trees Gerth Stølting Brodal, Morten Kragelund Holt, Jens Johansen Aarhus University Rolf Fagerberg University](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d015503460f949d38b6/html5/thumbnails/15.jpg)
Counting Agreeing Triplets(Basic Idea)
v
1 i j d
∑v T 1
∑w T 2
∑c
∑1 ≤i ≤d
(n i❑c
2 ) (n w−nc−n i❑w +n i
❑c )
T1
j
T2
c
i i
∑1 ≤ i ≤d
n i❑w
w0
![Page 16: Computing Triplet and Quartet Distances Between Trees Gerth Stølting Brodal, Morten Kragelund Holt, Jens Johansen Aarhus University Rolf Fagerberg University](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d015503460f949d38b6/html5/thumbnails/16.jpg)
Efficient ComputationLimit recolorings in T1 (and T2) to O(n·log n)
v
1 1 1
0v
1 2 d
0v
1
0v
0
v
1
0
v
1
0...
Count T2 contribution
(precondition)
Recolor Recolor Recurse
Recolor & recurse
T1
Reduce recoloring cost in T2 from O(n2) to O(n·log2 n)
1
2 3
4
56
7 891
24
79
8563
Reduce recoloring cost in T2 from O(n·log2 n) to O(n·log n)
T2
arbitrary
height
degree
H(T2)
binary
height
O(log n)
Contract T2 and reconstruct H(T2) during recursion
![Page 17: Computing Triplet and Quartet Distances Between Trees Gerth Stølting Brodal, Morten Kragelund Holt, Jens Johansen Aarhus University Rolf Fagerberg University](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d015503460f949d38b6/html5/thumbnails/17.jpg)
Counting Agreeing Triplets (II)
v
1 i j d
T10
node in H(T2) =component
composition in T2
∑1 ≤ i ≤d
(n i❑C1
2 )(n∗❑C 2−n i
❑C2 )∑1 ≤ i ≤d
(n∗❑C1−n i❑C1 ) n( ii )
❑ C2∑1 ≤ i ≤d
n i❑C1 ·n i↑∗
❑ C 2+ +
Contribution to agreeing triplets at node in H(T2)
i
i j
ii
j
j
ii
C2
C1
![Page 18: Computing Triplet and Quartet Distances Between Trees Gerth Stølting Brodal, Morten Kragelund Holt, Jens Johansen Aarhus University Rolf Fagerberg University](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d015503460f949d38b6/html5/thumbnails/18.jpg)
From O(n·log2 n) to O(n·log n)Update O(1) counters for all
colors through node
∑2 ≤ i ≤d
log (¿ T 2∨¿n i )= ∑2 ≤i ≤d
n i ∙ lognv
n iColored path lengths
v
1 i j d
T10
nv
ni
w H ( T 2)
Compressed version of T2 of size O(nv)
Total cost for updating counters
∑leaf l∈ T1
∑ancestor a( j)not heavy child
logna ( j +1)
na ( j ) = n ∙ log na(1)
a(2)
l=a(0)
a(3)
a(4) a(5)
T1
![Page 19: Computing Triplet and Quartet Distances Between Trees Gerth Stølting Brodal, Morten Kragelund Holt, Jens Johansen Aarhus University Rolf Fagerberg University](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d015503460f949d38b6/html5/thumbnails/19.jpg)
Counting Quartets...
Bottleneck in computing disagreeing resolved-resolved quartets
v
1 i j d
T10
i j i j
∑1 ≤ i <d
∑i < j ≤ d
n ( ij )G1 ·n ( ij )G2
G1 G2
T2
double-sum factor d time
Root T1 and T2 arbitrary Keep up to 7d2 + 97d + 29 different counters per node in H(T2)...
![Page 20: Computing Triplet and Quartet Distances Between Trees Gerth Stølting Brodal, Morten Kragelund Holt, Jens Johansen Aarhus University Rolf Fagerberg University](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d015503460f949d38b6/html5/thumbnails/20.jpg)
Distance Computation T2
Resolved Unresolved
T1
ResolvedA : Agree
C
B : Disagree
Unresolved D E
i jk
i j k i j k
i j k
i jk
j ki
i kj
i kj
Triplet-dist(T1 , T2) = B + C + D = A – E
Sufficient to compute A and E
i jk
j ki
O(n·log n) triplets & quartets
O(n·log n) triplets
O(d·n·log n) quartets
![Page 21: Computing Triplet and Quartet Distances Between Trees Gerth Stølting Brodal, Morten Kragelund Holt, Jens Johansen Aarhus University Rolf Fagerberg University](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d015503460f949d38b6/html5/thumbnails/21.jpg)
ALENEX 2014: Implementation(M.Sc. thesis Morten Kragelund Holt and Jens Johansen)
Worst-case #counters per node in HDT(T2)
First implementation for triplets for arbitrary degree Space usage 10 KB per node for quartet (binary trees)
can handle 1,000,000 leaves 64 bit integers, except 128 bit integers for values > n3
quartet distance of up to 2,000,000 leaves
Binary Arbitrary degreetime counters time counters
Triplet O(n log n) 6 O(n log n) 4d+2
Quartet O(n log n) 40
O(max(d1, d2) n log n) 2d2 + 79d + 22 (B, with T1T2)
O(min(d1, d2) n log n)7d2 + 97d + 29 (B, no swap)d2 + 12d + 12 (E, no swap)
![Page 22: Computing Triplet and Quartet Distances Between Trees Gerth Stølting Brodal, Morten Kragelund Holt, Jens Johansen Aarhus University Rolf Fagerberg University](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d015503460f949d38b6/html5/thumbnails/22.jpg)
Experimental ResultsQuartet Distance – Binary Trees
[SODA 2013]MP 2004
NKMP 2011
[ALENEX 2014] are the first O(nlog n) implementations MP 2004 overhead from working with polynomials
![Page 23: Computing Triplet and Quartet Distances Between Trees Gerth Stølting Brodal, Morten Kragelund Holt, Jens Johansen Aarhus University Rolf Fagerberg University](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d015503460f949d38b6/html5/thumbnails/23.jpg)
Experimental ResultsQuartet Distance – High Degree Trees
d = 256
NKMP 2011[SODA 2013]max
d = 1024
[ALENEX 2014] are the first npoly(log n,d) implementation
![Page 24: Computing Triplet and Quartet Distances Between Trees Gerth Stølting Brodal, Morten Kragelund Holt, Jens Johansen Aarhus University Rolf Fagerberg University](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d015503460f949d38b6/html5/thumbnails/24.jpg)
Experimental ResultsTriplet Distance – Binary Trees
[ALENEX 2014] are the first O(nlog n) implementation SBFPM 2013 only binary trees, no contractions
[SODA 2013]
SBFPM 2013
![Page 25: Computing Triplet and Quartet Distances Between Trees Gerth Stølting Brodal, Morten Kragelund Holt, Jens Johansen Aarhus University Rolf Fagerberg University](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d015503460f949d38b6/html5/thumbnails/25.jpg)
Experimental ResultsTriplet Distance – High Degree Trees
[ALENEX 2014] first implementation Triplet distance appears hardest for binary trees
SODA 2013
[SBFPM 2013]
[SODA 2013], d = 2
[SODA 2013], d = 256
[SODA 2013], d = 1024
![Page 26: Computing Triplet and Quartet Distances Between Trees Gerth Stølting Brodal, Morten Kragelund Holt, Jens Johansen Aarhus University Rolf Fagerberg University](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d015503460f949d38b6/html5/thumbnails/26.jpg)
RootedTriplet distance
UnrootedQuartet distance
Binary O(n2)O(nlog2 n)O(nlog n)
CPQ 1996
SBFPM 2013
[SODA 2013]
O(n3)O(n2)
O(nlog2 n)O(nlog n)O(nlog n)
D 1985
BTKL 2000
BFP 2001
BFP 2003
[SODA 2013]
Arbitrarydegrees
O(n2)O(nlog n)
BDF 2011
[SODA 2013]
O(d 9nlog n)
O(n2.688)O(dnlog n)
SPMBF 2007
NKMP 2011
[SODA 2013][ALENEX 2014]
Summary
d = minimal degree of any node in T1 and T2
12
534 1
3
2 5
4
12
534 6
712
3 110
6 7 1311
58
9
O(n·log n) ?
o(n·log n) ?
= fastest implementation for large n
![Page 27: Computing Triplet and Quartet Distances Between Trees Gerth Stølting Brodal, Morten Kragelund Holt, Jens Johansen Aarhus University Rolf Fagerberg University](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d015503460f949d38b6/html5/thumbnails/27.jpg)
References On the Scalability of Computing Triplet and Quartet Distances.
M.K. Holt, J. Johansen, G.S. Brodal. ALENEX 2014. Algorithms for Computing the Triplet and Quartet Distances for Binary and General Trees. A.
Sand, M.K. Holt, J. Johansen, R. Fagerberg, G.S. Brodal, C.N.S. Pedersen, T. Mailund. Biology - Special Issue on Developments in Bioinformatic Algorithms, 2013.
A practical O(n log2 n) time algorithm for computing the triplet distance on binary trees. A. Sand, G.S. Brodal, R. Fagerberg, C.N.S. Pedersen, T. Mailund. BMC Bioinformatics 2013.
Efficient Algorithms for Computing the Triplet and Quartet Distance Between Trees of Arbitrary Degree. G.S. Brodal, R. Fagerberg, C.N.S. Pedersen, T. Mailund, A. Sand. SODA 2013.
A sub-cubic time algorithm for computing the quartet distance between two general trees. J. Nielsen, A. K. Kristensen, T. Mailund, C.N.S. Pedersen. Algorithms in Molecular Biology 2011.
Computing the Quartet Distance Between Evolutionary Trees of Bounded Degree. M. Stissing, C.N.S. Pedersen, T. Mailund, G.S. Brodal, R. Fagerberg. APBC 2007.
QDist - Quartet Distance between Evolutionary Trees. T. Mailund and C.N. S. Pedersen. Bioinformatics 2004.
Computing the Quartet Distance Between Evolutionary Trees in Time O(n log n). G.S. Brodal, R. Fagerberg, C.N.S. Pedersen. Algorithmica 2004.
Computing the Quartet Distance Between Evolutionary Trees in Time O(n log2 n). G.S. Brodal, R. Fagerberg, C.N.S. Pedersen. ISAAC 2001. birc.au.dk/software