the saitou&nei neighbor joining algorithm ©shlomo moran & ilan gronau
TRANSCRIPT
The Saitou&Nei Neighbor Joining Algorithm
©Shlomo Moran & Ilan Gronau
2
Recall: Distance-Based Reconstruction:
• Input: distances between all taxon-pairs• Output: a tree (edge-weighted) best-describing the
distances
0
30
980
1514180
171620220
1615192190
D
4 5
7 21
210 61
3
Requirements from Distance-Based Tree-Reconstruction Algorithms
1. Consistency: If the input metric is a tree metric, the returned tree should be the (unique) tree which fits this metric.
2. Efficiency: poly-time, preferably no more than O(n3), where n is the number of leaves (ie, the distance matrix is nXn).
3. Robustness: if the input matrix is “close” to tree metric, the algorithm should return the corresponding tree.
Definition: Tree metric or additive distances are distances which can be realized by a weighted tree.
A natural family of algorithms which satisfy 1 and 2 is called “Neighbor Joining”, presented next. Then we present one such algorithm which is known to be robust in practice.
4
The Neighbor Joining Tree-Reconstruction Scheme
1. Use D to select pair of neighboring leaves (cherries) i,j
2. Define a new vertex v as the parent of the cherries i,j
3. Compute a reduced (n-1)✕(n-1) distance matrix D’, over S’=S \ {i,j}{v}:
Important: need to compute distances from v to other vertices in S’, s.t.
D’ is a distance matrix of the reduced tree T’, obtained by prunning
i,j from T.
Start with an n✕n distance matrix D over a set S of n taxa (or vertices, or leaves)0 . .
0 . .
0 . .
0 .
0 .
0
0
D’
0 .. ..
0
0 ..
0 ..
0 ..
0
0
0
Di
v
j
5
The Neighbor Joining Tree-Reconstruction Scheme
4. Apply the method recursively on the reduced matrix D’, to get
the reduced tree T’.
5. In T’, add i,j as children of v (and possibly update edge
lengths).
Recursion base: when there are only two objects, return a tree with 2 leaves.
v
ji
0 . .
0 . .
0 . .
0 .
0 .
0
0
D’
v
T’
Question: how can we find cherries?
7
Least Common Ancestor Depth
Let i,j be leaves in T, and let r i,j be a vertex in T.LCAr(i,j) is the Least Common Ancestor of i and j when r is
viewed as a root.If r is fixed we just write LCA(i,j) . dT(r,LCA(i,j)) is the “depth of LCAr(i,j)”.
ij
r
dT(r,LCA(i,j))
8
Let T be a weighted tree, with a root r. For leaves i,j ≠r , let L (i,j)=dT(r,LCA(i,j))
Then if :
Cherries maximize the LCA Depth
i j
r
0 0( , ) max{ ( , )}L i j L i j j
v
Then i and j are cherries.This property can be used to select cherries pairs.The “Saitou&Nei” NJ algorithm uses a variant of this property.
9
Saitou & Nei’s Neighbor Joining Algorithm (1987)
select , which maximize the sum
( , ) ( , ) ( , ) ( 2) ( , )r r
i j
Q i j D r i D r j n D i j
~13,000 citations (Science Citation Index)
Implemented in numerous phylogenetic packages
Fastest implementation - θ(n3)
Usually referred to as “the NJ algorithm”
Identified by its neigbor selection criterion
Saitou & Nei’s neighbor-selection criterion
10
Consistency of Seitou&Nei method
Theorem (Saitou&Nei) Assume all edge weights of T are positive. If Q(i,j)=max{i’,j’} Q(i’,j’) , then i and j are cherries in the tree.
Proof: in the following slides.
( , ) ( , ) ( , ) ( 2) ( , )r r
Q i j D r i D r j n D i j
,
2 ( , ) ( , )rr i j
LCA i j D i j
Intuition: NJ “tries” to selects taxon-pairs with average deepest LCA
The addition of D(i,j) is needed to make the formula consistent .
Next we prove the above equality.
Saitou & Nei’s Selection criterion:Select i,j which maximize
( , ) ( , ) ( , ) ( 2) ( , )r r
Q i j D r i D r j n D i j
1st step in the proof: Express Saitou&Nei selection criterion in terms of LCA distances
12
Proof of equality in previous slide
, ,
,
( , ) ( , ) ( , ) ( , ) ( , ) ( 2) ( , )
[ ( , ) ( , ) ( , )] 2 ( , )
r i j r i j
r i j
Q i j D i j D i r D j i D j r n D i j
D i r D j r D i j D i j
-2d(r,LCAr(i,j))
,
2 ( , ) ( , ( , ))rr i j
D i j d r LCA i j
ri rj
13
2nd step in proof:Consistency of Saitou&Nei Neighbor Selection
,
We need to show that a pair of leaves , which maximize
'( , ) ( , ) / 2 ( , ) ( , ( , ))
must be cherries. First we express ' as a sum of edge weights.
rr i j
i j
Q i j Q i j D i j D r LCA i j
Q
, ( , ) ( , )
'( , ) ( , ) ( , ( , )) ( ) ( ) ( )r ir i j e path i j e path i j
Q i j D i j D r LCA i j w e N e w e
For a vertex i, and an edge e:Ni(e) = |{rS : e is on path(i,r)}|Then:
Note: If e’ is a “leaf edge”, then w(e’) is added exactly once to Q(i,j).
ij
rRest of T
e
path(i,j)
14
Let (see the figure below):• path(i,j) = (i,...,k,j).• T1 = the subtree rooted at k. WLOG that T1 has at most n/2 leaves. • T2 = T \ T1.
ij
k
T1
T2
Assume for contradiction that Q’(i,j) is maximized for i,j which are not cherries.
i’j’Let i’,j’ be any two cherries in T1. We
will show that Q’(i’,j’) > Q’(i,j).
Consistency of Saitou&Nei (cont)
15
ij
k
T1
T2
Proof that Q’(i’,j’)>Q’(i,j):
i’j’
( , ) ( , )
'( ', ') ( ', ')
'( , ) ( ) ( ) ( )
'( ', ') ( ) ( ) ( )
ie p i j e p i j
ie p i j e p i j
Q i j w e N e w e
Q i j w e N e w e
Each leaf edge e adds w(e) both to Q’(i,j) and to Q’(i’,j’), so we can ignore the contribution of leaf edges to both Q’(i,j) and Q’(i’,j’)
Consistency of Saitou&Nei (cont)
16
ij
k
T1
T2i’
j’
Location of internal edge e
# w(e) added to Q’(i,j)
# w(e) added to Q’(i’,j’)
epath(i,j) 1 Ni’(e)≥2
epath(i’,j) Ni (e) < n/2 Ni’(e) ≥ n/2
eT\path(i,i’) Ni (e) = Ni’(e)
Since there is at least one internal edge e in path(i,j), Q’(i’,j’) > Q’(i,j). QED
Contribution of internal edges to Q(i,j) and to Q(i’,j’)
Consistency of Saitou&Nei (end)
17
Initialization: θ(n2) to compute Q(i,j) for all i,jL.
Each Iteration: O(n2) to find the maximal Q(i,j), and to update the
values of Q(x,y)
Total: O(n3)
Complexity of Seitou&Nei NJ Algorithm