lecture 13 maximal accurate forests from distance matrix

Lecture 13

Maximal Accurate Forests From Distance Matrix

Cavender-Farri-Neyman 2-state model

Definition 1:

Let T be a fixed rooted tree with leaves labeled 1,…,n.The Cavender-Farri-Neyman 2-state model makes the

following assumptions:

1. The possible states for each site are 0 and 1.2. Along every edge e of the tree with probability Θ(e) the child

copies its value from the father, and with probability 1- Θ(e) it randomizes uniformly in {0,1}

3. The sites evolve identically and independently (i.i.d) down the tree from the root.

Let Θ(u,v) be the probability during transition from node u to v the value of the site was only copied. If the path from u

to v is e1,e2,..ep It is clear that:

Θ(u,v) = ∏ Θ(ei)

Let p(u,v) be the probability that at the end of the same transition the value is different from the initial. It is clear that:

p(u,v) = ½(1- Θ(u,v) )

Cavender-Farri-Neyman 2-state model

i = 1

p

Definition of distance

• The input of our problem is a sequence of k site values for each species (DNA)

• The target is to restore the evolution tree

• So we must define the distance from the sequences such that it will “match” some tree metric

d(u,v) = - log Θ(u,v)

How to compute distance from sequences?

• Now, given k samples of the 2-state model process at the leaves xu(t) for each leaf u and index t from 1 to k we can estimate

Θ(u,v) by 1-2p’(u,v) where:

p’(u,v) = 1/k * |{ t | xu(t) != xv(t) }|

• example on the board

t = 1

k

Definition of the problem

Let T be edge-weighted, unrooted binarytree. we define:• L(T) – the set of leafs of T• For any sub set X of L(T) T|X denotes the

restriction of T to X.• For leafs x,y let P(x,y) denotes the path from x to

y in T.• Two subsets L1 and L2 of L(T) are edge sharing

if there exist x,y in L1 and w,z in L2 s.t. P(x,y) and P(w,z) have common edges

Edge sharing graph

Given distance matrix D’: • For u in L(T) Let L(v) denote a sub set of L(T) s.t. if

D’(v,y) < D’(v,x) and x in L(v), then y in L(v). • For sub set U of L(T) let ε(U) be the graph with nodes {L(x)|x in U} and edge determined by the edge-sharing

relation.• Let SL(v) be the union of L(v) with all neighbours of L(v)

in ε(U).

Local (ε,M) distortion

Let T be an edge-weighted binary tree let D be

the associated additive matrix. Suppose 0 < ε <M.

We say that D’:L(T) x L(T) -> R+ is a local

(ε,M) distortion for a sub set U of L(T) if:

1. D’ is a distance matrix.

2. D’(x,y) = ∞ implies D(x,y) > M, for all x,y in U

3. D’(x,y) < M implies |D(x,y)-D’(x,y)| < ε, for all

x,y in U

local distortion decomposition

Let T be an edge-weighted binary tree and let D be

the associated additive matrix.

Suppose L(T) = C1 U … U Cα s.t. T|Ci and T|Cj are

edge-disjoint for each 1 < i < j < α. For each i < α,

let 0 < εi < Mi be given.

We say that C = f(Ci,εi,Mi) : 0 < i < α is a local

distortion decomposition of D’ if D’ is a local (εi,Mi)

distortion for Ci, for each i from 1 to α.

Constructive distortion decomposition

Furthermore, let fi be the weight of the

smallest edge in T|Ci and let for all i:

• εi < fi/2 and Mi > 7εi

• L(v) be the ball of radius (Mi-7εi)/6 about v

• ε(Ci) are the connected components of

ε(L(T))

Then we say that C is constructive.

Theorem: Let T be an edge-weighted binary

tree and let D be the associated additive

matrix. Suppose D’ an (ε,M) distortion of D

for L(T) with ε < f/2 and M > 7ε where f

and g are the smallest and largest edges

respectively. Let ε(L(T)) be the edge-sharing

graph of (M-7ε)/6 balls around leaves.

Then the connected components of ε(L(T))

is a constructive distortion decomposiotion

and their number is less than O(2-(M-ε)/2g)n

Again, what problem we want to solve?

• Input: matrix D’ which is a local distortion decomposition of some unknown additive matrix D

• output: approximate the real tree topology by a forest with as few trees as you can

Algorithm 1

Algorithm 2

3אלגוריתם

(*)Lאינטואיציה: אלגוריתם המקבל את כל ה-• משתפים קשת זה עם זה. (*)Lומחזיר אילו

.T|SL(v)ובונה גרפים-

אם הם L(v),L(u)האלג' בודק לכל זוג •משתפים קשת. הוא עושה זאת ע"י בדיקה אם

יש רביעייה מסויימת בשתי הקבוצות הנ"ל:

3המשך תיאור אלג'

במקרה שהרביעייה • מחזיר 2שאלג'

מתקבלת כמו שמתואר L(u),L(v)מימין אז

משתפים קשת.

L(u)

L(v)

3המשך תיאור אלג'

כאשר מסיימים לבדוק את כל שיתוף הקשתות • ע"י שיטה כמו T|SL(v) גרף SL(v)בונים לכל

NJ.

(בניית רכיב בגרף)4אלגוריתם Algorithm 4 (Component reconstruction)

INPUT: SL(·) trees of a connected component C of ε(S)OUTPUT: T|CLet v1, ..., vr be a perfect elimination order of the leaves of a component C of E(S)

)by lemma 1 C is triangulated.(for 1 ≤ i ≤ r do

Let Xi = SL(vi) ∩ {vi, ..., vr} Get ti = T|(Xi {v∪ i}) by restricting T|SL(vi)

end forSet Tr = tr

for i = r − 1 to 1 do Ti := strict consensus merger of ti and Ti+1

end forreturn T1

4הוכחת אלגוריתם

.C ב-v מדויק לכל עלה ε (S), T|SL(v)נניח

. i<=n Ti = T|{vi, ..., vr}לכן לכל

הוכחה:

ברור שהטענה נכונה. נניח נכונות ל-i=rל-

Ti+1 = T|{vi+1, ..., vr} .

L(ti)∩L(Ti+1) = Xi כאשר,Xi זה אוסף העלים של .Ti+1, ti של merger שהוא Z של backboneה-

4המשך הוכחת אלג'

. Ti+1, ti של של mergeנראה שאין התנגשות ב-• vi, ו-Z היא קשת של eנניח שיש התנגשות: אם •

.e מחוברים לקשת Ti+1 של ’T ותת עץ (כי vi+1, ..., vr −{Xi{ מוכל ב- L(T’)ברור כי •

. Z לא תת עץ של ’T, ו-Xi הם Zהעלים של עם קצוות P מסמלת מסלול eנניח שקשת •

a,b יהי .T0 תת עץ של T המכיל את כל וגם את כל Pהצמתים והקשתות הפנימיות של

תתי העצים שמחוברים לצמתים אלו.


•Vi הוא עלה של T0 כי הוא מחובר לצומת במסלול P.

הוא תת ’T כי T0 מוכלים בעלים של ’Tהעלים של •.Pעץ המחובר לצומת במסלול

.Z מסמל קשת בגרף P זרים כי T0,Xiהעלים של •

•E(L(T0)).(הוכחה בהמשך) הוא קשיר מסלול

ל- צומת L(vi) מ-E(L(T0)) מסלול ב- Qיהי •.L(T’) הוא הצומת הראשון ב- x. כאשר L(T’)ב-


כך שנוכל perfect elimination orderקיים •להחסיר קשתות במסלול הנ"ל ולקבל שקיימת

ולכן sl(vi) שייך ל- x, דבר הגורר ש- )vi,x(קשת x-שייך ל Xi-הסתירה לכך ש ,L(T’) -זרים ל Xi.

סיבוכיות זמן

.O(n) היא 3סיבוכיות ריצה אחת של אלגוריתם • היא 4סיבוכיות ריצה אחת של אלגוריתם •

O(n^2). הוא 3סה"כ סיבוכיות זמן של כל ריצות אלגוריתם •

O(n^3) 'רץ 3 (זאת כי אלג n^2.(פעמים קוראים מקסימום 4 עצים לכן את אלגוריתם nיש •

n פעמים. לכן סה"כ סיבוכיות של כל ריצות .O(n^3) הוא 4אלגוריתם

.O(n^3)לכן סה"כ סיבוכיות •

אורך סידרה

O(poly(log(n)))אם אורך סדרה של כל זן הוא • עובד טוב כמעט תמיד.2אז אלגוריתם

lecture 13 maximal accurate forests from distance matrix

Documents