approximate labelled subtree homeomorphism

65
Approximate Labelled Subtree Homeomorphism Based on : “Approximate Labelled Subtree Homeomorphism” R. Y. Pinter, O.Rokhlenko, D. Tsur, M. Ziv-Ukelson “Alignment of Metabolic Pathways” R. Y. Pinter, O. Rokhlenko, E. Yeger- Lotem, M. Ziv-Ukelson

Upload: xylia

Post on 20-Jan-2016

29 views

Category:

Documents


1 download

DESCRIPTION

Approximate Labelled Subtree Homeomorphism. Based on: “Approximate Labelled Subtree Homeomorphism” R. Y. Pinter, O.Rokhlenko, D. Tsur, M. Ziv-Ukelson “Alignment of Metabolic Pathways” R. Y. Pinter, O. Rokhlenko, E. Yeger-Lotem, M. Ziv-Ukelson. The general Idea. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Approximate Labelled Subtree Homeomorphism

Approximate Labelled Subtree Homeomorphism

Based on: “Approximate Labelled Subtree Homeomorphism”

R. Y. Pinter, O.Rokhlenko, D. Tsur, M. Ziv-Ukelson

“Alignment of Metabolic Pathways” R. Y. Pinter, O. Rokhlenko, E. Yeger-Lotem,

M. Ziv-Ukelson

Page 2: Approximate Labelled Subtree Homeomorphism

The general Idea

Biological Problem

Converting into terms of computer science problem

Finding the solution

Reverting back to Biological terms

Page 3: Approximate Labelled Subtree Homeomorphism

Metabolism

Page 4: Approximate Labelled Subtree Homeomorphism

IL-2Th1

TNF-IFN-

Proliferation

IL-12

Ag Stimuli

Thnp

IL-12R

T-Bet

Stat 4

Signal transduction

Page 5: Approximate Labelled Subtree Homeomorphism

Why pathways?

Metabolic and regulatory pathways

have biological importance. These pathways are evolutionary

conserved.

Page 6: Approximate Labelled Subtree Homeomorphism

What do we want to do?

Compare one metabolic pathway of a

certain organism against the same

metabolic pathways in other

organisms. Compare a metabolic pathway against

other metabolic pathways in the same

organism.

Page 7: Approximate Labelled Subtree Homeomorphism

How do we do (it)?

Page 8: Approximate Labelled Subtree Homeomorphism

The subtree homeomorphism problem:

Given a pattern tree P and a text tree T, find a subtree of T which is isomorphic to P or decide that there is no such tree.

Degree 2 node can be deleted from the text tree.

?

Page 9: Approximate Labelled Subtree Homeomorphism

Graph homeomorphism Text

Pattern

Colors

?

Page 10: Approximate Labelled Subtree Homeomorphism

Graph homeomorphism Text

Pattern

Page 11: Approximate Labelled Subtree Homeomorphism

Graph homeomorphism Text Pattern

Labels (similarity)

topology

Page 12: Approximate Labelled Subtree Homeomorphism

Back to 2nd semester… An unrooted tree is an undirected, acyclic,

connected graph (T=(VT,ET((

A rooted tree is a triple Tr=(VT,ET,r( where

(VT,ET( is an unrooted tree, and r is some vertex

in V which is called the root. The root node of the

tree implies the direction for all the edges in the

graph. A multi-source tree is an acyclic, directed graph,

whose underlying undirected graph is a tree.

Page 13: Approximate Labelled Subtree Homeomorphism

Back to 2nd semester…

A tree is said to be ordered if the relative order of its subtree in each node is fix. Otherwise a tree is unordered.

for “ordered”

Problem:

Page 14: Approximate Labelled Subtree Homeomorphism

What are we allowed to do?

Taking into account both label similarity

and topology. We are permited to delete vertexes

from the text tree. We are NOT permited to delete vertexes

from the pattern tree.

Page 15: Approximate Labelled Subtree Homeomorphism

What we “gonna” see today:

Rooted unorderedO(m2n + mn log n)

Unrooted unorderedO(m2n + mn log n)

Directed multi source unordered

O(m2n + mn log n)

Rooted orderedO(mn)

Page 16: Approximate Labelled Subtree Homeomorphism

Some definitions:

Let Δ denote a predefined node-to-node similarity score table.

Let D denote a predefined score for deleting a

node from a tree (usually a penalty). A mapping M from T1 to T2 is a partial one to one

map from the nodes of T1 to the nodes of T2 that preserves the ancestor relations of the nodes.

Page 17: Approximate Labelled Subtree Homeomorphism

Our problem:

Let M be a mapping from T1 to T2 . The Labelled Subtree Homeomorphic Similarity Score of

M[T1,T2] is:

LSH (M[T1,T2]) = D (|T1|-|T2|) + ∑ (u,v) ∈ M Δ]u,v]

Given two undirected labeled trees P and T, We want to find a mapping M and a subtree t of T, such that:

LSH (M [t,P]) is maximal.

Page 18: Approximate Labelled Subtree Homeomorphism

Scoring

Text Pattern

Score

-1-1

+2-2

-2+2Score:2

Score:2

Score:5

Page 19: Approximate Labelled Subtree Homeomorphism

Dynamic programming

vu

x1 x2

y3y2y1

TP

x1x2…u…

y1w11w12w1m

y2w21w22w1m

y3w31w32w1m

vwn1wn1wnm

Page 20: Approximate Labelled Subtree Homeomorphism

RScore[u,v] is the maximum between two terms:

The node-to-node similarity value Δ[v,u] plus the sum of the weights of the matched edges in the maximal assignment over G. This term is only compute if c(u) ≤ c(v) (otherwise: -∞).

The weight RScore[yi,u] for the comparison of u and the best scoring child yi of v, updated with the penalty for deleting v.

C(u) is the number of the children of u

Page 21: Approximate Labelled Subtree Homeomorphism

RScore[u,v] - example

Pattern Text

score matrix

deletion = -1

ab

u10-∞v9

deletion

8

w8

deletion

12

ab

u1010

v5-2

w33

a

b

u

v

w

Page 22: Approximate Labelled Subtree Homeomorphism

The assignment problem

Let G be a bipartite graph G = (V = X U Y,E) with weights w (x,y) for all edges. The assignment problem is to compute a matching M (list of monogamic pairs) such that: The size of M is maximal among all the

matchings. From all the matchings above, The sum of

the weights is maximal.

Page 23: Approximate Labelled Subtree Homeomorphism

Solving the assignment problem

Reduction from the assignment problem to the min cost max flow problem. We’ll construct G’ which contains G(V,E) with the following changes:

Two more vertexes: s,t Edges from s to X and from Y to t, while w (s,x) = 0, w (y,t) =0 The cost of the other edges in E is –w (x,y) The capacity of all edges is 1

What is it? Among all the maximal flows we’ll choose the

cheapest

Page 24: Approximate Labelled Subtree Homeomorphism

From assignment to matching

u

x1 x2

v

y3y1y2

x1

x2

y2

y2

y2

s t

Page 25: Approximate Labelled Subtree Homeomorphism

Time complexity analysis

Edmonds and Karp’s algorithm: O(EV*logV)

Fredman and Tarjan: O(VE + V2logV) (independent of the edges cost)

Gabow and Tarjan: O(V1/2Elog(VC) where the input costs are integers and in the range [-C,….,C] (the similarity assumption)

Page 26: Approximate Labelled Subtree Homeomorphism

Reminder…

Page 27: Approximate Labelled Subtree Homeomorphism

What did we have so far?

Motivation“Advanced” homeomorphism: labels

and topologyScoring and deletionDynamic programmingMatchingQuestions?

Page 28: Approximate Labelled Subtree Homeomorphism

The algorithm for rooted unordered trees:

Input: Rooted trees T = (VT,ET,r) and

P = (VP,EP,r’ )).

Output: The root of the subtree t of T which has the highest similarity score to P, (and homeomorphic to P).

Page 29: Approximate Labelled Subtree Homeomorphism

for each node u of P in postorder dofor each node v of T in postorder do

if u is leaf thenif v is leaf then

RScore(v, u) = Δ [v,u]else

RScores(v,u) = ComputeScores (v,u)end if

elseif Level(u) > Level(v)

then RScore(v, u) = -∞else RScores(v,u) = ComputeScores (v,u)

end if ; end if; end for; end for

Dynamic programming

Node to node score

Delete from the pattern

Page 30: Approximate Labelled Subtree Homeomorphism

Let k denote the out-degree of node u and l denote the out degree of node v

if k >l thenAssignmentScore(G) = -∞

else

Construct a bipartite graph G with node bipartition X and Y such that: X is the set of children {x1…xk{ of u,

Y is the set of children {y1…yl{ of v,

node ui ∈ X X is connected to node vj ∈ Y via an edge whose weight w(ui,vj) is set to RScore(vj,ui).

AssignmentScore(G) = max ∑ (i,j) ∈ M RScores[yj,xi]end if

Find, among all children of v, the node BestChild(v,u) whose ALSH score with u is highest: BestChild(v,u) = max j=1 to l RScore(yj,u)

return max {Δ [v,u]+AssignmentScore(G),BestChild(v,u)+δ}

Procedure ComputeScores (v,u)

Deletion penalty

Page 31: Approximate Labelled Subtree Homeomorphism

Time complexity analysis

Observation 1:

∑u =1 to m c(u) = m-1

∑v =1 to n c(v) = n-1

The number of the vertexes in the pattern

Page 32: Approximate Labelled Subtree Homeomorphism

Time complexity analysisThe weighted assignment is computed once

for each pair u,v u T, v PIn a bipartite graph there are c(v)+c(u) nodes

and c(v)c(u) edges. Based on Fredman and Tarjan the time complexity is:

O(∑u=1 to m ∑ v=1 to n)c(u)2)c(v)+c(u)c(v) log (c(v))

= (observation 1)

O(∑u=1 to m c(u)2)n+c(u)n log n) = (observation 1)

O(m2n + mn log n)

Page 33: Approximate Labelled Subtree Homeomorphism
Page 34: Approximate Labelled Subtree Homeomorphism

Unrooted unordered trees:

The problem: each vertex in both the text tree and the pattern tree can be the root.

The naïve solution: choose an arbitrary node r of T to get a rooted tree. Next, for each u P compute rooted ALSH between Pu and Tr.

Time complexity: O(m3n+m2n log n)

Page 35: Approximate Labelled Subtree Homeomorphism

2nd try: Select an arbitrary node r in T as the root For each internal node in T (in post order)

and for each node in P compute an

“improved” matching problem

Page 36: Approximate Labelled Subtree Homeomorphism

2nd try: Select an arbitrary node r in T as the root For each internal node in T (in post order)

and for each node in P compute an

“improved” matching problem

Page 37: Approximate Labelled Subtree Homeomorphism

2nd try: Select an arbitrary node r in T as the root For each internal node in T (in post order)

and for each node in P compute an

“improved” matching problem

Page 38: Approximate Labelled Subtree Homeomorphism

General idea for keeping the time complexity

Find the best match between the children {x1,..,xn) of v∈T and {y1,…,ym} of u∈P.

After computing the best match and removing a node xi (which act as the parent of u) there is a way to find the optimal matching between {x1,…,xn}\xi and {y1,…,ym} in O(d(u)c(v)+c(v) log c(v))

The total time complexity for computing all assignments between v and u: O(d(u)2c(v)+d(u)c(v) log c(v))

Page 39: Approximate Labelled Subtree Homeomorphism

Time complexity

Observation 2: The sum of vertex degrees in an unrooted tree P is

∑u =1 to m d(u) = 2m-2

We’ve study that at

Combinatorics

Page 40: Approximate Labelled Subtree Homeomorphism

Time complexity – continue…

O((∑u =1 to m ∑v =1 tond(u)2c(v))+d(u)c(v) log c(v))

=

O((∑u =1 to m d(u)2n +d(u)n log n)

=

O(m2n + mn log n)Observation 1

Observatin 2

Page 41: Approximate Labelled Subtree Homeomorphism

Up the tree…

For each vertex v∈T, u∈P and xi∈ neighbors (u), UScore[v,u, xi[ is the maximal LSH between a subtree pu,xi of P and a corresponding homeomorphic subtree of tv,r if one exists.

otherwise, UScore[v,u,xi] is set to -∞

A subtree in P which his root is u and the

root’s parent is xi

Page 42: Approximate Labelled Subtree Homeomorphism

UScore[u,v,xi] is the maximum between two

terms: The node-to-node similarity value Δ[v,u] plus

the sum of the weights of the matched edges in the maximal assignment over Gi. This term is only compute if d(u) - 1 c(v)

(otherwise: -∞). The weight UScore[yi,u,xi] for the comparison

of u and the best scoring child yi of v, updated with the penalty for deleting v.

d(u) is the degree of u

Page 43: Approximate Labelled Subtree Homeomorphism

And if ‘u’ is the root…

We have to compute an additional entry

UScore[v,u,Φ]. This entry represent the fact that u

might be the root of P. The root of P will be node u such that:

UScore[v,u,Φ] is maximal.

Page 44: Approximate Labelled Subtree Homeomorphism

Multi-source graphs

DAG = Directed Acyclic Graph. A multi-source tree is a DAG whose its

underlying structure is an unrooted,

unordered trees.

Page 45: Approximate Labelled Subtree Homeomorphism

Multi-source graph - example pattern text

UScore[u,v,r’] = -∞

r’ r

u v

Page 46: Approximate Labelled Subtree Homeomorphism

Multi-source graphs & alignment

We’ll use the algorithm for the unrooted unordered tress.

We’ll filter out subtree alignments that map together edges of conflicting direction.

We’ll split the bipartite graph G = {X U Y,E} into two different graphs: one correspond to macthing of incoming-edge neighbors of u and v and the other for matching outgoing edge neighbors.

Page 47: Approximate Labelled Subtree Homeomorphism

ALSH for ordered rooted trees

Page 48: Approximate Labelled Subtree Homeomorphism

Solving ALSH for ordered rooted trees

Maximum weighted matching problem on ordered bipartite graphs, where no edges are allowed to cross.

Given a pattern string X, a source Y, and a character to character similarity table Δ[∑X, ∑Y], find among all |X|-sized subsequences of Y the subsequence Q which is most similar to X, that is, the sum ∑i=1 to |X| Δ[Qi,Xi] is maximized.

Page 49: Approximate Labelled Subtree Homeomorphism

String alignment

y3

y2

y1

ki+1

y1 y2 y3

x1

x2

lj+1

-∞

-∞

0 0 0

We can’t delete nodes from the

pattern tree

This is NOT the deletion penalty

Page 50: Approximate Labelled Subtree Homeomorphism

Time complexity for rooted ordered

For each node pair (v∈T,u∈P), the time complexity of the assignmentb is

O(c(u)c(v)) (dynamic programming)

∑u =1 to m ∑v =1 to n O(c(v) c(u)) =

∑v =1 to n O(m c(v)) =

O(m n)Observation 1

Page 51: Approximate Labelled Subtree Homeomorphism

The tool: MetaPathwayHunter

Page 52: Approximate Labelled Subtree Homeomorphism

What can it do?

A pathway against a pathway - 5 best alignments.

A pathway against a directory of pathways – 5 best alignment for pathway in the directory (sorted by score).

Page 53: Approximate Labelled Subtree Homeomorphism

Two extreme cases of deletion penalty

Assuming the similarity score is negative (≤ 0)

• Deletion penalty 0: always worth deleting

• Deletion penalty -∞ : never worth deleting

Page 54: Approximate Labelled Subtree Homeomorphism

Deletion penalty 0

What does it mean?

Page 55: Approximate Labelled Subtree Homeomorphism

Deletion penalty -∞

What does it mean?

Page 56: Approximate Labelled Subtree Homeomorphism

About the similarity score

MetaPathwayHunter uses the EC (Enzyme Commission) classification.

Four sets of numbers that categorize the type of the catalyzed chemical reaction. (e.g 1.2.5.23).

For an enzyme class h, C(h) denotes the number of enzymes whoose classes are included under h.

For two enzymes ei and ej, if their lowest common upper class is hij, then the similarity between then is –log2C(h).

Page 57: Approximate Labelled Subtree Homeomorphism

Similarity score - example

Δ[1.1.2.1, 1.1.2.14] = -log2C(1.1.2.-) =-log2(14)= -3.81

Δ[1.1.2.1, 1.1.3.1] = -log2C(1.1.-.-) = -log2(20) = -4.32

1.1.2-.

1.1.2.1 1.1.2.14

1.1-.-.

1.1.3-.

1.1.3.61.1.3.1

These are not

enzymes

Page 58: Approximate Labelled Subtree Homeomorphism

Is the result statistically significant?

Statistical significance is base on p-value. The p-value of an alignment (scored s) is

calculated by aligning the same query against 100 random pathway graphs, and counting the fraction of graphs containing an alignment that receive score s or higher.

A random pathway is a graph containing the same set of nodes and the same number of edges for each node, with random switch of the nodes.

Page 59: Approximate Labelled Subtree Homeomorphism

Inter species alignment

113 E. coli pathways and 151 S. cerevisiae

pathways. 610 pathway pairs had at least one

statistically significant alignment between them.

63% of the E. coli and 66% of S.

cerevisiae had at least one statistically

significantly aligned pair-mate from the other species

Page 60: Approximate Labelled Subtree Homeomorphism

Inter species alignmentE. Coli & S. cerevisiae: Phenilalanine, tyrosine

and tryptophan pathway (score: -4.28) from [1]

Page 61: Approximate Labelled Subtree Homeomorphism

Inter species alignment

What is the single mismatch? In E. coli: the enzyme uses NAD+ In S. cerevisia: the enzyme uses

NADP+ These two enzyme doesn’t have a

significant sequence similarity. == Two functional orthologs.

Page 62: Approximate Labelled Subtree Homeomorphism

A meta-pathway query• E. colly allantoin degradation (score =0)

• S. cerevisia ureide degradation (score=0)

Page 63: Approximate Labelled Subtree Homeomorphism

summary

• Biological motivation• Homeomorphism• Scoring and deleting• From assignment to matching• The algorithm for rooted unordered

trees• How to keep the time complexity for

unrooted unordered trees

Page 64: Approximate Labelled Subtree Homeomorphism

summary

• How to deal with Multi-source graphs

• The algorithm for rooted ordered trees

• The MetaPathwayHunter and its properties

• Results of alignments

Page 65: Approximate Labelled Subtree Homeomorphism

THE

END