constraint programming and biology: rna secondary structureagostino.dovier/wroclaw/biocp12_4.pdf ·...
Post on 26-Jul-2020
1 Views
Preview:
TRANSCRIPT
Constraint Programming and Biology:RNA secondary structure
Agostino Dovier
Dept. Math and Computer Science, Univ. of Udine, Italy
ACP Summer School in Constraint ProgrammingWrocław, September 2012
Agostino Dovier (DIMI, UDINE Univ.) CP and Biology Wrocław, September 2012 1 / 15
RNA secondary structure prediction DNA and RNA
The central dogma
Agostino Dovier (DIMI, UDINE Univ.) CP and Biology Wrocław, September 2012 2 / 15
RNA secondary structure prediction DNA and RNA
The central dogma
Agostino Dovier (DIMI, UDINE Univ.) CP and Biology Wrocław, September 2012 2 / 15
RNA secondary structure prediction DNA and RNA
The central dogma
Agostino Dovier (DIMI, UDINE Univ.) CP and Biology Wrocław, September 2012 2 / 15
RNA secondary structure prediction DNA and RNA
The central dogma
Agostino Dovier (DIMI, UDINE Univ.) CP and Biology Wrocław, September 2012 2 / 15
RNA secondary structure prediction DNA and RNA
The central dogma
RNA is a sequence of nucleotides (A,C,G,U) that (often) is just anintermediary between DNA and proteinsThe 3D structure of RNA depends largely on interactions betweenpairs of nucleotides (base pairing)The secondary structure is the set of its base pairings
Agostino Dovier (DIMI, UDINE Univ.) CP and Biology Wrocław, September 2012 3 / 15
RNA secondary structure prediction Definitions
Mathematically
A RNA sequence ~s = s1s2 · · · sn is a string in {A,C,G,U}∗
A RNA secondary structure is a (partial) injective functionP ⊆ {1, . . . ,n}2 such that (i , j) ∈ P → i < j
(or, alternatively, such that (i , j) ∈ P ↔ (j , i) ∈ P)One might also require from the beginning that (i , j) ∈ P only if(si , sj) ∈ {(A,U), (U,A), (C,G), (G,C), (U,G), (G,U)}We are interested in a pairing maximizing the pairings (and/orminimizing a more difficult energy function)
Agostino Dovier (DIMI, UDINE Univ.) CP and Biology Wrocław, September 2012 4 / 15
RNA secondary structure prediction Definitions
Mathematically
A RNA sequence ~s = s1s2 · · · sn is a string in {A,C,G,U}∗
A RNA secondary structure is a (partial) injective functionP ⊆ {1, . . . ,n}2 such that (i , j) ∈ P → i < j(or, alternatively, such that (i , j) ∈ P ↔ (j , i) ∈ P)
One might also require from the beginning that (i , j) ∈ P only if(si , sj) ∈ {(A,U), (U,A), (C,G), (G,C), (U,G), (G,U)}We are interested in a pairing maximizing the pairings (and/orminimizing a more difficult energy function)
Agostino Dovier (DIMI, UDINE Univ.) CP and Biology Wrocław, September 2012 4 / 15
RNA secondary structure prediction Definitions
Mathematically
A RNA sequence ~s = s1s2 · · · sn is a string in {A,C,G,U}∗
A RNA secondary structure is a (partial) injective functionP ⊆ {1, . . . ,n}2 such that (i , j) ∈ P → i < j(or, alternatively, such that (i , j) ∈ P ↔ (j , i) ∈ P)One might also require from the beginning that (i , j) ∈ P only if(si , sj) ∈ {(A,U), (U,A), (C,G), (G,C), (U,G), (G,U)}
We are interested in a pairing maximizing the pairings (and/orminimizing a more difficult energy function)
Agostino Dovier (DIMI, UDINE Univ.) CP and Biology Wrocław, September 2012 4 / 15
RNA secondary structure prediction Definitions
Mathematically
A RNA sequence ~s = s1s2 · · · sn is a string in {A,C,G,U}∗
A RNA secondary structure is a (partial) injective functionP ⊆ {1, . . . ,n}2 such that (i , j) ∈ P → i < j(or, alternatively, such that (i , j) ∈ P ↔ (j , i) ∈ P)One might also require from the beginning that (i , j) ∈ P only if(si , sj) ∈ {(A,U), (U,A), (C,G), (G,C), (U,G), (G,U)}We are interested in a pairing maximizing the pairings (and/orminimizing a more difficult energy function)
Agostino Dovier (DIMI, UDINE Univ.) CP and Biology Wrocław, September 2012 4 / 15
RNA secondary structure prediction Definitions
Spatial constraints (pseudo knot)
If i < ` < j and (i , j) ∈ P, and ((`, k) ∈ P or (k , `) ∈ P), then i < k < j .
Agostino Dovier (DIMI, UDINE Univ.) CP and Biology Wrocław, September 2012 5 / 15
RNA secondary structure prediction Definitions
Spatial constraints (pseudo knot)
If i < ` < j and (i , j) ∈ P, and ((`, k) ∈ P or (k , `) ∈ P), then i < k < j .
Agostino Dovier (DIMI, UDINE Univ.) CP and Biology Wrocław, September 2012 5 / 15
RNA secondary structure prediction Definitions
Spatial constraints (pseudo knot)
If i < ` < j and (i , j) ∈ P, and ((`, k) ∈ P or (k , `) ∈ P), then i < k < j .
Agostino Dovier (DIMI, UDINE Univ.) CP and Biology Wrocław, September 2012 5 / 15
RNA secondary structure prediction Definitions
Spatial constraints (pseudo knot)
If i < ` < j and (i , j) ∈ P, and ((`, k) ∈ P or (k , `) ∈ P), then i < k < j .
Agostino Dovier (DIMI, UDINE Univ.) CP and Biology Wrocław, September 2012 5 / 15
RNA secondary structure prediction Definitions
Spatial constraints (pseudo knot)
If i < ` < j and (i , j) ∈ P, and ((`, k) ∈ P or (k , `) ∈ P), then i < k < j .
Agostino Dovier (DIMI, UDINE Univ.) CP and Biology Wrocław, September 2012 5 / 15
RNA secondary structure prediction Definitions
Spatial constraints (pseudo knot)
If i < ` < j and (i , j) ∈ P, and ((`, k) ∈ P or (k , `) ∈ P), then i < k < j .
Agostino Dovier (DIMI, UDINE Univ.) CP and Biology Wrocław, September 2012 5 / 15
RNA secondary structure prediction Definitions
Spatial constraints (pseudo knot)
If i < ` < j and (i , j) ∈ P, and ((`, k) ∈ P or (k , `) ∈ P), then i < k < j .
Agostino Dovier (DIMI, UDINE Univ.) CP and Biology Wrocław, September 2012 5 / 15
RNA secondary structure prediction Definitions
Spatial constraints (pseudo knot)
If i < ` < j and (i , j) ∈ P, and ((`, k) ∈ P or (k , `) ∈ P), then i < k < j .
Agostino Dovier (DIMI, UDINE Univ.) CP and Biology Wrocław, September 2012 5 / 15
RNA secondary structure prediction Definitions
Spatial constraints (pseudo knot)
If i < ` < j and (i , j) ∈ P, and ((`, k) ∈ P or (k , `) ∈ P), then i < k < j .
NO!
Agostino Dovier (DIMI, UDINE Univ.) CP and Biology Wrocław, September 2012 6 / 15
RNA secondary structure prediction Definitions
Spatial constraints (pseudo knot)
If i < ` < j and (i , j) ∈ P, and ((`, k) ∈ P or (k , `) ∈ P), then i < k < j .
NO!
Agostino Dovier (DIMI, UDINE Univ.) CP and Biology Wrocław, September 2012 6 / 15
RNA secondary structure prediction Definitions
Spatial constraints (pseudo knot)
If i < ` < j and (i , j) ∈ P, and ((`, k) ∈ P or (k , `) ∈ P), then i < k < j .NO!
Agostino Dovier (DIMI, UDINE Univ.) CP and Biology Wrocław, September 2012 6 / 15
RNA secondary structure prediction Definitions
Results
The pseudo-knot constraint is sensible. Adding it there arepolynomial-time algorithms (mfold: dynamic programming.http://mfold.rna.albany.edu)Without the pseudo-knot constraint the problem is NP complete.
Actually, what problem?
Agostino Dovier (DIMI, UDINE Univ.) CP and Biology Wrocław, September 2012 7 / 15
RNA secondary structure prediction Definitions
Results
The pseudo-knot constraint is sensible. Adding it there arepolynomial-time algorithms (mfold: dynamic programming.http://mfold.rna.albany.edu)Without the pseudo-knot constraint the problem is NP complete.Actually, what problem?
Agostino Dovier (DIMI, UDINE Univ.) CP and Biology Wrocław, September 2012 7 / 15
RNA secondary structure prediction Definitions
NP CompletenessLyngsø and Pedersen, 2000
Let ~s = s1 · · · sn be a RNA sequence, and P a secondarystructure. Then
E(~s,P) =∑
(i,j)∈P,i<j
E(~s, i , j ,P)
where E(~s, i , j ,P) depend on si and sj and, moreover, on the szsuch that (i + 1, z) ∈ P or (j − 1, z) ∈ P.
Agostino Dovier (DIMI, UDINE Univ.) CP and Biology Wrocław, September 2012 8 / 15
RNA secondary structure prediction Definitions
NP CompletenessLyngsø and Pedersen, 2000
In the NP-completeness proof they first assume to have an infiniteset of complementary bases (e.g., (A1,U1), (A2,U2), (A3,U3), . . . )and define E as follows:
E(~s, i , j ,P) =
−1 If si and sj are complementary symbols and
(∀z ∈ {1, . . . , i − 1, j + 1, . . . ,n})({(i + 1, z), (z, i + 1), (j − 1, z), (z, j − 1)} ∩ P = ∅)
0 otherwise
E(~s, i , j ,P) = 0
-10-1
Agostino Dovier (DIMI, UDINE Univ.) CP and Biology Wrocław, September 2012 9 / 15
RNA secondary structure prediction Definitions
NP CompletenessLyngsø and Pedersen, 2000
In the NP-completeness proof they first assume to have an infiniteset of complementary bases (e.g., (A1,U1), (A2,U2), (A3,U3), . . . )and define E as follows:
E(~s, i , j ,P) =
−1 If si and sj are complementary symbols and
(∀z ∈ {1, . . . , i − 1, j + 1, . . . ,n})({(i + 1, z), (z, i + 1), (j − 1, z), (z, j − 1)} ∩ P = ∅)
0 otherwise
E(~s, i , j ,P) =
0
-1
0-1
Agostino Dovier (DIMI, UDINE Univ.) CP and Biology Wrocław, September 2012 9 / 15
RNA secondary structure prediction Definitions
NP CompletenessLyngsø and Pedersen, 2000
In the NP-completeness proof they first assume to have an infiniteset of complementary bases (e.g., (A1,U1), (A2,U2), (A3,U3), . . . )and define E as follows:
E(~s, i , j ,P) =
−1 If si and sj are complementary symbols and
(∀z ∈ {1, . . . , i − 1, j + 1, . . . ,n})({(i + 1, z), (z, i + 1), (j − 1, z), (z, j − 1)} ∩ P = ∅)
0 otherwise
E(~s, i , j ,P) =
0
-1
0-1
Agostino Dovier (DIMI, UDINE Univ.) CP and Biology Wrocław, September 2012 9 / 15
RNA secondary structure prediction Definitions
NP CompletenessLyngsø and Pedersen, 2000
In the NP-completeness proof they first assume to have an infiniteset of complementary bases (e.g., (A1,U1), (A2,U2), (A3,U3), . . . )and define E as follows:
E(~s, i , j ,P) =
−1 If si and sj are complementary symbols and
(∀z ∈ {1, . . . , i − 1, j + 1, . . . ,n})({(i + 1, z), (z, i + 1), (j − 1, z), (z, j − 1)} ∩ P = ∅)
0 otherwise
E(~s, i , j ,P) =
0-1
0
-1
Agostino Dovier (DIMI, UDINE Univ.) CP and Biology Wrocław, September 2012 9 / 15
RNA secondary structure prediction Definitions
NP CompletenessLyngsø and Pedersen, 2000
In the NP-completeness proof they first assume to have an infiniteset of complementary bases (e.g., (A1,U1), (A2,U2), (A3,U3), . . . )and define E as follows:
E(~s, i , j ,P) =
−1 If si and sj are complementary symbols and
(∀z ∈ {1, . . . , i − 1, j + 1, . . . ,n})({(i + 1, z), (z, i + 1), (j − 1, z), (z, j − 1)} ∩ P = ∅)
0 otherwise
E(~s, i , j ,P) =
0-1
0
-1
Agostino Dovier (DIMI, UDINE Univ.) CP and Biology Wrocław, September 2012 9 / 15
RNA secondary structure prediction Definitions
NP CompletenessLyngsø and Pedersen, 2000
In the NP-completeness proof they first assume to have an infiniteset of complementary bases (e.g., (A1,U1), (A2,U2), (A3,U3), . . . )and define E as follows:
E(~s, i , j ,P) =
−1 If si and sj are complementary symbols and
(∀z ∈ {1, . . . , i − 1, j + 1, . . . ,n})({(i + 1, z), (z, i + 1), (j − 1, z), (z, j − 1)} ∩ P = ∅)
0 otherwise
E(~s, i , j ,P) =
0-10
-1
Agostino Dovier (DIMI, UDINE Univ.) CP and Biology Wrocław, September 2012 9 / 15
RNA secondary structure prediction Definitions
NP CompletenessLyngsø and Pedersen, 2000
To prove NP hardness they start from 3SAT with the furtherrequirement that each literal occurs at most twice (for a variable X ,you can have X zero, one or two times and ¬X zero, one, or twotimes). Prove it is NP complete (exercise)For a clause ci = `1 ∨ `2 ∨ `3 they introduce a gadget:
Ci = ci,1(`1)1/2ci,1ci,2(`2)1/2ci,1ci,2(`3)1/2ci,2
(1/2 according to the leftmost/rightmost occurrence of that literal)For a variable Xi that occurs twice positively and twice negatively,introduce a gadget (a substring in case of less occurrences)
Vi = vi(Xi)2 (Xi)1vivi(¬Xi)2 (¬Xi)1vi
The encoding of a formula c1 ∧ · · · ∧ cm on variables X1 . . .Xn is
C1 · · · CmV1 · · · Vn
Agostino Dovier (DIMI, UDINE Univ.) CP and Biology Wrocław, September 2012 10 / 15
RNA secondary structure prediction Definitions
NP CompletenessMain idea
Agostino Dovier (DIMI, UDINE Univ.) CP and Biology Wrocław, September 2012 11 / 15
RNA secondary structure prediction Definitions
NP CompletenessMain idea
Agostino Dovier (DIMI, UDINE Univ.) CP and Biology Wrocław, September 2012 11 / 15
RNA secondary structure prediction Definitions
NP CompletenessMain idea
Agostino Dovier (DIMI, UDINE Univ.) CP and Biology Wrocław, September 2012 11 / 15
RNA secondary structure prediction Definitions
NP CompletenessMain idea
Agostino Dovier (DIMI, UDINE Univ.) CP and Biology Wrocław, September 2012 11 / 15
RNA secondary structure prediction Definitions
NP CompletenessMain idea
Agostino Dovier (DIMI, UDINE Univ.) CP and Biology Wrocław, September 2012 11 / 15
RNA secondary structure prediction Definitions
NP CompletenessLyngsø and Pedersen, 2000
The encoding of a formula C1 ∧ · · · ∧ Cm on variables X1 . . .Xn is
C1 · · · CmV1 · · · Vn
They prove that ϕ is satisfiable iff there is a secondary structure withenergy -(3m+n) [Nice exercise]
Then they complete the proof without the hypothesis of infinitealphabet (this is a nice reading—too long for explaining it in thiscourse).
Agostino Dovier (DIMI, UDINE Univ.) CP and Biology Wrocław, September 2012 12 / 15
RNA secondary structure prediction Definitions
NP CompletenessLyngsø and Pedersen, 2000
The encoding of a formula C1 ∧ · · · ∧ Cm on variables X1 . . .Xn is
C1 · · · CmV1 · · · Vn
They prove that ϕ is satisfiable iff there is a secondary structure withenergy -(3m+n) [Nice exercise]
Then they complete the proof without the hypothesis of infinitealphabet (this is a nice reading—too long for explaining it in thiscourse).
Agostino Dovier (DIMI, UDINE Univ.) CP and Biology Wrocław, September 2012 12 / 15
RNA secondary structure prediction Definitions
NP CompletenessLyngsø and Pedersen, 2000
The encoding of a formula C1 ∧ · · · ∧ Cm on variables X1 . . .Xn is
C1 · · · CmV1 · · · Vn
They prove that ϕ is satisfiable iff there is a secondary structure withenergy -(3m+n) [Nice exercise]
Then they complete the proof without the hypothesis of infinitealphabet (this is a nice reading—too long for explaining it in thiscourse).
Agostino Dovier (DIMI, UDINE Univ.) CP and Biology Wrocław, September 2012 12 / 15
RNA secondary structure prediction Definitions
A simple CLP encoding
Input s1, . . . , sn
Variables Pairs = [P1, . . . ,Pn].Let Sx = {i ∈ {1, . . . ,n} | si = x}.If si = A, then dom(Pi) = {0} ∪ SU .If si = C, then dom(Pi) = {0} ∪ SG.If si = G, then dom(Pi) = {0} ∪ SC ∪ SU .If si = U, then dom(Pi) = {0} ∪ SA ∪ SG.For i = 1, . . . ,n, if Pi > 0 then PPi = I. If Pi = 0 no constraint. Itcan be stated compactly as:
element(P + 1, [I|Pairs], I)
Pseudo-knots: If Pi > 0 then (Pi+1 ∈ [i + 3..PPi − 1]) ∨ (Pi+1 = 0)
Agostino Dovier (DIMI, UDINE Univ.) CP and Biology Wrocław, September 2012 13 / 15
RNA secondary structure prediction Definitions
A simple CLP encoding
As cost function we want either to maximize contacts or
(as done by Dahl-Bavarian, WCB05),a solution close to the statistics, namely 35% for AU, 53% for CG,12% for GU.Let NC = n −#contactsWe minimize therefore a weighted sum of the form
c1NCn
+ c2#(AU)− .35(n − NC)
n+ c3
#(CG)− .53(n − NC)
n
(c1, c2, c3 constants that can be changed. The denominator n canbe omitted for minimization)Let us see some execution of RNA_alignment.pl
Agostino Dovier (DIMI, UDINE Univ.) CP and Biology Wrocław, September 2012 14 / 15
RNA secondary structure prediction Definitions
A simple CLP encoding
As cost function we want either to maximize contacts or(as done by Dahl-Bavarian, WCB05),
a solution close to the statistics, namely 35% for AU, 53% for CG,12% for GU.Let NC = n −#contactsWe minimize therefore a weighted sum of the form
c1NCn
+ c2#(AU)− .35(n − NC)
n+ c3
#(CG)− .53(n − NC)
n
(c1, c2, c3 constants that can be changed. The denominator n canbe omitted for minimization)Let us see some execution of RNA_alignment.pl
Agostino Dovier (DIMI, UDINE Univ.) CP and Biology Wrocław, September 2012 14 / 15
RNA secondary structure prediction Definitions
A simple CLP encoding
As cost function we want either to maximize contacts or(as done by Dahl-Bavarian, WCB05),a solution close to the statistics, namely 35% for AU, 53% for CG,12% for GU.Let NC = n −#contactsWe minimize therefore a weighted sum of the form
c1NCn
+ c2#(AU)− .35(n − NC)
n+ c3
#(CG)− .53(n − NC)
n
(c1, c2, c3 constants that can be changed. The denominator n canbe omitted for minimization)Let us see some execution of RNA_alignment.pl
Agostino Dovier (DIMI, UDINE Univ.) CP and Biology Wrocław, September 2012 14 / 15
RNA secondary structure prediction Definitions
(Some) References
M. Zucker and P. Stiegler. Optimal computer folding of large RNAsequences using thermodynamics and auxiliary information. NucleidAcid Research, 9(1):133–148, 2981.
R.B. Lyngsø and C.N.S Pedersen. RNA Pseudoknot prediction inEnergy-Based Models. J. of Computational Biology 7(3/4), 2000.
G. Blin, G. Fertin, I. Rusu, and C. Sinoquet. Extending the hardness ofRNA secondary structure comparison. LNCS 4614, pp. 140–151, 2007.
M. Bauer, G.W. Klau, and K. Reinert. Accurate multiplesequence-structure alignment of RNA sequences using combinatorialoptimization. BMC Bioinformatics, 8, 2007.
M. Bavarian and V. Dahl. Constraint Based Methods for BiologicalSequence Analysis. J. Universal Computer Science 12(11):1500–1520,2006 (also in WCB 05).
A. Dal Palù, M. Möhl, S. Will. A Propagator for Maximum Weight StringAlignment with Arbitrary Pairwise Dependencies. CP 2010: 167-175(also in WCB 10)
Agostino Dovier (DIMI, UDINE Univ.) CP and Biology Wrocław, September 2012 15 / 15
top related