constraint programming and biology: rna secondary structureagostino.dovier/wroclaw/biocp12_4.pdf ·...

Post on 26-Jul-2020

1 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Constraint Programming and Biology:RNA secondary structure

Agostino Dovier

Dept. Math and Computer Science, Univ. of Udine, Italy

ACP Summer School in Constraint ProgrammingWrocław, September 2012

Agostino Dovier (DIMI, UDINE Univ.) CP and Biology Wrocław, September 2012 1 / 15

RNA secondary structure prediction DNA and RNA

The central dogma

Agostino Dovier (DIMI, UDINE Univ.) CP and Biology Wrocław, September 2012 2 / 15

RNA secondary structure prediction DNA and RNA

The central dogma

Agostino Dovier (DIMI, UDINE Univ.) CP and Biology Wrocław, September 2012 2 / 15

RNA secondary structure prediction DNA and RNA

The central dogma

Agostino Dovier (DIMI, UDINE Univ.) CP and Biology Wrocław, September 2012 2 / 15

RNA secondary structure prediction DNA and RNA

The central dogma

Agostino Dovier (DIMI, UDINE Univ.) CP and Biology Wrocław, September 2012 2 / 15

RNA secondary structure prediction DNA and RNA

The central dogma

RNA is a sequence of nucleotides (A,C,G,U) that (often) is just anintermediary between DNA and proteinsThe 3D structure of RNA depends largely on interactions betweenpairs of nucleotides (base pairing)The secondary structure is the set of its base pairings

Agostino Dovier (DIMI, UDINE Univ.) CP and Biology Wrocław, September 2012 3 / 15

RNA secondary structure prediction Definitions

Mathematically

A RNA sequence ~s = s1s2 · · · sn is a string in {A,C,G,U}∗

A RNA secondary structure is a (partial) injective functionP ⊆ {1, . . . ,n}2 such that (i , j) ∈ P → i < j

(or, alternatively, such that (i , j) ∈ P ↔ (j , i) ∈ P)One might also require from the beginning that (i , j) ∈ P only if(si , sj) ∈ {(A,U), (U,A), (C,G), (G,C), (U,G), (G,U)}We are interested in a pairing maximizing the pairings (and/orminimizing a more difficult energy function)

Agostino Dovier (DIMI, UDINE Univ.) CP and Biology Wrocław, September 2012 4 / 15

RNA secondary structure prediction Definitions

Mathematically

A RNA sequence ~s = s1s2 · · · sn is a string in {A,C,G,U}∗

A RNA secondary structure is a (partial) injective functionP ⊆ {1, . . . ,n}2 such that (i , j) ∈ P → i < j(or, alternatively, such that (i , j) ∈ P ↔ (j , i) ∈ P)

One might also require from the beginning that (i , j) ∈ P only if(si , sj) ∈ {(A,U), (U,A), (C,G), (G,C), (U,G), (G,U)}We are interested in a pairing maximizing the pairings (and/orminimizing a more difficult energy function)

Agostino Dovier (DIMI, UDINE Univ.) CP and Biology Wrocław, September 2012 4 / 15

RNA secondary structure prediction Definitions

Mathematically

A RNA sequence ~s = s1s2 · · · sn is a string in {A,C,G,U}∗

A RNA secondary structure is a (partial) injective functionP ⊆ {1, . . . ,n}2 such that (i , j) ∈ P → i < j(or, alternatively, such that (i , j) ∈ P ↔ (j , i) ∈ P)One might also require from the beginning that (i , j) ∈ P only if(si , sj) ∈ {(A,U), (U,A), (C,G), (G,C), (U,G), (G,U)}

We are interested in a pairing maximizing the pairings (and/orminimizing a more difficult energy function)

Agostino Dovier (DIMI, UDINE Univ.) CP and Biology Wrocław, September 2012 4 / 15

RNA secondary structure prediction Definitions

Mathematically

A RNA sequence ~s = s1s2 · · · sn is a string in {A,C,G,U}∗

A RNA secondary structure is a (partial) injective functionP ⊆ {1, . . . ,n}2 such that (i , j) ∈ P → i < j(or, alternatively, such that (i , j) ∈ P ↔ (j , i) ∈ P)One might also require from the beginning that (i , j) ∈ P only if(si , sj) ∈ {(A,U), (U,A), (C,G), (G,C), (U,G), (G,U)}We are interested in a pairing maximizing the pairings (and/orminimizing a more difficult energy function)

Agostino Dovier (DIMI, UDINE Univ.) CP and Biology Wrocław, September 2012 4 / 15

RNA secondary structure prediction Definitions

Spatial constraints (pseudo knot)

If i < ` < j and (i , j) ∈ P, and ((`, k) ∈ P or (k , `) ∈ P), then i < k < j .

Agostino Dovier (DIMI, UDINE Univ.) CP and Biology Wrocław, September 2012 5 / 15

RNA secondary structure prediction Definitions

Spatial constraints (pseudo knot)

If i < ` < j and (i , j) ∈ P, and ((`, k) ∈ P or (k , `) ∈ P), then i < k < j .

Agostino Dovier (DIMI, UDINE Univ.) CP and Biology Wrocław, September 2012 5 / 15

RNA secondary structure prediction Definitions

Spatial constraints (pseudo knot)

If i < ` < j and (i , j) ∈ P, and ((`, k) ∈ P or (k , `) ∈ P), then i < k < j .

Agostino Dovier (DIMI, UDINE Univ.) CP and Biology Wrocław, September 2012 5 / 15

RNA secondary structure prediction Definitions

Spatial constraints (pseudo knot)

If i < ` < j and (i , j) ∈ P, and ((`, k) ∈ P or (k , `) ∈ P), then i < k < j .

Agostino Dovier (DIMI, UDINE Univ.) CP and Biology Wrocław, September 2012 5 / 15

RNA secondary structure prediction Definitions

Spatial constraints (pseudo knot)

If i < ` < j and (i , j) ∈ P, and ((`, k) ∈ P or (k , `) ∈ P), then i < k < j .

Agostino Dovier (DIMI, UDINE Univ.) CP and Biology Wrocław, September 2012 5 / 15

RNA secondary structure prediction Definitions

Spatial constraints (pseudo knot)

If i < ` < j and (i , j) ∈ P, and ((`, k) ∈ P or (k , `) ∈ P), then i < k < j .

Agostino Dovier (DIMI, UDINE Univ.) CP and Biology Wrocław, September 2012 5 / 15

RNA secondary structure prediction Definitions

Spatial constraints (pseudo knot)

If i < ` < j and (i , j) ∈ P, and ((`, k) ∈ P or (k , `) ∈ P), then i < k < j .

Agostino Dovier (DIMI, UDINE Univ.) CP and Biology Wrocław, September 2012 5 / 15

RNA secondary structure prediction Definitions

Spatial constraints (pseudo knot)

If i < ` < j and (i , j) ∈ P, and ((`, k) ∈ P or (k , `) ∈ P), then i < k < j .

Agostino Dovier (DIMI, UDINE Univ.) CP and Biology Wrocław, September 2012 5 / 15

RNA secondary structure prediction Definitions

Spatial constraints (pseudo knot)

If i < ` < j and (i , j) ∈ P, and ((`, k) ∈ P or (k , `) ∈ P), then i < k < j .

NO!

Agostino Dovier (DIMI, UDINE Univ.) CP and Biology Wrocław, September 2012 6 / 15

RNA secondary structure prediction Definitions

Spatial constraints (pseudo knot)

If i < ` < j and (i , j) ∈ P, and ((`, k) ∈ P or (k , `) ∈ P), then i < k < j .

NO!

Agostino Dovier (DIMI, UDINE Univ.) CP and Biology Wrocław, September 2012 6 / 15

RNA secondary structure prediction Definitions

Spatial constraints (pseudo knot)

If i < ` < j and (i , j) ∈ P, and ((`, k) ∈ P or (k , `) ∈ P), then i < k < j .NO!

Agostino Dovier (DIMI, UDINE Univ.) CP and Biology Wrocław, September 2012 6 / 15

RNA secondary structure prediction Definitions

Results

The pseudo-knot constraint is sensible. Adding it there arepolynomial-time algorithms (mfold: dynamic programming.http://mfold.rna.albany.edu)Without the pseudo-knot constraint the problem is NP complete.

Actually, what problem?

Agostino Dovier (DIMI, UDINE Univ.) CP and Biology Wrocław, September 2012 7 / 15

RNA secondary structure prediction Definitions

Results

The pseudo-knot constraint is sensible. Adding it there arepolynomial-time algorithms (mfold: dynamic programming.http://mfold.rna.albany.edu)Without the pseudo-knot constraint the problem is NP complete.Actually, what problem?

Agostino Dovier (DIMI, UDINE Univ.) CP and Biology Wrocław, September 2012 7 / 15

RNA secondary structure prediction Definitions

NP CompletenessLyngsø and Pedersen, 2000

Let ~s = s1 · · · sn be a RNA sequence, and P a secondarystructure. Then

E(~s,P) =∑

(i,j)∈P,i<j

E(~s, i , j ,P)

where E(~s, i , j ,P) depend on si and sj and, moreover, on the szsuch that (i + 1, z) ∈ P or (j − 1, z) ∈ P.

Agostino Dovier (DIMI, UDINE Univ.) CP and Biology Wrocław, September 2012 8 / 15

RNA secondary structure prediction Definitions

NP CompletenessLyngsø and Pedersen, 2000

In the NP-completeness proof they first assume to have an infiniteset of complementary bases (e.g., (A1,U1), (A2,U2), (A3,U3), . . . )and define E as follows:

E(~s, i , j ,P) =

−1 If si and sj are complementary symbols and

(∀z ∈ {1, . . . , i − 1, j + 1, . . . ,n})({(i + 1, z), (z, i + 1), (j − 1, z), (z, j − 1)} ∩ P = ∅)

0 otherwise

E(~s, i , j ,P) = 0

-10-1

Agostino Dovier (DIMI, UDINE Univ.) CP and Biology Wrocław, September 2012 9 / 15

RNA secondary structure prediction Definitions

NP CompletenessLyngsø and Pedersen, 2000

In the NP-completeness proof they first assume to have an infiniteset of complementary bases (e.g., (A1,U1), (A2,U2), (A3,U3), . . . )and define E as follows:

E(~s, i , j ,P) =

−1 If si and sj are complementary symbols and

(∀z ∈ {1, . . . , i − 1, j + 1, . . . ,n})({(i + 1, z), (z, i + 1), (j − 1, z), (z, j − 1)} ∩ P = ∅)

0 otherwise

E(~s, i , j ,P) =

0

-1

0-1

Agostino Dovier (DIMI, UDINE Univ.) CP and Biology Wrocław, September 2012 9 / 15

RNA secondary structure prediction Definitions

NP CompletenessLyngsø and Pedersen, 2000

In the NP-completeness proof they first assume to have an infiniteset of complementary bases (e.g., (A1,U1), (A2,U2), (A3,U3), . . . )and define E as follows:

E(~s, i , j ,P) =

−1 If si and sj are complementary symbols and

(∀z ∈ {1, . . . , i − 1, j + 1, . . . ,n})({(i + 1, z), (z, i + 1), (j − 1, z), (z, j − 1)} ∩ P = ∅)

0 otherwise

E(~s, i , j ,P) =

0

-1

0-1

Agostino Dovier (DIMI, UDINE Univ.) CP and Biology Wrocław, September 2012 9 / 15

RNA secondary structure prediction Definitions

NP CompletenessLyngsø and Pedersen, 2000

In the NP-completeness proof they first assume to have an infiniteset of complementary bases (e.g., (A1,U1), (A2,U2), (A3,U3), . . . )and define E as follows:

E(~s, i , j ,P) =

−1 If si and sj are complementary symbols and

(∀z ∈ {1, . . . , i − 1, j + 1, . . . ,n})({(i + 1, z), (z, i + 1), (j − 1, z), (z, j − 1)} ∩ P = ∅)

0 otherwise

E(~s, i , j ,P) =

0-1

0

-1

Agostino Dovier (DIMI, UDINE Univ.) CP and Biology Wrocław, September 2012 9 / 15

RNA secondary structure prediction Definitions

NP CompletenessLyngsø and Pedersen, 2000

In the NP-completeness proof they first assume to have an infiniteset of complementary bases (e.g., (A1,U1), (A2,U2), (A3,U3), . . . )and define E as follows:

E(~s, i , j ,P) =

−1 If si and sj are complementary symbols and

(∀z ∈ {1, . . . , i − 1, j + 1, . . . ,n})({(i + 1, z), (z, i + 1), (j − 1, z), (z, j − 1)} ∩ P = ∅)

0 otherwise

E(~s, i , j ,P) =

0-1

0

-1

Agostino Dovier (DIMI, UDINE Univ.) CP and Biology Wrocław, September 2012 9 / 15

RNA secondary structure prediction Definitions

NP CompletenessLyngsø and Pedersen, 2000

In the NP-completeness proof they first assume to have an infiniteset of complementary bases (e.g., (A1,U1), (A2,U2), (A3,U3), . . . )and define E as follows:

E(~s, i , j ,P) =

−1 If si and sj are complementary symbols and

(∀z ∈ {1, . . . , i − 1, j + 1, . . . ,n})({(i + 1, z), (z, i + 1), (j − 1, z), (z, j − 1)} ∩ P = ∅)

0 otherwise

E(~s, i , j ,P) =

0-10

-1

Agostino Dovier (DIMI, UDINE Univ.) CP and Biology Wrocław, September 2012 9 / 15

RNA secondary structure prediction Definitions

NP CompletenessLyngsø and Pedersen, 2000

To prove NP hardness they start from 3SAT with the furtherrequirement that each literal occurs at most twice (for a variable X ,you can have X zero, one or two times and ¬X zero, one, or twotimes). Prove it is NP complete (exercise)For a clause ci = `1 ∨ `2 ∨ `3 they introduce a gadget:

Ci = ci,1(`1)1/2ci,1ci,2(`2)1/2ci,1ci,2(`3)1/2ci,2

(1/2 according to the leftmost/rightmost occurrence of that literal)For a variable Xi that occurs twice positively and twice negatively,introduce a gadget (a substring in case of less occurrences)

Vi = vi(Xi)2 (Xi)1vivi(¬Xi)2 (¬Xi)1vi

The encoding of a formula c1 ∧ · · · ∧ cm on variables X1 . . .Xn is

C1 · · · CmV1 · · · Vn

Agostino Dovier (DIMI, UDINE Univ.) CP and Biology Wrocław, September 2012 10 / 15

RNA secondary structure prediction Definitions

NP CompletenessMain idea

Agostino Dovier (DIMI, UDINE Univ.) CP and Biology Wrocław, September 2012 11 / 15

RNA secondary structure prediction Definitions

NP CompletenessMain idea

Agostino Dovier (DIMI, UDINE Univ.) CP and Biology Wrocław, September 2012 11 / 15

RNA secondary structure prediction Definitions

NP CompletenessMain idea

Agostino Dovier (DIMI, UDINE Univ.) CP and Biology Wrocław, September 2012 11 / 15

RNA secondary structure prediction Definitions

NP CompletenessMain idea

Agostino Dovier (DIMI, UDINE Univ.) CP and Biology Wrocław, September 2012 11 / 15

RNA secondary structure prediction Definitions

NP CompletenessMain idea

Agostino Dovier (DIMI, UDINE Univ.) CP and Biology Wrocław, September 2012 11 / 15

RNA secondary structure prediction Definitions

NP CompletenessLyngsø and Pedersen, 2000

The encoding of a formula C1 ∧ · · · ∧ Cm on variables X1 . . .Xn is

C1 · · · CmV1 · · · Vn

They prove that ϕ is satisfiable iff there is a secondary structure withenergy -(3m+n) [Nice exercise]

Then they complete the proof without the hypothesis of infinitealphabet (this is a nice reading—too long for explaining it in thiscourse).

Agostino Dovier (DIMI, UDINE Univ.) CP and Biology Wrocław, September 2012 12 / 15

RNA secondary structure prediction Definitions

NP CompletenessLyngsø and Pedersen, 2000

The encoding of a formula C1 ∧ · · · ∧ Cm on variables X1 . . .Xn is

C1 · · · CmV1 · · · Vn

They prove that ϕ is satisfiable iff there is a secondary structure withenergy -(3m+n) [Nice exercise]

Then they complete the proof without the hypothesis of infinitealphabet (this is a nice reading—too long for explaining it in thiscourse).

Agostino Dovier (DIMI, UDINE Univ.) CP and Biology Wrocław, September 2012 12 / 15

RNA secondary structure prediction Definitions

NP CompletenessLyngsø and Pedersen, 2000

The encoding of a formula C1 ∧ · · · ∧ Cm on variables X1 . . .Xn is

C1 · · · CmV1 · · · Vn

They prove that ϕ is satisfiable iff there is a secondary structure withenergy -(3m+n) [Nice exercise]

Then they complete the proof without the hypothesis of infinitealphabet (this is a nice reading—too long for explaining it in thiscourse).

Agostino Dovier (DIMI, UDINE Univ.) CP and Biology Wrocław, September 2012 12 / 15

RNA secondary structure prediction Definitions

A simple CLP encoding

Input s1, . . . , sn

Variables Pairs = [P1, . . . ,Pn].Let Sx = {i ∈ {1, . . . ,n} | si = x}.If si = A, then dom(Pi) = {0} ∪ SU .If si = C, then dom(Pi) = {0} ∪ SG.If si = G, then dom(Pi) = {0} ∪ SC ∪ SU .If si = U, then dom(Pi) = {0} ∪ SA ∪ SG.For i = 1, . . . ,n, if Pi > 0 then PPi = I. If Pi = 0 no constraint. Itcan be stated compactly as:

element(P + 1, [I|Pairs], I)

Pseudo-knots: If Pi > 0 then (Pi+1 ∈ [i + 3..PPi − 1]) ∨ (Pi+1 = 0)

Agostino Dovier (DIMI, UDINE Univ.) CP and Biology Wrocław, September 2012 13 / 15

RNA secondary structure prediction Definitions

A simple CLP encoding

As cost function we want either to maximize contacts or

(as done by Dahl-Bavarian, WCB05),a solution close to the statistics, namely 35% for AU, 53% for CG,12% for GU.Let NC = n −#contactsWe minimize therefore a weighted sum of the form

c1NCn

+ c2#(AU)− .35(n − NC)

n+ c3

#(CG)− .53(n − NC)

n

(c1, c2, c3 constants that can be changed. The denominator n canbe omitted for minimization)Let us see some execution of RNA_alignment.pl

Agostino Dovier (DIMI, UDINE Univ.) CP and Biology Wrocław, September 2012 14 / 15

RNA secondary structure prediction Definitions

A simple CLP encoding

As cost function we want either to maximize contacts or(as done by Dahl-Bavarian, WCB05),

a solution close to the statistics, namely 35% for AU, 53% for CG,12% for GU.Let NC = n −#contactsWe minimize therefore a weighted sum of the form

c1NCn

+ c2#(AU)− .35(n − NC)

n+ c3

#(CG)− .53(n − NC)

n

(c1, c2, c3 constants that can be changed. The denominator n canbe omitted for minimization)Let us see some execution of RNA_alignment.pl

Agostino Dovier (DIMI, UDINE Univ.) CP and Biology Wrocław, September 2012 14 / 15

RNA secondary structure prediction Definitions

A simple CLP encoding

As cost function we want either to maximize contacts or(as done by Dahl-Bavarian, WCB05),a solution close to the statistics, namely 35% for AU, 53% for CG,12% for GU.Let NC = n −#contactsWe minimize therefore a weighted sum of the form

c1NCn

+ c2#(AU)− .35(n − NC)

n+ c3

#(CG)− .53(n − NC)

n

(c1, c2, c3 constants that can be changed. The denominator n canbe omitted for minimization)Let us see some execution of RNA_alignment.pl

Agostino Dovier (DIMI, UDINE Univ.) CP and Biology Wrocław, September 2012 14 / 15

RNA secondary structure prediction Definitions

(Some) References

M. Zucker and P. Stiegler. Optimal computer folding of large RNAsequences using thermodynamics and auxiliary information. NucleidAcid Research, 9(1):133–148, 2981.

R.B. Lyngsø and C.N.S Pedersen. RNA Pseudoknot prediction inEnergy-Based Models. J. of Computational Biology 7(3/4), 2000.

G. Blin, G. Fertin, I. Rusu, and C. Sinoquet. Extending the hardness ofRNA secondary structure comparison. LNCS 4614, pp. 140–151, 2007.

M. Bauer, G.W. Klau, and K. Reinert. Accurate multiplesequence-structure alignment of RNA sequences using combinatorialoptimization. BMC Bioinformatics, 8, 2007.

M. Bavarian and V. Dahl. Constraint Based Methods for BiologicalSequence Analysis. J. Universal Computer Science 12(11):1500–1520,2006 (also in WCB 05).

A. Dal Palù, M. Möhl, S. Will. A Propagator for Maximum Weight StringAlignment with Arbitrary Pairwise Dependencies. CP 2010: 167-175(also in WCB 10)

Agostino Dovier (DIMI, UDINE Univ.) CP and Biology Wrocław, September 2012 15 / 15

top related