rna secondary structure prediction introduction rna is a single-stranded chain of the nucleotides a,...

36
RNA Secondary Structure Prediction Introduction •RNA is a single-stranded chain of the nucleotides A, C, G, and U. The string of nucleotides specifies the linear structure of the RNA strand. •When RNA folds, complementary nucleotides form base pairs (CG and AU). •The tertiary (3 dimensional) structure is too complicated for us to calculate. •We calculate only secondary structures, lists of base pairs.

Upload: roger-riley

Post on 23-Dec-2015

226 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: RNA Secondary Structure Prediction Introduction RNA is a single-stranded chain of the nucleotides A, C, G, and U. The string of nucleotides specifies the

RNA Secondary Structure Prediction

Introduction

•RNA is a single-stranded chain of the nucleotides A, C, G, and U. The string of nucleotides specifies the linear structure of the RNA strand.

•When RNA folds, complementary nucleotides form base pairs (CG and AU).

•The tertiary (3 dimensional) structure is too complicated for us to calculate.

•We calculate only secondary structures, lists of base pairs.

•Knowing the base pairs tells a lot about the 3 dimensional structure.

Page 2: RNA Secondary Structure Prediction Introduction RNA is a single-stranded chain of the nucleotides A, C, G, and U. The string of nucleotides specifies the

Chemical Structure of RNA

• Four base types.• Distinguishable ends.

Page 3: RNA Secondary Structure Prediction Introduction RNA is a single-stranded chain of the nucleotides A, C, G, and U. The string of nucleotides specifies the

Partial Tertiary Structure

• One illustration

Page 4: RNA Secondary Structure Prediction Introduction RNA is a single-stranded chain of the nucleotides A, C, G, and U. The string of nucleotides specifies the

Yet Another Tertiary Structure

• Found via google

Page 5: RNA Secondary Structure Prediction Introduction RNA is a single-stranded chain of the nucleotides A, C, G, and U. The string of nucleotides specifies the

Our Final Tertiary Picture

• Very complex

Page 6: RNA Secondary Structure Prediction Introduction RNA is a single-stranded chain of the nucleotides A, C, G, and U. The string of nucleotides specifies the

A Partial RNA Secondary Structure

Page 7: RNA Secondary Structure Prediction Introduction RNA is a single-stranded chain of the nucleotides A, C, G, and U. The string of nucleotides specifies the

Pure Secondary Structure

Page 8: RNA Secondary Structure Prediction Introduction RNA is a single-stranded chain of the nucleotides A, C, G, and U. The string of nucleotides specifies the

Our Basic Model

• RNA linear structure: R=r1 r2 . . . rn from {A,C,G,U}

• RNA secondary structure: pairs (ri,rj) such that 0<i<j<n+1.

• Goal: secondary structures with minimum free energy.

Page 9: RNA Secondary Structure Prediction Introduction RNA is a single-stranded chain of the nucleotides A, C, G, and U. The string of nucleotides specifies the

Implementing Model Restrictions

• No knots: pairs (ri,rj) and (rk,rl) such that i<k<j<l. RNA does contain knots.

• Program loop structure.

• No “close” base pairs: j-i>t for some t>0.

• High free energy.

• Complementary base pairs: A-U, C-G.

• High free energy.

Page 10: RNA Secondary Structure Prediction Introduction RNA is a single-stranded chain of the nucleotides A, C, G, and U. The string of nucleotides specifies the

Our Two Algorithms

• Independent base pairs – quite easy, but inaccurate.

• Calculate loops’ free energy – best we can do for today’s class.

Page 11: RNA Secondary Structure Prediction Introduction RNA is a single-stranded chain of the nucleotides A, C, G, and U. The string of nucleotides specifies the

Independent Base Pair Algorithm

• Assumption: Independent base pairs.

• Advantage 1: Simpler calculations.

• Advantage 2: Illustrates ideas for a much

more accurate algorithm.

• Disadvantage: Unrealistic answers.

Page 12: RNA Secondary Structure Prediction Introduction RNA is a single-stranded chain of the nucleotides A, C, G, and U. The string of nucleotides specifies the

Independent Base PairsWhat Makes It “Easy”?

• Assumption: The energy of each base pair is independent of all of the other pairs and the loop structure.

• Consequence: Total free energy is the sum of all of the base pair free energies.

Page 13: RNA Secondary Structure Prediction Introduction RNA is a single-stranded chain of the nucleotides A, C, G, and U. The string of nucleotides specifies the

Independent Base PairsBasic Approach

• Use solutions for smaller strings to determine solutions for larger strings.

• This is precisely the kind of decoupling required for dynamic programming algorithms to work.

Page 14: RNA Secondary Structure Prediction Introduction RNA is a single-stranded chain of the nucleotides A, C, G, and U. The string of nucleotides specifies the

Independent Base Pairs Notation

• a(ri,rj) – the free energy of a base pair joining ri and rj.

• Si,j – The secondary structure of the RNA strand from base ri to base rj. Ie, the set of base pairs between ri and rj inclusive.

• E(Si,j) – The free energy associated with the secondary structure Si,j.

• We define a(ri,rj) large when constraints are violated.

Page 15: RNA Secondary Structure Prediction Introduction RNA is a single-stranded chain of the nucleotides A, C, G, and U. The string of nucleotides specifies the

• Consider the RNA strand from position i to j.

• Consider whether rj is paired

• If rj is paired, E(Si,j)=E(Si,k-1)+a(k,j)+E(Sk+1,j-1) for some i-1<k<j

• If rj isn’t paired, then E(Si,j)=E(Si,j-1)

Independent Base Pairs:Calculating Free Energy

Page 16: RNA Secondary Structure Prediction Introduction RNA is a single-stranded chain of the nucleotides A, C, G, and U. The string of nucleotides specifies the

Independent Base Pairs - Algorithm

• We search for intervals with minimum free energy.

• For each interval, the free energy is given by this formula:

E(Si,j) = min( E(Si+1,j-1)+a(ri,rj), E(Si,k-1+a(ri,rk)+Sk+1,j-1), i -1<k<j+1 )• The free energy of the RNA strand is E(S1,n).

Page 17: RNA Secondary Structure Prediction Introduction RNA is a single-stranded chain of the nucleotides A, C, G, and U. The string of nucleotides specifies the

Independent Base Pairs:Question 1

• How does this formula deal with the case where rj isn’t paired with any base?

• A special case of

E(Si,k-1+a(ri,rk)+Sk+1,j-1), i -1<k<j+1

• The special case with k=j.

Page 18: RNA Secondary Structure Prediction Introduction RNA is a single-stranded chain of the nucleotides A, C, G, and U. The string of nucleotides specifies the

Independent Base Pairs:Question 2

• What is the high level algorithm flow?

1. Advance from smaller to larger intervals, calculating free energy costs.

2. Trace back the path that corresponds to the maximum free energy cost.

Page 19: RNA Secondary Structure Prediction Introduction RNA is a single-stranded chain of the nucleotides A, C, G, and U. The string of nucleotides specifies the

Independent Base Pairs:Question 3

• In what orders can the intervals’ free energy costs be evaluated?

1. Major = lower, minor = upper bound

2. Major = upper, minor = lower bound

3. Diagonally

4. Any order (eg, random) that respects the partial order induced by inclusion

Page 20: RNA Secondary Structure Prediction Introduction RNA is a single-stranded chain of the nucleotides A, C, G, and U. The string of nucleotides specifies the

Independent Base Pairs:Question 4

• What are the time and storage requirements of this algorithm?

• Express your answer in terms of the number of bases in the RNA strand.

• Since the number of intervals is quadratic, the storage requirements are quadratic.

• Since the time requirement for each interval is linear, total time is cubic.

Page 21: RNA Secondary Structure Prediction Introduction RNA is a single-stranded chain of the nucleotides A, C, G, and U. The string of nucleotides specifies the

Independent Base Pairs: Question 5

• Why not simply calculate free energies as they are needed? Why store them at all?

• Because the recursive calls would turn our polynomial algorithm into an exponential algorithm.

Page 22: RNA Secondary Structure Prediction Introduction RNA is a single-stranded chain of the nucleotides A, C, G, and U. The string of nucleotides specifies the

Independent Base Pairs:Question 6

• How does traceback work for this algorithm?

1. Recalculate which subinterval yields the maximum free energy.

2. Save traceback paths.

Page 23: RNA Secondary Structure Prediction Introduction RNA is a single-stranded chain of the nucleotides A, C, G, and U. The string of nucleotides specifies the

Loop Free Energy Algorithm

• An RNA molecule’s free energy is not independent of all other base pairs.

• An RNA molecules free energy actually depends on its loop structure.

• What do we mean by loops?

Page 24: RNA Secondary Structure Prediction Introduction RNA is a single-stranded chain of the nucleotides A, C, G, and U. The string of nucleotides specifies the

Types of Loops

• Each base pair (ri,rj) encloses a loop:

1. Hairpin loop

2. Bulge on i or j

3. Interior loop

4. Helical region

Page 25: RNA Secondary Structure Prediction Introduction RNA is a single-stranded chain of the nucleotides A, C, G, and U. The string of nucleotides specifies the

Hairpin Loop

• There are no base pairs (rk,rl) for i<k<l<j.

Page 26: RNA Secondary Structure Prediction Introduction RNA is a single-stranded chain of the nucleotides A, C, G, and U. The string of nucleotides specifies the

Bulge on i and j

• Bulge on i:

• (ri,rj) and (rk,rj-1) are base pairs with k>i+1.

• ri+1 is not paired.

• The bulge on j is symmetric.

Page 27: RNA Secondary Structure Prediction Introduction RNA is a single-stranded chain of the nucleotides A, C, G, and U. The string of nucleotides specifies the

Interior loop

• (ri,rj) and (rk,rl) are base pairs with i+1<k1<k2<j-1.

• ri+1 and rj-1 are not in base pairs

Page 28: RNA Secondary Structure Prediction Introduction RNA is a single-stranded chain of the nucleotides A, C, G, and U. The string of nucleotides specifies the

Helical region

• (ri,rj) and (ri+1,rj-1) are base pairs.

Page 29: RNA Secondary Structure Prediction Introduction RNA is a single-stranded chain of the nucleotides A, C, G, and U. The string of nucleotides specifies the

Free energy analysis

• E(Si,j) = E(Si+1,j) when ri isn’t paired.• E(Si,j) = E(Si,j-1) when rj isn’t paired. • E(Si,j) = min(E(Si,k)+E(Sk+1,j)) for i<k<l, k between i’s and j’s pairs when i and j are paired but not to each other• E(Si,j) = E(Li,j) where Li,j is loop energy when I and j are paired to each other

Page 30: RNA Secondary Structure Prediction Introduction RNA is a single-stranded chain of the nucleotides A, C, G, and U. The string of nucleotides specifies the

Free Energy Functions

• a(ri,rj) – Free energy of base pair (ri,rj)• H(k) – Destabilizing free energy of a

hairpin loop with size k.• R – Stabilizing free energy of adjacent

base pairs (helical region).• B(k) – Destabilizing free energy of a bulge

of size k.• I(k) – Destabilizing free energy of an

interior loop of size k.

Page 31: RNA Secondary Structure Prediction Introduction RNA is a single-stranded chain of the nucleotides A, C, G, and U. The string of nucleotides specifies the

Loop Energy Formulas

• H(j-i-1) – for a hairpin loop

• R + E(Si+1,j-1) – for a helical region

• B(k) + E(Si+k+1,j-1) – for a bulge on i

• B(k) + E(Si+1,j-k-1) – for a bulge on j

• I(k1+k2) + E(Si+k1+1,j-k2-1)

– for an interior loop

Page 32: RNA Secondary Structure Prediction Introduction RNA is a single-stranded chain of the nucleotides A, C, G, and U. The string of nucleotides specifies the

Free Energy Calculationfor interval (i,j)

• Minimize over

1. Case where (ri,rj) is not a pair.

2. Case where (ri,rj) is a pair.

• Add a(ri,rj) to the formulas.

• Minimize over k, k1, and k2.

Page 33: RNA Secondary Structure Prediction Introduction RNA is a single-stranded chain of the nucleotides A, C, G, and U. The string of nucleotides specifies the

What is the Apparent Complexity?

• The interior loop calculations are given by I(k1+k2) + E(Si+k1+1,j-k2-1)

• The number of inner loop possibilities is quadratic in the interval size.

• The number of intervals is quadratic in the size of the problem.

• The complexity appears to grow as n4.

Page 34: RNA Secondary Structure Prediction Introduction RNA is a single-stranded chain of the nucleotides A, C, G, and U. The string of nucleotides specifies the

What is the Actual Complexity?

• Overall reduction from n4 to n3 is possible.

• Interval reduction from n2 to linear.

• Store the minimum free energy Vi,j,k where the interval (i,j) contains an interior loop of size k.

Page 35: RNA Secondary Structure Prediction Introduction RNA is a single-stranded chain of the nucleotides A, C, G, and U. The string of nucleotides specifies the

Multiple Solutions

• Care must be taken to define the issues.

• Multiple solutions can be obtained by adding flexibility to the traceback logic.

• The number of solutions can grow exponentially.

Page 36: RNA Secondary Structure Prediction Introduction RNA is a single-stranded chain of the nucleotides A, C, G, and U. The string of nucleotides specifies the

References

• M. Zuker, “The Use of dynamic programming in RNA secondary structure prdiction”. In M. S. Waterman, editor, Mathematical Methods for DNS Sequences. Boca Raton, FL: CRC Press, 1989

• J, Setubal and J. Meidanis,Ch 8.1, Introduction to Computational Molecular Biology, Pacific Grove, CA: Brooks/Cole Publishing Co., 1997