biol335: rna bioinformatics

30
RNA bioinformatics Paul Gardner April 2, 2015 Paul Gardner RNA bioinformatics

Upload: paul-gardner

Post on 29-Jul-2015

105 views

Category:

Science


1 download

TRANSCRIPT

RNA bioinformatics

Paul Gardner

April 2, 2015

Paul Gardner RNA bioinformatics

Main questions

I How can we predict RNA structure?

Paul Gardner RNA bioinformatics

Why do we care about RNA?

I RNA is important for translation and gene regulationI 2

3 of the ribosome is RNA. Ribosomal function is preservedeven after amino-acid residues are deleted from the active site!

I Current estimates indicate that the number of ncRNA genes iscomparable to the number of protein coding genes.

mDNA

uDNA

rDNA

tDNA

pre-mRNA

mRNA

nascentprotein

localisedprotein

spliceosome

ribosome

tRNA

+

RNase P

RNase MRP+snoRNP

snoRNPSRP

tmRNA

transcription

splicing

translation

transport

RISC (miRNA)

Paul Gardner RNA bioinformatics

RNA: why is this stuff interesting?

I RNA world was an essential step to modern protein-DNAbased life (using current reasonable models).

I Which came first, DNA or protein?I RNA has catalytic potential (like protein), carries hereditary

information (like DNA).

Image by James W. Brown, www.mbio.ncsu.edu/JWB/soup.html

Paul Gardner RNA bioinformatics

RNA interference

Image lifted from: http://en.wikipedia.org/wiki/RNA interference

Paul Gardner RNA bioinformatics

RNA: structure

GCGGAUUU

AGCUC

AGDDGG G A

G A G CG

CCA

GACUG

A A.A.

CUGGAGGU

CC U G U G

T . CGA

UCCACAG

AAUUCGC

AC

CA

VariableLoopAnticodon

Loop

T ΨCLoop

10 15 20 25 30 355 40 45 50 55 60 65 70 75

AnticodonLoop

Acceptor Stem

GCGGAUUUAGCUCAGDDGGGAGAGCGCCAGACUGAAYA.CUGGAGGUCCUGUGT.CGAUCCACAGAAUUCGCACCA5’ 3’

Secondary Structure Tertiary StructureB C

Primary StructureA

Acceptor Stem

T ΨCLoop

ΨΨ

Ψ

Ψ

Y

6560

55

40

10

20

155

70

75

25

30

35

45

50

D Loop

3’

5’

5’3’

D Loop

Paul Gardner RNA bioinformatics

RNA: base-pairing

I Canonical (Watson-Crick) base-pairs C · G , A · U.I Non-canonical (Wobble) base-pair G · U

I Note: other non-canonical base-pairs do occur, but these are“rare” and generally re-defined as “tertiary” interactions.

I Central dogma of structural biology: structure is important forfunction.

I Images lifted from: http://en.wikipedia.org/wiki/Base pair

Paul Gardner RNA bioinformatics

RNA: base-pairing

Images lifted from: http://eternawiki.org/wiki/index.php5/Base Pair

Paul Gardner RNA bioinformatics

RNA: base-pairing

bpC C:G U:A U:G G:A C:A U:C A:A C:C G:G U:U Total

WC 49.8% 14.4% 0.01% 1.2% 0.1% 0.5% - - - - 66.1%

Wb 0.06% 0.06% 7.1% - 0.2% - 0.3% 0.5% 0.2% 0.9% 9.6%

Other 0.8% 5.8% 1.5% 9.4% 2.3% 0.6% 2.6% 0.5% 0.7% 0.3% 24.3%

Total 50.7% 20.3% 8.7% 10.6% 2.6% 1.0% 2.9% 1.0% 0.9% 1.3% 100.0%

Just 71.3% of rRNA contacts are canonical or G:U wobble!

Lee & Gutell (2004) Diversity of base-pair conformations and their occurrence in rRNA structure and RNAstructural motifs J Mol Biol.

Paul Gardner RNA bioinformatics

RNA stacking

Laurberg et al. (2008) Structural basis for translation termination on the 70S ribosome Nature. Image lifted from:http://rna.ucsc.edu/pdbrestraints/index.html

Paul Gardner RNA bioinformatics

RNA: number of structures

AN is the number of possible secondary sequences of length N.

AN ∼ 4N

SN is the number of possible secondary structures of length N.

S0 = S1 = 1

SN+1 = SN +N∑j=1

Sj−1SN−j+1

SN ∼ 1.8N

Hofacker et al. (1998) Combinatorics of RNA Secondary Structures, Discrete Applied Mathematics.

Paul Gardner RNA bioinformatics

How can we make a secondary structure predictionalgorithm?

I Maximize the number of base-pairs in aRNA sequence?

Nussinov et al. (1978) Algorithms for loop matching, SIAM J. Appl. Math.

Paul Gardner RNA bioinformatics

Structure prediction: Nussinov

Nussinov et al. (1978) Algorithms for loop matching, SIAM J. Appl. Math.Image from: Eddy SR (2004) How do RNA folding algorithms work? Nature Biotechnology.

Paul Gardner RNA bioinformatics

Structure prediction: Nussinov

I Maximize the number of base-pairs in RNA sequence.

Seq = s1s2 · · · snNi ,j = 0, ∀ j − i < 3.

Ni ,j = max

Ni+1,j−1 + ρ(i , j), i , j pairNi+1,j , i unpairedNi ,j−1, j unpairedmaxi<k<j [Ni ,k + Nk+1,j ] bifurcation

I O(n3) in CPU, O(n2) in memory.

I ρ(i , j) = 1 if si and sj are complementary, otherwiseρ(i , j) = 0.

I N1,n = BPmax .

Nussinov et al. (1978) Algorithms for loop matching, SIAM J. Appl. Math.

Paul Gardner RNA bioinformatics

Structure prediction: Nussinov

I There are a few problems with this approach:I the solution to Nussinov is frequently not unique. For example,

the 77 nucleotide long tRNAhis has 22 base-pairs in thephylogentic structure, there are 149, 126 structures with themaximal number of 26 base-pairs!

I The method ignores stacking interactions.

Fontana (2002) Modelling ‘evo-devo’ with RNA. BioEssays.

Paul Gardner RNA bioinformatics

Structure prediction: Zuker

I Nearest neighbour modelI Modified Nussinov algorithm to find minimal free energy

(most stable) structures

A U

C G

U A

G CS3

S2

S1

S1 S2 S3GU L

A CFree Energy = L + + +

= −1.70 kcal/mol= 5.00 − 2.11 − 2.35 − 2.24

∆Gstack = ∆H37,stack − T∆S37,stack∆Gloop = −T∆S37,loop

Tinoco et al. (1971) Estimation of secondary structure in RNA. Nature.

Paul Gardner RNA bioinformatics

Structure prediction: Zuker

WX\Y Z CG GC AU UA GU UG

CG -3.26 -2.36 -2.11 -2.08 -1.41 -2.11

GC -3.42 -3.26 -2.35 -2.24 -1.53 -2.51

AU -2.24 -2.08 -0.93 -1.10 -0.55 -1.36

UA -2.35 -2.11 -1.33 -0.93 -1.00 -1.27

GU -2.51 -2.11 -1.27 -1.36 +0.47 +1.29

UG -1.53 -1.41 -1.00 -0.55 +0.30 +0.47

I Energies (∆G in kcals/mol) of 5′3′WX

YZ3′5′ stacked basepairs.

I Note that ∆G of 5′

3′WX

YZ3′

5′ stacks is the same as 5′

3′ZYXW

3′

5′ stacks.

Mathews et al. (1999) Expanded sequence dependence of thermodynamic parameters improves prediction of RNA

secondary structure. JMB.

Paul Gardner RNA bioinformatics

Suboptimal structures

“There is an embarrassing abundance of structures having a freeenergy near that of the optimum.” (McCaskill 1990)

−5 0 5 10 15 20 25 30 35

−22

−21.8

−21.6

−21.4

−21.2

−21

−20.8

−20.6

−20.4

−20.2

−20

dBP

(Si,S

mfe)

∆ G

(kc

al/m

ol)

GCGGAUUU

AG

CUCA

G UUGGG

AGAGCGC

CA

GACU

GAAG A U U

UG

GAG

GUC

CU

GUGUUCGAU

CCAC

AGAAUUCGCA

GCGGAUUU

AGCUC

AGUUG

G G AG A G C

GCCAGAC

UG A

AGAUUUGGAGGU

CC U G U G

U UCG

AUCCACAG

AAUUCGCA

GCGGAUUUAG

CUCAGUUG

GGAGAG C G

C C AG A C U G AAGAU

UU G

G A G G U CC

U GUG

UUC

GAUCCACA

GAAUUCGCA

Biological

Suboptimal

MFE

Wuchty et al. (1999) Complete suboptimal folding of RNA and the stability of secondary structures, Biopolymers.

Paul Gardner RNA bioinformatics

Accuracy of MFE predictions

Non-independant benchmarks:

I Walter et al. (1994) Mean sensitivity 63.6

I Mathews et al. (1999) Mean sensitivity 72.9%

Independant benchmarks:

I Doshi et al. (2004) Mean sensitivity 41%

I Dowell & Eddy (2004) Mean sensitivity 56% Mean PPV 48%

I Gardner & Giegerich (2004) Mean sensitivity 56% Mean PPV46%

Data-sets: tRNA, SSU rRNA, LSU rRNA, SRP, RNase P, tmRNA.

Paul Gardner RNA bioinformatics

Limitations of MFE predictions

I Energy parameters: estimated at constant saltconcentrations and temperatures.

I Energy model: models of loop energies are extrapolated fromrelatively few experiments, no pseudoknots, ...

I Cellular environment: contains proteins, RNAs, DNAs,sugars, etc

I Post-transcriptional modifications: many functional RNAshave been covalently modified.

I Folding kinetics: RNAs fold along “pathways”, perhapsbecoming trapped in sub-optimal conformations.

I Co-transcriptional folding: RNAs fold during transcription,the transcriptional apparatus occludes 3’ portions of thesequence.

I Transcription is jerky: transcriptional pausing can influencefolding.

Paul Gardner RNA bioinformatics

Comparative sequence analysis

I Input: a set of sequences with the same biological functionwhich are assumed to have approximately the same structure.

I Output: the common structural elements, aligned sequencesand a phylogeny which best explains the observed data.

2

4

5

3

1

>1GCAUCCAUGGCUGAAUGGUUAAAGCGCCCAACUCAUAAUUGGCGAACUCGCGGGUUCAAUUCCUGCUGGAUGCA>2GCAUUGGUGGUUCAGUGGUAGAAUUCUCGCCUGCCACGCGGGAGGCCCGGGUUCGAUUCCCGGCCAAUGCA>3UGGGCUAUGGUGUAAUUGGCAGCACGACUGAUUCUGGUUCAGUUAGUCUAGGUUCGAGUCCUGGUAGCCCAG>4GAAGAUCGUCGUCUCCGGUGAGGCGGCUGGACUUCAAAUCCAGUUGGGGCCGCCAGCGGUCCCGGGCAGGUUCGACUCCUGUGAUCUUCCG>5CUAAAUAUAUUUCAAUGGUUAGCAAAAUACGCUUGUGGUGCGUUAAAUCUAAGUUCGAUUCUUAGUAUUUACC

** * 1 GCAUCCAUGGCUGAAU-GGUU-AAAGCGCCCAACUCAUAAUUGGCGAA-- 2 GCAUUGGUGGUUCAGU-GGU--AGAAUUCUCGCCUGCCACGCGG-GAG-- 3 UGGGCUAUGGUGUAAUUGGC--AGCACGACUGAUUCUGGUUCAG-UUA-- 4 GAAGAUCGUCGUCUCC-GGUG-AGGCGGCUGGACUUCAAAUCCA-GU-UG 5 CUAAAUAUAUUUCAAU-GGUUAGCAAAAUACGCUUGUGGUGCGU-UAA--

**** * ** 1 ------------------CUCGCGGGUUCAAUUCCUGCUGGAUGC-A 2 ------------------G-CCCGGGUUCGAUUCCCGGCCAAUGC-A 3 ------------------G-UCUAGGUUCGAGUCCUGGUAGCCCA-G 4 GGGCCGCCAGCGGUCCCG--GGCAGGUUCGACUCCUGUGAUCUUCCG 5 ------------------A-UCUAAGUUCGAUUCUUAGUAUUUAC-C

SMADMYMURSYUC

AMY-

GGY

u a AV M M M

R MH

CR

MYUSH V R

HK

CV

Rc

KWA--

-- - c c - c c

a-c---

cc

c-V-YS Y R R G U UCR

AYU

CCYRSYMDMYVM

cV

Paul Gardner RNA bioinformatics

Comparative sequence analysis

I Evolution of RNA sequences

I Base-pairs that covary have strong evolutionary support

UAC

AA

GAG

UG C

GUUU

AA

GUAY

RY

AA

SMG

US C

GYKK

AA

GYRY

AU

AA

NAD

UG C

GUUG

AA

GUR

c

b

(((..(((....)))..)))

(((..(((....)))..)))

(((..(((....)))..)))

(((..(((....)))..)))UACAAGAGUGCGCUUAAGUA

UGCAAAAGUCCGUUUAAGCA

UAUAACCUUUCGAGGAAAUA

CAUAAUAAUGCGUUGAAGUG

a

MIS

YAUAANADUGCGUUGAAGURAncestral

UACAAGAGUGCGUUUAAGUA

YRYAASMGUSCGYKKAAGYRconsensus

consensusAncestral MIS

G U

A U

G C

U G

C G

U A

fast fast

slow

Paul Gardner RNA bioinformatics

Alignment Folding: RNAalifold

I Generate an alignment (e.g. with ClustalW)

I Find a consensus structure that is both energetically stable inall sequences and has covariation support

G C G G A A U U A G C U C A G U U _ G G G A G A G C G C C A G A C U G A A A A U C U G G A G G U C C C C _ G G U U C G A A U C C C G G A A U C C G C A

G C G G A A U U A G C U C A G U U _ G G G A G A G C G C C A G A C U G A A A A U C U G G A G G U C C C C _ G G U U C G A A U C C C G G A A U C C G C A

GC

GG

AA

UU

AG

CU

CA

GU

U_

GG

GA

GA

GC

GC

CA

GA

CU

GA

AA

AU

CU

GG

AG

GU

CC

CC

_G

GU

UC

GA

AU

CC

CG

GA

AU

CC

GC

A

GC

GG

AA

UU

AG

CU

CA

GU

U_

GG

GA

GA

GC

GC

CA

GA

CU

GA

AA

AU

CU

GG

AG

GU

CC

CC

_G

GU

UC

GA

AU

CC

CG

GA

AU

CC

GC

A

GCBKMWWU

AGCUC

AGUu

-G

G K AG A G C

RYY

WSAY

UK A W

RA

UCWRRAKG

uCS C S -R G

U UCG

AWYCYSKB

WWUSSGCA

UA

Hofacker et al. (2002) Secondary Structure Prediction for Aligned RNA Sequences, J.Mol.Biol.

Paul Gardner RNA bioinformatics

Alignment Folding: RNAalifold

RNAalifold: energy + covariation.

βi ,j =1

N

N∑α

Zαi ,j − Cov

Ci ,j =2

N(N − 1)

∑bαi b

αj ,b

βi b

βj

DH(bαi bαj , b

βi b

βj )Πα

ij Πβij

Hofacker et al. (2002) Secondary Structure Prediction for Aligned RNA Sequences, J.Mol.Biol.

Paul Gardner RNA bioinformatics

Covariation metrics

Lindgreen, Gardner & Krogh (2006) Measuring covariation in RNA alignments: physical realism improvesinformation measures. Bioinformatics.

Paul Gardner RNA bioinformatics

Rfam: annotation hierarchy

Types Clans Families Sequences

ribozyme

tRNA

CD-box_snoRNA

splicing

thermoregulator

leader

HACA-box_snoRNA

scaRNA

Intron

IRES

frameshift_element

sRNA

riboswitch

antisense

rRNA

miRNA

CRISPR

Cis-reg.

Gene

snRNA

snoRNA

Intron

Types

Paul Gardner RNA bioinformatics

Building an Rfam family

I A structure from literature

I An Rfam family: produced manually from publication figures

Paul Gardner RNA bioinformatics

An example Rfam entry

Paul Gardner RNA bioinformatics

Relevant reading

I Reviews:I Eddy SR (2004) How do RNA folding algorithms work?

Nature Biotechnology.

I Methods:I Hofacker et al. (2002) Secondary Structure Prediction for

Aligned RNA Sequences, J.Mol.Biol.

Paul Gardner RNA bioinformatics

The End

Paul Gardner RNA bioinformatics