biol335: rna bioinformatics
TRANSCRIPT
Why do we care about RNA?
I RNA is important for translation and gene regulationI 2
3 of the ribosome is RNA. Ribosomal function is preservedeven after amino-acid residues are deleted from the active site!
I Current estimates indicate that the number of ncRNA genes iscomparable to the number of protein coding genes.
mDNA
uDNA
rDNA
tDNA
pre-mRNA
mRNA
nascentprotein
localisedprotein
spliceosome
ribosome
tRNA
+
RNase P
RNase MRP+snoRNP
snoRNPSRP
tmRNA
transcription
splicing
translation
transport
RISC (miRNA)
Paul Gardner RNA bioinformatics
RNA: why is this stuff interesting?
I RNA world was an essential step to modern protein-DNAbased life (using current reasonable models).
I Which came first, DNA or protein?I RNA has catalytic potential (like protein), carries hereditary
information (like DNA).
Image by James W. Brown, www.mbio.ncsu.edu/JWB/soup.html
Paul Gardner RNA bioinformatics
RNA interference
Image lifted from: http://en.wikipedia.org/wiki/RNA interference
Paul Gardner RNA bioinformatics
RNA: structure
GCGGAUUU
AGCUC
AGDDGG G A
G A G CG
CCA
GACUG
A A.A.
CUGGAGGU
CC U G U G
T . CGA
UCCACAG
AAUUCGC
AC
CA
VariableLoopAnticodon
Loop
T ΨCLoop
10 15 20 25 30 355 40 45 50 55 60 65 70 75
AnticodonLoop
Acceptor Stem
GCGGAUUUAGCUCAGDDGGGAGAGCGCCAGACUGAAYA.CUGGAGGUCCUGUGT.CGAUCCACAGAAUUCGCACCA5’ 3’
Secondary Structure Tertiary StructureB C
Primary StructureA
Acceptor Stem
T ΨCLoop
ΨΨ
Ψ
Ψ
Y
6560
55
40
10
20
155
70
75
25
30
35
45
50
D Loop
3’
5’
5’3’
D Loop
Paul Gardner RNA bioinformatics
RNA: base-pairing
I Canonical (Watson-Crick) base-pairs C · G , A · U.I Non-canonical (Wobble) base-pair G · U
I Note: other non-canonical base-pairs do occur, but these are“rare” and generally re-defined as “tertiary” interactions.
I Central dogma of structural biology: structure is important forfunction.
I Images lifted from: http://en.wikipedia.org/wiki/Base pair
Paul Gardner RNA bioinformatics
RNA: base-pairing
Images lifted from: http://eternawiki.org/wiki/index.php5/Base Pair
Paul Gardner RNA bioinformatics
RNA: base-pairing
bpC C:G U:A U:G G:A C:A U:C A:A C:C G:G U:U Total
WC 49.8% 14.4% 0.01% 1.2% 0.1% 0.5% - - - - 66.1%
Wb 0.06% 0.06% 7.1% - 0.2% - 0.3% 0.5% 0.2% 0.9% 9.6%
Other 0.8% 5.8% 1.5% 9.4% 2.3% 0.6% 2.6% 0.5% 0.7% 0.3% 24.3%
Total 50.7% 20.3% 8.7% 10.6% 2.6% 1.0% 2.9% 1.0% 0.9% 1.3% 100.0%
Just 71.3% of rRNA contacts are canonical or G:U wobble!
Lee & Gutell (2004) Diversity of base-pair conformations and their occurrence in rRNA structure and RNAstructural motifs J Mol Biol.
Paul Gardner RNA bioinformatics
RNA stacking
Laurberg et al. (2008) Structural basis for translation termination on the 70S ribosome Nature. Image lifted from:http://rna.ucsc.edu/pdbrestraints/index.html
Paul Gardner RNA bioinformatics
RNA: number of structures
AN is the number of possible secondary sequences of length N.
AN ∼ 4N
SN is the number of possible secondary structures of length N.
S0 = S1 = 1
SN+1 = SN +N∑j=1
Sj−1SN−j+1
SN ∼ 1.8N
Hofacker et al. (1998) Combinatorics of RNA Secondary Structures, Discrete Applied Mathematics.
Paul Gardner RNA bioinformatics
How can we make a secondary structure predictionalgorithm?
I Maximize the number of base-pairs in aRNA sequence?
Nussinov et al. (1978) Algorithms for loop matching, SIAM J. Appl. Math.
Paul Gardner RNA bioinformatics
Structure prediction: Nussinov
Nussinov et al. (1978) Algorithms for loop matching, SIAM J. Appl. Math.Image from: Eddy SR (2004) How do RNA folding algorithms work? Nature Biotechnology.
Paul Gardner RNA bioinformatics
Structure prediction: Nussinov
I Maximize the number of base-pairs in RNA sequence.
Seq = s1s2 · · · snNi ,j = 0, ∀ j − i < 3.
Ni ,j = max
Ni+1,j−1 + ρ(i , j), i , j pairNi+1,j , i unpairedNi ,j−1, j unpairedmaxi<k<j [Ni ,k + Nk+1,j ] bifurcation
I O(n3) in CPU, O(n2) in memory.
I ρ(i , j) = 1 if si and sj are complementary, otherwiseρ(i , j) = 0.
I N1,n = BPmax .
Nussinov et al. (1978) Algorithms for loop matching, SIAM J. Appl. Math.
Paul Gardner RNA bioinformatics
Structure prediction: Nussinov
I There are a few problems with this approach:I the solution to Nussinov is frequently not unique. For example,
the 77 nucleotide long tRNAhis has 22 base-pairs in thephylogentic structure, there are 149, 126 structures with themaximal number of 26 base-pairs!
I The method ignores stacking interactions.
Fontana (2002) Modelling ‘evo-devo’ with RNA. BioEssays.
Paul Gardner RNA bioinformatics
Structure prediction: Zuker
I Nearest neighbour modelI Modified Nussinov algorithm to find minimal free energy
(most stable) structures
A U
C G
U A
G CS3
S2
S1
S1 S2 S3GU L
A CFree Energy = L + + +
= −1.70 kcal/mol= 5.00 − 2.11 − 2.35 − 2.24
∆Gstack = ∆H37,stack − T∆S37,stack∆Gloop = −T∆S37,loop
Tinoco et al. (1971) Estimation of secondary structure in RNA. Nature.
Paul Gardner RNA bioinformatics
Structure prediction: Zuker
WX\Y Z CG GC AU UA GU UG
CG -3.26 -2.36 -2.11 -2.08 -1.41 -2.11
GC -3.42 -3.26 -2.35 -2.24 -1.53 -2.51
AU -2.24 -2.08 -0.93 -1.10 -0.55 -1.36
UA -2.35 -2.11 -1.33 -0.93 -1.00 -1.27
GU -2.51 -2.11 -1.27 -1.36 +0.47 +1.29
UG -1.53 -1.41 -1.00 -0.55 +0.30 +0.47
I Energies (∆G in kcals/mol) of 5′3′WX
YZ3′5′ stacked basepairs.
I Note that ∆G of 5′
3′WX
YZ3′
5′ stacks is the same as 5′
3′ZYXW
3′
5′ stacks.
Mathews et al. (1999) Expanded sequence dependence of thermodynamic parameters improves prediction of RNA
secondary structure. JMB.
Paul Gardner RNA bioinformatics
Suboptimal structures
“There is an embarrassing abundance of structures having a freeenergy near that of the optimum.” (McCaskill 1990)
−5 0 5 10 15 20 25 30 35
−22
−21.8
−21.6
−21.4
−21.2
−21
−20.8
−20.6
−20.4
−20.2
−20
dBP
(Si,S
mfe)
∆ G
(kc
al/m
ol)
GCGGAUUU
AG
CUCA
G UUGGG
AGAGCGC
CA
GACU
GAAG A U U
UG
GAG
GUC
CU
GUGUUCGAU
CCAC
AGAAUUCGCA
GCGGAUUU
AGCUC
AGUUG
G G AG A G C
GCCAGAC
UG A
AGAUUUGGAGGU
CC U G U G
U UCG
AUCCACAG
AAUUCGCA
GCGGAUUUAG
CUCAGUUG
GGAGAG C G
C C AG A C U G AAGAU
UU G
G A G G U CC
U GUG
UUC
GAUCCACA
GAAUUCGCA
Biological
Suboptimal
MFE
Wuchty et al. (1999) Complete suboptimal folding of RNA and the stability of secondary structures, Biopolymers.
Paul Gardner RNA bioinformatics
Accuracy of MFE predictions
Non-independant benchmarks:
I Walter et al. (1994) Mean sensitivity 63.6
I Mathews et al. (1999) Mean sensitivity 72.9%
Independant benchmarks:
I Doshi et al. (2004) Mean sensitivity 41%
I Dowell & Eddy (2004) Mean sensitivity 56% Mean PPV 48%
I Gardner & Giegerich (2004) Mean sensitivity 56% Mean PPV46%
Data-sets: tRNA, SSU rRNA, LSU rRNA, SRP, RNase P, tmRNA.
Paul Gardner RNA bioinformatics
Limitations of MFE predictions
I Energy parameters: estimated at constant saltconcentrations and temperatures.
I Energy model: models of loop energies are extrapolated fromrelatively few experiments, no pseudoknots, ...
I Cellular environment: contains proteins, RNAs, DNAs,sugars, etc
I Post-transcriptional modifications: many functional RNAshave been covalently modified.
I Folding kinetics: RNAs fold along “pathways”, perhapsbecoming trapped in sub-optimal conformations.
I Co-transcriptional folding: RNAs fold during transcription,the transcriptional apparatus occludes 3’ portions of thesequence.
I Transcription is jerky: transcriptional pausing can influencefolding.
Paul Gardner RNA bioinformatics
Comparative sequence analysis
I Input: a set of sequences with the same biological functionwhich are assumed to have approximately the same structure.
I Output: the common structural elements, aligned sequencesand a phylogeny which best explains the observed data.
2
4
5
3
1
>1GCAUCCAUGGCUGAAUGGUUAAAGCGCCCAACUCAUAAUUGGCGAACUCGCGGGUUCAAUUCCUGCUGGAUGCA>2GCAUUGGUGGUUCAGUGGUAGAAUUCUCGCCUGCCACGCGGGAGGCCCGGGUUCGAUUCCCGGCCAAUGCA>3UGGGCUAUGGUGUAAUUGGCAGCACGACUGAUUCUGGUUCAGUUAGUCUAGGUUCGAGUCCUGGUAGCCCAG>4GAAGAUCGUCGUCUCCGGUGAGGCGGCUGGACUUCAAAUCCAGUUGGGGCCGCCAGCGGUCCCGGGCAGGUUCGACUCCUGUGAUCUUCCG>5CUAAAUAUAUUUCAAUGGUUAGCAAAAUACGCUUGUGGUGCGUUAAAUCUAAGUUCGAUUCUUAGUAUUUACC
** * 1 GCAUCCAUGGCUGAAU-GGUU-AAAGCGCCCAACUCAUAAUUGGCGAA-- 2 GCAUUGGUGGUUCAGU-GGU--AGAAUUCUCGCCUGCCACGCGG-GAG-- 3 UGGGCUAUGGUGUAAUUGGC--AGCACGACUGAUUCUGGUUCAG-UUA-- 4 GAAGAUCGUCGUCUCC-GGUG-AGGCGGCUGGACUUCAAAUCCA-GU-UG 5 CUAAAUAUAUUUCAAU-GGUUAGCAAAAUACGCUUGUGGUGCGU-UAA--
**** * ** 1 ------------------CUCGCGGGUUCAAUUCCUGCUGGAUGC-A 2 ------------------G-CCCGGGUUCGAUUCCCGGCCAAUGC-A 3 ------------------G-UCUAGGUUCGAGUCCUGGUAGCCCA-G 4 GGGCCGCCAGCGGUCCCG--GGCAGGUUCGACUCCUGUGAUCUUCCG 5 ------------------A-UCUAAGUUCGAUUCUUAGUAUUUAC-C
SMADMYMURSYUC
AMY-
GGY
u a AV M M M
R MH
CR
MYUSH V R
HK
CV
Rc
KWA--
-- - c c - c c
a-c---
cc
c-V-YS Y R R G U UCR
AYU
CCYRSYMDMYVM
cV
Paul Gardner RNA bioinformatics
Comparative sequence analysis
I Evolution of RNA sequences
I Base-pairs that covary have strong evolutionary support
UAC
AA
GAG
UG C
GUUU
AA
GUAY
RY
AA
SMG
US C
GYKK
AA
GYRY
AU
AA
NAD
UG C
GUUG
AA
GUR
c
b
(((..(((....)))..)))
(((..(((....)))..)))
(((..(((....)))..)))
(((..(((....)))..)))UACAAGAGUGCGCUUAAGUA
UGCAAAAGUCCGUUUAAGCA
UAUAACCUUUCGAGGAAAUA
CAUAAUAAUGCGUUGAAGUG
a
MIS
YAUAANADUGCGUUGAAGURAncestral
UACAAGAGUGCGUUUAAGUA
YRYAASMGUSCGYKKAAGYRconsensus
consensusAncestral MIS
G U
A U
G C
U G
C G
U A
fast fast
slow
Paul Gardner RNA bioinformatics
Alignment Folding: RNAalifold
I Generate an alignment (e.g. with ClustalW)
I Find a consensus structure that is both energetically stable inall sequences and has covariation support
G C G G A A U U A G C U C A G U U _ G G G A G A G C G C C A G A C U G A A A A U C U G G A G G U C C C C _ G G U U C G A A U C C C G G A A U C C G C A
G C G G A A U U A G C U C A G U U _ G G G A G A G C G C C A G A C U G A A A A U C U G G A G G U C C C C _ G G U U C G A A U C C C G G A A U C C G C A
GC
GG
AA
UU
AG
CU
CA
GU
U_
GG
GA
GA
GC
GC
CA
GA
CU
GA
AA
AU
CU
GG
AG
GU
CC
CC
_G
GU
UC
GA
AU
CC
CG
GA
AU
CC
GC
A
GC
GG
AA
UU
AG
CU
CA
GU
U_
GG
GA
GA
GC
GC
CA
GA
CU
GA
AA
AU
CU
GG
AG
GU
CC
CC
_G
GU
UC
GA
AU
CC
CG
GA
AU
CC
GC
A
GCBKMWWU
AGCUC
AGUu
-G
G K AG A G C
RYY
WSAY
UK A W
RA
UCWRRAKG
uCS C S -R G
U UCG
AWYCYSKB
WWUSSGCA
UA
Hofacker et al. (2002) Secondary Structure Prediction for Aligned RNA Sequences, J.Mol.Biol.
Paul Gardner RNA bioinformatics
Alignment Folding: RNAalifold
RNAalifold: energy + covariation.
βi ,j =1
N
N∑α
Zαi ,j − Cov
Ci ,j =2
N(N − 1)
∑bαi b
αj ,b
βi b
βj
DH(bαi bαj , b
βi b
βj )Πα
ij Πβij
Hofacker et al. (2002) Secondary Structure Prediction for Aligned RNA Sequences, J.Mol.Biol.
Paul Gardner RNA bioinformatics
Covariation metrics
Lindgreen, Gardner & Krogh (2006) Measuring covariation in RNA alignments: physical realism improvesinformation measures. Bioinformatics.
Paul Gardner RNA bioinformatics
Rfam: annotation hierarchy
Types Clans Families Sequences
ribozyme
tRNA
CD-box_snoRNA
splicing
thermoregulator
leader
HACA-box_snoRNA
scaRNA
Intron
IRES
frameshift_element
sRNA
riboswitch
antisense
rRNA
miRNA
CRISPR
Cis-reg.
Gene
snRNA
snoRNA
Intron
Types
Paul Gardner RNA bioinformatics
Building an Rfam family
I A structure from literature
I An Rfam family: produced manually from publication figures
Paul Gardner RNA bioinformatics
Relevant reading
I Reviews:I Eddy SR (2004) How do RNA folding algorithms work?
Nature Biotechnology.
I Methods:I Hofacker et al. (2002) Secondary Structure Prediction for
Aligned RNA Sequences, J.Mol.Biol.
Paul Gardner RNA bioinformatics