turbo-decoding of rna secondary structuregsharma/.../sharmaturbodecodingrnase… · rna structure...
TRANSCRIPT
BackgroundTurbo-Decoding RNA Secondary Structure
References
Turbo-Decoding of RNA Secondary Structure
Gaurav Sharma
Department of Electrical and Computer EngineeringDepartment of Biostatistics and Computational Biology
June 18, 2013
1
Outline
◮ Turbo-decoding in Communications: A Quick Review
Outline
◮ Turbo-decoding in Communications: A Quick Review
◮ RNA Structure Analysis: Motivation and Background◮ RNA, noncoding RNA, RNA structure and its significance◮ RNA structure prediction
◮ Single/Multiple sequence methods
Outline
◮ Turbo-decoding in Communications: A Quick Review
◮ RNA Structure Analysis: Motivation and Background◮ RNA, noncoding RNA, RNA structure and its significance◮ RNA structure prediction
◮ Single/Multiple sequence methods
◮ Turbo-decoding RNA secondary structure◮ Iterative probabilistic decoding of structures of multiple
homologs: TurboFold
◮ Turbo-decoding: RNA vs communications
Outline
◮ Turbo-decoding in Communications: A Quick Review
◮ RNA Structure Analysis: Motivation and Background◮ RNA, noncoding RNA, RNA structure and its significance◮ RNA structure prediction
◮ Single/Multiple sequence methods
◮ Turbo-decoding RNA secondary structure◮ Iterative probabilistic decoding of structures of multiple
homologs: TurboFold
◮ Turbo-decoding: RNA vs communications
◮ Conclusions
Outline
◮ Turbo-decoding in Communications: A Quick Review
◮ RNA Structure Analysis: Motivation and Background◮ RNA, noncoding RNA, RNA structure and its significance◮ RNA structure prediction
◮ Single/Multiple sequence methods
◮ Turbo-decoding RNA secondary structure◮ Iterative probabilistic decoding of structures of multiple
homologs: TurboFold
◮ Turbo-decoding: RNA vs communications
◮ Conclusions
◮ Ongoing related work
Convolutional Codes
◮ Encoder
D D D Codeword symbolstreams
Messageword
symbolstreams c1(D)
c2(D)
c3(D)
c4(D)
m1(D)
m2(D)
m3(D)
◮ Finite state machine◮ Output and next state are functions of current state and inputs
ML Decoding: Convolutional Code
◮ Convolutional code structure constrains possibilities to a trellis
1 00 0 1 1 Message BitsDe odedRx Bits
11 10
00
10
01
11
11 01 1001
◮ ML Decoding: Most likely path through the trellis given thereceived information
Turbo Decoding in Communications
◮ An encoder construction
m(l)1
m(l)0
m(l)k−1
p(l)
m(l)... ...ParitySubunitComputationRe ursive Systemati Convolutional En oder... c
(l)0
c(l)1
c(l)n−k−1
m(l)Messageword
c(l)n−k
c(l)n−k+1
c(l)n
Systemati Subunit ...P(D)
Figure : Systematic convolutional encoder(recursive)
m(l)
P1(D)
m(l)
p(l)1
m(l)
P2(D)
m(l)
p(l)2
Figure : Two encoders.
InterleaverΠ
P2(D) p(l)2
m(l) m(l)
P1(D) p(l)1
c(l)
Figure : Parallelconcatentation.
Turbo Decoding in Communications
m̃
InterleaverΠ
P1(D)
P2(D)
p̃1
p̃2
m̃ r̃1
r̃1
r̃1
Channel
Figure : Encoder + channel
Forward-Ba kward Algorithmfor Extrinsi InformationUpdate for Code 1r̃1
Forward-Ba kward Algorithmfor Extrinsi InformationUpdate for Code 2r̃2
Π
Approximate MAPDe oding {m̂(l)}L−1l=0 ≡ ˆ̃m
Π−1 Π
r̃0
r̃
t = T
Iterative PhaseFinal De oding Step{
T(t)l
(m)}L−1
l=0
{
U(t)l
(m)}L−1
l=0
Figure : Iterative Decoder
Symbol-wise MAP Decoding: Convolutional Code
◮ Convolutional code structure constrains possibilities to a trellisRx Bits11 10
00
10
01
11
11 01 1001
◮ Most probable value of a bit for all possible paths through thetrellis, given the received information
◮ Localized probabilistic information
Turbo Decoding in Communications: Observations
◮ Multiple encodings of same message information
◮ Joint (optimal) decoding desirable◮ Exact joint decoding ≈ exponential complexity◮ Computationally Efficient Decoding: Iterative approximation
(belief propagation)◮ Localized MAP probabilitistic formulation◮ Decomposition into loosely coupled individual decodings +
information exchange at each iteration◮ Linear complexity in length of data◮ Pseudo-prior interpretation
Turbo Decoding in Communications: Observations
◮ Multiple encodings of same message information
◮ Joint (optimal) decoding desirable◮ Exact joint decoding ≈ exponential complexity◮ Computationally Efficient Decoding: Iterative approximation
(belief propagation)◮ Localized MAP probabilitistic formulation◮ Decomposition into loosely coupled individual decodings +
information exchange at each iteration◮ Linear complexity in length of data◮ Pseudo-prior interpretation
RNA?
What does this have to do with RNA?
RNA: Ribonucleic Acid
◮ Nucleic Acid of long chain ofunits named nucleotides:Nitrogenous Base, Ribosesugar, Phosphate
◮ Adjacent nucleotides linkedtogether by strong (covalent)phosphodiester bonds
between sugar and phosphate
◮ Information encoded with 4different types of nucleotidesdifferentiated by basecontent: Adenine, Guanine,Cytosine, Uracilhttp://www.genome.gov
DNA
Deoxyribonucleic acidRibonucleic acid
RNA
Base pair
Nitrogenous
Bases
Sugarphosphatebackbone
AUCG's
ATC
G's
H
CC
H
CH
H
H
NC
O
N
OU Uracil
replaces Thymine in RNA
NH2
CC
N
N
CC H
H
H
H H
HC
NC
N
N
O
G Guanine
CC
N
N
CC
H
H
CH
NC
H
N
A Adenine
H
C Cytosine
CC
H
CH
H
NC
O
N
NH2
NH2
CC
N
N
CC H
H
H
H H
HC
NC
N
N
O
G Guanine
CC
N
N
CC
H
H
CH
NC
H
N
A Adenine
H
C Cytosine
CC
H
CH
H
NC
O
N
NH2
H
CC
H3C
CH
H
H
NC
O
N
O
T Thymine
Nitrogeneous
BasesNitrogeneous
Bases
The Central Dogma
DNA
mRNA Transcription
Mature mRNA
mRNA
Transport to cytoplasm
Translation
Nuclear membrane
Ribosome
tRNA
Codon
Anti-codon
Amino acid
Amino acid
chain (protein)
◮ Genetic information flows unidirectionally:◮ DNA → RNA → Protein
◮ RNA plays a passive role◮ Transient copy created for protein synthesis
http://www.genome.gov
RNA an Active Player: ncRNAs
micro RNAUAG
DNA Template
Transcription
pre mRNASelf-splicing
UAA
mRNA
Codons
Anti-Codon
Protein
AUG
+/- RibosomalRNA
tRNA
Intron
ncRNAs
◮ ncRNAs: play direct functional rolesin cellular processes
◮ w/o translation to protein =⇒“noncoding”
◮ Increasing numbers (being)discovered
◮ 1989 Nobel Prize in Chemistry:Ribozymes
◮ Thomas Cech and SidneyAltman
◮ 2006 Nobel Prize inPhysiology/Medicine: siRNA
◮ Andrew Fire and Craig Mello
BackgroundTurbo-Decoding RNA Secondary Structure
References
Noncoding RNAs (ncRNAs): Examples
◮ Commonly known ncRNAs◮ Protein synthesis: tRNA, rRNA◮ RNA modification: snoRNAs,
◮ Up/Down regulation of gene expression◮ Regulation of transcription◮ siRNA/miRNA post transcription regulation silencing of genes◮ piRNAs regulation of retroransposons
◮ RNA Splicing (autocatalysis)
◮ Many more: ...
◮ RNA Genomes (Many viruses including HIV and SIV)
◮ ncRNAs and diseases◮ Abnormal expression for ncRNAs observed in cancerous cells◮ Prader-Willi Syndrome (over-eating and learning disabilities)◮ Autism, Alzheimer’s, ...
13
BackgroundTurbo-Decoding RNA Secondary Structure
References
Noncoding RNAs (ncRNAs)
◮ RNA molecules that directly play functional roles in cellularprocesses
◮ Do not code for protein synthesis =⇒ “noncoding”.
◮ Structure determines function in noncoding roles
◮ Determination of structure is of significant interest◮ Further understanding of ncRNA function◮ Enhances understanding of cellular processes and interactions◮ Provides targets for drug design
14
Computational Prediction of RNA Structure
◮ Structure determines function in noncoding roles
Computational Prediction of RNA Structure
◮ Structure determines function in noncoding roles
◮ Experimental determination of structure is challenging◮ X-ray Crystallography◮ Crystallization difficult and expensive
Computational Prediction of RNA Structure
◮ Structure determines function in noncoding roles
◮ Experimental determination of structure is challenging◮ X-ray Crystallography◮ Crystallization difficult and expensive
◮ Computational estimation of structure is of significant interest
◮ Understanding ncRNA function in cellular processes andinteractions
◮ Genome understanding: structure based ncRNA gene search◮ Therapeutics: targets for drug design
Computational Prediction of RNA Structure
◮ Structure determines function in noncoding roles
◮ Experimental determination of structure is challenging◮ X-ray Crystallography◮ Crystallization difficult and expensive
◮ Computational estimation of structure is of significant interest
◮ Understanding ncRNA function in cellular processes andinteractions
◮ Genome understanding: structure based ncRNA gene search◮ Therapeutics: targets for drug design
RNA Structure Hierarchy [Tinoco and Bustamante, 1999]
AAUUGCGGGAAAGGGGUCAA
CAGCCGUUCAGUACCAAGUC
UCAGGGGAAACUUUGAGAUG
GCCUUGCAAAGGGUAUGGUA
AUAAGCUGACGGACAUGGUC
CUAACCACGCAGCCAAGUCC
UAAGUCAACAGAUCUUCUGU
UGAUAUGGAUGCAGUUCA
UG
CA A
P 6b
P 6a
P 6
P 4
P 5P 5a
P 5b
P 5c
120
140
160
180
200
220
240
260
AAU
UGCGGG
A
A
A
G
GGGUCA
ACAGCCG UUCAG
U
ACCA
AGUCUCAGGGG
AAACUUUGAGAU
GGCCU
A G GG U A U
GGUA
AU
A AG
CUGACGG
ACA
UGGUCC
U
A
A
CC
A CGCA
GC
C
A
A
GUCC
UA
AGUCAACAGA
U C UU
CUGUUGAU
A
U
GGAU
G
C
A
GU
UC A
5’
3’
Figure : Hierarchy of RNA structure formation [Waring and Davies, 1984,Doudna and Cech, 2002, Doudna and Cate, 1997]
RNA Secondary Structure
◮ Folding of RNA linear molecular chain onto itself with basepairing rules
◮ Formation of hydrogen bonds between nucleotides◮ Canonical base pairs
◮ A can pair with U◮ G can pair with C and U◮ G-U pair called non Watson-Crick pair
◮ Greater variety of structures than the DNA double helix
RNA Secondary Structure Elements
G A G G A A
U
410
A
U
G
C
UC
10
G
C
G
G
C
G
C
G
C
C GU AC G 280C GA
U
20
AGG
G CA G C
G CA UU30 A
C
G
C
G
A
U
G
C
U
G
UA
AC
40G
G CU AG CC50 G 70U G
AUACG
U GCG C
60
AA
ACCA
A
A
G
C
G
C
A CAG
80
C
U
A
270
G
U
C
G
AACA
GA
G
90
A A C A GA
C
G
C
G
GU
C100
G
CG
GA U
A
G
C
G
U
C
G
U
A
200
U
A
U
110
A
G
C
G
C
G
C
C
G
U
G
GU
G
GC
GC120
GGCC
GUA
130G
CCU U G
AAAA
GC
190
U
140
G CG CG CG UG UG
UAG
CG
A 150U
GU180
AU
GU
CG
GCAUA
GCAU 160G
CG170
GC
UA
UUC
G
C
G
C
A
A
AA
210GUGA
AA
CG
220
C GC
CG
C AU GC G
UA
AG
230
AG
A
GC
C
240
G260
GC
CG
UA
GC
GU
CG
GC
GUA
250
A
U
GA
CGGA
CAAG
U
A
U
C
G
C
G
A
290
A A UAGG
C
G
C
A
U
A
U
G
300
C
C
G
A UUA
GU320 GCCGGCU 310
GG
GCCC
GGGU
330
A
GCUUG340AG
C
C
G
360
C
G
U
A
G
C
U
G
C
G
G
C
G
350
UG
AC
UA
GA
GGAA
370
UGA
UU
GCCGA
380 AC
G
C
G
C
G
C
G
390
GCA
A G G A AC
AG
400
AA
A
5’
3’
Pseudoknot
Internal Loop
Multibranch Loop
Bulge Loop
G C
Hairpin Loop
Figure : Structural Elements of LGW17 sequence from RNAse PDatabase [Brown, 1999]
RNA Structure: Thermodynamics
◮ Equilibrium: Boltzmann Distribution of structures
UnpairedState
3’
5’
3’
5’
3’
5’
3’
5’
exp(−∆Go(Sk)/RT )
···
··· ···
exp(−∆Go(S1)/RT )
exp(−∆Go(S2)/RT )
Sk
···
···
S1
S2
◮ Lower ∆Go(Sk), higher the probability of Sk
◮ Most likely structure → Minimization of free energy
Modeling RNA Thermodynamics: Nearest neighbor model
◮ Nearest neighbor model [Xia et al., 1998, Mathews et al.,1999]
◮ Computational model for free energy change of RNA structure◮ Experimentally determined free energy terms for each nearest
neighbor interaction in secondary structure◮ Loop decomposition
A
U U
1 Nucleotide Bulge Loop: +3.3 kcal/mol
Stacked loop (UA/GC): −1.8 kcal/mol
4 nucleotide hairpin loop: +5.9 kcal/mol
Terminal Mismatch (GC/AA): −1.1 kcal/mol
5’ dangling nucleotide: −0.3 kcal/mol
Stacked loop (AU/CG): −2.1 kcal/mol
Stacked loop (CG/AU): −1.8 kcal/mol
Stacked loop (AU/UA): −0.9 kcal/mol
Stacking of base pairs (GC/GC): −2.9 kcal/mol
Stacked loop (GC/GC): −2.9 kcal/mol
3’
5’
U
G
U
A
C
C
C
AA
G
G
A
G
U
A
C
A
A
Figure : Total free energy change is summation of all nearest neighborenergies [Durbin et al., 1999]
RNA ML Decoding of Structure: Single Sequence
◮ Most likely or minimum free energy structure, given sequence
· · · · · ·· · ·
◮ Dynamic Programming MFold [Zuker, 1989] O(N3)complexity
RNA MAP Decoding of Structure
◮ Posterior probability of base pairing, given sequence
· · · · · · · · ·
◮ Dynamic Programming [McCaskill, 1990], MFold, RNAfold(O(N3) in time, O(N2) in space)
◮ Localized probabilistic information
RNA Structure Prediction (Single Sequence)
RD0260 Sequence
GCGACCGGGGCUGGCUUGGUAAUGGUACUCCCCUGUCACGGGAGAGAAUGUGGGUUCAAAUCCCAUCGGUCGCGCCA
-28.6 kcals/mol
0 10 20 30 40 50 60 70 800
10
20
30
40
50
60
70
80
RD0260 nucleotide indices
RD
02
60
nu
cle
oti
de
ind
ice
s
Free Energy
Minimization
Partition Function
Computation
0 10 20 30 40 50 60 70 800
10
20
30
40
50
60
70
80
RD
02
60
nu
cle
oti
de
ind
ice
s
RD0260 nucleotide indices
20
30
40
50
60
70
GCGACCG
GG
GC
UG
GCUUG
G
U
AA U
G G U
ACUC
CCC
U
G
U CA
C
GGG
A
GAG
AA
U
G
U G G GU U
C
A
AAU
CCCA
UCGGUCGC
G
C
C
A
◮ Free energy minimization: “Hard” Prediction◮ Single prediction structure
◮ Base pairing probabilities: “Soft” Prediction◮ Thresholding may yield pseudo-knotted structures◮ Maximum Expected Accuracy Structure Prediction, [Do et al.,
2006, Lu et al., 2009]
Structure Prediction for Multiple Sequences: HomologousncRNAs
◮ Homologous ncRNAs
◮ Share evolutionary ancestor◮ Serve same function◮ Structural similarity in terms of topology of structures
10
20
30
40
60
70
GCGACCG
G
G
GCUGG
CUU
G
GU A A
U G G U
A
CUCCC
C
U
GU C
A
C
GGGAG
AG
A
A
U
G U G G GU U
C
A
AAU
CCCAU
CGGUCGC
G
C
C
A
10
20
30
40
50
60
70
GCCCGGG
UG
GUG
UAGU
G
G
C
CC A
UC A U
ACGACC
C
U
GU C
A
C
GGUCGU
GA
C
G C G G GU U
C
G
AAU
CCCGC
CUCGGGC
G
C
C
A
10
20
30
40
50
60
70
GCCCCCA
U
C
GUCUA
GAG
G
CC U
AG G A C
A
CCUCC
C
U
UU C
A
C
GGAGG
CG
A
C
A G G G AU U
C
G
AAU
UCCCU
UGGGGGU
A
C
C
A
RD0260 RD0500 RE2140
Bacteriophage T5 (Asp) Haloferax Volcanii (Asp) Synechocystis (Glu)
Structure Prediction for Multiple Sequences: HomologousncRNAs
◮ Homologous ncRNAs
◮ Share evolutionary ancestor◮ Serve same function◮ Structural similarity in terms of topology of structures
10
20
30
40
60
70
GCGACCG
G
G
GCUGG
CUU
G
GU A A
U G G U
A
CUCCC
C
U
GU C
A
C
GGGAG
AG
A
A
U
G U G G GU U
C
A
AAU
CCCAU
CGGUCGC
G
C
C
A
10
20
30
40
50
60
70
GCCCGGG
UG
GUG
UAGU
G
G
C
CC A
UC A U
ACGACC
C
U
GU C
A
C
GGUCGU
GA
C
G C G G GU U
C
G
AAU
CCCGC
CUCGGGC
G
C
C
A
10
20
30
40
50
60
70
GCCCCCA
U
C
GUCUA
GAG
G
CC U
AG G A C
A
CCUCC
C
U
UU C
A
C
GGAGG
CG
A
C
A G G G AU U
C
G
AAU
UCCCU
UGGGGGU
A
C
C
A
RD0260 RD0500 RE2140
Bacteriophage T5 (Asp) Haloferax Volcanii (Asp) Synechocystis (Glu)
GCGACCG C
GGUCGC
GCCCGGG C
UCGGGC
GCCCCCA U
GGGGGU
G C G G GU
CCCGC
A G G G AU
UCCCUG U G G G
U
CCCAU
CUCCC
C
GGGAG
ACGACC G
GUCGU C
CUCC
C GGAGG
GCUG
U G G UGUG
C A U
GUCU
G G A C
Structure Prediction for Multiple Sequences: HomologousncRNAs
RD0260 GCGACCGGGGCUGGCUUGGUA-AUGGUACUCCCCUGUCACGGGAGAGAAUGUGGGUUCAAAUCCCAUCGGUCGCGCCA
RD0500 GCCCGGGUGGUGUAGU-GGCCCAUCAUACGACCCUGUCACGGUCGUGA-CGCGGGUUCGAAUCCCGCCUCGGGCGCCA
RE2140 GCCCCCAUCGUCUAGA-GGCCUAGGACACCUCCCUUUCACGGAGGCGA-CAGGGAUUCGAAUUCCCUUGGGGGUACCA
10
20
30
40
60
70
GCGACCG
G
G
GCUGG
CUU
G
GU A A
U G G U
A
CUCCC
C
U
GU C
A
C
GGGAG
AG
A
A
U
G U G G GU U
C
A
AAU
CCCAU
CGGUCGC
G
C
C
A
10
20
30
40
50
60
70
GCCCGGG
UG
GUG
UAGU
G
G
C
CC A
UC A U
ACGACC
C
U
GU C
A
C
GGUCGU
GA
C
G C G G GU U
C
G
AAU
CCCGC
CUCGGGC
G
C
C
A
10
20
30
40
50
60
70
GCCCCCA
U
C
GUCUA
GAG
G
CC U
AG G A C
A
CCUCC
C
U
UU C
A
C
GGAGG
CG
A
C
A G G G AU U
C
G
AAU
UCCCU
UGGGGGU
A
C
C
A
GCGACCG C
GGUCGC
GCCCGGG C
UCGGGC
GCCCCCA U
GGGGGU
70
G C G G G
CCCGC
A G G G A
UCCCUG U G G G
CCCAU
CUCCC
C
GGGAG
ACGACC
C GGUCGU C
CUCC
C GGAGG
GCUG
U G G UGUG
UC A U
GUCU
G G A C
GCGACCGGGGCUGGCUUGGUA-AUGGUACUCCCCUGUCACGGGAGAGAA
GCCCGGGUGGUGUAGU-GGCCCAUCAUACGACCCUGUCACGGUCGUGA-
GCCCCCAUCGUCUAGA-GGCCUAGGACACCUCCCUUUCACGGAGGCGA-
UGUGGGUUCAAAUCCCAUCGGUCGCGCCA
CGCGGGUUCGAAUCCCGCCUCGGGCGCCA
CAGGGAUUCGAAUUCCCUUGGGGGUACCA
GCGACCGGGGCUGGCUUGGUA-AUGGUACUCCCCUGUCACGGGAGAGAA
GCCCGGGUGGUGUAGU-GGCCCAUCAUACGACCCUGUCACGGUCGUGA-
GCCCCCAUCGUCUAGA-GGCCUAGGACACCUCCCUUUCACGGAGGCGA-
GCGACCGGGGCUGGCUUGGUA-AUGGUACUCCCCUGUCACGGGAGAGAA
GCCCGGGUGGUGUAGU-GGCCCAUCAUACGACCCUGUCACGGUCGUGA-
GCCCCCAUCGUCUAGA-GGCCUAGGACACCUCCCUUUCACGGAGGCGA-
GCGACCGGGGCUGGCUUGGUA-AUGGUACUCCCCUGUCACGGGAGAGAA
GCCCGGGUGGUGUAGU-GGCCCAUCAUACGACCCUGUCACGGUCGUGA-
GCCCCCAUCGUCUAGA-GGCCUAGGACACCUCCCUUUCACGGAGGCGA-
GCCCGGGUGGUGUAGU-GGCCCAUCAUACGACCCUGUCACGGUCGUGA-
GCCCCCAUCGUCUAGA-GGCCUAGGACACCUCCCUUUCACGGAGGCGA-
GCGACCGGGGCUGGCUUGGUA-AUGGUACUCCCCUGUCACGGGAGAGAA
GCCCGGGUGGUGUAGU-GGCCCAUCAUACGACCCUGUCACGGUCGUGA-
GCCCCCAUCGUCUAGA-GGCCUAGGACACCUCCCUUUCACGGAGGCGA-
UGUGGGUUCAAAUCCCAUCGGUCGCGCCA
CGCGGGUUCGAAUCCCGCCUCGGGCGCCA
CAGGGAUUCGAAUUCCCUUGGGGGUACCA
UGUGGGUUCAAAUCCCAUCGGUCGCGCCA
CGCGGGUUCGAAUCCCGCCUCGGGCGCCA
CAGGGAUUCGAAUUCCCUUGGGGGUACCA
CGCGGGUUCGAAUCCCGCCUCGGGCGCCA
CAGGGAUUCGAAUUCCCUUGGGGGUACCA
RD0260 RD0500 RE2140
5’ 3’
◮ “Common” structures and conforming sequence alignment
◮ Joint estimation can harness comparative structure and sequenceinformation across homologs
Multiple Sequence RNA Structure Prediction
Multiple
PredictionStructure
Input Sequences
(Optional)
N1
...
xK
x2
x1
...
N2
NK
1
1
1 A
...
S1
SK
S2
Multiple Sequence RNA Structure Prediction
Multiple
PredictionStructure
Input Sequences
(Optional)
N1
...
xK
x2
x1
...
N2
NK
1
1
1 A
...
S1
SK
S2
◮ Sankoff’s dynamic programming algorithm [Sankoff, 1985]◮ Simultaneous folding (pseudo-knot free) and alignment of K
sequences◮ Time (Memory) complexity: O(N3K) (O(N2K))◮ Computationally infeasible even for short sequences and K = 2
w/o cutting corners
Turbo-Decoding RNA Secondary Structure
◮ Goal: Performance similar (“better”) than joint estimation,complexity similar to single sequence computation.
◮ Probabilistic formulation of folding and alignment◮ Base pairing probabilities, posterior alignment probabilities
◮ Iteratively update each using information from other◮ TurboFold [Harmanci et al., 2007, 2011].
Base pairing probabilities
TurboFold
Input Sequences
...
...
NK11
NK
N2
N211
11
N1
N1
1
1
1
NK
N2
...
x1
x2
xK
...
N1
Structural Alignment: Joint Representation of Structuresand Alignment
◮ Two sequence case
3’5’
5’ 3’
A
UG
CA
C
UU
U U GA
U
C G
AU
U
GU
UA
AC
CG U A C U U A C U U G A U C G C A U U G UUAA C
GGA C A A U A U U C U U C U U C C G U G G U A G UUC
UUGAUGGUGCCUUCUUCUUAUAACAGG C
for alignmentCo−incidence path
Structure 2
Structure 1
Sequence 2
Sequence 1 3’5’
Sequence 2
Sequence 1U
UG
AU
GG
UG
CC
UU
CU
UC
UU
AU
AA
CA
GG
CG
AC
UA
AC
UU
AC
UU
GA
UC
GC
AU
UG
UU
UUGUUACGCUAGUUCAUUCAA UCAG
CU
U
U
A
U
C
UU
AA
U C
C
A
G
G C
U
U
G
A
G
UC
G
G
U
Decoupled Probabilistic Representation for RD0260,RD0500 Structural Alignment
◮ Formulate in probabilistic framework and separate thefolding/alignment representations
0 10 20 30 40 50 60 70 800
10
20
30
40
50
60
70
80
RD0260 indices
RD
05
00
ind
ice
s
0 10 20 30 40 50 60 70 800
10
20
30
40
50
60
70
80
RD0500 indices
0 10 20 30 40 50 60 70 80
0
10
20
30
40
50
60
70
80R
D0
26
0 in
dic
es
Base pairing
probabilities for
RD0260
Co-incidence
probabilities
(via HMM) Base pairing
probabilities for
RD0500
(via partition
function)
TurboFold
Given K homologous RNA sequences, each sequence contains
some information about folding of every other sequence.
TurboFold
Given K homologous RNA sequences, each sequence contains
some information about folding of every other sequence.
◮ Extrinsic information for a sequence◮ The information about folding of a sequence which is
computed using base pairing probabilities of other sequences◮ Thermodynamic model + Alignment model
TurboFold
Given K homologous RNA sequences, each sequence contains
some information about folding of every other sequence.
◮ Extrinsic information for a sequence◮ The information about folding of a sequence which is
computed using base pairing probabilities of other sequences◮ Thermodynamic model + Alignment model
◮ Base pairing probabilities of a sequence (Intrinsic Information)◮ From sequence itself◮ Thermodynamic model
TurboFold
Given K homologous RNA sequences, each sequence contains
some information about folding of every other sequence.
◮ Extrinsic information for a sequence◮ The information about folding of a sequence which is
computed using base pairing probabilities of other sequences◮ Thermodynamic model + Alignment model
◮ Base pairing probabilities of a sequence (Intrinsic Information)◮ From sequence itself◮ Thermodynamic model
◮ Iterative updates:◮ Compute extrinsic information using base pairing probabilities
and alignment co-incidence probabilities◮ Update base pairing probabilities using updated extrinsic
information◮ Update extrinsic information using updated base pairing
probabilities◮ . . .
Extrinsic Information for Base Pairing for RD0260
0 10 20 30 40 50 60 70 80
0
10
20
30
40
50
60
70
80
RD
02
60
ind
ice
s
0 10 20 30 40 50 60 70 800
10
20
30
40
50
60
70
80
RD0260 indices
RD
05
00
ind
ice
s
0 10 20 30 40 50 60 70 800
10
20
30
40
50
60
70
80
RD0500 indices
0
pΠRD0260
Extrinsic (RD0260)
˜
0
pΠRD0500
0
cΠRD0500, RD0260
0
cΠRD0500, RD0260
T
(1 − ψRD0500, RD0260
)
( )
3 Sequences
RD0260 RD0500
RE2140
0 10 20 30 40 50 60 70 800
10
20
30
40
50
60
70
80
RD0260 indices
RD
02
60
ind
ice
s
0 10 20 30 40 50 60 70 800
10
20
30
40
50
60
70
80
RD0500 indices
RD
05
00
ind
ice
s
0 10 20 30 40 50 60 70 800
10
20
30
40
50
60
70
80
RE2140 indices
RE
21
40
ind
ice
s
Base Pairing Proclivity Matrix for RD0260 Induced by RE2140
0 10 20 30 40 50 60 70 800
10
20
30
40
50
60
70
80
RE2140
0 10 20 30 40 50 60 70 800
10
20
30
40
50
60
70
80
RE2140 indices
RD
02
60
ind
ice
s
0 10 20 30 40 50 60 70 800
10
20
30
40
50
60
70
80
RE2140 indices
RD
02
60
ind
ice
s
RE2140 indices
RE
21
40
ind
ice
s
0 10 20 30 40 50 60 70 800
10
20
30
40
50
60
70
80
RD0260 indices
RD
02
60
ind
ice
s
Base pairing proclivity matrix
for RD0260 induced by RE2140
T
◮ Information in RE2140 about folding of RD0260
tpΠ̃
(s→m) = cΠ(m,s) t−1
p Πs (cΠ
(m,s))T (1)
3 Sequences: Extrinsic information Computation
0 10 20 30 40 50 60 70 800
10
20
30
40
50
60
70
80
0 10 20 30 40 50 60 70 800
10
20
30
40
50
60
70
80
RD0260 indices
Proclivity Matrices for RD0260
0 10 20 30 40 50 60 70 800
10
20
30
40
50
60
70
80
Extrinsic Information
Matrix for RD0260
Normalize by
maximum
RD0260 indices
RD
02
60
ind
ice
sR
D0
26
0 in
dic
es
Proclivity matrix
induced by RE2140
Proclivity matrix
induced by RD0500
RD0260 indices
RD
02
60
ind
ice
s
Weight by
(1 - similarity)
Weight by
(1 - similarity)
K Sequences: Extrinsic Information Computation for xm
Co−incidence probabilities
Induced base pairingproclivity matrices
for
Base pairingprobabilities
Normalize by maximum
base pairing forExtrinsic information for
{tcΠ(s,m)}s∈N\m
{tpΠ̃(s→m)}s∈N\m
xm
{t−1p Πs}s∈N\m1
1N1
N1
Nm−1
Nm−1
11
· · ·
· · ·
· · ·
· · ·
· · ·
· · ·x1
NK
11 NK
· · ·xm−1 xm+1 xK
Nm+111
Nm+1
· · ·
xm
11 Nm
Nm
(1 − ψ1,m)
tpΠ̃
m
Nm
Nm11
11 Nm
Nm
· · · (1 − ψk,m)
111
1 Nm
Nm
Nm
Nm
· · · (1 − ψm−1,m) (1 − ψm+1,m)
Nm
11
N1
11 Nm
NK
11
11
Nm Nm
Nm+1Nm−1
Base Pairing Probability Computation Using ExtrinsicInformation
◮ Modified Boltzmann distribution of secondary structures:
P (S) ∝ exp
(
−∆G̃(S)
RT
)
Base Pairing Probability Computation Using ExtrinsicInformation
◮ Modified Boltzmann distribution of secondary structures:
P (S) ∝ exp
(
−∆G̃(S)
RT
)
where∆G̃(S) = ∆Go(S)− γ
∑
(i,j)∈S
log(π̃(i, j))
is the modified free energy change for structure S.
◮ π̃(i, j): Extrinsic information for pairing of nucleotides atindices i and j
◮ γ: Weight of extrinsic information on modified free energyrelative to ∆Go(S)
Extrinsic information introduced via a pseudo free energy for eachbase pair
Base Pairing Probability Computation Using ExtrinsicInformation
◮ Modified Boltzmann distribution of secondary structures:
P (S) ∝ exp
(
−∆G̃(S)
RT
)
(2)
where∆G̃(S) = ∆Go(S)− γ
∑
(i,j)∈S
log(π̃(i, j)) (3)
Base Pairing Probability Computation Using ExtrinsicInformation
◮ Modified Boltzmann distribution of secondary structures:
P (S) ∝ exp
(
−∆G̃(S)
RT
)
(2)
where∆G̃(S) = ∆Go(S)− γ
∑
(i,j)∈S
log(π̃(i, j)) (3)
Replace (3) in (2):
P (S) ∝ exp(−∆Go(S)
RT)
︸ ︷︷ ︸
Boltzmann distributionproportionality term
∏
(i,j)∈S
(π̃(i, j))γ/RT
︸ ︷︷ ︸
Extrinsicinformation
Base Pairing Probability Computation Using ExtrinsicInformation
◮ Modified Boltzmann distribution of secondary structures:
P (S) ∝ exp
(
−∆G̃(S)
RT
)
(2)
where∆G̃(S) = ∆Go(S)− γ
∑
(i,j)∈S
log(π̃(i, j)) (3)
Replace (3) in (2):
P (S) ∝ exp(−∆Go(S)
RT)
︸ ︷︷ ︸
Boltzmann distributionproportionality term
∏
(i,j)∈S
(π̃(i, j))γ/RT
︸ ︷︷ ︸
Extrinsicinformation
◮ Base pair (i, j) has a pseudo prior probability of (π̃(i, j))γ/RT
due to extrinsic information.
TurboFold: Iterative Updates
Alignment
Iterations
computation for
sequencecomputation for computation for
sequence
Unit "time" delay
computation forcomputation for
probabilities
sequence
Extrinsic information
sequence
Extrinsic information Extrinsic information
sequence
Base pairing probability Base pairing probabilityBase pairing probability
sequencecomputation for
···
···
· · ·
· · ·
η
1st· · ·
· · ·mth Kth· · · · · ·
· · ·· · ·
KthmthΠK
···
· · · · · ·
···
···
Πm
· · ·· · ·
Π̃1
Π1
···
· · · · · ·
···
···
···
···
Π̃m
1st Π̃K
◮ For low K, per iteration complexity is comparable to singlesequence structure prediction
◮ Benefits from comparative analysis
TurboFold Structure Prediction Overview
◮ Obtain base pairing probabilities after η iterations, thenpredict structures
◮ Significant base pairs◮ Maximum expected accuracy (MEA) structures
Iterations
Thresholding
Maximum Expected
Accuracy Prediction
TurboFold−MEA
Input Sequences
TurboFold
η
{tpΠ
m}{tpΠ̃
m}{η
pΠm}m∈N
{xm}m∈N
N1
...
xK
x2
x1
...
N2
NK
1
1
1
Structure Prediction
◮ Structure for xm composed of base pairs with probabilitiesgreater than Pthresh:
S∗
m = {(i, j) ∋ ηpπ
m(i, j) > Pthresh} (4)
TurboFold: Computation Complexity
◮ Initialization◮ Computation of co-incidence matrices: O(K2N2)◮ Computation of sequence similarities: O(K2N2)
◮ Iterations◮ Extrinsic information computation: O(ηK2d2N2)◮ Base pairing probability computation: O(ηKUN3)
◮ Structure prediction◮ Thresholding: O(KN2)◮ MEA prediction: O(KN3)
Compare to Sankoff’s algorithm: O(N3(U2d)K)
Evaluating Accuracy of Estimates
◮ Sensitivity: Ratio of number of correctly predicted base pairsto the total number of base pairs in the known structure
True Positive
True Positive + False Negative
◮ Recall
◮ Positive Predictive Value(PPV): Ratio of number of correctlypredicted base pairs to the total number of base pairs in thepredicted structure
True Positive
True Positive + False Positive
◮ Precision
Parameter Selection: Number of iterations, η
Sensitivity vs. PPV over 5S rRNA datasetwith changing η
0.66 0.67 0.87 0.88 0.89 0.90.635
0.64
0.645
0.875
0.88
0.885
0.89
0.895
0.9
0.905
Sensitivity
PP
V
0
1
234
5
◮ η = 3 is used in TurboFold
Parameter Selection: Weight of Extrinsic Information, γ
Sensitivity vs. PPV over 5S rRNA datasetwith changing γ/RT
0.65 0.7 0.75 0.8 0.85 0.90.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1
0.001
0.05
0.1
0.2
0.30.5
0.81.01.2
Sensitivity
PP
V
◮ γ = 0.3RT is used in TurboFold
Benchmarking Experiments: Datasets
◮ Randomly choose 200 RNase P, 400 5S rRNA, 400 SRP, and400 tRNA sequences and divide into K combinations
◮ Choose and divide for K = 2, . . . , 10
◮ Yields 36 datasets
The datasets have significant diversity:
◮ RNase Ps: 336 nucleotides, 50% average pairwise identity
◮ tmRNA: 366 nucleotides, 45% average pairwise identity
◮ telomerase RNA: 445 nucleotides, 54% average pairwiseidentity
◮ SRPs: 187 nucleotides, 42% average pairwise identity
◮ tRNAs: 77 nucleotides, 47% average pairwise identity
◮ 5S rRNAs: 119 nucleotides, 63% average pairwise identity
Benchmarking Experiments: Datasets
◮ Randomly choose 200 RNase P, 400 5S rRNA, 400 SRP, and400 tRNA sequences and divide into K combinations
◮ Choose and divide for K = 2, . . . , 10
◮ Yields 36 datasets
The datasets have significant diversity:
◮ RNase Ps: 336 nucleotides, 50% average pairwise identity
◮ tmRNA: 366 nucleotides, 45% average pairwise identity
◮ telomerase RNA: 445 nucleotides, 54% average pairwiseidentity
◮ SRPs: 187 nucleotides, 42% average pairwise identity
◮ tRNAs: 77 nucleotides, 47% average pairwise identity
◮ 5S rRNAs: 119 nucleotides, 63% average pairwise identity
Benchmarking Experiments
◮ TurboFold is benchmarked against methods that estimatebase pairing probabilities:
◮ LocARNA [Will et al., 2007]◮ RNAalifold [Bernhart et al., 2008]◮ Single sequence partition function [Mathews, 2004]
◮ The set of base pairs with estimated probabilities higher thanPthresh are scored
◮ Plotted sensitivity versus PPV while varying Pthresh between 0and 1 with step size of 0.04
Benchmarking Experiments
Sensitivity vs PPV ROC curves for TurboFold vs three alternativemethods
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Sensitivity
PP
V
TurboFOLD (K = 3)TurboFOLD (K = 10)LocARNA (K = 3)LocARNA (K = 10)RNAalifold (K = 3)RNAalifold (K = 10)Single (K = 3)Single (K = 10)
Run Time Requirements
◮ Run time requirements over 50 RNase P sequence datasets
Runtime (seconds) forK = 3 K = 5 K = 10
TurboFold 136.75 277.9 517.0
LocARNA 746.44 2815.9 11395.8
RNAalifold 0.2 0.3 0.6Table : Time requirements (in seconds) for the methods.
◮ TurboFold scales slower with increase in K
Conclusions
◮ TurboFold: A multiple sequence structure prediction method◮ Lowers Complexity with iterative combination of intrinsic and
extrinsic information for folding◮ Intrinsic information: From sequence via thermodynamic
folding model (nearest neighbor model)◮ Extrinsic information: From other sequences
◮ TurboFold accuracy: close to or higher than the simultaneousfolding and alignment methods
◮ Details: BMC Bioinformatics article Harmanci et al. [2011].
◮ Connections to coding theory in digital communications
Turbo Decoding: RNA vs Communications
◮ Multiple encodings of same information◮ Nature/Man
◮ Joint (optimal) decoding desirable◮ Exact joint decoding ≈ exponential complexity◮ Iterative approximation (belief propagation)
◮ Localized MAP probabilitistic formulation (basepairing/symbol probs.)
◮ Decomposition into loosely coupled individual decodings +information exchange at each iteration
◮ Linear complexity in length of data◮ Pseudo-prior interpretation
Ongoing Related Work
◮ Moving beyond TurboFold◮ Alignment probability updates based on structures◮ Better handling of dependencies◮ Domain insertions/deletions
◮ Connecting with experiments◮ Incorporating experimental information (e.g. SHAPE) in
structural alignments◮ Postulating mechanisms and experimental validation (HIV)
Acknowledgments
◮ Collaborators at UR:◮ Arif O. Harmanci, UR ECE, currently PostDoc at Yale◮ David H. Mathews, Department of Biochemistry and
Biophysics
◮ Research support:◮ National Institutes of Health (NIH) (Award # GM097334-01)◮ Center for Research Computing, University of Rochester
Thank you
Questions?
BackgroundTurbo-Decoding RNA Secondary Structure
References
S. H. Bernhart, I. L. Hofacker, S. Will, A. R. Gruber, and P. F.Stadler. RNAalifold: improved consensus structure prediction forRNA alignments. BMC Bioinformatics, 9:474, 2008.
J. W. Brown. The Ribonuclease P database. Nucleic Acids Res.,27(1):314, Jan. 1999.
Chuong B. Do, Daniel A. Woods, and Serafim Batzoglou.CONTRAfold: RNA secondary structure prediction withoutphysics-based models. Bioinformatics, 22(14):90–98, 2006.
J. A. Doudna and T. R. Cech. The chemical repertoire of naturalribozymes. Nature, 418(6894):222–228, 2002.
J.A. Doudna and J.H. Cate. RNA structure: crystal clear? Current
Opinions in Structural Biology, 7:310–316, 1997.
R. Durbin, S. R. Eddy, A. Krogh, and G. Mitchison. BiologicalSequence Analysis : Probabilistic Models of Proteins and
Nucleic Acids. Cambridge University Press, Cambridge, UK,1999. ISBN 0521629713.
54
BackgroundTurbo-Decoding RNA Secondary Structure
References
A. Ozgun Harmanci, Gaurav Sharma, and David H. Mathews.Toward turbo decoding of RNA secondary structure. In Proc.
IEEE Intl. Conf. Acoustics Speech and Sig. Proc., volume I,pages 365–368, Apr. 2007.
A. Ozgun Harmanci, Gaurav Sharma, and David H. Mathews.TurboFold: Iterative Probabilistic Estimation of SecondaryStructures for Multiple RNA Sequences. BMC Bioinformatics,12:108, 2011. early access available online, April 20, 2011.
Zhi John Lu, Jason W. Gloor, and David H. Mathews. ImprovedRNA secondary structure prediction by maximizing expected pairaccuracy. RNA, 15(10):1805–1813, 2009.
D. H. Mathews. Using an RNA secondary structure partitionfunction to determine confidence in base pairs predicted by freeenergy minimization. RNA, 10(8):1178–1190, 2004.
D. H. Mathews, J. Sabina, M. Zuker, and D. H. Turner. Expandedsequence dependence of thermodynamic parameters provides
54
BackgroundTurbo-Decoding RNA Secondary Structure
References
improved prediction of RNA secondary structure. J. Mol. Biol.,288(5):911–940, 1999.
J. S. McCaskill. The equilibrium partition function and base pairbinding probabilities for RNA secondary structure. Biopolymers,29(6-7):1105–1119, Nov. 1990.
D. Sankoff. Simultaneous solution of RNA folding, alignment andprotosequence problems. SIAM J. App. Math., 45(5):810–825,Oct. 1985.
I. Tinoco, Jr. and C. Bustamante. How RNA folds. J Mol Biol,293(2):271–281, 1999.
Richard B. Waring and R. Wayne Davies. Assessment of a modelfor intron rna secondary structure relevant to RNA self-splicing –a review. Gene, 28(3):277–291, 1984.
Sebastian Will, Kristin Reiche, Ivo L. Hofacker, Peter F. Stadler,and Rolf Backofen. Inferring noncoding RNA families andclasses by means of genome-scale structure-based clustering.PLoS Comput. Biol., 3(4):680–691, Apr. 2007.
54
BackgroundTurbo-Decoding RNA Secondary Structure
References
T. Xia, J. SantaLucia, Jr., R Kierzek, S. J. Schroeder, X Jiao,C Cox, and Douglas Henry Turner. Thermodynamic parametersfor an expanded nearest-neighbor model for formation of RNAduplexes with Watson-Crick pairs. Biochemistry, 37(42):14719–14735, 1998.
M. Zuker. Computer prediction of RNA structure. Methods
Enzymol., 180:262–288, 1989.
54