turbo-decoding of rna secondary structuregsharma/.../sharmaturbodecodingrnase… · rna structure...

BackgroundTurbo-Decoding RNA Secondary Structure

References

Turbo-Decoding of RNA Secondary Structure

Gaurav Sharma

Department of Electrical and Computer EngineeringDepartment of Biostatistics and Computational Biology

June 18, 2013

1

Outline

◮ Turbo-decoding in Communications: A Quick Review

Outline


◮ RNA Structure Analysis: Motivation and Background◮ RNA, noncoding RNA, RNA structure and its significance◮ RNA structure prediction

◮ Single/Multiple sequence methods

Outline




◮ Turbo-decoding RNA secondary structure◮ Iterative probabilistic decoding of structures of multiple

homologs: TurboFold

◮ Turbo-decoding: RNA vs communications

Outline





homologs: TurboFold


◮ Conclusions

Outline





homologs: TurboFold


◮ Conclusions

◮ Ongoing related work

Convolutional Codes

◮ Encoder

D D D Codeword symbolstreams

Messageword

symbolstreams c1(D)

c2(D)

c3(D)

c4(D)

m1(D)

m2(D)

m3(D)

◮ Finite state machine◮ Output and next state are functions of current state and inputs

ML Decoding: Convolutional Code

◮ Convolutional code structure constrains possibilities to a trellis

1 00 0 1 1 Message BitsDe odedRx Bits

11 10

00

10

01

11

11 01 1001

◮ ML Decoding: Most likely path through the trellis given thereceived information

Turbo Decoding in Communications

◮ An encoder construction

m(l)1

m(l)0

m(l)k−1

p(l)

m(l)... ...ParitySubunitComputationRe ursive Systemati Convolutional En oder... c

(l)0

c(l)1

c(l)n−k−1

m(l)Messageword

c(l)n−k

c(l)n−k+1

c(l)n

Systemati Subunit ...P(D)

Figure : Systematic convolutional encoder(recursive)

m(l)

P1(D)

m(l)

p(l)1

m(l)

P2(D)

m(l)

p(l)2

Figure : Two encoders.

InterleaverΠ

P2(D) p(l)2

m(l) m(l)

P1(D) p(l)1

c(l)

Figure : Parallelconcatentation.

Turbo Decoding in Communications

m̃

InterleaverΠ

P1(D)

P2(D)

p̃1

p̃2

m̃ r̃1

r̃1

r̃1

Channel

Figure : Encoder + channel

Forward-Ba kward Algorithmfor Extrinsi InformationUpdate for Code 1r̃1

Forward-Ba kward Algorithmfor Extrinsi InformationUpdate for Code 2r̃2

Π

Approximate MAPDe oding {m̂(l)}L−1l=0 ≡ ˆ̃m

Π−1 Π

r̃0

r̃

t = T

Iterative PhaseFinal De oding Step{

T(t)l

(m)}L−1

l=0

{

U(t)l

(m)}L−1

l=0

Figure : Iterative Decoder

Symbol-wise MAP Decoding: Convolutional Code

◮ Convolutional code structure constrains possibilities to a trellisRx Bits11 10

00

10

01

11

11 01 1001

◮ Most probable value of a bit for all possible paths through thetrellis, given the received information

◮ Localized probabilistic information

Turbo Decoding in Communications: Observations

◮ Multiple encodings of same message information

◮ Joint (optimal) decoding desirable◮ Exact joint decoding ≈ exponential complexity◮ Computationally Efficient Decoding: Iterative approximation

(belief propagation)◮ Localized MAP probabilitistic formulation◮ Decomposition into loosely coupled individual decodings +

information exchange at each iteration◮ Linear complexity in length of data◮ Pseudo-prior interpretation

RNA?

What does this have to do with RNA?

RNA: Ribonucleic Acid

◮ Nucleic Acid of long chain ofunits named nucleotides:Nitrogenous Base, Ribosesugar, Phosphate

◮ Adjacent nucleotides linkedtogether by strong (covalent)phosphodiester bonds

between sugar and phosphate

◮ Information encoded with 4different types of nucleotidesdifferentiated by basecontent: Adenine, Guanine,Cytosine, Uracilhttp://www.genome.gov

DNA

Deoxyribonucleic acidRibonucleic acid

RNA

Base pair

Nitrogenous

Bases

Sugarphosphatebackbone

AUCG's

ATC

G's

H

CC

H

CH

H

H

NC

O

N

OU Uracil

replaces Thymine in RNA

NH2

CC

N

N

CC H

H

H

H H

HC

NC

N

N

O

G Guanine

CC

N

N

CC

H

H

CH

NC

H

N

A Adenine

H

C Cytosine

CC

H

CH

H

NC

O

N

NH2

NH2

CC

N

N

CC H

H

H

H H

HC

NC

N

N

O

G Guanine

CC

N

N

CC

H

H

CH

NC

H

N

A Adenine

H

C Cytosine

CC

H

CH

H

NC

O

N

NH2

H

CC

H3C

CH

H

H

NC

O

N

O

T Thymine

Nitrogeneous

BasesNitrogeneous

Bases

The Central Dogma

DNA

mRNA Transcription

Mature mRNA

mRNA

Transport to cytoplasm

Translation

Nuclear membrane

Ribosome

tRNA

Codon

Anti-codon

Amino acid

Amino acid

chain (protein)

◮ Genetic information flows unidirectionally:◮ DNA → RNA → Protein

◮ RNA plays a passive role◮ Transient copy created for protein synthesis

http://www.genome.gov

RNA an Active Player: ncRNAs

micro RNAUAG

DNA Template

Transcription

pre mRNASelf-splicing

UAA

mRNA

Codons

Anti-Codon

Protein

AUG

+/- RibosomalRNA

tRNA

Intron

ncRNAs

◮ ncRNAs: play direct functional rolesin cellular processes

◮ w/o translation to protein =⇒“noncoding”

◮ Increasing numbers (being)discovered

◮ 1989 Nobel Prize in Chemistry:Ribozymes

◮ Thomas Cech and SidneyAltman

◮ 2006 Nobel Prize inPhysiology/Medicine: siRNA

◮ Andrew Fire and Craig Mello


References

Noncoding RNAs (ncRNAs): Examples

◮ Commonly known ncRNAs◮ Protein synthesis: tRNA, rRNA◮ RNA modification: snoRNAs,

◮ Up/Down regulation of gene expression◮ Regulation of transcription◮ siRNA/miRNA post transcription regulation silencing of genes◮ piRNAs regulation of retroransposons

◮ RNA Splicing (autocatalysis)

◮ Many more: ...

◮ RNA Genomes (Many viruses including HIV and SIV)

◮ ncRNAs and diseases◮ Abnormal expression for ncRNAs observed in cancerous cells◮ Prader-Willi Syndrome (over-eating and learning disabilities)◮ Autism, Alzheimer’s, ...

13


References

Noncoding RNAs (ncRNAs)

◮ RNA molecules that directly play functional roles in cellularprocesses

◮ Do not code for protein synthesis =⇒ “noncoding”.

◮ Structure determines function in noncoding roles

◮ Determination of structure is of significant interest◮ Further understanding of ncRNA function◮ Enhances understanding of cellular processes and interactions◮ Provides targets for drug design

14

Computational Prediction of RNA Structure




◮ Experimental determination of structure is challenging◮ X-ray Crystallography◮ Crystallization difficult and expensive



◮ Experimental determination of structure is challenging◮ X-ray Crystallography◮ Crystallization difficult and expensive

◮ Computational estimation of structure is of significant interest

◮ Understanding ncRNA function in cellular processes andinteractions

◮ Genome understanding: structure based ncRNA gene search◮ Therapeutics: targets for drug design

RNA Structure Hierarchy [Tinoco and Bustamante, 1999]

AAUUGCGGGAAAGGGGUCAA

CAGCCGUUCAGUACCAAGUC

UCAGGGGAAACUUUGAGAUG

GCCUUGCAAAGGGUAUGGUA

AUAAGCUGACGGACAUGGUC

CUAACCACGCAGCCAAGUCC

UAAGUCAACAGAUCUUCUGU

UGAUAUGGAUGCAGUUCA

UG

CA A

P 6b

P 6a

P 6

P 4

P 5P 5a

P 5b

P 5c

120

140

160

180

200

220

240

260

AAU

UGCGGG

A

A

A

G

GGGUCA

ACAGCCG UUCAG

U

ACCA

AGUCUCAGGGG

AAACUUUGAGAU

GGCCU

A G GG U A U

GGUA

AU

A AG

CUGACGG

ACA

UGGUCC

U

A

A

CC

A CGCA

GC

C

A

A

GUCC

UA

AGUCAACAGA

U C UU

CUGUUGAU

A

U

GGAU

G

C

A

GU

UC A

5’

3’

Figure : Hierarchy of RNA structure formation [Waring and Davies, 1984,Doudna and Cech, 2002, Doudna and Cate, 1997]

RNA Secondary Structure

◮ Folding of RNA linear molecular chain onto itself with basepairing rules

◮ Formation of hydrogen bonds between nucleotides◮ Canonical base pairs

◮ A can pair with U◮ G can pair with C and U◮ G-U pair called non Watson-Crick pair

◮ Greater variety of structures than the DNA double helix

RNA Secondary Structure Elements

G A G G A A

U

410

A

U

G

C

UC

10

G

C

G

G

C

G

C

G

C

C GU AC G 280C GA

U

20

AGG

G CA G C

G CA UU30 A

C

G

C

G

A

U

G

C

U

G

UA

AC

40G

G CU AG CC50 G 70U G

AUACG

U GCG C

60

AA

ACCA

A

A

G

C

G

C

A CAG

80

C

U

A

270

G

U

C

G

AACA

GA

G

90

A A C A GA

C

G

C

G

GU

C100

G

CG

GA U

A

G

C

G

U

C

G

U

A

200

U

A

U

110

A

G

C

G

C

G

C

C

G

U

G

GU

G

GC

GC120

GGCC

GUA

130G

CCU U G

AAAA

GC

190

U

140

G CG CG CG UG UG

UAG

CG

A 150U

GU180

AU

GU

CG

GCAUA

GCAU 160G

CG170

GC

UA

UUC

G

C

G

C

A

A

AA

210GUGA

AA

CG

220

C GC

CG

C AU GC G

UA

AG

230

AG

A

GC

C

240

G260

GC

CG

UA

GC

GU

CG

GC

GUA

250

A

U

GA

CGGA

CAAG

U

A

U

C

G

C

G

A

290

A A UAGG

C

G

C

A

U

A

U

G

300

C

C

G

A UUA

GU320 GCCGGCU 310

GG

GCCC

GGGU

330

A

GCUUG340AG

C

C

G

360

C

G

U

A

G

C

U

G

C

G

G

C

G

350

UG

AC

UA

GA

GGAA

370

UGA

UU

GCCGA

380 AC

G

C

G

C

G

C

G

390

GCA

A G G A AC

AG

400

AA

A

5’

3’

Pseudoknot

Internal Loop

Multibranch Loop

Bulge Loop

G C

Hairpin Loop

Figure : Structural Elements of LGW17 sequence from RNAse PDatabase [Brown, 1999]

RNA Structure: Thermodynamics

◮ Equilibrium: Boltzmann Distribution of structures

UnpairedState

3’

5’

3’

5’

3’

5’

3’

5’

exp(−∆Go(Sk)/RT )

···

··· ···

exp(−∆Go(S1)/RT )

exp(−∆Go(S2)/RT )

Sk

···

···

S1

S2

◮ Lower ∆Go(Sk), higher the probability of Sk

◮ Most likely structure → Minimization of free energy

Modeling RNA Thermodynamics: Nearest neighbor model

◮ Nearest neighbor model [Xia et al., 1998, Mathews et al.,1999]

◮ Computational model for free energy change of RNA structure◮ Experimentally determined free energy terms for each nearest

neighbor interaction in secondary structure◮ Loop decomposition

A

U U

1 Nucleotide Bulge Loop: +3.3 kcal/mol

Stacked loop (UA/GC): −1.8 kcal/mol

4 nucleotide hairpin loop: +5.9 kcal/mol

Terminal Mismatch (GC/AA): −1.1 kcal/mol

5’ dangling nucleotide: −0.3 kcal/mol

Stacked loop (AU/CG): −2.1 kcal/mol

Stacked loop (CG/AU): −1.8 kcal/mol

Stacked loop (AU/UA): −0.9 kcal/mol

Stacking of base pairs (GC/GC): −2.9 kcal/mol

Stacked loop (GC/GC): −2.9 kcal/mol

3’

5’

U

G

U

A

C

C

C

AA

G

G

A

G

U

A

C

A

A

Figure : Total free energy change is summation of all nearest neighborenergies [Durbin et al., 1999]

RNA ML Decoding of Structure: Single Sequence

◮ Most likely or minimum free energy structure, given sequence

· · · · · ·· · ·

◮ Dynamic Programming MFold [Zuker, 1989] O(N3)complexity

RNA MAP Decoding of Structure

◮ Posterior probability of base pairing, given sequence

· · · · · · · · ·

◮ Dynamic Programming [McCaskill, 1990], MFold, RNAfold(O(N3) in time, O(N2) in space)

◮ Localized probabilistic information

RNA Structure Prediction (Single Sequence)

RD0260 Sequence

GCGACCGGGGCUGGCUUGGUAAUGGUACUCCCCUGUCACGGGAGAGAAUGUGGGUUCAAAUCCCAUCGGUCGCGCCA

-28.6 kcals/mol

0 10 20 30 40 50 60 70 800

10

20

30

40

50

60

70

80

RD0260 nucleotide indices

RD

02

60

nu

cle

oti

de

ind

ice

s

Free Energy

Minimization

Partition Function

Computation

0 10 20 30 40 50 60 70 800

10

20

30

40

50

60

70

80

RD

02

60

nu

cle

oti

de

ind

ice

s

RD0260 nucleotide indices

20

30

40

50

60

70

GCGACCG

GG

GC

UG

GCUUG

G

U

AA U

G G U

ACUC

CCC

U

G

U CA

C

GGG

A

GAG

AA

U

G

U G G GU U

C

A

AAU

CCCA

UCGGUCGC

G

C

C

A

◮ Free energy minimization: “Hard” Prediction◮ Single prediction structure

◮ Base pairing probabilities: “Soft” Prediction◮ Thresholding may yield pseudo-knotted structures◮ Maximum Expected Accuracy Structure Prediction, [Do et al.,

2006, Lu et al., 2009]

Structure Prediction for Multiple Sequences: HomologousncRNAs

◮ Homologous ncRNAs

◮ Share evolutionary ancestor◮ Serve same function◮ Structural similarity in terms of topology of structures

10

20

30

40

60

70

GCGACCG

G

G

GCUGG

CUU

G

GU A A

U G G U

A

CUCCC

C

U

GU C

A

C

GGGAG

AG

A

A

U

G U G G GU U

C

A

AAU

CCCAU

CGGUCGC

G

C

C

A

10

20

30

40

50

60

70

GCCCGGG

UG

GUG

UAGU

G

G

C

CC A

UC A U

ACGACC

C

U

GU C

A

C

GGUCGU

GA

C

G C G G GU U

C

G

AAU

CCCGC

CUCGGGC

G

C

C

A

10

20

30

40

50

60

70

GCCCCCA

U

C

GUCUA

GAG

G

CC U

AG G A C

A

CCUCC

C

U

UU C

A

C

GGAGG

CG

A

C

A G G G AU U

C

G

AAU

UCCCU

UGGGGGU

A

C

C

A

RD0260 RD0500 RE2140

Bacteriophage T5 (Asp) Haloferax Volcanii (Asp) Synechocystis (Glu)


◮ Homologous ncRNAs

◮ Share evolutionary ancestor◮ Serve same function◮ Structural similarity in terms of topology of structures

10

20

30

40

60

70

GCGACCG

G

G

GCUGG

CUU

G

GU A A

U G G U

A

CUCCC

C

U

GU C

A

C

GGGAG

AG

A

A

U

G U G G GU U

C

A

AAU

CCCAU

CGGUCGC

G

C

C

A

10

20

30

40

50

60

70

GCCCGGG

UG

GUG

UAGU

G

G

C

CC A

UC A U

ACGACC

C

U

GU C

A

C

GGUCGU

GA

C

G C G G GU U

C

G

AAU

CCCGC

CUCGGGC

G

C

C

A

10

20

30

40

50

60

70

GCCCCCA

U

C

GUCUA

GAG

G

CC U

AG G A C

A

CCUCC

C

U

UU C

A

C

GGAGG

CG

A

C

A G G G AU U

C

G

AAU

UCCCU

UGGGGGU

A

C

C

A

RD0260 RD0500 RE2140

Bacteriophage T5 (Asp) Haloferax Volcanii (Asp) Synechocystis (Glu)

GCGACCG C

GGUCGC

GCCCGGG C

UCGGGC

GCCCCCA U

GGGGGU

G C G G GU

CCCGC

A G G G AU

UCCCUG U G G G

U

CCCAU

CUCCC

C

GGGAG

ACGACC G

GUCGU C

CUCC

C GGAGG

GCUG

U G G UGUG

C A U

GUCU

G G A C


RD0260 GCGACCGGGGCUGGCUUGGUA-AUGGUACUCCCCUGUCACGGGAGAGAAUGUGGGUUCAAAUCCCAUCGGUCGCGCCA

RD0500 GCCCGGGUGGUGUAGU-GGCCCAUCAUACGACCCUGUCACGGUCGUGA-CGCGGGUUCGAAUCCCGCCUCGGGCGCCA

RE2140 GCCCCCAUCGUCUAGA-GGCCUAGGACACCUCCCUUUCACGGAGGCGA-CAGGGAUUCGAAUUCCCUUGGGGGUACCA

10

20

30

40

60

70

GCGACCG

G

G

GCUGG

CUU

G

GU A A

U G G U

A

CUCCC

C

U

GU C

A

C

GGGAG

AG

A

A

U

G U G G GU U

C

A

AAU

CCCAU

CGGUCGC

G

C

C

A

10

20

30

40

50

60

70

GCCCGGG

UG

GUG

UAGU

G

G

C

CC A

UC A U

ACGACC

C

U

GU C

A

C

GGUCGU

GA

C

G C G G GU U

C

G

AAU

CCCGC

CUCGGGC

G

C

C

A

10

20

30

40

50

60

70

GCCCCCA

U

C

GUCUA

GAG

G

CC U

AG G A C

A

CCUCC

C

U

UU C

A

C

GGAGG

CG

A

C

A G G G AU U

C

G

AAU

UCCCU

UGGGGGU

A

C

C

A

GCGACCG C

GGUCGC

GCCCGGG C

UCGGGC

GCCCCCA U

GGGGGU

70

G C G G G

CCCGC

A G G G A

UCCCUG U G G G

CCCAU

CUCCC

C

GGGAG

ACGACC

C GGUCGU C

CUCC

C GGAGG

GCUG

U G G UGUG

UC A U

GUCU

G G A C

GCGACCGGGGCUGGCUUGGUA-AUGGUACUCCCCUGUCACGGGAGAGAA

GCCCGGGUGGUGUAGU-GGCCCAUCAUACGACCCUGUCACGGUCGUGA-

GCCCCCAUCGUCUAGA-GGCCUAGGACACCUCCCUUUCACGGAGGCGA-

UGUGGGUUCAAAUCCCAUCGGUCGCGCCA

CGCGGGUUCGAAUCCCGCCUCGGGCGCCA

CAGGGAUUCGAAUUCCCUUGGGGGUACCA























RD0260 RD0500 RE2140

5’ 3’

◮ “Common” structures and conforming sequence alignment

◮ Joint estimation can harness comparative structure and sequenceinformation across homologs

Multiple Sequence RNA Structure Prediction

Multiple

PredictionStructure

Input Sequences

(Optional)

N1

...

xK

x2

x1

...

N2

NK

1

1

1 A

...

S1

SK

S2

Multiple Sequence RNA Structure Prediction

Multiple

PredictionStructure

Input Sequences

(Optional)

N1

...

xK

x2

x1

...

N2

NK

1

1

1 A

...

S1

SK

S2

◮ Sankoff’s dynamic programming algorithm [Sankoff, 1985]◮ Simultaneous folding (pseudo-knot free) and alignment of K

sequences◮ Time (Memory) complexity: O(N3K) (O(N2K))◮ Computationally infeasible even for short sequences and K = 2

w/o cutting corners

Turbo-Decoding RNA Secondary Structure

◮ Goal: Performance similar (“better”) than joint estimation,complexity similar to single sequence computation.

◮ Probabilistic formulation of folding and alignment◮ Base pairing probabilities, posterior alignment probabilities

◮ Iteratively update each using information from other◮ TurboFold [Harmanci et al., 2007, 2011].

Base pairing probabilities

TurboFold

Input Sequences

...

...

NK11

NK

N2

N211

11

N1

N1

1

1

1

NK

N2

...

x1

x2

xK

...

N1

Structural Alignment: Joint Representation of Structuresand Alignment

◮ Two sequence case

3’5’

5’ 3’

A

UG

CA

C

UU

U U GA

U

C G

AU

U

GU

UA

AC

CG U A C U U A C U U G A U C G C A U U G UUAA C

GGA C A A U A U U C U U C U U C C G U G G U A G UUC

UUGAUGGUGCCUUCUUCUUAUAACAGG C

for alignmentCo−incidence path

Structure 2

Structure 1

Sequence 2

Sequence 1 3’5’

Sequence 2

Sequence 1U

UG

AU

GG

UG

CC

UU

CU

UC

UU

AU

AA

CA

GG

CG

AC

UA

AC

UU

AC

UU

GA

UC

GC

AU

UG

UU

UUGUUACGCUAGUUCAUUCAA UCAG

CU

U

U

A

U

C

UU

AA

U C

C

A

G

G C

U

U

G

A

G

UC

G

G

U

Decoupled Probabilistic Representation for RD0260,RD0500 Structural Alignment

◮ Formulate in probabilistic framework and separate thefolding/alignment representations

0 10 20 30 40 50 60 70 800

10

20

30

40

50

60

70

80

RD0260 indices

RD

05

00

ind

ice

s

0 10 20 30 40 50 60 70 800

10

20

30

40

50

60

70

80

RD0500 indices

0 10 20 30 40 50 60 70 80

0

10

20

30

40

50

60

70

80R

D0

26

0 in

dic

es

Base pairing

probabilities for

RD0260

Co-incidence

probabilities

(via HMM) Base pairing

probabilities for

RD0500

(via partition

function)

TurboFold

Given K homologous RNA sequences, each sequence contains

some information about folding of every other sequence.

TurboFold



◮ Extrinsic information for a sequence◮ The information about folding of a sequence which is

computed using base pairing probabilities of other sequences◮ Thermodynamic model + Alignment model

TurboFold





◮ Base pairing probabilities of a sequence (Intrinsic Information)◮ From sequence itself◮ Thermodynamic model

TurboFold





◮ Base pairing probabilities of a sequence (Intrinsic Information)◮ From sequence itself◮ Thermodynamic model

◮ Iterative updates:◮ Compute extrinsic information using base pairing probabilities

and alignment co-incidence probabilities◮ Update base pairing probabilities using updated extrinsic

information◮ Update extrinsic information using updated base pairing

probabilities◮ . . .

Extrinsic Information for Base Pairing for RD0260

0 10 20 30 40 50 60 70 80

0

10

20

30

40

50

60

70

80

RD

02

60

ind

ice

s

0 10 20 30 40 50 60 70 800

10

20

30

40

50

60

70

80

RD0260 indices

RD

05

00

ind

ice

s

0 10 20 30 40 50 60 70 800

10

20

30

40

50

60

70

80

RD0500 indices

0

pΠRD0260

Extrinsic (RD0260)

˜

0

pΠRD0500

0

cΠRD0500, RD0260

0

cΠRD0500, RD0260

T

(1 − ψRD0500, RD0260

)

( )

3 Sequences

RD0260 RD0500

RE2140

0 10 20 30 40 50 60 70 800

10

20

30

40

50

60

70

80

RD0260 indices

RD

02

60

ind

ice

s

0 10 20 30 40 50 60 70 800

10

20

30

40

50

60

70

80

RD0500 indices

RD

05

00

ind

ice

s

0 10 20 30 40 50 60 70 800

10

20

30

40

50

60

70

80

RE2140 indices

RE

21

40

ind

ice

s

Base Pairing Proclivity Matrix for RD0260 Induced by RE2140

0 10 20 30 40 50 60 70 800

10

20

30

40

50

60

70

80

RE2140

0 10 20 30 40 50 60 70 800

10

20

30

40

50

60

70

80

RE2140 indices

RD

02

60

ind

ice

s

0 10 20 30 40 50 60 70 800

10

20

30

40

50

60

70

80

RE2140 indices

RD

02

60

ind

ice

s

RE2140 indices

RE

21

40

ind

ice

s

0 10 20 30 40 50 60 70 800

10

20

30

40

50

60

70

80

RD0260 indices

RD

02

60

ind

ice

s

Base pairing proclivity matrix

for RD0260 induced by RE2140

T

◮ Information in RE2140 about folding of RD0260

tpΠ̃

(s→m) = cΠ(m,s) t−1

p Πs (cΠ

(m,s))T (1)

3 Sequences: Extrinsic information Computation

0 10 20 30 40 50 60 70 800

10

20

30

40

50

60

70

80

0 10 20 30 40 50 60 70 800

10

20

30

40

50

60

70

80

RD0260 indices

Proclivity Matrices for RD0260

0 10 20 30 40 50 60 70 800

10

20

30

40

50

60

70

80

Extrinsic Information

Matrix for RD0260

Normalize by

maximum

RD0260 indices

RD

02

60

ind

ice

sR

D0

26

0 in

dic

es

Proclivity matrix

induced by RE2140

Proclivity matrix

induced by RD0500

RD0260 indices

RD

02

60

ind

ice

s

Weight by

(1 - similarity)

Weight by

(1 - similarity)

K Sequences: Extrinsic Information Computation for xm

Co−incidence probabilities

Induced base pairingproclivity matrices

for

Base pairingprobabilities

Normalize by maximum

base pairing forExtrinsic information for

{tcΠ(s,m)}s∈N\m

{tpΠ̃(s→m)}s∈N\m

xm

{t−1p Πs}s∈N\m1

1N1

N1

Nm−1

Nm−1

11

· · ·

· · ·

· · ·

· · ·

· · ·

· · ·x1

NK

11 NK

· · ·xm−1 xm+1 xK

Nm+111

Nm+1

· · ·

xm

11 Nm

Nm

(1 − ψ1,m)

tpΠ̃

m

Nm

Nm11

11 Nm

Nm

· · · (1 − ψk,m)

111

1 Nm

Nm

Nm

Nm

· · · (1 − ψm−1,m) (1 − ψm+1,m)

Nm

11

N1

11 Nm

NK

11

11

Nm Nm

Nm+1Nm−1

Base Pairing Probability Computation Using ExtrinsicInformation

◮ Modified Boltzmann distribution of secondary structures:

P (S) ∝ exp

(

−∆G̃(S)

RT

)



P (S) ∝ exp

(

−∆G̃(S)

RT

)

where∆G̃(S) = ∆Go(S)− γ

∑

(i,j)∈S

log(π̃(i, j))

is the modified free energy change for structure S.

◮ π̃(i, j): Extrinsic information for pairing of nucleotides atindices i and j

◮ γ: Weight of extrinsic information on modified free energyrelative to ∆Go(S)

Extrinsic information introduced via a pseudo free energy for eachbase pair



P (S) ∝ exp

(

−∆G̃(S)

RT

)

(2)


∑

(i,j)∈S

log(π̃(i, j)) (3)



P (S) ∝ exp

(

−∆G̃(S)

RT

)

(2)


∑

(i,j)∈S

log(π̃(i, j)) (3)

Replace (3) in (2):

P (S) ∝ exp(−∆Go(S)

RT)

︸︷︷︸

Boltzmann distributionproportionality term

∏

(i,j)∈S

(π̃(i, j))γ/RT

︸︷︷︸

Extrinsicinformation



P (S) ∝ exp

(

−∆G̃(S)

RT

)

(2)


∑

(i,j)∈S

log(π̃(i, j)) (3)

Replace (3) in (2):

P (S) ∝ exp(−∆Go(S)

RT)

︸︷︷︸

Boltzmann distributionproportionality term

∏

(i,j)∈S

(π̃(i, j))γ/RT

︸︷︷︸

Extrinsicinformation

◮ Base pair (i, j) has a pseudo prior probability of (π̃(i, j))γ/RT

due to extrinsic information.

TurboFold: Iterative Updates

Alignment

Iterations

computation for

sequencecomputation for computation for

sequence

Unit "time" delay

computation forcomputation for

probabilities

sequence

Extrinsic information

sequence

Extrinsic information Extrinsic information

sequence

Base pairing probability Base pairing probabilityBase pairing probability

sequencecomputation for

···

···

· · ·

· · ·

η

1st· · ·

· · ·mth Kth· · · · · ·

· · ·· · ·

KthmthΠK

···

· · · · · ·

···

···

Πm

· · ·· · ·

Π̃1

Π1

···

· · · · · ·

···

···

···

···

Π̃m

1st Π̃K

◮ For low K, per iteration complexity is comparable to singlesequence structure prediction

◮ Benefits from comparative analysis

TurboFold Structure Prediction Overview

◮ Obtain base pairing probabilities after η iterations, thenpredict structures

◮ Significant base pairs◮ Maximum expected accuracy (MEA) structures

Iterations

Thresholding

Maximum Expected

Accuracy Prediction

TurboFold−MEA

Input Sequences

TurboFold

η

{tpΠ

m}{tpΠ̃

m}{η

pΠm}m∈N

{xm}m∈N

N1

...

xK

x2

x1

...

N2

NK

1

1

1

Structure Prediction

◮ Structure for xm composed of base pairs with probabilitiesgreater than Pthresh:

S∗

m = {(i, j) ∋ ηpπ

m(i, j) > Pthresh} (4)

TurboFold: Computation Complexity

◮ Initialization◮ Computation of co-incidence matrices: O(K2N2)◮ Computation of sequence similarities: O(K2N2)

◮ Iterations◮ Extrinsic information computation: O(ηK2d2N2)◮ Base pairing probability computation: O(ηKUN3)

◮ Structure prediction◮ Thresholding: O(KN2)◮ MEA prediction: O(KN3)

Compare to Sankoff’s algorithm: O(N3(U2d)K)

Evaluating Accuracy of Estimates

◮ Sensitivity: Ratio of number of correctly predicted base pairsto the total number of base pairs in the known structure

True Positive

True Positive + False Negative

◮ Recall

◮ Positive Predictive Value(PPV): Ratio of number of correctlypredicted base pairs to the total number of base pairs in thepredicted structure

True Positive

True Positive + False Positive

◮ Precision

Parameter Selection: Number of iterations, η

Sensitivity vs. PPV over 5S rRNA datasetwith changing η

0.66 0.67 0.87 0.88 0.89 0.90.635

0.64

0.645

0.875

0.88

0.885

0.89

0.895

0.9

0.905

Sensitivity

PP

V

0

1

234

5

◮ η = 3 is used in TurboFold

Parameter Selection: Weight of Extrinsic Information, γ

Sensitivity vs. PPV over 5S rRNA datasetwith changing γ/RT

0.65 0.7 0.75 0.8 0.85 0.90.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

0.001

0.05

0.1

0.2

0.30.5

0.81.01.2

Sensitivity

PP

V

◮ γ = 0.3RT is used in TurboFold

Benchmarking Experiments: Datasets

◮ Randomly choose 200 RNase P, 400 5S rRNA, 400 SRP, and400 tRNA sequences and divide into K combinations

◮ Choose and divide for K = 2, . . . , 10

◮ Yields 36 datasets

The datasets have significant diversity:

◮ RNase Ps: 336 nucleotides, 50% average pairwise identity

◮ tmRNA: 366 nucleotides, 45% average pairwise identity

◮ telomerase RNA: 445 nucleotides, 54% average pairwiseidentity

◮ SRPs: 187 nucleotides, 42% average pairwise identity

◮ tRNAs: 77 nucleotides, 47% average pairwise identity

◮ 5S rRNAs: 119 nucleotides, 63% average pairwise identity

Benchmarking Experiments

◮ TurboFold is benchmarked against methods that estimatebase pairing probabilities:

◮ LocARNA [Will et al., 2007]◮ RNAalifold [Bernhart et al., 2008]◮ Single sequence partition function [Mathews, 2004]

◮ The set of base pairs with estimated probabilities higher thanPthresh are scored

◮ Plotted sensitivity versus PPV while varying Pthresh between 0and 1 with step size of 0.04

Benchmarking Experiments

Sensitivity vs PPV ROC curves for TurboFold vs three alternativemethods

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Sensitivity

PP

V

TurboFOLD (K = 3)TurboFOLD (K = 10)LocARNA (K = 3)LocARNA (K = 10)RNAalifold (K = 3)RNAalifold (K = 10)Single (K = 3)Single (K = 10)

Run Time Requirements

◮ Run time requirements over 50 RNase P sequence datasets

Runtime (seconds) forK = 3 K = 5 K = 10

TurboFold 136.75 277.9 517.0

LocARNA 746.44 2815.9 11395.8

RNAalifold 0.2 0.3 0.6Table : Time requirements (in seconds) for the methods.

◮ TurboFold scales slower with increase in K

Conclusions

◮ TurboFold: A multiple sequence structure prediction method◮ Lowers Complexity with iterative combination of intrinsic and

extrinsic information for folding◮ Intrinsic information: From sequence via thermodynamic

folding model (nearest neighbor model)◮ Extrinsic information: From other sequences

◮ TurboFold accuracy: close to or higher than the simultaneousfolding and alignment methods

◮ Details: BMC Bioinformatics article Harmanci et al. [2011].

◮ Connections to coding theory in digital communications

Turbo Decoding: RNA vs Communications

◮ Multiple encodings of same information◮ Nature/Man

◮ Joint (optimal) decoding desirable◮ Exact joint decoding ≈ exponential complexity◮ Iterative approximation (belief propagation)

◮ Localized MAP probabilitistic formulation (basepairing/symbol probs.)

◮ Decomposition into loosely coupled individual decodings +information exchange at each iteration

◮ Linear complexity in length of data◮ Pseudo-prior interpretation

Ongoing Related Work

◮ Moving beyond TurboFold◮ Alignment probability updates based on structures◮ Better handling of dependencies◮ Domain insertions/deletions

◮ Connecting with experiments◮ Incorporating experimental information (e.g. SHAPE) in

structural alignments◮ Postulating mechanisms and experimental validation (HIV)

Acknowledgments

◮ Collaborators at UR:◮ Arif O. Harmanci, UR ECE, currently PostDoc at Yale◮ David H. Mathews, Department of Biochemistry and

Biophysics

◮ Research support:◮ National Institutes of Health (NIH) (Award # GM097334-01)◮ Center for Research Computing, University of Rochester

Thank you

Questions?


References

S. H. Bernhart, I. L. Hofacker, S. Will, A. R. Gruber, and P. F.Stadler. RNAalifold: improved consensus structure prediction forRNA alignments. BMC Bioinformatics, 9:474, 2008.

J. W. Brown. The Ribonuclease P database. Nucleic Acids Res.,27(1):314, Jan. 1999.

Chuong B. Do, Daniel A. Woods, and Serafim Batzoglou.CONTRAfold: RNA secondary structure prediction withoutphysics-based models. Bioinformatics, 22(14):90–98, 2006.

J. A. Doudna and T. R. Cech. The chemical repertoire of naturalribozymes. Nature, 418(6894):222–228, 2002.

J.A. Doudna and J.H. Cate. RNA structure: crystal clear? Current

Opinions in Structural Biology, 7:310–316, 1997.

R. Durbin, S. R. Eddy, A. Krogh, and G. Mitchison. BiologicalSequence Analysis : Probabilistic Models of Proteins and

Nucleic Acids. Cambridge University Press, Cambridge, UK,1999. ISBN 0521629713.

54


References

A. Ozgun Harmanci, Gaurav Sharma, and David H. Mathews.Toward turbo decoding of RNA secondary structure. In Proc.

IEEE Intl. Conf. Acoustics Speech and Sig. Proc., volume I,pages 365–368, Apr. 2007.

A. Ozgun Harmanci, Gaurav Sharma, and David H. Mathews.TurboFold: Iterative Probabilistic Estimation of SecondaryStructures for Multiple RNA Sequences. BMC Bioinformatics,12:108, 2011. early access available online, April 20, 2011.

Zhi John Lu, Jason W. Gloor, and David H. Mathews. ImprovedRNA secondary structure prediction by maximizing expected pairaccuracy. RNA, 15(10):1805–1813, 2009.

D. H. Mathews. Using an RNA secondary structure partitionfunction to determine confidence in base pairs predicted by freeenergy minimization. RNA, 10(8):1178–1190, 2004.

D. H. Mathews, J. Sabina, M. Zuker, and D. H. Turner. Expandedsequence dependence of thermodynamic parameters provides

54


References

improved prediction of RNA secondary structure. J. Mol. Biol.,288(5):911–940, 1999.

J. S. McCaskill. The equilibrium partition function and base pairbinding probabilities for RNA secondary structure. Biopolymers,29(6-7):1105–1119, Nov. 1990.

D. Sankoff. Simultaneous solution of RNA folding, alignment andprotosequence problems. SIAM J. App. Math., 45(5):810–825,Oct. 1985.

I. Tinoco, Jr. and C. Bustamante. How RNA folds. J Mol Biol,293(2):271–281, 1999.

Richard B. Waring and R. Wayne Davies. Assessment of a modelfor intron rna secondary structure relevant to RNA self-splicing –a review. Gene, 28(3):277–291, 1984.

Sebastian Will, Kristin Reiche, Ivo L. Hofacker, Peter F. Stadler,and Rolf Backofen. Inferring noncoding RNA families andclasses by means of genome-scale structure-based clustering.PLoS Comput. Biol., 3(4):680–691, Apr. 2007.

54


References

T. Xia, J. SantaLucia, Jr., R Kierzek, S. J. Schroeder, X Jiao,C Cox, and Douglas Henry Turner. Thermodynamic parametersfor an expanded nearest-neighbor model for formation of RNAduplexes with Watson-Crick pairs. Biochemistry, 37(42):14719–14735, 1998.

M. Zuker. Computer prediction of RNA structure. Methods

Enzymol., 180:262–288, 1989.

54

turbo-decoding of rna secondary structuregsharma/.../sharmaturbodecodingrnase… · rna structure...

Documents