c omputational ncrna gene finding (& nc rna structure prediction) liming cai (binf8210@uga, fall...

95
Computational ncRNA gene finding (& ncRNA structure prediction) Liming Cai (BINF8210@UGA, Fall 2010) ncRNA structure prediction (& computational ncRNA gene finding)

Upload: beverly-cameron

Post on 29-Dec-2015

218 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: C omputational ncRNA gene finding (& nc RNA structure prediction) Liming Cai (BINF8210@UGA, Fall 2010) nc RNA structure prediction (& computational ncRNA

Computational ncRNA gene finding(& ncRNA structure prediction)

Liming Cai

(BINF8210@UGA, Fall 2010)

ncRNA structure prediction (& computational ncRNA gene

finding)

Page 2: C omputational ncRNA gene finding (& nc RNA structure prediction) Liming Cai (BINF8210@UGA, Fall 2010) nc RNA structure prediction (& computational ncRNA
Page 3: C omputational ncRNA gene finding (& nc RNA structure prediction) Liming Cai (BINF8210@UGA, Fall 2010) nc RNA structure prediction (& computational ncRNA

Non-coding RNAs

• Functions other than coding proteins, e.g., structural, catalytic, and regulatory factors

functional RNAs = ncRNAs + UTR motifs

• (-) No strong statistical features, such as ORFs, or polyadenylated, demonstrated in coding genes

• (+) Transcribed ncRNA molecules can fold into secondary and tertiary structures (more conserved than sequences)

Page 4: C omputational ncRNA gene finding (& nc RNA structure prediction) Liming Cai (BINF8210@UGA, Fall 2010) nc RNA structure prediction (& computational ncRNA

Sources of ncRNAs

• Non-coding RNA genes encode RNAs, e.g., miRNAs, rox1 and rox2 RNAs in male Drosophila melanogaster.

• In introns and intergenic regions, e.g., snoRNAs

• In 5’ and 3’ UTRs, e.g., regulatory motifs (functional RNAs)

Page 5: C omputational ncRNA gene finding (& nc RNA structure prediction) Liming Cai (BINF8210@UGA, Fall 2010) nc RNA structure prediction (& computational ncRNA

Functions of ncRNAs

• rRNAs and tRNAs• RNA maturation: snRNA in recognizing splicing

sites• RNA modification: snoRNA converting uridine

to pseudo-uridine• Regulation of gene expression and translation:

e.g., miRNAs• DNA replication: e.g., telomerase RNAs -

template for addition of telomeric repeats• Etc.

Page 6: C omputational ncRNA gene finding (& nc RNA structure prediction) Liming Cai (BINF8210@UGA, Fall 2010) nc RNA structure prediction (& computational ncRNA

Classes of ncRNAs(Bompfunewerer, et al, 2005)

Class Size Function Phylogenetic

distribution

tRNA 70-80 Translation ubiquitous

rRNA

16S/18S

28S+5.8S/23S

5S

1.5K

3K

130

translation ubiquitous

RNase P

MRP

220-440

250-350

tRNA -maturation

ubiquitous

eukarya

snoRNA

telomerase

130

400-550

pseudouridinylation

addition of repeats

snRNA

U1 ~ U6

100-600

130-140

Spliceosome

mRNA maturation

Eukarya

Eukarya, archaea

U7

7SK

~65

~300

Histone mRNA

Maturation

Translational

regulation

Eukayotes

vertebrata

tmRNA 300-400 Tags protein

For proteolysis

bacteria

miRNA ~22 Post-tran. Reg. Multi-cellular orgs

Page 7: C omputational ncRNA gene finding (& nc RNA structure prediction) Liming Cai (BINF8210@UGA, Fall 2010) nc RNA structure prediction (& computational ncRNA

Some ncRNAs databases

• Rfam (280,000 regions of 379 families)• NONCODE (109 transitional classes and 9

groups) • RNAdb (800 mammalian ncRNAs, excluding

tRNAs, rRNAs and snRNAs)• Arabidposis small RNA Project (ASRP)• Etc.

Page 8: C omputational ncRNA gene finding (& nc RNA structure prediction) Liming Cai (BINF8210@UGA, Fall 2010) nc RNA structure prediction (& computational ncRNA

ncRNA gene finding strategies

1. Computational predictive methods 2. cDNA cloning to enrich ncRNAs3. Detecting new transcripts with

oligonucleotide microarrays

Page 9: C omputational ncRNA gene finding (& nc RNA structure prediction) Liming Cai (BINF8210@UGA, Fall 2010) nc RNA structure prediction (& computational ncRNA

ncRNA gene finding: a computational challenge

• ncRNA genes do not have significant statistical signals

• large in number• diverse, 20 nts to 22,000 nts

- Not sure what to look for- Computationally intensive- Simply no good method- Methods compromising accuracy

Page 10: C omputational ncRNA gene finding (& nc RNA structure prediction) Liming Cai (BINF8210@UGA, Fall 2010) nc RNA structure prediction (& computational ncRNA

Computational ncRNA gene finding methods

• Specific (custom-designed) ncRNA search and annotation(e.g., tRNAscan, methylattion-guide snoRNA, miRNA, tmRNA)

• Reconfigurable search systems (e.g., Infernal, ERPIN, RNATOPS,FastR)

– mechanism to profile the target ncRNA (structure)- need training data

• De novo ncRNA gene detection with– base composition (e.g., G+C %)– structure fold (e.g., RNAz)

• Comparative analysis (e.g., QRNA, EvolFold)- consensus structure

• ncRNA “holy grail” ?

Page 11: C omputational ncRNA gene finding (& nc RNA structure prediction) Liming Cai (BINF8210@UGA, Fall 2010) nc RNA structure prediction (& computational ncRNA

Review literature in computational ncRNA gene finding and annotation• A. Laederach (2007) Informatics challenges in Structural

RNA, Brief Bioinformatics 8(5) 294-303.

• S. Eddy (2001) Non-coding RNA genes and modern RNA world, Nature Reviews Genetics, 2(12), 919-929.

• S. Griffiths-Jones (2007) Annotating noncoding RNA genes, Annual Rev. Genomics & Human Genetics, 8:279-298.

• Machado-Lima et al (2008) Computational methods in noncoding RNA research, Mathematical Biology, 56: 15-49.

Page 12: C omputational ncRNA gene finding (& nc RNA structure prediction) Liming Cai (BINF8210@UGA, Fall 2010) nc RNA structure prediction (& computational ncRNA

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0-2.0

0.0

2.0

4.0

6.0

8.0

10.0

12.0

14.0

zscore with Triple zscore with NUPACKzs

co

re

506 miRNAs Comparison between NUPACK and Triple

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0-2.0

-1.0

0.0

1.0

2.0

3.0

4.0

5.0

zscore with Triple zscore with NUPACK

zsco

re

499 tRNAsComparison betweenNUPACK and Triple

Data were from Bonnet et al, 2004

Page 13: C omputational ncRNA gene finding (& nc RNA structure prediction) Liming Cai (BINF8210@UGA, Fall 2010) nc RNA structure prediction (& computational ncRNA

499 tRNAComparisons between HG, Triple, NUPACK

Data were from Bonnet et al, 2004

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0-2.0

-1.0

0.0

1.0

2.0

3.0

4.0

5.0

6.0

zscore with triple zscore with HGzscore with nupack

zsco

re 499 tRNAComparisons between HG and NUPACK

Page 14: C omputational ncRNA gene finding (& nc RNA structure prediction) Liming Cai (BINF8210@UGA, Fall 2010) nc RNA structure prediction (& computational ncRNA

What are in this lecture?• RNA secondary structure prediction 1. ab initio structure prediction 2. consensus structure prediction

3. structural model-based prediction �

[Doudna,et al, 1999][tRNA unfolding pathway]

but why just secondary structure?

Page 15: C omputational ncRNA gene finding (& nc RNA structure prediction) Liming Cai (BINF8210@UGA, Fall 2010) nc RNA structure prediction (& computational ncRNA
Page 16: C omputational ncRNA gene finding (& nc RNA structure prediction) Liming Cai (BINF8210@UGA, Fall 2010) nc RNA structure prediction (& computational ncRNA

What else are in this lecture?

• ncRNA gene finding and annotation4. Structural profile-based ncRNA gene annotation5. comparative analysis based ncRNA gene finding 6. ab initio ncRNA gene detection

Page 17: C omputational ncRNA gene finding (& nc RNA structure prediction) Liming Cai (BINF8210@UGA, Fall 2010) nc RNA structure prediction (& computational ncRNA

Base pairings of RNAs

• Base pairings allow RNA to fold

Watson-Crick base pairs: A-U, C-GWobble pair G-U

called canonical pairs for secondary structure

Note: all 16 (including non-canonical) base pairs are possible for RNA tertiary structure

Page 18: C omputational ncRNA gene finding (& nc RNA structure prediction) Liming Cai (BINF8210@UGA, Fall 2010) nc RNA structure prediction (& computational ncRNA

N N

N

O

H

H

5’-u-u-c-c-g-a-a-g-c-u-c-a-a-c-g-g-g-a-a-a-u-g-a-g-c-u-3’

P a

P c

5’ 3’

P u a

P

g

P

CYTOSINE

N

N

N

O

H

H

H

N

N

GUANINE

URACIL ADENINE

N N

O

O

H

N

N

N

N

N

HH

Page 19: C omputational ncRNA gene finding (& nc RNA structure prediction) Liming Cai (BINF8210@UGA, Fall 2010) nc RNA structure prediction (& computational ncRNA

Secondary structure is importantto tertiary structure

Page 20: C omputational ncRNA gene finding (& nc RNA structure prediction) Liming Cai (BINF8210@UGA, Fall 2010) nc RNA structure prediction (& computational ncRNA

Stems in nested or parallel pattern

aacguu ccccucu ggggcagc cc

aga ugccc

stem (double helix): stacked base pairs

loop: strand of unpaired bases

accacc ggu

Page 21: C omputational ncRNA gene finding (& nc RNA structure prediction) Liming Cai (BINF8210@UGA, Fall 2010) nc RNA structure prediction (& computational ncRNA

Stems in crossing patterns

aacguu ccccucu acc ggggcagc ggucc

aga ugcacccc

Pseudoknots: crossing patterns of stems

Page 22: C omputational ncRNA gene finding (& nc RNA structure prediction) Liming Cai (BINF8210@UGA, Fall 2010) nc RNA structure prediction (& computational ncRNA

RNA secondary structure elements

Hairpin loopJunction (Multiloop)

Bulge Loop

Single-Stranded

Interior Loop

Stem

Image– Wuchty

Pseudoknot

Page 23: C omputational ncRNA gene finding (& nc RNA structure prediction) Liming Cai (BINF8210@UGA, Fall 2010) nc RNA structure prediction (& computational ncRNA

RNA stem-loop (pseudoknot-free) structure example

Page 24: C omputational ncRNA gene finding (& nc RNA structure prediction) Liming Cai (BINF8210@UGA, Fall 2010) nc RNA structure prediction (& computational ncRNA

RNA secondary structure prediction

1. ab inito structure predictionto predict the structure of a single sequence

2. Consensus structure predictionto predict the structure shared by more than one sequences

3. Statistical model-based prediction and alignment

to search for desirable structures on genomes or data bases

Page 25: C omputational ncRNA gene finding (& nc RNA structure prediction) Liming Cai (BINF8210@UGA, Fall 2010) nc RNA structure prediction (& computational ncRNA

1. ab initio structure prediction

• Hydrogen bonds consume energy contained in the molecule.

• The smaller the free energy is, the more stable the structure folded.

Page 26: C omputational ncRNA gene finding (& nc RNA structure prediction) Liming Cai (BINF8210@UGA, Fall 2010) nc RNA structure prediction (& computational ncRNA

ab initio structure prediction (cont’)

• Consider only canonical base pairsA-U, C-G, and G-U. Base pairings reduce the amount of free energy contained in the molecule.

• Maximizing the number of base pairs would minimize the free energy in the molecule. (Only an approximate model)

Page 27: C omputational ncRNA gene finding (& nc RNA structure prediction) Liming Cai (BINF8210@UGA, Fall 2010) nc RNA structure prediction (& computational ncRNA

ab initio structure prediction (cont’)• But how to count?

An RNA could be very long; there may be many possible ways that base pairs can be formed:

e.g., ……ACGGUACGUC…..conflicting pairs A-U, A-U

G-C, G-C etc.

Even the number of non-conflicting combinations of base pairsis exponentially large.

Page 28: C omputational ncRNA gene finding (& nc RNA structure prediction) Liming Cai (BINF8210@UGA, Fall 2010) nc RNA structure prediction (& computational ncRNA

ab initio structure prediction (cont’)

i j

(1) head paired with tail

(2) tail is unpaired

(3) head is unpaired

(4)i k j

two subfolds

Page 29: C omputational ncRNA gene finding (& nc RNA structure prediction) Liming Cai (BINF8210@UGA, Fall 2010) nc RNA structure prediction (& computational ncRNA

looking at shorter (e.g., very short) subsequences

in a long sequence ACGGU…ACGUC

• For subsequences of length 1, A, C, G, G, U, …, A, C, G, U, C

#of base pairs 0, 0, 0, 0, 0, …, 0, 0, 0, 0, 0

• For subsequences of length 2, AC, CG, GG, GU, …, AC, CG, GU, UC # 0, 1. 0, 1, …, 0, 1, 1, 0

• For subsequence of length 3, ACG, CGG, GGU, …, UAC, ACG, CGU, GUC, UUC ?: e.g., GUC (1) G-C + U --> 1+0 =1 head-tail (2) G + UC --> 0+0 =0 head unpaired (3) GU + C --> 1+0 =1 tail unpaired (4) GU + C --> 1+0 =1 split (5) G + UC --> 0+0 =0 split

Page 30: C omputational ncRNA gene finding (& nc RNA structure prediction) Liming Cai (BINF8210@UGA, Fall 2010) nc RNA structure prediction (& computational ncRNA

examine a little longer sequence …..ACGGUACGU…..

i j ==> max of {cases 1, 2, 3, 4}1. Head-tail paired, count = 1 + max count in subsequence

CGGUACG i+1 j-12. Head unpaired, count = max count in subsequence

CGGUACGU i+1 j• Tail unpaired, count = max count in subsequence

ACGGUACG i j-1• Split (why needed and where to split ?)

ACGGUACGU when k=i+2 i j ==> ACG + GUACGU <---- k ---> count = max count in ACG + max count in GUACGU

Page 31: C omputational ncRNA gene finding (& nc RNA structure prediction) Liming Cai (BINF8210@UGA, Fall 2010) nc RNA structure prediction (& computational ncRNA

Ab initio structure prediction (cont’)

• Maximizing the number of base pairs (Nussinov et al, 1978)

simple model:(i, j) = 1

Page 32: C omputational ncRNA gene finding (& nc RNA structure prediction) Liming Cai (BINF8210@UGA, Fall 2010) nc RNA structure prediction (& computational ncRNA

GGGAAAUCC

Ci,j = 0 when i=j0

0

0

0

0

0

0

0

G G G A A A U C C

GG

GA

AA

UCC

0

0

0

0

0

0

0

0

0

0

0

0

0

0

1

0

0

0

0

0

0

1

1

0

0

0

0

1

1

1

AU

AAUC

0

0

1

1

1

GAAAUC

0

1

2

1

1

2

2

3

2 3

Page 33: C omputational ncRNA gene finding (& nc RNA structure prediction) Liming Cai (BINF8210@UGA, Fall 2010) nc RNA structure prediction (& computational ncRNA

Example 2: ACGGUU

subsequence of length 0: empty sequence, 0 pairssubsequences of length 1: A, C, G, G, U, U 0 0 0 0 0 0 pairssubsequences of length 2: AC, CG, GG, GU, UU 0 1 1 0 0 pairssubsequences of length 3: ACG, CGG, GGU, GUU

1 1 1 1 pairsSubsequences of length 4: ACGG, CGGU, GGUU

1 2 2 pairsSubsequences of length 5: ACGGU, CGGUU

2 2 pairssubsequence of length 6: ACGGUU

3 pairs

Page 34: C omputational ncRNA gene finding (& nc RNA structure prediction) Liming Cai (BINF8210@UGA, Fall 2010) nc RNA structure prediction (& computational ncRNA

Prediction Algorithm Web Server

• http://frontend.bioinfo.rpi.edu/applications/mfold/cgi-bin/rna-form1.cgi

• Sample sequence: (1) tRNA

GGGGUCAUAGCUCAGUUGGUAGAGCGCUACAAUGGCAUUGUAGAGGUCAGCGGUUCGAUCCCGCUUGGCUCCACCA

(2) a part of tmRNA CCUCUCUCCCUAGCCUCCGCUCUUAGGACGGGGAUCAAGAGAGGUCAAACCCAAAAGAGA

• Simple matrix, • simple matrix with G-U pair• Complex matrix

Rfam database: http://www.sanger.ac.uk/Software/Rfam/

Page 35: C omputational ncRNA gene finding (& nc RNA structure prediction) Liming Cai (BINF8210@UGA, Fall 2010) nc RNA structure prediction (& computational ncRNA

Thermodynamic energy based structure prediction

• Energy minimization algorithm predicts the correct secondary structure by minimizing the free energy (G)

G calculated as sum of individual contributions of:– loops– base pairs– secondary structure elements

Page 36: C omputational ncRNA gene finding (& nc RNA structure prediction) Liming Cai (BINF8210@UGA, Fall 2010) nc RNA structure prediction (& computational ncRNA

Free-energy values (kcal/mole at 37oC )

• Energies of stems calculated as stacking contributions between neighboring base pairs

Page 37: C omputational ncRNA gene finding (& nc RNA structure prediction) Liming Cai (BINF8210@UGA, Fall 2010) nc RNA structure prediction (& computational ncRNA

Free-energy values (kcal/mole at 37oC )

Page 38: C omputational ncRNA gene finding (& nc RNA structure prediction) Liming Cai (BINF8210@UGA, Fall 2010) nc RNA structure prediction (& computational ncRNA

Zuker’s algorithm MFOLD: computing loop dependent energies

Page 39: C omputational ncRNA gene finding (& nc RNA structure prediction) Liming Cai (BINF8210@UGA, Fall 2010) nc RNA structure prediction (& computational ncRNA

Assumptions in such algorithms

• Most likely structure corresponds to energetically most stable structure

• Energy associated with any position is only influenced by local sequence and structure

• Structure formed does not produce pseudoknots

Page 40: C omputational ncRNA gene finding (& nc RNA structure prediction) Liming Cai (BINF8210@UGA, Fall 2010) nc RNA structure prediction (& computational ncRNA

RNA structure prediction web servers

• MFOLD http://www.bioinfo.rpi.edu/applications/mfold/rna/form1.cgi

• RNAfold ( a part of Vienna Package) http://rna.tbi.univie.ac.at/cgi-bin/RNAfold.cgi

Examples:

GCTTACGACCATATCACGTTGAATGCACGCCATCCCGTCCGATCTGGCAAGTTAAGCAACGTTGAGTCCAGTTAGTACTTGGATCGGAGACGGCCTGGGAATCCTGGATGTTGTAAGCT

Page 41: C omputational ncRNA gene finding (& nc RNA structure prediction) Liming Cai (BINF8210@UGA, Fall 2010) nc RNA structure prediction (& computational ncRNA

RNA pseudoknot (tmRNAs)

terminates translation errors

Bacterial tmRNA consensus structure

(Felden et al. 2001. NAR 29)

Page 42: C omputational ncRNA gene finding (& nc RNA structure prediction) Liming Cai (BINF8210@UGA, Fall 2010) nc RNA structure prediction (& computational ncRNA

Functions of pseudoknots (TMV 3’ UTR)

Promotes efficient translationBinds EF1A, cooperates with 5’UTR

(Leathers et al. 1993 MCB 13Zeenko et al. 2002 JVI 76)

Page 43: C omputational ncRNA gene finding (& nc RNA structure prediction) Liming Cai (BINF8210@UGA, Fall 2010) nc RNA structure prediction (& computational ncRNA

Pseudoknots drastically increase computational complexity

Page 44: C omputational ncRNA gene finding (& nc RNA structure prediction) Liming Cai (BINF8210@UGA, Fall 2010) nc RNA structure prediction (& computational ncRNA

RNA pseudoknot prediction web servers

• Pknots-RG: http://bibiserv.techfak.uni-bielefeld.de/pknotsrg/

• Pknots-RE (the first pseudoknot prediction algorithm)

• Kinefold: http://kinefold.curie.fr/cgi-bin/form.pl

• ILM http://cic.cs.wustl.edu/RNA/

Page 45: C omputational ncRNA gene finding (& nc RNA structure prediction) Liming Cai (BINF8210@UGA, Fall 2010) nc RNA structure prediction (& computational ncRNA

Computational complexity issues

• Pseudoknot-free structures: O(n3) CUP time

• Pseudoknots: NP-hard, restricted cases O(n5)

• Heuristics added: O(n4)• Difficult for search RNA structures in

genomes

Page 46: C omputational ncRNA gene finding (& nc RNA structure prediction) Liming Cai (BINF8210@UGA, Fall 2010) nc RNA structure prediction (& computational ncRNA

2. Consensus structure prediction

Covariance fact for RNAs:

Variations in RNA sequence maintain base-pairing patterns for secondary structures

When a nucleotide in one base changes, the base it pairs to must also change to maintain the same structure

Page 47: C omputational ncRNA gene finding (& nc RNA structure prediction) Liming Cai (BINF8210@UGA, Fall 2010) nc RNA structure prediction (& computational ncRNA

Structure alignments (example) C AG A G•C G•C G•C G•C

A AG A C•G C•G U•A A•U

G AG A AG UG CA CU

Query RNA structure B: nonhomologousA: structural homolog

query: GGGGGCAACCCC

A: AUCCGAAAGGAU | | |

query: GGGGGCAACCCC

B: CCUAGAAAGGAU | | |

query: GGGGGCAACCCC

A: AUCCGAAAGGAU| | | | | | | | | | |

query: GGGGGCAACCCC

B: CCUAGAAAGGAU | | |

primary sequence alignment scoring:

structure + sequence alignment scoring:

-6 -6

+11 -6

Page 48: C omputational ncRNA gene finding (& nc RNA structure prediction) Liming Cai (BINF8210@UGA, Fall 2010) nc RNA structure prediction (& computational ncRNA

Mutlipel Structural Alignment of 13 tmRNA genes from

the β-proteobacteria [Felden et al’01]

Covariance

Page 49: C omputational ncRNA gene finding (& nc RNA structure prediction) Liming Cai (BINF8210@UGA, Fall 2010) nc RNA structure prediction (& computational ncRNA

Dynamic programming approach• (Sankoff 1984)

This can be regarded as running ‘two Nussinov algorithms at the same time’ to simultaneously fold two RNAs

i j

p q

‘the coordinated fold’ is found through computing Ci,j,p,q,

needs: O(n6) time for two sequences and O(n3k) for k seqs

Page 50: C omputational ncRNA gene finding (& nc RNA structure prediction) Liming Cai (BINF8210@UGA, Fall 2010) nc RNA structure prediction (& computational ncRNA

Inferring structure by comparative sequence analysis

• (1) calculate a multiple sequence alignment

Requires sequences to be similar enough so that they can be initially aligned

Sequences should be dissimilar enough for covarying substitutions to be detected

 

Page 51: C omputational ncRNA gene finding (& nc RNA structure prediction) Liming Cai (BINF8210@UGA, Fall 2010) nc RNA structure prediction (& computational ncRNA

Inferring structure by comparative sequence analysis (cont’)

• (2) compute Mutual Information

fxi : frequency of a base x in column i

• fxiyj : joint (pairwise) frequency of base pair x-y between columns i and j

• If i and j are uncorrelated, mutual information is 0

M ij = fxi x jlog2

fxi x j

fxifx jxi ,x j

Page 52: C omputational ncRNA gene finding (& nc RNA structure prediction) Liming Cai (BINF8210@UGA, Fall 2010) nc RNA structure prediction (& computational ncRNA

Inferring structure by comparative sequence analysis (cont’)

• (3) use mutual information Mi,j as pairing “energy” and treat the multiple alignment as a “generic” sequence

• apply a Nussinov’s algorithm-like process to find the most likely common structure

Page 53: C omputational ncRNA gene finding (& nc RNA structure prediction) Liming Cai (BINF8210@UGA, Fall 2010) nc RNA structure prediction (& computational ncRNA

Inferring consensus structure by a graph-theoretic approach (ConRAN and

RNASampler)

• Identify all stems in every sequence, assigning each stem a vertex in the graph

• Connect two stems in two different sequences with an edge if they are similar

• Connect two stems in the same sequence with an edge if they do not conflict

• The optimal consensus structure corresponds the maximum clique

Page 54: C omputational ncRNA gene finding (& nc RNA structure prediction) Liming Cai (BINF8210@UGA, Fall 2010) nc RNA structure prediction (& computational ncRNA

Consensus structure prediction programs

• Dynalignhttp://rna.urmc.rochester.edu/dynalign.html

• Foldalignhttp://www.bioinf.au.dk/cgi-bin/webparser-1.5.pl

• ComRNAhttp://ural.wustl.edu/~yji/comRNA/

• RNA samplerhttp://ural.wustl.edu/~xingxu/RNASampler/index.html

• Carnachttp://bioinfo.lifl.fr/RNA/carnac/carnac.php

Page 55: C omputational ncRNA gene finding (& nc RNA structure prediction) Liming Cai (BINF8210@UGA, Fall 2010) nc RNA structure prediction (& computational ncRNA

3. Statistical model-based structure prediction and alignment

• Extension from HMM to include mechanisms that can describe (long-distance) base pairings

• Stochastic grammars can describe models defined by HMMs

• Stochastic grammars can describe models not definable by HMMs

Page 56: C omputational ncRNA gene finding (& nc RNA structure prediction) Liming Cai (BINF8210@UGA, Fall 2010) nc RNA structure prediction (& computational ncRNA

Stochastic context-free grammar• Covariance model (CM) [Eddy and Durbin’94]

based on computational grammar systems

M2 a M’2 I2 a I2 D2 I2

M’2 I2 I2 M3 D2 M3

M’2 D3 I2 D3 D2 D3

A path in the HMM a derivation in the grammar

Page 57: C omputational ncRNA gene finding (& nc RNA structure prediction) Liming Cai (BINF8210@UGA, Fall 2010) nc RNA structure prediction (& computational ncRNA

a a cg u u c c c cu c ua g a cc

S

S

S

S

S aSu L aL

S uSa L cL

S gSc L a

S cSg L c

S L

• Each derivation tree corresponds to a structure.

Stochastic context-free grammar (cont’)

• Stochastic Context-free Grammars (SCFGs)

L

L

L

L

[Lari and Young’90, Sakakibara et al’94]

Page 58: C omputational ncRNA gene finding (& nc RNA structure prediction) Liming Cai (BINF8210@UGA, Fall 2010) nc RNA structure prediction (& computational ncRNA

S aSu

S cSg

S gSc

S uSa

S a

S c

S g

S u

S SS

1. A CFG

S aSu

acSgu

accSggu

accuSaggu

accuSSaggu

accugScSaggu

accuggSccSaggu

accuggaccSaggu

accuggacccSgaggu

accuggacccuSagaggu

accuggacccuuagaggu

2. A derivation of “accuggacccuuagaggu” 3. Corresponding structure

Stochastic context-free grammar (cont’)

Page 59: C omputational ncRNA gene finding (& nc RNA structure prediction) Liming Cai (BINF8210@UGA, Fall 2010) nc RNA structure prediction (& computational ncRNA

What to do with SCFGs ?

• Structure prediction require the SCFG model to be flexible enough

• Structure searchrequire the model to be specific

• Both need to do sequence-structure alignment

Page 60: C omputational ncRNA gene finding (& nc RNA structure prediction) Liming Cai (BINF8210@UGA, Fall 2010) nc RNA structure prediction (& computational ncRNA

Structure prediction with SCFGS aSu

S cSg

S gSc

S uSa

S aS

S cS

S gS

S uS

S Sa

S Sc

S Sg

S Su

S a

S c

S g

S u

S SS

Probability parameter assignment:

(1) Sum of probabilities of the same LHS =1(2) Geometric distributions for loop and stem

lengths(3) Parameters are obtained from training

sequences with known structures

Alignment score between model S and subsequence x[i..j] is computed, when x[i]=a, x[j]=u

C(S, i, j) = max { C(S,i+1, j-1)*P(S -> aSu), C(S,i+1, j)*P(S -> aS), C(S, I,j-1)*P(S -> Su), maxk { C(S,i,k)C(S,k+1,j)P(S->SS) }

Page 61: C omputational ncRNA gene finding (& nc RNA structure prediction) Liming Cai (BINF8210@UGA, Fall 2010) nc RNA structure prediction (& computational ncRNA

Web servers for RNA Structure prediction with

SCFG

S aSu

S cSg

S gSc

S uSa

S aS

S cS

S gS

S uS

S Sa

S Sc

S Sg

S Su

S a

S c

S g

S u

S SS

Infernal:

http://infernal.janelia.org/

Pfold: (multiple sequence + SCFG)

http://www.daimi.au.dk/~compbio/rnafold/

Page 62: C omputational ncRNA gene finding (& nc RNA structure prediction) Liming Cai (BINF8210@UGA, Fall 2010) nc RNA structure prediction (& computational ncRNA

• RNA secondary structure prediction 1. ab initio structure prediction 2. consensus structure prediction

3. structural (SCFG) model-based prediction

• ncRNA gene finding and annotation4. profile-based ncRNA gene annotation5. comparative analysis based ncRNA gene finding 6. ab initio ncRNA gene prediction

Page 63: C omputational ncRNA gene finding (& nc RNA structure prediction) Liming Cai (BINF8210@UGA, Fall 2010) nc RNA structure prediction (& computational ncRNA

4. Structure profile based RNA gene annotation

Secondary structure alone is not sufficient for predicting ncRNA genes, BUT it remains to be the best hope for an exploitable statistical signal

To find RNA structures or genes, one can profilethe structure to be searched. Often, SCFG is used as a modeling tool.

Page 64: C omputational ncRNA gene finding (& nc RNA structure prediction) Liming Cai (BINF8210@UGA, Fall 2010) nc RNA structure prediction (& computational ncRNA

Structure profile based RNA gene annotation (cont’)

• Search for a specific family RNAs (structures)• Need an effective mechanism to profile the

family• Need a fast structure-sequence alignment

algorithm

Page 65: C omputational ncRNA gene finding (& nc RNA structure prediction) Liming Cai (BINF8210@UGA, Fall 2010) nc RNA structure prediction (& computational ncRNA

model

modelinge.g. CM SCFG

profiling

genome

alignment

scanning window(target sequence)

RNA training sequenceswith annotated structures

Page 66: C omputational ncRNA gene finding (& nc RNA structure prediction) Liming Cai (BINF8210@UGA, Fall 2010) nc RNA structure prediction (& computational ncRNA

• CM is a profile-SCFG, position-specific, very effective• Slow O(n3N)-time even for pseudoknot-free RNAs in genomes

or large databases• Cannot handle pseudoknots• HMM based filtering to imprve speed• Examples:

tRNAscan-SE (http://lowelab.ucsc.edu/tRNAscan-SE/)

infernal (http://infernal.janelia.org/)

Page 67: C omputational ncRNA gene finding (& nc RNA structure prediction) Liming Cai (BINF8210@UGA, Fall 2010) nc RNA structure prediction (& computational ncRNA

5. Comparative analysis based ncRNA gene finding

• Based on structure features of RNA

• Consider two or more genomes phylogenetically related• Use sequence alignment tools (such as BLASTN) to find local

alignment between the two• Search with a sliding window• Identify potential RNA fold within the window• Computationally verify it to be putative RNA

QRNA (Eddy group, 2001)EvoFold (Haussler group, 2006)RNAz (Stadler group, 2005)

Page 68: C omputational ncRNA gene finding (& nc RNA structure prediction) Liming Cai (BINF8210@UGA, Fall 2010) nc RNA structure prediction (& computational ncRNA

QRNA (Eddy et al, 2001)

detecting ncRNA genes with SCFGs given two aligned sequences, to test the pattern of substitutions observed in the pairwise alignment oftwo homologous sequences using

a pair of SCFGs for ncRNAs (compensatory mutations)a pair HMM for protein-coding genes (conserved regions)a pair HMM for other regions (random evolution)

Page 69: C omputational ncRNA gene finding (& nc RNA structure prediction) Liming Cai (BINF8210@UGA, Fall 2010) nc RNA structure prediction (& computational ncRNA

QRNA:

Page 70: C omputational ncRNA gene finding (& nc RNA structure prediction) Liming Cai (BINF8210@UGA, Fall 2010) nc RNA structure prediction (& computational ncRNA

Probability parameters

Page 71: C omputational ncRNA gene finding (& nc RNA structure prediction) Liming Cai (BINF8210@UGA, Fall 2010) nc RNA structure prediction (& computational ncRNA
Page 72: C omputational ncRNA gene finding (& nc RNA structure prediction) Liming Cai (BINF8210@UGA, Fall 2010) nc RNA structure prediction (& computational ncRNA

Other software to detect RNA genesbased on comparative analysis

• EvoFoldmultiple genomesuse SCFG + phylogeny to predict consensus structure

• RNAzmultiple genomespredict the consensus foldcompare energy of the fold to background energy

Page 73: C omputational ncRNA gene finding (& nc RNA structure prediction) Liming Cai (BINF8210@UGA, Fall 2010) nc RNA structure prediction (& computational ncRNA

3. Ab initio prediction of ncRNA genes

• mainly based on base composition difference between real RNAs and the background, limited success.

• Unsuccessful by simply predicting the structure of RNAs

• Other methods?

Page 74: C omputational ncRNA gene finding (& nc RNA structure prediction) Liming Cai (BINF8210@UGA, Fall 2010) nc RNA structure prediction (& computational ncRNA

Fold energy and fold certainty

• Methods based on folding energy do not seem to work [just like structure prediction]

• How do distinguish a real ncRNA from random sequences that fold to the same structure by chance [both could have the same energy]

• The difference seems to be the structure certainty

• But how to compute structure certainty?

Page 75: C omputational ncRNA gene finding (& nc RNA structure prediction) Liming Cai (BINF8210@UGA, Fall 2010) nc RNA structure prediction (& computational ncRNA

Fold certainty

• For a real ncRNA sequence:Base pairs contributing to the real fold should not be everywhere.‘overall strength’ of base pairs contributing to other, false folds should be weak.

• For a random sequence:either it does not foldor there is a low probability to form a certain fold

Page 76: C omputational ncRNA gene finding (& nc RNA structure prediction) Liming Cai (BINF8210@UGA, Fall 2010) nc RNA structure prediction (& computational ncRNA

Fold certainty (cont’)

• Compute Shannon entropyEn(S) = ∑Pij log Pij

where Pij is the probability for bases i and j to pair

• Pij = (number of folds pair (i,j) is involved) / (number of folds)

[simplified]

Page 77: C omputational ncRNA gene finding (& nc RNA structure prediction) Liming Cai (BINF8210@UGA, Fall 2010) nc RNA structure prediction (& computational ncRNA

Fold certainty (cont’)

• We measured entropy Z-score of a real ncRNA based on the entropies of its random counterparts

• But the entropy Z-score performance on different ncRNAs is different

• miRNAs perform well while tRNAs do not• What happened?

Page 78: C omputational ncRNA gene finding (& nc RNA structure prediction) Liming Cai (BINF8210@UGA, Fall 2010) nc RNA structure prediction (& computational ncRNA

Readings and projects in RNA informatics

• www.uga.edu/RNA-Informatics/Readings

• www.uga.edu/RNA-Informatics/Projects

• www.uga.edu/RNA-Informatics/Projects/project-details.html

Page 79: C omputational ncRNA gene finding (& nc RNA structure prediction) Liming Cai (BINF8210@UGA, Fall 2010) nc RNA structure prediction (& computational ncRNA
Page 80: C omputational ncRNA gene finding (& nc RNA structure prediction) Liming Cai (BINF8210@UGA, Fall 2010) nc RNA structure prediction (& computational ncRNA

What about pseudoknots?

Page 81: C omputational ncRNA gene finding (& nc RNA structure prediction) Liming Cai (BINF8210@UGA, Fall 2010) nc RNA structure prediction (& computational ncRNA

Tree decomposition based search algorithms

• Dynamic programming at the nucleotide level is time consuming

• Very Slow, O(n6N)-time even for restricted pseudoknot categories

• Pseudoknots are not very complex from graph-theoretic point of view

Page 82: C omputational ncRNA gene finding (& nc RNA structure prediction) Liming Cai (BINF8210@UGA, Fall 2010) nc RNA structure prediction (& computational ncRNA

Tree decomposition based search algorithms (cont’s)

• profile each stem with SCFG, connecting the two halves with an edgeprofile each loop with HMM, connect two ends of the loop with a directed edgeProduce a mxied graph H for the structure

• Preprocess target sequence with the profiles to obtain all potential candidates, construct a graph G for the sequence

• Structure-sequence alignment corresponding finding an optimal subgraph in G isomorphic to H

Page 83: C omputational ncRNA gene finding (& nc RNA structure prediction) Liming Cai (BINF8210@UGA, Fall 2010) nc RNA structure prediction (& computational ncRNA

Tree decomposition based search algorithms (cont’s)

• H is decomposed as a tree representation

• Fast alignment algorithm can be obtained O(kt Nn)where t is the tree width of H, usually small for pseudoknotted RNAs, k is a parameter, small also

• Successful for RNA structures that belong to a well-defined family

Page 84: C omputational ncRNA gene finding (& nc RNA structure prediction) Liming Cai (BINF8210@UGA, Fall 2010) nc RNA structure prediction (& computational ncRNA

Sequence-structure alignment

Structure

Structure graph

Tree decomposition

Sequence

1

Sequence graph

2

11. Construct graphs

2. Tree decompose the structure graph

3. Dynamic programming based on tree decomposition

alignment

subgraphisomorphism

3

Page 85: C omputational ncRNA gene finding (& nc RNA structure prediction) Liming Cai (BINF8210@UGA, Fall 2010) nc RNA structure prediction (& computational ncRNA

s t

structure graph

a b b’ c c’ d d’ a’

Structure graph

Hidden Markov Model (HMM)

Covariance Model (CM)a a’

b

b’c

c’

d

d’

Page 86: C omputational ncRNA gene finding (& nc RNA structure prediction) Liming Cai (BINF8210@UGA, Fall 2010) nc RNA structure prediction (& computational ncRNA

structure graph

For each stem, identify k candidates in the sequence

genome sequence

Sequence graph

s ta b b’ c c’ d d’ a’

a1 a2 a’1 a’2

Page 87: C omputational ncRNA gene finding (& nc RNA structure prediction) Liming Cai (BINF8210@UGA, Fall 2010) nc RNA structure prediction (& computational ncRNA

structure graph

For each stem, identify k candidates in the sequence

genome sequence

Sequence graph

s ta b b’ c c’ d d’ a’

a1 a2 a’2a’1b1 b’1b2 b’2

Page 88: C omputational ncRNA gene finding (& nc RNA structure prediction) Liming Cai (BINF8210@UGA, Fall 2010) nc RNA structure prediction (& computational ncRNA

genome sequence

t

a1 a’1b1 b’1 c1 c’1

a2 a’2b2 b’2 c2 c’2

sequence graph

s

Sequence graph

a’2a’1a1 a2 b1 b’1 d’2 d’1d1 d2b2 b’2 c1 c’1c2 c’2

Page 89: C omputational ncRNA gene finding (& nc RNA structure prediction) Liming Cai (BINF8210@UGA, Fall 2010) nc RNA structure prediction (& computational ncRNA

Subgraph isomorphism

structure graph

a a’b b’ c c’

a1 a’1b1 b’1 c1 c’1

a2 a’2b2 b’2 c2 c’2

Sequence-structure alignment becomes subgraph isomorphismsequence graph

k=2

s

s

t

t

Page 90: C omputational ncRNA gene finding (& nc RNA structure prediction) Liming Cai (BINF8210@UGA, Fall 2010) nc RNA structure prediction (& computational ncRNA

a a’b b’ c c’

a b a’ b c’ a’ b b’ c’ b’ c c’

b d b’ d d’ b’

d d’s t

s a a’ a a’ t

Tree decomposition of structure graph

(1) Pseudoknot-free structure graphs have tree width = 2

Page 91: C omputational ncRNA gene finding (& nc RNA structure prediction) Liming Cai (BINF8210@UGA, Fall 2010) nc RNA structure prediction (& computational ncRNA

a a’b b’ c c’d d’

(1) Pseudoknot-free structure graphs have tree width = 2(2) Almost all pseudoknot structure graphs have small tree width

x y

a b a’ b c’ a’b b’ c’ y

b’ c c’y

b d b’y

d d’ b’y

s a a’ a a’ t c c’ y

d d’ x y

s t

Tree decomposition of structure graph (cont’d)

Page 92: C omputational ncRNA gene finding (& nc RNA structure prediction) Liming Cai (BINF8210@UGA, Fall 2010) nc RNA structure prediction (& computational ncRNA

Tree width of tmRNA

Tree width = 5

Page 93: C omputational ncRNA gene finding (& nc RNA structure prediction) Liming Cai (BINF8210@UGA, Fall 2010) nc RNA structure prediction (& computational ncRNA

Tree decomposition based search algorithms (cont’s)

Page 94: C omputational ncRNA gene finding (& nc RNA structure prediction) Liming Cai (BINF8210@UGA, Fall 2010) nc RNA structure prediction (& computational ncRNA

Tree decomposition based search algorithms (cont’s)

HI: Haemophilus influenzaeNM: Neisseria meningitidis

SC:Saccharomyces cerevisiaeSB: Saccharomyces bayanus

Page 95: C omputational ncRNA gene finding (& nc RNA structure prediction) Liming Cai (BINF8210@UGA, Fall 2010) nc RNA structure prediction (& computational ncRNA

RNA structure and gene search

How to identify novel RNAs whose structure maydeviate from the common structure of the family?

- make a profile accommodate novel structures(This may mean to test more potential structures)

- make the structure-sequence alignment fast enough