c omputational ncrna gene finding (& nc rna structure prediction)

Computational ncRNA gene finding(& ncRNA structure prediction) Liming Cai

(BINF8210@UGA, Fall 2010) ncRNA structure prediction (& computational ncRNA gene finding)

Non-coding RNAsFunctions other than coding proteins, e.g., structural, catalytic, and regulatory factors

functional RNAs = ncRNAs + UTR motifs

(-) No strong statistical features, such as ORFs, or polyadenylated, demonstrated in coding genes

(+) Transcribed ncRNA molecules can fold into secondary and tertiary structures (more conserved than sequences)

Sources of ncRNAsNon-coding RNA genes encode RNAs, e.g., miRNAs, rox1 and rox2 RNAs in male Drosophila melanogaster.

In introns and intergenic regions, e.g., snoRNAs

In 5 and 3 UTRs, e.g., regulatory motifs (functional RNAs)

Functions of ncRNAsrRNAs and tRNAsRNA maturation: snRNA in recognizing splicing sitesRNA modification: snoRNA converting uridine to pseudo-uridineRegulation of gene expression and translation: e.g., miRNAsDNA replication: e.g., telomerase RNAs - template for addition of telomeric repeatsEtc.

Classes of ncRNAs(Bompfunewerer, et al, 2005)

ClassSizeFunctionPhylogeneticdistributiontRNA70-80TranslationubiquitousrRNA16S/18S28S+5.8S/23S5S1.5K3K130translationubiquitousRNase PMRP220-440250-350tRNA -maturationubiquitouseukaryasnoRNA

telomerase130

400-550pseudouridinylationaddition of repeatssnRNAU1 ~ U6100-600130-140SpliceosomemRNA maturationEukaryaEukarya, archaeaU7

7SK~65

~300Histone mRNAMaturationTranslationalregulationEukayotes

vertebratatmRNA300-400Tags proteinFor proteolysisbacteriamiRNA~22Post-tran. Reg.Multi-cellular orgs

Some ncRNAs databasesRfam (280,000 regions of 379 families)NONCODE (109 transitional classes and 9 groups) RNAdb (800 mammalian ncRNAs, excluding tRNAs, rRNAs and snRNAs)Arabidposis small RNA Project (ASRP)Etc.

ncRNA gene finding strategiesComputational predictive methods cDNA cloning to enrich ncRNAsDetecting new transcripts with oligonucleotide microarrays

ncRNA gene finding: a computational challengencRNA genes do not have significant statistical signalslarge in numberdiverse, 20 nts to 22,000 nts

Not sure what to look for Computationally intensive- Simply no good method- Methods compromising accuracy

Computational ncRNA gene finding methodsSpecific (custom-designed) ncRNA search and annotation(e.g., tRNAscan, methylattion-guide snoRNA, miRNA, tmRNA)

Reconfigurable search systems (e.g., Infernal, ERPIN, RNATOPS,FastR)mechanism to profile the target ncRNA (structure)- need training data

De novo ncRNA gene detection withbase composition (e.g., G+C %)structure fold (e.g., RNAz)

Comparative analysis (e.g., QRNA, EvolFold)- consensus structure

ncRNA holy grail ?

Review literature in computational ncRNA gene finding and annotationA. Laederach (2007) Informatics challenges in Structural RNA, Brief Bioinformatics 8(5) 294-303.

S. Eddy (2001) Non-coding RNA genes and modern RNA world, Nature Reviews Genetics, 2(12), 919-929.

S. Griffiths-Jones (2007) Annotating noncoding RNA genes, Annual Rev. Genomics & Human Genetics, 8:279-298.

Machado-Lima et al (2008) Computational methods in noncoding RNA research, Mathematical Biology, 56: 15-49.

506 miRNAs Comparison between NUPACK and Triple499 tRNAsComparison betweenNUPACK and TripleData were from Bonnet et al, 2004

499 tRNAComparisons between HG, Triple, NUPACKData were from Bonnet et al, 2004 499 tRNAComparisons between HG and NUPACK

What are in this lecture?RNA secondary structure prediction 1. ab initio structure prediction 2. consensus structure prediction3. structural model-based predictionbut why just secondary structure?

Tertiary structure:Less understood non-canonical interactionsOnly a small number of resolved structuresMeasuring ncRNA secondary structure may be a feasible solution for ncRNA gene finding

What else are in this lecture?

ncRNA gene finding and annotation4. Structural profile-based ncRNA gene annotation5. comparative analysis based ncRNA gene finding 6. ab initio ncRNA gene detection

Base pairings of RNAsBase pairings allow RNA to foldWatson-Crick base pairs: A-U, C-GWobble pair G-U called canonical pairs for secondary structureNote: all 16 (including non-canonical) base pairs are possible for RNA tertiary structure

5-u-u-c-c-g-a-a-g-c-u-c-a-a-c-g-g-g-a-a-a-u-g-a-g-c-u-353CYTOSINEGUANINEURACILADENINE

Secondary structure is importantto tertiary structure

Stems in nested or parallel patternaacguuccccucuggggcagcccagaugcccstem (double helix): stacked base pairs

loop: strand of unpaired basesggu

Stems in crossing patternsaacguuccccucuaccggggcagcgguccagaugcaccccPseudoknots: crossing patterns of stems

RNA secondary structure elementsHairpin loopJunction (Multiloop)Bulge LoopSingle-StrandedInterior LoopStemImage WuchtyPseudoknot

RNA stem-loop (pseudoknot-free) structure example

RNA secondary structure predictionab inito structure predictionto predict the structure of a single sequence

2. Consensus structure predictionto predict the structure shared by more than one sequences

3. Statistical model-based prediction and alignmentto search for desirable structures on genomes or data bases

1. ab initio structure predictionHydrogen bonds consume energy contained in the molecule.

The smaller the free energy is, the more stable the structure folded.

ab initio structure prediction (cont)Consider only canonical base pairsA-U, C-G, and G-U. Base pairings reduce the amount of free energy contained in the molecule.

Maximizing the number of base pairs would minimize the free energy in the molecule. (Only an approximate model)

ab initio structure prediction (cont)But how to count?An RNA could be very long; there may be many possible ways that base pairs can be formed:e.g., ACGGUACGUC..conflicting pairs A-U, A-U G-C, G-C etc. Even the number of non-conflicting combinations of base pairsis exponentially large.

ab initio structure prediction (cont)

looking at shorter (e.g., very short) subsequencesin a long sequence ACGGUACGUC

For subsequences of length 1, A, C, G, G, U, , A, C, G, U, C #of base pairs 0, 0, 0, 0, 0, , 0, 0, 0, 0, 0

For subsequences of length 2, AC, CG, GG, GU, , AC, CG, GU, UC # 0, 1. 0, 1, , 0, 1, 1, 0

For subsequence of length 3, ACG, CGG, GGU, , UAC, ACG, CGU, GUC, UUC ?: e.g., GUC (1) G-C + U --> 1+0 =1 head-tail (2) G + UC --> 0+0 =0 head unpaired (3) GU + C --> 1+0 =1 tail unpaired (4) GU + C --> 1+0 =1 split (5) G + UC --> 0+0 =0 split

examine a little longer sequence ..ACGGUACGU.. i j ==> max of {cases 1, 2, 3, 4}Head-tail paired, count = 1 + max count in subsequence CGGUACG i+1 j-12. Head unpaired, count = max count in subsequence CGGUACGU i+1 jTail unpaired, count = max count in subsequence ACGGUACG i j-1Split (why needed and where to split ?) ACGGUACGU when k=i+2 i j ==> ACG + GUACGU count = max count in ACG + max count in GUACGU

Ab initio structure prediction (cont)Maximizing the number of base pairs (Nussinov et al, 1978)

GGGAAAUCCCi,j = 0 when i=j G G G A A A U C CGG

GA

AA

UCCAUAAUC

00000000

Example 2: ACGGUUsubsequence of length 0: empty sequence, 0 pairssubsequences of length 1: A, C, G, G, U, U 0 0 0 0 0 0 pairssubsequences of length 2: AC, CG, GG, GU, UU 0 1 1 0 0 pairssubsequences of length 3: ACG, CGG, GGU, GUU 1 1 1 1 pairsSubsequences of length 4: ACGG, CGGU, GGUU 1 2 2pairsSubsequences of length 5: ACGGU, CGGUU 2 2pairssubsequence of length 6: ACGGUU 3pairs

Prediction Algorithm Web Serverhttp://frontend.bioinfo.rpi.edu/applications/mfold/cgi-bin/rna-form1.cgiSample sequence: (1) tRNA GGGGUCAUAGCUCAGUUGGUAGAGCGCUACAAUGGCAUUGUAGAGGUCAGCGGUUCGAUCCCGCUUGGCUCCACCA (2) a part of tmRNA CCUCUCUCCCUAGCCUCCGCUCUUAGGACGGGGAUCAAGAGAGGUCAAACCCAAAAGAGASimple matrix, simple matrix with G-U pairComplex matrix

Rfam database: http://www.sanger.ac.uk/Software/Rfam/

Thermodynamic energy based structure prediction

Energy minimization algorithm predicts the correct secondary structure by minimizing the free energy (G)

G calculated as sum of individual contributions of:loopsbase pairssecondary structure elements

Free-energy values (kcal/mole at 37oC )Energies of stems calculated as stacking contributions between neighboring base pairs

Free-energy values (kcal/mole at 37oC )

Zukers algorithm MFOLD: computing loop dependent energies

Assumptions in such algorithmsMost likely structure corresponds to energetically most stable structure

Energy associated with any position is only influenced by local sequence and structure

Structure formed does not produce pseudoknots

RNA structure prediction web serversMFOLD http://www.bioinfo.rpi.edu/applications/mfold/rna/form1.cgi

RNAfold ( a part of Vienna Package) http://rna.tbi.univie.ac.at/cgi-bin/RNAfold.cgi

Examples:GCTTACGACCATATCACGTTGAATGCACGCCATCCCGTCCGATCTGGCAAGTTAAGCAACGTTGAGTCCAGTTAGTACTTGGATCGGAGACGGCCTGGGAATCCTGGATGTTGTAAGCT

RNA pseudoknot (tmRNAs)terminates translation errorsBacterial tmRNA consensus structure(Felden et al. 2001. NAR 29)

Functions of pseudoknots (TMV 3 UTR)Promotes efficient translationBinds EF1A, cooperates with 5UTR(Leathers et al. 1993 MCB 13Zeenko et al. 2002 JVI 76)

Pseudoknots drastically increase computational complexity

RNA pseudoknot prediction web serversPknots-RG: http://bibiserv.techfak.uni-bielefeld.de/pknotsrg/

Pknots-RE (the first pseudoknot prediction algorithm)

Kinefold: http://kinefold.curie.fr/cgi-bin/form.pl

ILM http://cic.cs.wustl.edu/RNA/

Computational complexity issuesPseudoknot-free structures: O(n3) CUP timePseudoknots: NP-hard, restricted cases O(n5)Heuristics added: O(n4)Difficult for search RNA structures in genomes

2. Consensus structure predictionCovariance fact for RNAs:

Variations in RNA sequence maintain base-pairing patterns for secondary structures

When a nucleotide in one base changes, the base it pairs to must also change to maintain the same structure

Structure alignments (example) C AG A GC GC GC GC A AG A CG CG UA AU G AG A AG UG CA CUQuery RNA structureB: nonhomologousA: structural homologprimary sequence alignment scoring:structure + sequence alignment scoring:-6-6+11-6

Mutlipel Structural Alignment of 13 tmRNA genes from the -proteobacteria [Felden et al01]

Dynamic programming approach(Sankoff 1984)This can be regarded as running two Nussinov algorithms at the same time to simultaneously fold two RNAsthe coordinated fold is found through computing Ci,j,p,q,

needs: O(n6) time for two sequences and O(n3k) for k seqs

Inferring structure by comparative sequence analysis(1) calculate a multiple sequence alignment

Requires sequences to be similar enough so that they can be initially aligned

Sequences should be dissimilar enough for covarying substitutions to be detected

Inferring structure by comparative sequence analysis (cont)(2) compute Mutual Informationfxi : frequency of a base x in column i

fxiyj : joint (pairwise) frequency of base pair x-y between columns i and j

If i and j are uncorrelated, mutual information is 0

Inferring structure by comparative sequence analysis (cont)(3) use mutual information Mi,j as pairing energy and treat the multiple alignment as a generic sequence

apply a Nussinovs algorithm-like process to find the most likely common structure

Inferring consensus structure by a graph-theoretic approach (ConRAN and RNASampler)Identify all stems in every sequence, assigning each stem a vertex in the graph

Connect two stems in two different sequences with an edge if they are similar

Connect two stems in the same sequence with an edge if they do not conflict

The optimal consensus structure corresponds the maximum clique

Consensus structure prediction programsDynalignhttp://rna.urmc.rochester.edu/dynalign.htmlFoldalignhttp://www.bioinf.au.dk/cgi-bin/webparser-1.5.plComRNAhttp://ural.wustl.edu/~yji/comRNA/RNA samplerhttp://ural.wustl.edu/~xingxu/RNASampler/index.htmlCarnachttp://bioinfo.lifl.fr/RNA/carnac/carnac.php

3. Statistical model-based structure prediction and alignmentExtension from HMM to include mechanisms that can describe (long-distance) base pairings

Stochastic grammars can describe models defined by HMMs

Stochastic grammars can describe models not definable by HMMs

Stochastic context-free grammarCovariance model (CM) [Eddy and Durbin94] based on computational grammar systemsM2 a M2 I2 a I2 D2 I2 M2 I2 I2 M3D2 M3M2 D3 I2 D3 D2 D3A path in the HMM a derivation in the grammar

a a cg u uc c c cu c ua g accSSS aSu L aLS uSaL cLS gSc L aS cSgL cS L Each derivation tree corresponds to a structure.Stochastic context-free grammar (cont) Stochastic Context-free Grammars (SCFGs)LLLL[Lari and Young90, Sakakibara et al94]

S aSuS cSgS gScS uSaS aS cS gS uS SS

1. A CFGS aSu acSgu accSggu accuSaggu accuSSaggu accugScSaggu accuggSccSaggu accuggaccSaggu accuggacccSgaggu accuggacccuSagaggu accuggacccuuagaggu2. A derivation of accuggacccuuagaggu3. Corresponding structure Stochastic context-free grammar (cont)

What to do with SCFGs ?Structure prediction require the SCFG model to be flexible enough

Structure searchrequire the model to be specific

Both need to do sequence-structure alignment

Structure prediction with SCFGS aSuS cSgS gScS uSaS aSS cSS gSS uSS SaS ScS SgS SuS aS cS gS uS SSProbability parameter assignment:

Sum of probabilities of the same LHS =1(2) Geometric distributions for loop and stem lengths(3) Parameters are obtained from training sequences with known structuresAlignment score between model S and subsequence x[i..j] is computed, when x[i]=a, x[j]=u

C(S, i, j) = max { C(S,i+1, j-1)*P(S -> aSu), C(S,i+1, j)*P(S -> aS), C(S, I,j-1)*P(S -> Su), maxk { C(S,i,k)C(S,k+1,j)P(S->SS) }

Web servers for RNA Structure prediction with SCFGS aSuS cSgS gScS uSaS aSS cSS gSS uSS SaS ScS SgS SuS aS cS gS uS SSInfernal:

http://infernal.janelia.org/

Pfold: (multiple sequence + SCFG)

http://www.daimi.au.dk/~compbio/rnafold/

RNA secondary structure prediction 1. ab initio structure prediction 2. consensus structure prediction3. structural (SCFG) model-based prediction

ncRNA gene finding and annotation4. profile-based ncRNA gene annotation5. comparative analysis based ncRNA gene finding 6. ab initio ncRNA gene prediction

4. Structure profile based RNA gene annotation Secondary structure alone is not sufficient for predicting ncRNA genes, BUT it remains to be the best hope for an exploitable statistical signal

To find RNA structures or genes, one can profilethe structure to be searched. Often, SCFG is used as a modeling tool.

Structure profile based RNA gene annotation (cont)Search for a specific family RNAs (structures)Need an effective mechanism to profile the familyNeed a fast structure-sequence alignment algorithm

modelprofilinggenomealignmentscanning window(target sequence)RNA training sequenceswith annotated structures

CM is a profile-SCFG, position-specific, very effectiveSlow O(n3N)-time even for pseudoknot-free RNAs in genomes or large databasesCannot handle pseudoknotsHMM based filtering to imprve speedExamples:

tRNAscan-SE (http://lowelab.ucsc.edu/tRNAscan-SE/)infernal (http://infernal.janelia.org/)

5. Comparative analysis based ncRNA gene findingBased on structure features of RNA

Consider two or more genomes phylogenetically relatedUse sequence alignment tools (such as BLASTN) to find local alignment between the twoSearch with a sliding windowIdentify potential RNA fold within the windowComputationally verify it to be putative RNA

QRNA (Eddy group, 2001)EvoFold (Haussler group, 2006)RNAz (Stadler group, 2005)

QRNA (Eddy et al, 2001)

detecting ncRNA genes with SCFGs given two aligned sequences, to test the pattern of substitutions observed in the pairwise alignment oftwo homologous sequences using

a pair of SCFGs for ncRNAs (compensatory mutations)a pair HMM for protein-coding genes (conserved regions)a pair HMM for other regions (random evolution)

Probability parameters

Other software to detect RNA genesbased on comparative analysisEvoFoldmultiple genomesuse SCFG + phylogeny to predict consensus structure

RNAzmultiple genomespredict the consensus foldcompare energy of the fold to background energy

3. Ab initio prediction of ncRNA genesmainly based on base composition difference between real RNAs and the background, limited success.Unsuccessful by simply predicting the structure of RNAsOther methods?

Fold energy and fold certaintyMethods based on folding energy do not seem to work [just like structure prediction]How do distinguish a real ncRNA from random sequences that fold to the same structure by chance [both could have the same energy]The difference seems to be the structure certaintyBut how to compute structure certainty?

Fold certaintyFor a real ncRNA sequence:Base pairs contributing to the real fold should not be everywhere.overall strength of base pairs contributing to other, false folds should be weak.

For a random sequence:either it does not foldor there is a low probability to form a certain fold

Fold certainty (cont)Compute Shannon entropyEn(S) = Pij log Pijwhere Pij is the probability for bases i and j to pair

Pij = (number of folds pair (i,j) is involved) / (number of folds)

[simplified]

Fold certainty (cont)We measured entropy Z-score of a real ncRNA based on the entropies of its random counterpartsBut the entropy Z-score performance on different ncRNAs is differentmiRNAs perform well while tRNAs do notWhat happened?

Readings and projects in RNA informaticswww.uga.edu/RNA-Informatics/Readings

www.uga.edu/RNA-Informatics/Projects

www.uga.edu/RNA-Informatics/Projects/project-details.html

What about pseudoknots?

Tree decomposition based search algorithmsDynamic programming at the nucleotide level is time consumingVery Slow, O(n6N)-time even for restricted pseudoknot categoriesPseudoknots are not very complex from graph-theoretic point of view

Tree decomposition based search algorithms (conts)profile each stem with SCFG, connecting the two halves with an edgeprofile each loop with HMM, connect two ends of the loop with a directed edgeProduce a mxied graph H for the structure

Preprocess target sequence with the profiles to obtain all potential candidates, construct a graph G for the sequence

Structure-sequence alignment corresponding finding an optimal subgraph in G isomorphic to H

Tree decomposition based search algorithms (conts)H is decomposed as a tree representation

Fast alignment algorithm can be obtained O(kt Nn)where t is the tree width of H, usually small for pseudoknotted RNAs, k is a parameter, small also

Successful for RNA structures that belong to a well-defined family

Sequence-structure alignmentStructureStructure graphTree decompositionSequence1Sequence graph211. Construct graphs2. Tree decompose the structure graph3. Dynamic programming based on tree decompositionalignmentsubgraphisomorphism3

ststructure graphabbccddaStructure graphHidden Markov Model (HMM)Covariance Model (CM)aabbccdd

structure graphFor each stem, identify k candidates in the sequencegenome sequenceSequence graphstabbccddaa1a2a1a2

structure graphFor each stem, identify k candidates in the sequencegenome sequenceSequence graphstabbccddaa1a2a2a1b1b1b2b2

genome sequenceta1a1b1b1c1c1a2a2b2b2c2c2sequence graphsSequence grapha2a1a1a2b1b1d2d1d1d2b2b2c1c1c2c2

Subgraph isomorphismstructure graphaabbcca1a1b1b1c1c1a2a2b2b2c2c2Sequence-structure alignment becomes subgraph isomorphismsequence graph k=2sstt

aabbcca b ab c ab b cb c cb d bd d bddsts a aa a tTree decomposition of structure graph(1) Pseudoknot-free structure graphs have tree width = 2

aabbccdd(1) Pseudoknot-free structure graphs have tree width = 2(2) Almost all pseudoknot structure graphs have small tree widtha b ab c ab b c yb c cyb d byd d bys a aa a tc c yd d x ystTree decomposition of structure graph (contd)

Tree width of tmRNATree width = 5

Tree decomposition based search algorithms (conts)

Tree decomposition based search algorithms (conts)HI: Haemophilus influenzaeNM: Neisseria meningitidisSC:Saccharomyces cerevisiaeSB: Saccharomyces bayanus

RNA structure and gene searchHow to identify novel RNAs whose structure maydeviate from the common structure of the family?

- make a profile accommodate novel structures(This may mean to test more potential structures)

- make the structure-sequence alignment fast enough

*****************************************************************************Now we come back to RNA gene search. The core part of RNA gene search is Sequence-structure alignment. First from the structure we want to searchin genome, we construct structure graph and sequence graph. Then we tree decompose the structure graph.Finally find the sequence-structure alignment based on the tree.I will explain each step in detail in the following slides.

*First, we construct structure graph from structure. Each vertex is a half stem. And follow the order, connect two neighboring vertices with a directed edge for a loop. And a stem with a non-directed edge.*Then we construct sequence graph. On the target genome sequence, we choose k candidates for each vertex. In this example,k is two. *We do that same thing for other stems.*Finally, we can get all possible structures in the sequence, so we get a sequence graph.*So now we have both structure graph and sequence graph. To find the optimal sequence-structure alignment is to find the optimal subgraph in the sequence graph that is isomorphic to the structure graph!*To get the optimal alignment, we do the second step: tree decompose the structure graph.This is an example of the graph of a pseudoknot-free structure. For pseudoknot-free graph representations, the tree width is always 2.*If we add a crossing stem in it, the tree width is only increased by 1. For all RNA structures with pseudoknots, the tree widths only increase slightly.

****

c omputational ncrna gene finding (& nc rna structure prediction)

Documents

target ncrna structure

computational methods

structural rna

noncoding rna genes

annotating noncoding

structure fold

noncoding rna research

modern rna world