gutell 108.jmb.2009.391.769

of 15/15
Correlation of RNA Secondary Structure Statistics with Thermodynamic Stability and Applications to Folding Johnny C. Wu 1 , David P. Gardner 2 , Stuart Ozer 3 , Robin R. Gutell 2 and Pengyu Ren 1 1 Department of Biomedical Engineering, University of Texas at Austin, Austin, Texas 78712-1062, USA 2 Center for Computational Biology and Bioinformatics, Section of Integrative Biology in the School of Biological Sciences, and the Institute for Cellular and Molecular Biology, University of Texas at Austin, 2401 Speedway, Austin, TX 78712, USA 3 Microsoft Corporation, Redmond, WA 98052, USA Received 16 March 2009; received in revised form 5 June 2009; accepted 12 June 2009 Available online 18 June 2009 The accurate prediction of the secondary and tertiary structure of an RNA with different folding algorithms is dependent on several factors, including the energy functions. However, an RNA higher-order structure cannot be predicted accurately from its sequence based on a limited set of energy parameters. The inter- and intramolecular forces between this RNA and other small molecules and macromolecules, in addition to other factors in the cell such as pH, ionic strength, and temperature, influence the complex dynamics associated with transition of a single stranded RNA to its secondary and tertiary structure. Since all of the factors that affect the formation of an RNAs 3D structure cannot be determined experimentally, statistically derived potential energy has been used in the prediction of protein structure. In the current work, we evaluate the statistical free energy of various secondary structure motifs, including base-pair stacks, hairpin loops, and internal loops, using their statistical frequency obtained from the comparative analysis of more than 50,000 RNA sequences stored in the RNA Comparative Analysis Database (rCAD) at the Comparative RNA Web (CRW) Site. Statistical energy was computed from the structural sta- tistics for several datasets. While the statistical energy for a base-pair stack correlates with experimentally derived free energy values, suggesting a Boltzmann-like distribution, variation is observed between different mole- cules and their location on the phylogenetic tree of life. Our statistical energy values calculated for several structural elements were utilized in the Mfold RNA-folding algorithm. The combined statistical energy values for base-pair stacks, hairpins and internal loop flanks result in a significant improvement in the accuracy of secondary structure prediction; the hairpin flanks contribute the most. Published by Elsevier Ltd. Edited by D. E. Draper Keywords: statistical potentials; RNA folding; thermodynamic stability; comparative analysis Introduction It has been appreciated since canonical base pairs were first identified and arranged antiparallel and consecutive with one another to form regular heli- ces, that G:C, A:U, and G:U base pairs are stabilized by hydrogen bonding and base stacking. With experimental calorimetric measurements of simple oligonucleotides that base-pair with one another, the majority of the known free energy values were determined for consecutive neighbor-joiningbase- pairs by Turner and his collaborators. 1 While it is appreciated that base-pair stacks (BP-STs; Fig. 1a) make a significant contribution to the stability of an RNA structure, the relative contribution of base stacking and the hydrogen bonding that forms base pairs and other interactions in RNA structure to the overall stability of an RNA structure is not well understood. 2,3 While the full extent of the types of RNA structural elements and helices have not been identified and characterized, the energetic contribu- tion for only a small percentage of the characterized *Corresponding authors. E-mail addresses: [email protected]; [email protected]. Abbreviations used: rCAD, RNA Comparative Analysis Database; BP, base pair; BP-ST, base-pair stack; SE, statistical energy; HF, hairpin flank; IL, internal loop. doi:10.1016/j.jmb.2009.06.036 J. Mol. Biol. (2009) 391, 769783 Available online at 0022-2836/$ - see front matter. Published by Elsevier Ltd.

Post on 11-Nov-2014




2 download

Embed Size (px)


Wu J.C., Gardner D.P., Ozer S., Gutell R.R. and Ren P. (2009). Correlation of RNA Secondary Structure Statistics with Thermodynamic Stability and Applications to Folding. Journal of Molecular Biology, 391(4):769-783.


  • 1. Correlation of RNA Secondary Structure Statistics withThermodynamic Stability and Applications to FoldingJohnny C. Wu1, David P. Gardner2, Stuart Ozer3,Robin R. Gutell2 and Pengyu Ren11Department of BiomedicalEngineering, University ofTexas at Austin, Austin,Texas 78712-1062, USA2Center for ComputationalBiology and Bioinformatics,Section of Integrative Biology inthe School of Biological Sciences,and the Institute for Cellularand Molecular Biology,University of Texas at Austin,2401 Speedway, Austin,TX 78712, USA3Microsoft Corporation,Redmond, WA 98052, USAReceived 16 March 2009;received in revised form5 June 2009;accepted 12 June 2009Available online18 June 2009The accurate prediction of the secondary and tertiary structure of an RNAwith different folding algorithms is dependent on several factors, includingthe energy functions. However, an RNA higher-order structure cannot bepredicted accurately from its sequence based on a limited set of energyparameters. The inter- and intramolecular forces between this RNA andother small molecules and macromolecules, in addition to other factors inthe cell such as pH, ionic strength, and temperature, influence the complexdynamics associated with transition of a single stranded RNA to itssecondary and tertiary structure. Since all of the factors that affect theformation of an RNAs 3D structure cannot be determined experimentally,statistically derived potential energy has been used in the prediction ofprotein structure. In the current work, we evaluate the statistical free energyof various secondary structure motifs, including base-pair stacks, hairpinloops, and internal loops, using their statistical frequency obtained from thecomparative analysis of more than 50,000 RNA sequences stored in theRNA Comparative Analysis Database (rCAD) at the Comparative RNAWeb (CRW) Site. Statistical energy was computed from the structural sta-tistics for several datasets. While the statistical energy for a base-pair stackcorrelates with experimentally derived free energy values, suggesting aBoltzmann-like distribution, variation is observed between different mole-cules and their location on the phylogenetic tree of life. Our statisticalenergy values calculated for several structural elements were utilized in theMfold RNA-folding algorithm. The combined statistical energy values forbase-pair stacks, hairpins and internal loop flanks result in a significantimprovement in the accuracy of secondary structure prediction; the hairpinflanks contribute the most.Published by Elsevier Ltd.Edited by D. E. DraperKeywords: statistical potentials; RNA folding; thermodynamic stability;comparative analysisIntroductionIt has been appreciated since canonical base pairswere first identified and arranged antiparallel andconsecutive with one another to form regular heli-ces, that G:C, A:U, and G:U base pairs are stabilizedby hydrogen bonding and base stacking. Withexperimental calorimetric measurements of simpleoligonucleotides that base-pair with one another, themajority of the known free energy values weredetermined for consecutive neighbor-joining base-pairs by Turner and his collaborators.1While it isappreciated that base-pair stacks (BP-STs; Fig. 1a)make a significant contribution to the stability of anRNA structure, the relative contribution of basestacking and the hydrogen bonding that forms basepairs and other interactions in RNA structure to theoverall stability of an RNA structure is not wellunderstood.2,3While the full extent of the types ofRNA structural elements and helices have not beenidentified and characterized, the energetic contribu-tion for only a small percentage of the characterized*Corresponding authors. E-mail addresses:[email protected]; [email protected] used: rCAD, RNA Comparative AnalysisDatabase; BP, base pair; BP-ST, base-pair stack; SE,statistical energy; HF, hairpin flank; IL, internal loop.doi:10.1016/j.jmb.2009.06.036 J. Mol. Biol. (2009) 391, 769783Available online at www.sciencedirect.com0022-2836/$ - see front matter. Published by Elsevier Ltd.
  • 2. RNA structural elements have been determined. It isknown that approximately 66% of the nucleotides inlarger RNAs like the 16 S and 23 S rRNA formregular secondary structure helices.4The remainingthird of the nucleotides are involved in more com-plex secondary and tertiary structures.5A partial listof these includes: U-turns,6lone pair tri-loops,7avery high percentage of unpaired As in the second-ary structure8that are involved in several motifs,including A minor motifs,9,10, E and E-like motifs,8UAA/GAN internal loop motif,11GNRA tetra-loops,11,12and a high percentage of A:A and A:Gjuxtapositions at the ends of regular helices.13Themajority of the base pairs in these structural motifsform unusual non-canonical base pair types andconformations.14,15,16With incomplete knowledge about all of the pos-sible RNA structural motifs and their energeticstabilities in different structural environments, analternative approach to the simplified energy modeldominated by BP-STs in regular secondary structurehelices is needed. An analysis of the population of750 optimal and sub-optimal structures for two dif-ferent 16 S rRNA sequences revealed that the accu-racy of the most stable (or optimal) predictedstructures with Mfold is higher when the majorityof the predicted structures with a minimal variationamong themselves have a minimal difference intheir total G.17These results are consistent withthose reported by Ding et al., who performed clusteranalysis on a distribution of folded structures toimprove the accuracy of prediction.18While experimental approaches have been essen-tial to our understanding of macromolecular struc-ture and their energetic stability, it is not feasible todetermine the energetic stability for all possiblestructural motifs. In contrast, an analysis of high-resolution crystal structures in parallel with statisti-cal analysis of different sets of comparative macro-molecular sequences that form identical or verysimilar secondary and tertiary structure has beenutilized to determine knowledge-based potentials orscoring functions. This approach has been used fre-quently in the prediction of protein structure19,20following the work by Scheraga.21An importantassumption involved in the conversion of structuralstatistics into (pseudo) free energy is the Boltzmann-like distribution of the structure, which has beensubstantiated for proteins.22,23Stochastic grammar-based models24,25,26,27havebeen used as a nonphysical-based method to modelRNA. Recently Do et al developed a method thatmaximizes the expectation of the objective functionrelated to the accuracy of the prediction.28Although this knowledge-based approach hasbeen applied to RNA only recently, there has beenan increased effort to apply statistically derivedpotentials to the prediction of RNA structure. A fewyears ago, Dima and co-workers29extracted RNAstructural statistics from experimentally determinedhigh-resolution structures from the Protein DataBank (PDB)30to attain PDB-derived potentials forconsecutive base pairs in helices. The statistics ofthe tertiary interactions in the 3D structures ofRNAs in PDB were not determined due to thecomplexity and uncertainty of the sets of nucleo-tides to study. They determined, to a first approx-imation, that the structural potentials derived fromthe base pairings in the secondary structure ofexperimentally determined 3D structures are simi-lar to the energy values determined experimentallyby Turner and collaborators.29Recently, Das et al. developed a Rosetta-likescheme to predict the tertiary structure of smallRNA sequences of 30 nucleotides.31In theirscheme, a statistical potential is inferred from thedistance and angle distributions of base pairs in theribosome crystal structure, following the methoddescribed by Sykes and Levitt.32Parisien et al.predicted the secondary and tertiary structuresof short RNA molecules with the statistics ofnucleotide cyclic motifs using the dynamic pro-gramming Waterman-Byers algorithm.33Recently,Jonikas et al. developed a coarse-grain nucleotide-based model to predict structure effectively.34Sincethe first few tRNA sequences were determined ithas been appreciated that different RNA sequenceswith similar function can form similar secondaryand tertiary structure.4,35Comparative analysis oftRNA sequences revealed the classic cloverleafsecondary structure36and parts of the 3D structure.37Comparative analysis has revealed other RNAsecondary structures that are each similar to differentsets of RNA sequences with similar functions. ThisFig. 1. Illustrations of four sec-ondary structures. (a) An exampleof a base-pair stack that is denotedUA/CG. (b) An example of a base-pair stack that is denoted CG/UA.(c) An example of a hairpin flankthat is denoted GC/CA. (d) Anexample of an internal loop that isdenoted GC/CG/AG/AA.770 RNA Secondary Structure
  • 3. list includes 5 S rRNA,3816 S rRNA,39,40,4123 SrRNA,42,43,44RNase P,45and group I and II in-trons.46,47,48Covariation and other comparativeanalysis have the potential to be extremely accuratewhen there are a sufficient number and diversity ofproperly aligned sequences. For the rRNA, 9798%of the base pairs predicted with covariation analysiswere present in the high-resolution crystal structureof the ribosome.5Due to the emerging technologies that determinenucleic acid sequences rapidly for entire genomes,we are obtaining nucleotide sequences for anincreasing number of RNA families with significantincreases in the number of sequences per family. Tofacilitate the analysis of these increasing compara-tive RNA datasets, we have collaborated withMicrosoft Research to develop an RNA Compara-tive Analysis Database (rCAD) that cross-indexessequence, structure, and phylogenetic information(S. O. et al., unpublished results). In this work, weutilized rCAD to obtain structural statistics fromthe comparative analysis of these sequences, andderived statistical energy values that can be usedfor RNA structure prediction. We demonstrate thatstructural motifs beyond base-pair stacking inhelices are also important in determining theRNA structure.The statistical energy derived fromsequence information has the potential to improvethe accuracy of RNA secondary structure predic-tion significantly.ResultsCorrelation of base-pair stack (BP-ST)statistical energy (SE) and experimentalthermodynamic stabilityWith comparative structural information, wehave examined the base-pair stack statistics andtheir correlation with thermodynamic stabilityextracted from RNA duplex melting experiments.49Since statistically derived energies require a refer-ence value to set the absolute scale, we normalizedthe lowest BP-ST (SE) with the lowest BP-ST expe-rimental value (see Materials and Methods). Wehave computed molecule-specific BP-ST SE fortRNA (for amino acids A, D, E, G, I, L, M, F),eukaryotic 5 S rRNA, bacterial 5 S rRNA, and bac-terial 16 S rRNA. A BP-ST SE was determined alsoby combining all the sequences from the four data-sets (Supplementary Data tables). The statisticalenergy is plotted against the experimental data inFig. 2. The correlation coefficient and standarddeviation of difference between SE and experimen-tal values for each BP-ST are given in Table 1. Thehighest correlation coefficient (0.870) between thestatistical and experimental free energy is found forbacterial 16 S rRNA, followed closely by eukaryotic5 S rRNA (0.857) and tRNA (0.833). The statisticalenergy for bacterial 5 S rRNA is the least correlatedFig. 2. (a) Base-pair stack statistical energy (SE) derived from all-sequence dataset. (b) Bacterial 16 S rRNA. (c)Bacterial 5 S rRNA. (d) Eukaryotic 5 S rRNA. (e) tRNA versus free energy obtained experimentally for a given base-pair stack. (f) PDB-derived statistical potentials versus free energy obtained experimentally.29Experimental values arein kcal/mol.771RNA Secondary Structure
  • 4. with experimental stability (0.477). More than 80%of the nucleotides in the all-sequence dataset arefrom bacterial 16 S rRNA ; thus, the calculated BP-ST SE is biased by the bacterial 16 S rRNA.(Table 1).As a result, the correlation of the all-sequence base-pair stack SE with experimental stability is also high(0.862). The standard deviations for all-sequencebase-pair stack SE, bacterial 16 S rRNA, bacterial 5 SrRNA, eukaryotic 5 S rRNA, and tRNA are 0.525,0.507, 1.194, 0.677, and 0.988, respectively. In com-parison, the PDB-derived BP-ST energies have acorrelation coefficient of 0.7764 with experimentalstability (Fig. 2f) and a standard deviation of0.8138.29 (All structural statistics extracted fromeach molecular dataset and free energy data used inFig. 2 are available in the Supplementary DataTables S.1 10.)For comparison with experimental BP-ST freeenergy we have used symmetric SE for correlationcalculations. Hence, 5 and 3 directionality are equi-valent, e.g. GC/CG and CG/GC are considered to bethe same structure. However, asymmetric SE, inwhich 5 and 3 directions are differentiated, will beused to study the folding applications of BP-ST SE.Overall, the statistics of BP-ST frequency show a goodcorrelation with experiment. However, a few, BP-STsdiffer more significantly. One of them involvesconsecutive UG base pairs. The UG/GU BP-ST SEfrom the bacterial 16 S rRNA (1.0 kcal/mol) and allmolecules (0.82 kcal/mol) datasets (SupplementaryData Tables S.4 and S5) are much lower than that usedby Mfold (0.5 kcal/mol). However, the value for thatBP-ST in Mfold has been adjusted from the experi-mental value of 0.47 kcal/mol.50 Additionally, theexperimental stability of UG/GU has the largest error(0.96 kcal.mol) among all the BP-STs, since only oneduplex containing the UG/GU was measured thatwas insufficient for linear regression.50The statisticalpotentials for the UG/GU BP-ST derived from PDBstructures also varied significantly from the experi-mental energy values.29The largest discrepancy between the all-sequenceSE and experimental stability is for the UG/UG(1.23 kcal/mol) and GU/GU (0.12 kcal/mol)BP-STs (Supplementary Data Table S.5). The cor-responding experimental energy for UG/UG is0.30 kcal/mol and GU/GU is 1.30 kcal/mol (Supple-mentary Data Table S.12). As noted earlier, thestatistical energy for these are significantly differentfrom the experimental data.50The UA/AU BP-STstatistical energies (Supplementary Data Tables S.1 S.5) are all consistently below 2.3 kcal/mol anddiffer from the experimental value by more than1 kcal/mol. This indicates that UA/AU BP-STs occurmore frequently in these RNA molecules thansuggested by the stability derived from duplexmelting experiment. The GC/GC and GC/CG BP-STs determined from calorimetric experiments arethe most stable structures with the stability of theformer being slightly lower than the latter by0.1 kcal/mol. However, the lowest (i.e., most stable)BP-ST SE is GC/CG for the all-sequence, bacterial 16S rRNA, and bacterial 5 S rRNA datasets. Thedifference between the SE of the GC/CG BP-ST foreach aforementioned dataset (Supplementary DataTables S.3 S.5) and the corresponding experimentalvalue (Supplementary Data Table S.12) is 0.9 kcal/mol on average. The GC/GC BP-ST SEs do notimmediately follow GC/CG in terms of stability. ThePDB-derived potential29GC/CG BP-ST minimumagrees with that of our BP-ST SE.Overall, BP-ST SEs derived from comparativesequences analysis show a better correlation withexperimental free energy1than the PDB-derivedpotentials. However, noticeable differences existbetween SEs of different phylogenetic domains,which concurs with experimental observations. Forexample, although the Escherichia coli loop E ofbacterial 5 S rRNA and Spinacia oleracia of eukaryotic5 S rRNA are found to be isosteric and are able tobind to the same L25 protein, they show substantialdifferences of stability in various ionic conditions.51In addition, sequences of RNA molecules have beenconfirmed by Kiparisov et al. to be highly specificand optimized through evolution as only sevenalleles in Saccharomyces cerevisiae 5 S rRNA are foundto be viable.52The high degree of correlation indi-cates that the distribution of BP-STs in three out ofthe four molecule-specific datasets or, when com-bined, is Boltzmann-like. The agreement is ratherremarkable, especially given that the experimentswere performed on isolated duplexes while thestatistics were collected from biological sequences,which also represents tertiary contacts and manyother factors in the cell. For example, structuralstudies have supported that hairpin loops can bestabilized by tertiary interactions that are not con-sidered in the oligomer experiments.13,53,54Table 1. Statistical energy derived from sequences of each specified moleculeNo. nucleotides No. sequences Corr coef SDBacterial 16 S rRNA 1,468,052 319 0.870 0.507Bacterial 5 S rRNA 34,443 96 0.477 1.194Eukaryotic 5 S rRNA 95,778 263 0.857 0.677tRNA 148,248 650 0.833 0.988All sequence 1,746,521 1328 0.862 0.525PDB-derived297424aN/A 0.7764 0.8138For statistical energies derived from the sequences of each specified molecule, the number of nucleotides in base-pair stacks, correlationcoefficient and standard deviation compared with experimental base-pair stack energies are presented.50The last row lists equivalentinformation for PDB-derived statistical potentials.aNumber of bases in stacks.772 RNA Secondary Structure
  • 5. Decomposition of pairing and stackingWith the statistical frequency from comparativestructures, we also modeled the individual contribu-tions of the base pairing (BP) and stacking (ST) fromthe BP-ST energy by fitting the decomposed energyto experimental BP-STenergies. As described earlier,SE for individual (WatsonCrick, G-U) base pairsand stacks are computed from statistical frequencywith Eqs. (4) and (5), respectively. A total of 16 dif-ferent types of stacks can form from the six accept-able types of base pair. The total energy is thenexpressed as a function of the individual pair andstack contributions in Eq. (6). Two assumptions aremade in this decomposition model: (1) the contribu-tions of the base-pair and stack are independent andcan be separated; and (2) the total energy of a BP-STis obtained from the sum of the decomposed basepair and stack. Although the assumptions may beover-simplified, this examination reveals informationon the relationship of the two interactions. Coeffi-cients and characterize the relative contribution ofeach term. By fitting to the 21 BP-STs simultaneouslyusing linear least-squares regression, we determined2.3893 for , 0.6189 for , and 0.4088 for C. With thisdecomposition model, we predict the energy foreach of the 16 canonical BP-STs. As shown in Table 2,the calculated values are in good agreement withexperimental energy values, with a correlation coeffi-cient of 0.882 and an RMS error of 0.51 kcal/mol.Since we fit for C, the decomposed base-pair and stackSE reflect only their relative contributions. Accord-ingly, positive energetic values for stacks indicate onlythat the BP-ST containing that stack has a higherenergy than the undecomposed value, C. (The decom-posed statistics and corresponding energy for eachbase pair and stack, total energy computed from thedecomposition model are available in SupplementaryData Tables S.13 and S.14.) Although almost alladjacent nucleotides that are involved in base pairingare stacked, many adjacent nucleotides in otherstructures, such as hairpin loops and nonconsecutivenucleotides that flank coaxial stacked helices, arestacked as well. We also attempted to count all adja-cent nucleotides in a sequence as stacks, whichyielded similar results.Since no experimental data have been determinedfor the individual energy contribution for base-pairsor base stacks in isolation from one another, noreference value is available to establish the absolutescale. Thus, the constant C, which is intended to bethe intrinsic or sequence-independent stability factorof a BP-ST was added to the decomposition model.Hence, the energy value of every BP-ST includes thevalue of C in addition to the sequence-specific term.Although no conclusion is drawn from the absolutecontribution of pairing or stacking, our model eluci-dates the relative stability between different BP-STs.By comparing the individual contributions frompairing and stacking with the BP-ST SE, it is evidentthat the relative stability of BP-STs is dominated bythe sequences in pairing. Figure 3 reveals a strongBP SE correlation (0.8582) with the WatsonCrick/GU base-pairs SE determined using all sequences. Bycontrast, the stacking SE shows almost no correlationwith BP-ST SE (correlation coefficient of 0.1790). Thestatistics collected in the current study show clearlythat pairing is responsible for the variation ofstabilities of base-pair stacking. It is worth empha-sizing that this finding does not dispute theimportance of the base stack, and that it providesmore stability as compared to the base-pair. 2,55,56Table 2. Base-pair stack statisticsBase pair 1 Base pair 2Decompbase-pair SEDecompstacking SEDecomp.base pair + stack SE BP-ST SEExp BP-STenergyDifference of decomp.BP-ST SE and exp BP-ST energyCG CG -3.1814 -0.2416 -3.01 -2.43 -2.40 0.61GC CG -3.2819 -0.2444 -3.12 -3.40 -3.30 -0.18GC GC -3.3824 -0.0578 -3.03 -2.70 -3.40 -0.37GU CG -1.7603 -0.1206 -1.47 -2.32 -2.10 -0.63GU GC -1.8608 -0.1169 -1.57 -2.35 -2.50 -0.93GU GU -0.3392 -0.1761 -0.11 0.14 1.30 1.41UG CG -1.724 -0.2009 -1.52 -1.76 -1.40 0.12UG GC -1.8245 -0.2202 -1.64 -1.93 -1.50 0.14UG GU -0.3029 -0.0461 0.06 -0.83 -0.50 -0.56UG UG -0.2667 -0.1602 -0.02 -1.22 0.30 0.32AU CG -2.6577 0.1496 -2.10 -3.19 -2.10 0.00AU GC -2.7583 -0.0203 -2.37 -3.09 -2.20 0.17AU GU -1.2367 0.1435 -0.68 -2.03 -1.40 -0.72AU UG -1.2004 0.2242 -0.57 -1.33 -0.60 -0.03AU AU -2.1341 0.4631 -1.26 -1.65 -1.10 0.16UA CG -2.8042 0.0226 -2.37 -3.19 -2.10 0.27UA GC -2.9048 -0.0084 -2.50 -2.91 -2.40 0.10UA GU -1.3832 0.1658 -0.81 -1.62 -1.30 -0.49UA UG -1.3469 0.1261 -0.81 -1.07 -1.00 -0.19UA AU -2.2806 0.4795 -1.39 -2.53 -0.90 0.49UA UA -2.4271 0.4124 -1.61 -2.04 -1.30 0.31For each base-pair stack, the decomposed base-pair SE, decomposed stacking SE, base-pair stack energy computed from thedecomposition model (Eq. (6)), base-pair stack SE, experimental base-pair stack energy (kcal/mol), and the difference between the sum ofdecomposed base-pair and stack SEs and experimental base-pair stack energy derived from experiment are presented. SE are in units ofkcal/mol and are derived from statistics using all sequences.773RNA Secondary Structure
  • 6. Specifically, suppose that Ga and Gb are two BP-STs. It is possible for the absolute energy of eachstructure to be stabilized primarily by base stacking.However, Gba, the relative stability of the BP-STs,could be largely due to the difference between thebase-pair energies. Hence, it is likely that stacking isthe driving force for helix formation, yet such inter-actions are not as specific as the complementaritiesfrom hydrogen bonding.Application of statistical energy to foldingWe investigated the ability of the statisticallyderived energies of BP-ST, hairpin flanks, and inter-nal loops to improve the prediction of RNAsecondary structure. We incorporated our statisticalenergy values in place of those energy parametersdetermined experimentally in the Mfold programSEderived from the molecule-specific and the all-sequence datasets are utilized within Mfold to pre-dict the secondary structure for tRNA, eukaryoticrRNA, bacterial 5 S rRNA, and bacterial 16 S rRNA.Similar to previous studies,50the folding accuracyis evaluated by comparing base pairs that are pre-dicted accurately by Mfold with base pairs deter-mined by comparative sequence analysis (see detailsin Materials and Methods).5Although Mfold deter-mines the optimal (most stable) structure and a set ofsub-optimal secondary structure models, the struc-ture with the lowest energy was used in our analysisfor folding accuracy. Furthermore, the experimentalenergy values were obtained from oligonucleotideduplexes where BP-STs have no directionality. How-ever, the anisotropic nature of a BP-ST is apparentwhen, for example, it is adjacent to a hairpin loop.Thus, Eq. (1) and its corresponding modificationswere used to evaluate symmetric and asymmetricBP-ST energies for folding accuracy. As shown inSupplementary Data Tables S.6 S.10, asymmetricSE offered slightly better folding results overall andis used in the folding evaluation.Figure 4 provides the accuracy of folding tRNA,eukaryotic 5 S rRNA, bacterial 5 S rRNA, bacterial16 S rRNA, using BP-ST SE derived from eachmolecular and phylogenetic dataset and the all-sequence dataset. The accuracy in secondary struc-ture prediction was either improved or remained thesame when the SE derived from each dataset wasFig. 4. Each group of bars represents folding accuracy of (from left to right) tRNA, eukaryotic 5 S rRNA, bacterial 5 SrRNA, and bacterial 16 S rRNA. Within each group, each bar represents (from left to right) unmodified Mfold, base-pairstack SE derived using tRNA, eukaryotic 5 S rRNA, bacterial 5 S rRNA, bacterial 16 S rRNA, and all-sequence dataset.Fig. 3. (a) Base-pair statistical energy versus base-pair stack statistical energy with a correlation coefficient of 0.8582.(b) Stacking statistical energy versus base-pair stack statistical energy with a correlation coefficient of 0.1790.774 RNA Secondary Structure
  • 7. applied to the sequences within the same dataset. Forexample, when tRNA-specific SEs were used in placeof the experimental BP-ST energies to fold tRNAsequences, the accuracy improved from 0.70 to 0.79.The improvement of prediction accuracy is moresignificant for bacterial 5 S rRNA; 0.74 versus 0.63 forSE and experimentally derived energy values.However, for other datasets (e.g., bacterial 16 SrRNA and eukaryotic 5 S rRNA) the predictionaccuracy was similar between the molecule-specificSEs and the experimentally derived energy values.The prediction accuracy usually decreased when theBP-ST SE determined from one molecule/phyloge-netic dataset was utilized to predict the secondarystructure for another dataset. And last, the predictionaccuracy for the SE derived from all-sequence datasetor the bacterial 16 S rRNAwas similar to the accuracygiven by the experimentally determined energyvalues. These results indicated that, the statistics ofbase-pair stacking in these two datasets is indeedBoltzmann-like, and the statistically derived energyis as reliable as the experimental measurement.Extension of statistical energy to hairpin flanksand internal loopsOur analysis has shown some improvement in theprediction of RNA secondary structure helices fromthe structural statistics of consecutive base pairswithin a helix. About 66% of the nucleotides in anRNA structure predicted with covariation analysisform a base pair, and the vast majority of these areG:C, A:U, and G:U pairings that occur within aregular helix. The remaining third of the RNAsecondary structure form hairpin, internal, and mul-tistem loops. However, an analysis of the 3Dstructure and our growing knowledge about thevariety of structural motifs that occur in RNA struc-tures provides a foundation to improve the accuracyin the prediction of secondary and even tertiarystructure of an RNA.The majority of these unpaired nucleotides in thesecondary structure are base-paired with non-canonical pairing types and conformation.14Andnearly all of the nucleotides in the rRNA high-resolution structure are stacked onto anothernucleotide. While the majority of these stackingsare formed between adjacent nucleotides that arebase-paired, a significant number of nucleotides thatare stacked are not base-paired or consecutive.Towards that end, Lee J.C. & Gutell R.R. (unpub-lished results) have identified and characterized a setof different types of base stackings at the ends ofhelices that add stability to the helix and potentiallyprotect the ends of the helix, while bridging the endsof regular helices with different structural motifs thatoccur in the hairpin, internal, and multistem loops inan RNA secondary structure. Many of these helixcappings are associated with numerous structuralmotifs that have been identified and characterizedfor their chemical structure and energetic properties.One example, the UAA/GAN motif,70 has severalnon-canonical base pairs and unpaired nucleotidesthat form a longer co-axial stack that bridges the twoflanking helices. Another example, the E loop andthe E-like loop have been well characterized,8,58 andcontain several non-canonical base pairs that form acontiguous stack onto the regular secondary struc-ture helix.While the current thermodynamics-based foldingalgorithms consider these unpaired regions of thesecondary structure to be destabilizing, as statedearlier, it is suggested that base stacking contributesmore to the overall stability of the RNA structure.Our longer term objectives are to determine thestructural potentials associated with all of thestructural motifs in the unpaired regions of thesecondary structure, but here we determine onlythe structural potentials that can be utilized withinthe Mfold program.The statistical energies for hairpin flanks (HF) andinternal loops (IL) have been evaluated and utilized inthe prediction of an RNA secondary structure withMfold. HF and IL SE are computed according to Eqs.(7), (10), (12) and (13). (The SE for HF, IL (11, 12, 22,and flanks) are available as Supplementary DataTables S.16 20, respectively.) The minimum andmaximum SE values were calibrated with those deter-mined by experiment.50SE is derived with compara-tive structures from several RNA datasets: tRNA,eukaryotic 5 S rRNA, bacterial 5 S rRNA, and bacterial16 S rRNA. The all-sequence SE is obtained from theBoltzmann average of the individual datasets, sincethe occurrences of HF and IL are sparse in structures.We then replaced the energy values for HF-IL inMfold with the corresponding SE, and set the rest tomaximum experimental energy values.We compare the SE of the motifs from the all-sequence dataset with the corresponding free energyobtained from experiment. HFs yield a poor correla-tion with experiment (0.5132), which was computedfor the SE values in Supplementary Data Table S.16and the experimental values reported by Mathews.50We analyzed the differences in SE values betweeneach of the datasets. (The SE values are shown onlyfor the base pair (BP)/HFs that occur within eachdataset.) The arrangement of the nucleotides isdescribed in the following example. The HF denotedGC-AC, as illustrated in Fig. 1c, is composed of a GCbase pair at the end of a helix, the A is an unpairednucleotide flanking the paired G, and the C is anunpaired nucleotide flanking the paired C. Thesenucleotides correspond to the first two columns ofSupplementary Data Table S.16. With six types ofhelix-ending base pairs (GC, CG, etc.) and 16possible unpaired flanking pairs (AA, AC, AG, AU,etc.), 96 BP/HF flanks are possible. Of the 55observed, 27 occur in only one dataset (excludingthe all-sequence dataset), 16 occur in two datasets,nine occur in three datasets, and three occur in allfour datasets.The most pronounced patterns are: The hairpin-flanking nucleotides that are asso-ciated with the most frequent helix-endingbase pair types are GA, occurring with six775RNA Secondary Structure
  • 8. different base pairs, followed by CA, UC andUU, which both have four different helix-ending base pairs. The vast majority of the most stable helixflanking nucleotides in hairpin loops are GA,UC, UU, and CA with most types of helix-ending base pairs. The other very stable BP/HF nucleotide setsthat do not have a GA, UC, UU, or CA hairpinflank are CG/AA, CG/AC, GC/AC, GC/AG,UA/UA,and CG/UG.Several of these BP/HF sets are associated withknown tetraloops, although not exclusively. Forexample, CG/UG, which has one of the most stableSE values, occurs frequently in numerous RNAmolecules with the sequence C(UUCG)G; this motifhas been determined to be very stable.11,59 Many,but not all, of the family of BP/HF nucleotide setswith a GA HF are associated with the GNRAtetraloop.11,12The closing base pairs of thesetetraloops were investigated experimentally andwere found to be generally consistent with our SEvalues for the BP/HF.60Some of the other GA BP/HP are associated with hairpin loops with sixnucleotides. Two experimental studies revealed thatthe G and A in the HF form a base pair with asheared conformation.61,62An earlier study revealed that many helices in theribosomal RNAs have AA or AG flanking the endof the helix in the unpaired region of the secondarystructure.13This analysis revealed that all of the GAflanks associated with a hairpin loop form a basepair. Two of the three most stable BP/HF SEs havea helix ending with a CG base pair and AA (2.81)and GA (2.8) flank (the SE for AU/UC, the moststable BP/HF, is 2.87). An experimental studyrevealed that the HF GA is more stable than AA fora hairpin loop with the same intervening fournucleotides.63We anticipate that the accuracy of theprediction of an RNA secondary structure will befurther enhanced once we associate the BP/HFstatistical energy with the size and sequence of thehairpin loop.The 11 IL-SE has a poor correlation withexperiment (0.18, as computed from SupplementaryData Table S.17). Two of the most stable BP/IL/BP11 internal loops has UU in juxtaposition. The nextmost stable BP/IL/BP 11 internal loop has a GAjuxtaposition. The UU juxtaposition in the 11internal loops occurs with eight different sets ofbase pair types that flank both sides of the non-canonical set of nucleotides. Of these, five are con-sidered more stable with SE greater than 1.0. TheGA and AG juxtapositions in the 11 internal loopsare both associated with seven different sets offlanking base pairs. However, while six of the sevenGAs have a stability greater than 1.0, only two ofthe AGs have an energy value greater than 1.0. Fiveof the six UC juxtapositions have an energy valuegreater than 1.0, and all three of the CU juxtaposi-tions have an energy value greater than 1.0. Someof our SEs are similar to those determined by expe-riment, but others are very different. For example, theUG/U-U/AU (BP/IL/BP) loop SE of 1.59 kcal/molis an underestimate compared to the 1.7 kcal/molobtained from experiment.50 The AU/G-G/GC loopSE of 0.45 kcal/mol is an underestimate of the 1.4 kcal/mol obtained from experiment. No 11 IL(Supplementary Data Tables S.17) was observed inthe tRNA datasets we analyzed, whereas theeukaryotic 5 S rRNA contains six and the bacterial 5S rRNA has eight. All of these are considered stable(energy value greater than 1.0 and all have a dif-ferent set of BP/IL/BP sequences).The number of occurrences of internal loops withone nucleotide on one side of the helix and two on theother is low. None was present in the tRNA or 5 SrRNA datasets analyzed, and seven were present inthe bacterial 16 S rRNA. Of the nine 12 internal loops,24 of the 27 nucleotides are purines. The correlationbetween the 12 IL-SE (0.11) with experiment is poor;the largest underestimation of SE is 0.31 kcal/mol forthe GC/A-AA/AU loop compared with the experi-mental value of 3.2 kcal/mol.A total of 25 different 22 internal loops withunique sets of nucleotides within the internal loopand the two base pairs that flank the internal loopoccur in the bacterial 16 S rRNA dataset. These 22internal loops are not in the tRNA or 5 S rRNA data-sets analyzed here. Of the 25 different 22 internalloops, nine of them have a tandem GA/AG internalloop, four have the tandem AA/AG IL, four morehave GA/AA IL, and two have AA/AA IL. In total,19 have one of the 22 internal loops that form aunique structural motif family.64The G in the GAjuxtaposition can be replaced with an A and main-tain the same sheared conformation in many of thetandem GA motifs. Experimental studies revealed anassociation between the base pairs that flank thetandem GA 22 internal loop.65The 22 internalloop in three of the 27 BP/22 IL/BP is UU/UU.Tandem UU mismatches have been studied experi-mentally and found to be stable in some structuralenvironments.66The 22 IL-SE has a poor correlationwith experiment as well (0.08). The AU/AA-AG/GCloop is underestimated with a calculated value of -4.70 kcal/mol while the experimental value is1.00 kcal/mol.While the SE values determined from the nucleo-tide frequencies of consecutive base pairs are similarto, and sometimes slightly better than, the experi-mentally determined energy values for the sameconsecutive base pairs, the incorporation of HF andinternal loop SE terms increased the accuracy of theprediction significantly (Fig. 5). Using all SEs (BP-ST,HF, and internal loop), we attain an increase infolding accuracy from 0.70 to 0.89 in tRNA, from0.72 to 0.84 in eukaryote 5 S rRNA, from 0.63 to 0.88in bacterial 5 S rRNA, and from 0.49 to 0.56 inbacterial 16 S rRNA. The substantial improvementsuggests that the structural elements, by themselvesand/or coordinated with the ends of the secondarystructure helix stabilize the higher-order structure ofthe RNA molecule. The SEs for the hairpin flankcontributed more to the improvement in the predic-776 RNA Secondary Structure
  • 9. tion of an RNA secondary structure than for theinternal loops and BP-STs. In particular, while thefolding accuracy increased from 0.70 (for experi-mental energy values) to 0.89 (for all SEs), SEs foronly internal loops increased the accuracy to 0.78,while SEs for only hairpin flanks increased theaccuracy to 0.85. Accordingly, the folding accu-racy of eukaryotic 5 S rRNA when only hairpinand internal loop SEs are used are 0.79 and 0.74,respectively, in constrast with 0.72 (experimental)and 0.84 (for all SEs). However, SEs for HFs donot always contribute more than internal loopflanks to the accuracy of the prediction of an RNAsecondary structure. The internal loop flank SEincreases the accuracy of RNA folding more thanhairpin flanks for bacterial 5 S rRNA. This may bedue to the small sampling of the molecule. Whilethe improvement in the folding accuracy with HFSE is not as significant for bacterial 16 S rRNA, anoverall improvement does occur. This observationemphasizes the significance of HFs to RNAstructures.While internal loops within 5 S rRNA ranges from1 8 nucleotides in length and those in 16 S rRNArange from 1 12 nucleotides in length, only energyfor internal loops of lengths N4 nucleotides can beutilized by Mfold. As a result, the SE contributionsfor the longer internal loops are not used. Similarly,Mfold has energy functions based on the length andnucleotide composition of hairpins and yet cannotassign an energy contribution for specific hairpinloops with N 4 nucleotides. This is particularlyimportant, since tRNA are composed mainly ofloops of 7 and 8 nucleotides, while 5 S rRNAs arecomposed of hairpin loops that are N 4 nucleotides aswell. Hence, the inclusion of energy parameters forlarger hairpin loops in the folding algorithm shouldincrease the prediction accuracy. Although bulgeloops are known to be important to the stability ofRNA, the simple model that accounts for only thelength of the loop to be used as input to evaluateenergy does not lend itself to much improvementwhen statistical methods are applied. Additionaldevelopment of Mfold would be needed to incorpo-rate SEs of bulge loops. Furthermore, many unpairednucleotides in RNA secondary structure participatein tertiary interactions that further stabilize thestructure.67,68,69,70 Information about such interac-tions may be needed to determine loop energies. Inaddition, a special bonus is used by Mfold whenempirical results deviate from the general rules. Forexample, hairpin loops are checked for a special GUclosure and given a bonus.50Due to these limitations,a new framework beyond the current Mfold is neces-sary to utilize the SEs that can be evaluated fromcomparative structures.While the SEs derived from one molecular/phylogenetic dataset have the potential to improvethe prediction accuracy for sequences in the samedataset significantly (i.e., self-folding), we expect thesame set of SEs derived from numerous molecularRNA sequences that span the phylogenetic tree of lifecan be determined and utilized by an RNA foldingalgorithm to predict the secondary structure accu-rately for any RNA molecule.The accuracy of folding each molecule using SEsfor BP-ST-HF-IF structural parameters derived frommolecule-specific and all-sequence datasets isshown in Fig. 6. The SEs for the all-sequence datasetpredicted the secondary structures more accurately(0.1) than the experimentally derived energyvalues for tRNA, eukaryotic 5 S rRNA, and bacterial5 S rRNA. The improved accuracy for the bacterial16 S rRNA dataset was moderate (0.05). However,as noted earlier, bacterial 16 S rRNA has more HFand IL structural elements that cannot be utilized bythe current Mfold program.The use of HF and internal loop energies with BP-ST energies will enhance the prediction accuracy forsome self- and cross-folding, and decrease theprediction accuracy for other folding. For example,tRNA-specific BP-ST-HF-IL SEs increase the accu-racy of the base pairs in tRNA to 89%. However, theSEs derived from eukaryotic and bacterial 5 S rRNAFig. 5. Each group of bars represents the folding accuracy of (from left to right) tRNA, eukaryotic 5 S rRNA, bacterial 5S rRNA, and bacterial 16 S rRNA. Within each group, each bar represents (from left to right) unmodified Mfold, base-pairstack SE, hairpin flank SE ,internal loop SE, and all available SE (base-pair stack, hairpin flank, and internal loops).777RNA Secondary Structure
  • 10. decrease the accuracy for sequences in the tRNAdataset to 0.62 and 0.65, respectively. This corre-sponds to a decrease by almost 0.3 when the SEvalues for a different molecule/phylogenetic datasetare used. However, when only BP-ST SE isexamined (Fig. 5), the difference in folding accuracyis about 0.1 between the tRNA-specific SE and SEfrom other molecules. Similar trends are observedfor other molecules. This is again due to the infre-quent occurrences of secondary structures such asHFs and ILs as opposed to BP-ST. It is possible thatadditional comparative structures from other mole-cules in the future would be helpful. Currently, itappears that the Boltzmann average between mole-cular contributions is an effective approach to retainthe signals and to derive a general SE for thesemotifs.DiscussionStructural statistics from comparative sequenceanalysis can be used to generate SEs that agree withexperimental measurements. For BP-ST, a correla-tion coefficient of 0.9 has been achieved betweenthe energy values derived by structural statistics ofbacterial 16S rRNA and the all-sequence dataset andthe free energy values extracted from experiments.Statistics for individual molecules, such as bacterial5 S rRNA and tRNA, express specificity and, tosome extent, rigidity. Smith et al.57have foundnucleotides in Loop E, Helix I, and Helix IV that arelethal to Saccharomyces cerevisiae. However, statisticsfrom a single molecule may not sufficiently repre-sent the complete Boltzmann distribution of BP-STsdue to biologically driven biases as well as the smallsample size of each small molecule, as shown by thenumber of nucleotides in Table 1, although thesampling issue will improve as more sequences arealigned. Conversely, the dataset for bacterial 16 SrRNA has a vast sample size (slightly less than 1.5million nucleotides) and, subsequently, represents aBoltzmann-like distribution.Utilizing the structural statistics, the BP-ST energywas decomposed into its corresponding base pairand stacking contributions. We find that base pairingis the main factor that determines the relative stabi-lity of that particular motif. Base pairing stability ishighly correlated to BP-STstability with a correlationcoefficient of 0.87, while no correlation is observedbetween the base-pair stacking and decomposedstacking SE. The analysis suggests that the stackinglikely contributes to the intrinsic stability in a non-specific way, while the base-pairing, driven by thehydrogen bonding, renders the sequence specificityin the BP-ST.We further evaluated the SE for BP-ST, internalloop and HF by applying them in Mfold to predictthe secondary structures. The SEs have been derivedfrom sequences of individual molecules (molecule-specific) and all combined (all-sequence). Molecule-specific BP-ST energy values improve folding accu-racy for some molecules and have little effect onothers, as compared to the experimental values usedby original Mfold. Using all-sequence BP-ST SE, weobserve the accuracy of folding prediction to becomparable to that of free energy obtained experi-mentally. When SE for hairpin loops and internalloops are included, we see dramatic improvementsin the folding accuracy to 0.80, 0.79, and 0.77 fortRNA, eukaryotic 5 S rRNA, and bacterial 5 S rRNA,respectively. Much of the improvement in accuracyis due to the application of HF SEs. Since Mfold doesnot utilize energy parameters for structures such asinternal loops that have more than three nucleotides,not all SEs can be employed, which is the likelycause for the moderate improvement for 16 S rRNA.Overall, the prediction accuracy when utilizingthe SE values from one dataset to the sequences inanother dataset is worse than the predictionFig. 6. Each group of bars represents folding accuracy of (from left to right) tRNA, eukaryotic 5 S rRNA, bacterial 5 SrRNA, and bacterial 16 S rRNA. Within each group, each bar represents (from left to right) unmodified Mfold, base-pairstack, hairpin flank, and internal loop SEs derived using tRNA, eukaryotic 5 S rRNA, bacterial 5 S rRNA, bacterial 16 SrRNA, and all-sequence dataset.778 RNA Secondary Structure
  • 11. accuracy for the experimentally determined energyvalues. Consistent with our previous analysis, theresults suggest that individual RNA molecules haveinsufficient HF and IL statistics for a Boltzmann-likedistribution and care needs to be taken when com-bining the statistics from different molecules intogeneral SEs. More importantly, we have demon-strated here that motifs beyond BP-ST are critical toa more complete understanding of RNA folding andto the refinement of folding algorithms.Materials and MethodsRNA comparative structureThe RNA molecules 5 S rRNA, 16 S rRNA, and 23 SrRNA are available for different phylogenetic groupsfrom the rCAD database. The rCAD database is imple-mented in the Microsoft SQL Server, a relational databasemanagement system. Analysis performed on rCAD canbe accessed online at the Comparative RNA Web Site(CRW). The rCAD system stores over 50,000 RNAsequences, 174,6521 nucleotides, and comparative struc-tures. We have rich information on BP-ST statistics of 319bacterial 16 S rRNA sequences, 650 tRNA sequences, 263eukaryotic 5 S rRNA sequences, and 96 bacterial 5 SrRNA. Sequences of the above molecules are availablefrom all phylogenetic domains including Eukaryote,Bacteria, and Archaea. Sequences with similarity ofgreater than 97% have been pruned to eliminate dupli-cates. We assume an ensemble of good distribution.71Theribosomal RNA is studied for its rich structural diversity,for its functional abilities, and for its well-conservedqualities within phylogenetic domains. The secondarystructures of all sequences are evaluated from compara-tive analysis.4,42,35,72 Structural motifs such as BP-STs,HFs, hairpin loops, ILs, multi-stemloops, and bulges arestored in tables within the SQL server.Base-pair stack statistical energyStatistics of BP-STs found via comparative analysis havebeen collected to calculate statistical energies. A BP-ST isdenoted AB/CD where AB and CD are base pairs, and Aand D are on the 5 ends of the stack. For example, theUA/GC BP-ST is identified in Fig. 1a. The sample size of BP-STs obtained from sequence analysis is three orders ofmagnitude greater than those obtained from crystal struc-tures.29In addition to containing a larger sample size thanthe set of crystal structures, the sequence data contain agreater diversity spanning a larger portion from the phylo-genetic tree of life and, hence, a more complete ensemble.Statistics were obtained by counting BP-STs composed ofcanonical WatsonCrick (CG and AU), and GU base-pairs,which are available in the Supplementary Data.SEs are calculated from BP-STs statistics (see Supple-mentary Data figures and tables) and are evaluated as:DGBPST ij; kl = EkBT lnPBPST ij; kl Prand BPST ij; kl !1where:PBPST ij; kl =NBPST ij; kl + NBPST kl; ij NBPST2and:Prand BPST ij; kl = 1 + yij;kl PiPjPkPl 3The indices i, j, k, and l represent any of the nucleo-tides A, C, G, and U. We define NST(ij,kl) as the numberof BP-STs composed of ij and kl. The total number of BP-STs is NST. The delta function ij,kl is equal to 1 if ij=kl and0 otherwise. Pi is the probability of occurrence of nucleo-tide i. The scaling factor, , is determined by setting=min{GST(Turner)(ij,kl)}/min{GST(pq,rs)}, where the numera-tor is the minimum experimental BP-ST energy,50andthe denominator is the minimum BP-ST SE. Any SEsthat are higher than the corresponding maximumexperimental value are set equal to the maximumexperimental value.Equations (1)(3) are used as a common treatment ofBP-STs where symmetric BP-STs such as UA/CG (Fig. 1a)and CG/UA (Fig. 1b) are considered to be equivalent. Forexample, the BP-ST indicated in Fig. 1a and b are dege-nerate, since the same nucleotides are on the 5 end of thestrands (namely U and C) while A and G are on the 3 endof the strands in both figures. These two rotated confi-gurations are commonly considered to be equivalent andtheir statistics would be the average of the two confor-mations. However, such treatment may not accuratelyrepresent the distribution of BP-ST configurations andasymmetry of directionality of RNA structures, such as theconsideration of individual nucleotides that are immedi-ately 5 and 3 to a hairpin loop. Hence, we will investigatethe asymmetric statistical energy as well. Only slightmodifications are needed to Eq. (2) by eliminating the sumand Eq. (3) by eliminating the delta function. The statis-tical energy is normalized by the reference state, P(rand),calculated from the probability of finding each individualnucleotide in the sequences of a given molecule.We have developed molecule-specific energies bysampling sequences that are specific to particular mole-cules. For example, to evaluate tRNA-specific energy, wesample from the set of tRNA sequences.Extracting components of base-pair stacking freeenergyWe attempt to separate the BP-ST energy into pairingand stacking components to elucidate the contributions ofthe two interactions. In order to evaluate the energy ofbase-paring, we have:DGBP iVjV = akBT lnPBP iVjV Prand BP iVjV !4where i and j are the nucleotides in consideration, PBP(i j) is the probability of finding the base-pair consisting ofi and j, and PBP(rand)(i j) is the reference state computed asthe probability of finding individual nucleotides i and j.The energy of stacking is evaluated as:DGST iWjW = hkBT lnPST iWjW Prand ST iWjW !5where i and j are the nucleotides in consideration,PST (i j) is the probability of finding the stack consisting http://www.rna.ccbb.utexas.edu779RNA Secondary Structure
  • 12. of i and j, and essentially is the same reference state asthat in Eq. (4). With energies for base-pairs, stacks, and BP-STs, we can use a linear fit to estimate the relativecontribution of base-pairs and stacks using the followingform:DGBPST ij; kl = 1=2 DGBP ij + DGBP kl + DGST ik + DGST jl + C 6In the equation above, nucleotides ij and kl are base-pairs and ik and jl are stacked.Additional statistical energy termsIn addition to BP-STs, the energetics of other secondarystructural motifs, although limited in Mfold, may bemodified to further refine our free energy calculations andpotentially improve folding results. Statistical potentialscan be developed by identifying nucleotides at the ends ofhairpin loops and nucleotides that flank those loops asshown in Fig. 1c. We let i be the nucleotide in a base-pairthat surrounds the 5 end of the hairpin loop, let j be thenucleotide on the 5 end of the loop, let k be the nucleotideon the 3 end of the loop, and let l be the nucleotide in abase-pair (with nucleotide i) that surrounds the 3 end ofthe loop. Then the free energy of hairpin flanks can beevaluated as:DGHF ij; kl = EkBT lnPHF ij; kl Prand HF ij; kl !7where:PHF ij; kl =NHF ij; kl NHF8and:Prand HF ij; kl = PiPjPkPl: 9The indices i, j, k, and l represent any of the nucleotidesA, C, G, and U. We define NHF(ij, kl) as the number ofhairpins composed of nucleotides, ij and kl, surroundingthe motifs. The total number of hairpin flanks is NHF. Pi isthe probability of occurrence of nucleotide i. The scalingfactor, , is estimated in a manner similar to that used forthe previous scaling factors by comparing the minimumvalue of experimental data50with the minimum value ofGHF(ij, kl). Furthermore, HF statistical potentials arelimited to the highest energy found in experimental results.Internal loops are unpaired nucleotides surrounded byhelices as shown in Fig. 1d. Energy terms for internal loopsare derived using nucleotides of the loops on the 5 and 3strands as well as the base-pairs surrounding thosenucleotides. Loops of different lengths on each stranduse a different energy function. For the example of internalloops with two nucleotides on the 5 and 3 ends, weevaluate the free energy of internal loops as:DGIL i1; i2; i3; i4; j1; j2; j3; j4 = EkBT lnPIL i1; i2; i3; i4; j1; j2; j3; j4 Prand IL i1; i2; i3; i4; j1; j2; j3; j4 !10where:PIL i1; i2; i3; i4; j1; j2; j3; j4 =NIL i1; i2; i3; i4; j1; j2; j3; j4 NIL 2211and:Prand IL i1; i2; i3; i4; j1; j2; j3; j4 = Pi1Pi2Pi3Pi4Pj1Pj2Pj3Pj4: 12The indices i1 and i4 are the base-pairs around theinternal loop on the 5 end of the molecule. Indices i2, andi3 are the nucleotides consisting of the internal loop on the5 end of the molecule. Indices j1, j2, j3, and j4 are thecorresponding nucleotides on the 3 end of the molecule.We define NIL(i1, i2, i3, i4, j1, j2, j3, j4) as the number ofinternal loops composed of nucleotides, i1, i2, i3, i4, j1, j2,j3, and j4. Similarly, internal loops with one nucleotide onthe 5 and 3 ends are evaluated as:DGIL i1; i2; i3; j1; j2; j3 = EkBT lnPIL i1; i2; i3; j1; j2; j3 Prand IL i1; i2; i3; j1; j2; j3 !: 12In this case, i1, i3, j1, and j3 are the surrounding base-pairs, while i2 and j2 are the nucleotides of the internalloops. Finally, internal loops with one nucleotide on onestrand and two nucleotides on another strand areevaluated as:DGIL i1; i2; i3; j1; j2; j3; j4 = EkBT lnPIL i1; i2; i3; j1; j2; j3; j4 Prand IL i1; i2; i3; j1; j2; j3; j4 !13The nucleotides i1, i3, j1, and j4 are the surroundingbase-pairs and the rest are the internal loops. The scalingfactor, , was estimated similarly to the previous scalingfactors by comparing the minimum value of experimentaldata50with the minimum value of GIL(i1, i2, i3, j1, j2, j3,j4). Note that is calculated uniquely for each equation.Furthermore, HF statistical potentials are limited to thehighest energy found in experimental results. All statisti-cally derived energies can be downloaded and applied tothe original Mfold program on the web.Statistical energy derived from all-sequencedatasetIn addition to molecule-specific potentials, sequencesfrom all molecules are combined to derive the all-sequence SE that can potentially be applied to prediction.For BP-STs, we have combined all sequences from allmolecules to derive the SE. For HFs and ILs, however,due to the limited occurrenc in smaller molecules such asthe 5 S rRNA and tRNA, a different approach wasused to combine the statistics from individual mole-cules. Hence, the all-sequence hairpin flank SE andall-sequence internal loop SE are derived by averagingthe molecule-specific SEs from all molecules usingBoltzmann weights:DGPallseq =PiaI exp DGPi =kbT DGPiPiaI exp DGPi =kbT 14For each structural element P, such as the BP-ST and ILsdiscussed above, the all-sequence statistical energy is aweighted sum of energies from each molecule i. RNA Secondary Structure
  • 13. Evaluation of statistical potentialsWe applied different combinations of BP-ST, HF and ILSEs to the Mfold program50developed by Zuker et al. totest its ability to predict RNA folding.1The Mfoldprogram was also modified to utilize asymmetric BP-STSE in energy calculations. Five sets of asymmetric SE werederived from tRNA, eukaryotic 5 S rRNA, bacterial 5SrRNA, bacterial 16 S rRNA, and all sequences combined.Each set of SE utilizing BP-STand BP-ST-HF-IL terms wereused to fold each of the four molecules.In order to make performance comparisons of energyvalues, the number of base-pairs of a structure predictedby Mfold that are in agreement with comparative sequenceanalysis5is divided by all base-pairs determined by com-parative sequence analysis. Only canonical (WatsonCrickG-C, A-U, and G-U) base-pairs are counted. Although thisperformance measurement does not count erroneouslypredicted base-pairs, false positives affect the predictedstructure and may preclude correct base-pairs fromforming. This metric is simple and sufficient in evaluatingthe accuracy of folding.AcknowledgementsThe authors acknowledge the Robert A. WelchFoundation (grant numbers F-1691 and F-1427), theNational Institutes of Health (R01GM079686 andR01GM067317), and Microsoft Research for finan-cial gifts and a TCI grant. Kishore Doshi isacknowledged for his contribution to the develop-ment of the rCAD system. The authors are also grate-ful to resources provided by the Texas AdvancedComputing Center.Supplementary DataSupplementary data associated with this articlecan be found, in the online version, at doi:10.1016/j.jmb.2009.06.036References1. Zuker, M., Jaeger, J. A. & Turner, D. H. (1991). Acomparison of optimal and suboptimal RNA sec-ondary structures predicted by free energy mini-mization with structures determined by phylogeneticcomparison. Nucleic Acids Res. 19, 27072714.2. Yakovchuk, P., Protozanova, E. & Frank-Kamenetskii,M. D. (2006). Base-stacking and base-pairing contribu-tions into thermal stability of the DNA double helix.Nucleic Acids Res. 34, 564574.3. Guckian, K. M., Schweitzer, B. A., Ren, R. X. F., Sheils,C. J., Tahmassebi, D. C. & Kool, E. T. (2000). Factorscontributing to aromatic stacking in water: evaluation inthe context of DNA. J. Am. Chem. Soc. 122, 22132222.4. Gutell, R. R., Weiser, B., Woese, C. R. & Noller, H. F.(1985). Comparative anatomy of 16 S-like ribosomalRNA. Prog. Nucleic Acid Res. Mol. Biol. 32, 155216.5. Gutell, R. R., Lee, J. C. & Cannone, J. J. (2002). Theaccuracy of ribosomal RNA comparative structuremodels. Curr. Opin. Struct. Biol. 12, 301310.6. Gutell, R. R., Cannone, J. J., Konings, D. &Gautheret, D. (2000). Predicting U-turns in ribosomalRNA with comparative sequence analysis. J. Mol.Biol. 300, 791803.7. Lee, J. C., Cannone, J. J. & Gutell, R. R. (2003). Thelonepair triloop: a new motif in RNA structure. J. Mol.Biol. 325, 6583.8. Gutell, R. R., Cannone, J. J., Shang, Z., Du, Y. & Serra,M. J. (2000). A story: unpaired adenosine bases inribosomal RNAs. J. Mol. Biol. 304, 335354.9. Nissen, P., Ippolito, J. A., Ban, N., Moore, P. B. &Steitz, T. A. (2001). RNA tertiary interactions in thelarge ribosomal subunit: the A-minor motif. Proc.Natl Acad. Sci. USA, 98, 48994903.10. Battle, D. J. & Doudna, J. A. (2002). Specificity ofRNA-RNA helix recognition. Proc. Natl Acad. Sci.USA, 99, 1167611681.11. Woese, C. R., Winker, S. & Gutell, R. R. (1990).Architecture of ribosomal-RNA - constraints on thesequence of tetra-loops. Proc. Natl Acad. Sci. USA, 87,84678471.12. Michel, F. & Westhof, E. (1990). Modeling of the3-dimensional architecture of group-I catalyticintrons based on comparative sequence-analysis.J. Mol. Biol. 216, 585610.13. Elgavish, T., Cannone, J. J., Lee, J. C., Harvey, S. C. &Gutell, R. R. (2001). AA.AG at helix.ends: A:A and A:G base-pairs at the ends of 16 S and 23 S rRNA helices.J. Mol. Biol. 310, 735753.14. Lee, J. C. & Gutell, R. R. (2004). Diversity of base-pair conformations and their occurrence in rRNAstructure and RNA structural motifs. J. Mol. Biol.344, 12251249.15. Leontis, N. B., Stombaugh, J. & Westhof, E. (2002). Thenon-Watson-Crick base pairs and their associatedisostericity matrices. Nucleic Acids Res. 30, 34973531.16. Xin, Y. R. & Olson, W. K. (2009). BPS: a database ofRNA base-pair structures. Nucleic Acids Res. 37,D83D88.17. Doshi, K. J., Cannone, J. J., Cobaugh, C. W. &Gutell, R. R. (2004). Evaluation of the suitability offree-energy minimization using nearest-neighborenergy parameters for RNA secondary structureprediction. BMC Bioinformatics, 5, 105.18. Ding, Y., Chan, C. Y. & Lawrence, C. E. (2005). RNAsecondary structure prediction by centroids in aBoltzmann weighted ensemble. RNA, 11, 11571166.19. Floudas, C. A., Fung, H. K., McAllister, S. R.,Monnigmann, M. & Rajgaria, R. (2006). Advances inprotein structure prediction and de novo proteindesign: a review. Chem. Eng. Sci. 61, 966988.20. Shen, M. Y. & Sali, A. (2006). Statistical potential forassessment and prediction of protein structures.Protein Sci. 15, 25072524.21. Tanaka, S. & Scheraga, H. A. (1976). Medium- andlong-range interaction parameters between aminoacids for predicting three-dimensional structures ofproteins. Macromolecules, 9, 945950.22. Bryant, S. H. & Lawrence, C. E. (1991). The frequencyof ion-pair substructures in proteins is quantitativelyrelated to electrostatic potential - a statistical-modelfor nonbonded interactions. Proteins: Struct. Funct.Genet. 9, 108119.23. Finkelstein, A. V., Badretdinov, A. Y. & Gutin, A.M. (1995). Why do protein architectures haveBoltzmann-like statistics. Proteins: Struct. Funct.Genet. 23, 142150.24. Sakakibara, Y., Brown, M., Hughey, R., Mian, I. S.,Sjolander, K., Underwood, R. C. & Haussler, D.781RNA Secondary Structure
  • 14. (1994). Stochastic context-free grammars for transfer-RNA modeling. Nucleic Acids Res. 22, 51125120.25. Durbin, R., Eddy, S. R., Krogh, A. & Mitchison, G.(1998). Biological Sequence Analysis: Probabilistic Modelsof Proteins and Nucleic Acids. Cambridge UniversityPress.26. Knudsen, B. & Hein, J. (1999). RNA secondary struc-ture prediction using stochastic context-free gram-mars and evolutionary history. Bioinformatics, 15,446454.27. Knudsen, B. & Hein, J. (2003). Pfold: RNA secondarystructure prediction using stochastic context-freegrammars. Nucleic Acids Res. 31, 34233428.28. Do, C. B., Woods, D. A. & Batzoglou, S. (2006).CONTRAfold: RNA secondary structure predictionwithout physics-based models. Bioinformatics, 22,E90E98.29. Dima, R. I., Hyeon, C. & Thirumalai, D. (2005).Extracting stacking interaction parameters for RNAfrom the data set of native structures. J. Mol. Biol. 347,5369.30. Berman, H. M., Westbrook, J., Feng, Z., Gilliland, G.,Bhat, T. N., Weissig, H., Shindyalov, N. & Bourne, P. E.(2000). The Protein Data Bank. Nucleic Acids Res. 28,235242.31. Das, R. & Baker, D. (2007). Automated de novoprediction of native-like RNA tertiary structures. Proc.Natl Acad. Sci. USA, 104, 1466414669.32. Sykes, M. T. & Levitt, M. (2005). Describing RNAstructure by libraries of clustered nucleotide doublets.J. Mol. Biol. 351, 2638.33. Parisien, M. & Major, F. (2008). The MC-Fold and MC-Sym pipeline infers RNA structure from sequencedata. Nature, 452, 5155.34. Jonikas, M. A., Radmer, R. J., Laederach, A., Das, R.,Pearlman, S., Herschlag, D. & Altman, R. B. (2009).Coarse-grained modeling of large RNA moleculeswith knowledge-based potentials and structuralfilters. RNA, 15, 189199.35. Woese, C. R., Gutell, R., Gupta, R. & Noller, H. F.(1983). Detailed analysis of the higher-order structureof 16 S-like ribosomal ribonucleic-acids. Microbiol. Rev.47, 621669.36. Holley, R. W., Apgar, J., Everett, G. A., Madison, J. T.,Marquise, M., Merrill, S. H. et al. (1965). Structure of aribonucleic acid. Science, 147, 14621465.37. Levitt, M. (1969). Detailed molecular model fortransfer ribonucleic acid. Nature, 224, 759763.38. Fox, G. E. & Woese, C. R. (1975). 5S-RNA secondarystructure. Nature, 256, 505507.39. Woese, C. R., Magrum, L. J., Gupta, R., Siegel,R. B., Stahl, D. A., Kop, J. et al. (1980). Secondarystructure model for bacterial 16S ribosomal-RNA phylogenetic, enzymatic and chemical evidence.Nucleic Acids Res. 8, 22752293.40. Zwieb, C., Glotz, C. & Brimacombe, R. (1981).Secondary structure comparisons between small sub-unit ribosomal-RNA molecules from 6 differentspecies. Nucleic Acids Res. 9, 36213640.41. Stiegler, P., Carbon, P., Zuker, M., Ebel, J. P. &Ehresmann, C. (1980). Secondary structure andtopography of 16S-ribosomal RNA from Escherichiacoli. C R Des Sci. D, 291, 937940.42. Noller, H. F., Kop, J., Wheaton, V., Brosius, J., Gutell,R. R., Kopylov, A. M. et al. (1981). Secondarystructure model for 23S ribosomal-RNA. NucleicAcids Res. 9, 61676189.43. Glotz, C., Zwieb, C., Brimacombe, R., Edwards, K. &Kossel, H. (1981). Secondary structure of the largesubunit ribosomal-RNA from Escherichia coli, Zeamays chloroplast, and human and mouse mitochon-drial ribosomes. Nucleic Acids Res. 9, 32873306.44. Branlant, C., Krol, A., Machatt, M. A., Pouyet, J.,Ebel, J. P., Edwards, K. & Kossel, H. (1981). Primaryand secondary structures of Escherichia coli Mre-600-23S ribosomal-RNA - comparison with models ofsecondary structure for maize chloroplast 23Sribosomal-RNA and for large portions of mouseand human 16S mitochondrial ribosomal-RNAs.Nucleic Acids Res. 9, 43034324.45. James, B. D., Olsen, G. J., Liu, J. S. & Pace, N. R.(1988). The secondary structure of ribonuclease-PRNA, the catalytic element of a ribonucleoproteinenzyme. Cell, 52, 1926.46. Michel, F., Jacquier, A. & Dujon, B. (1982).Comparison of fungal mitochondrial introns revealsextensive homologies in RNA secondary structure.Biochimie, 64, 867881.47. Cech, T. R. (1988). Conserved sequences and struc-tures of group-I introns - building an active-site forRNA catalysis a review. Gene, 73, 259271.48. Michel, F., Umesono, K. & Ozeki, H. (1989).Comparative and functional-anatomy of group-IIcatalytic introns - a review. Gene, 82, 530.49. Xia, T. B., SantaLucia, J., Burkard, M. E., Kierzek, R.,Schroeder, S. J., Jiao, X. Q. et al. (1998). Thermodynamicparameters for an expanded nearest-neighbor modelfor formation of RNA duplexes with Watson-Crickbase pairs. Biochemistry, 37, 1471914735.50. Mathews, D. H., Sabina, J., Zuker, M. & Turner, D. H.(1999). Expanded sequence dependence of thermo-dynamic parameters improves prediction of RNAsecondary structure. J. Mol. Biol. 288, 911940.51. Vallurupalli, P. & Moore, P. B. (2003). The solutionstructure of the loop E region of the 5 S rRNA fromspinach chloroplasts. J. Mol. Biol. 325, 843856.52. Kiparisov, S., Petrov, A., Meskauskas, A., Sergiev,P. V., Dontsova, O. A. & Dinman, J. D. (2005). Structuraland functional analysis of 5S rRNA in Saccharomycescerevisiae. Mol. Genet. Genomics, 274, 235247.53. Gutell, R. R. & Woese, C. R. (1990). Higher-orderstructural elements in ribosomal-RNAs - pseudo-knots and the use of noncanonical pairs. Proc. NatlAcad. Sci. USA, 87, 663667.54. Liang, X. G., Kuhn, H. & Frank-Kamenetskii, M. D.(2006). Monitoring single-stranded DNA secondarystructure formation by determining the topologicalstate of DNA catenanes. Biophys. J. 90, 28772889.55. Kool, E. T., Morales, J. C. & Guckian, K. M. (2000).Mimicking the structure and function of DNA:insights into DNA stability and replication. Angew.Chem. Int. Ed. 39, 9901009.56. Protozanova, E., Yakovchuk, P. & Frank-Kamenetskii,M. D. (2004). Stacked-unstacked equilibrium at thenick site of DNA. J. Mol. Biol. 342, 775785.57. Smith, M. W., Meskauskas, A., Wang, P., Sergiev, P. V.& Dinman, J. D. (2001). Saturation Mutagenesis of 5SrRNA in Saccharomyces cerevisiae. Mol. Cell. Biol. 21,82648275.58. Wimberly, B., Varani, G. & Tinoco, I. (1993). Theconformation of loop-E of eukaryotic 5S-ribosomalRNA. Biochemistry, 32, 10781087.59. Tuerk, C., Gauss, P., Thermes, C., Groebe, D. R.,Gayle, M., Guild, N. et al. (1988). CUUCGGhairpins - extraordinarily stable RNA secondarystructures associated with various biochemicalprocesses. Proc. Natl Acad. Sci. USA, 85, 13641368.60. Antao, V. P. & Tinoco, I. (1992). Thermodynamic782 RNA Secondary Structure
  • 15. parameters for loop formation in RNA and DNAhairpin tetraloops. Nucleic Acids Res. 20, 819824.61. Huang, S. G., Wang, Y. X. & Draper, D. E. (1996).Structure of a hexanucleotide RNA hairpin loop con-served in ribosomal RNAs. J. Mol. Biol. 258, 308321.62. Fountain, M. A., Serra, M. J., Krugh, T. R. & Turner,D. H. (1996). Structural features of a six-nucleotideRNA hairpin loop found in ribosomal RNA.Biochemistry, 35, 65396548.63. Serra, M. J., Lyttle, M. H., Axenson, T. J., Schadt,C. A. & Turner, D. H. (1993). RNA hairpin loopstability depends on closing base-pair. NucleicAcids Res. 21, 38453849.64. Gautheret, D., Konings, D. & Gutell, R. R. (1994). Amajor family of motifs involving GA mismatches inribosomal-RNA. J. Mol. Biol. 242, 18.65. Walter, A. E., Wu, M. & Turner, D. H. (1994). Thestability and structure of tandem GA mismatches inRNA depend on closing base-pairs. Biochemistry, 33,1134911354.66. Santalucia, J., Kierzek, R. & Turner, D. H. (1991).Stabilities of consecutive A.C, C.C, G.G, U.C, andU.U mismatches in RNA internal loops - evidencefor stable hydrogen-bonded U.U and C.C+ pairs.Biochemistry, 30, 82428251.67. Gate, J. H., Gooding, A. R., Podell, E., Zhou, K. H.,Golden, B. L., Szewczak, A. A. et al. (1996). RNAtertiary structure mediation by adenosine platforms.Science, 273, 16961699.68. Costa, M. & Michel, F. (1995). Frequent use of the sametertiary motif by self-folding RNAs. EMBO J. 14,12761285.69. Jaeger, L., Michel, F. & Westhof, E. (1994). Involve-ment OF A GNRA tetraloop in long-range tertiaryinteractions. J. Mol. Biol. 236, 12711276.70. Lee, J. C., Gutell, R. R. & Russell, R. (2006). TheUAA/GAN internal loop motif: a new RNAstructural element that forms a cross-strand AAAstack and long-range tertiary interactions. J. Mol.Biol. 360, 978988.71. Bloch, F. (2000). Fundamentals of Statistical Mechanics:Manuscript and Notes of Felix Bloch. Imperial CollegePress; World Scientific, London, Singapore.72. Gutell, R. R. (1994). Collection of small-subunit(16S- and 16S-like) ribosomal-RNA structures.Nucleic Acids Res. 22, 35023507.783RNA Secondary Structure