gutell 114.jmb.2011.413.0473

doi:10.1016/j.jmb.2011.08.033 J. Mol. Biol. (2011) 413, 473–483

Contents lists available at www.sciencedirect.com

Journal of Molecular Biologyj ourna l homepage: ht tp : / /ees .e lsev ie r.com. jmb

Statistical Potentials for Hairpin and Internal LoopsImprove the Accuracy of the Predicted RNA Structure

David P. Gardner 1, Pengyu Ren 2, Stuart Ozer 3 and Robin R. Gutell 1⁎1Center for Computational Biology and Bioinformatics, Section of Integrative Biology in the School of BiologicalSciences, and the Institute for Cellular and Molecular Biology, University of Texas at Austin, 2401 Speedway, Austin,TX 78712, USA2Department of Biomedical Engineering, University of Texas at Austin, Austin, TX 78712-1062, USA3Microsoft Corporation, 1 Microsoft Way, Redmond, WA 98052, USA

Received 16 February 2011;received in revised form12 August 2011;accepted 16 August 2011Available online23 August 2011

Edited by D. E. Draper

Keywords:statistical potentials;RNA folding;comparative analysis;RNA structure;accuracy of the predictedRNA structure

*Corresponding author. E-mail [email protected] used: rCAD, RNA C

Database; CRW site, Comparative Rsignal recognition particle; HCV IREinternal ribosome entry site; IRE, iroHIV DIS, human immunodeficiencydimerization initiation site; HDV, heratio, comparative/potential ratio.

0022-2836/$ - see front matter © 2011 E

RNA is directly associated with a growing number of functions within thecell. The accurate prediction of different RNA higher-order structures fromtheir nucleic acid sequences will provide insight into their functions andmolecular mechanics. We have been determining statistical potentials for acollection of structural elements that is larger than the number of structuralelements determined with experimentally determined energy values. Theexperimentally derived free energies and the statistical potentials forcanonical base-pair stacks are analogous, demonstrating that statisticalpotentials derived from comparative data can be used as an alternativeenergetic parameter. A new computational infrastructure—RNA Compar-ative Analysis Database (rCAD)—that utilizes a relational database wasdeveloped to manipulate and analyze very large sequence alignments andsecondary-structure data sets. Using rCAD, we determined a richer set ofenergetic parameters for RNA fundamental structural elements includinghairpin and internal loops. A new version of RNAfold was developed toutilize these statistical potentials. Overall, these new statistical potentials forhairpin and internal loops integrated into the new version of RNAfolddemonstrated significant improvements in the prediction accuracy of RNAsecondary structure.

© 2011 Elsevier Ltd. All rights reserved.

Introduction

“The comparative approach indicates far morethan the mere existence of a secondary structuralelement; it ultimately provides the detailed rules

ess:

omparative AnalysisNA Web site; SRP,S, hepatitis C virusn response element;virus type 1patitis delta virus; C/P

lsevier Ltd. All rights reserve

for constructing the functional form of each helix.Such rules are a transformation of the detailedphysical relationships of a helix and perhapseven reflection of its detailed energetics as well.(One might envision a future time when com-parative sequencing provides energetic measure-ments too subtle for physical chemicalmeasurements to determine).”1

The RNA sequences and their structures that weobserve today are the last record of their biologicalancestry. The snapshots of these RNA structuresare the result of their evolution from a simplerstructure and organization to their more sophisti-cated and complex state. Traditional experimentalmanipulation of biological systems expands ourunderstanding of this system. These laboratory

d.

mailto:[email protected]

http://dx.doi.org/10.1016/j.jmb.2011.08.033

http://www.sciencedirect.com/science/journal/00222836

474 Accurate Prediction of RNA Structure

experiments are designed to test or expand upon ahypothesis, based in part on the underlyingprinciples of RNA structure and a predicted orexperimentally determined higher-order structure.In contrast, Mother Nature's experiments duringthe evolution of RNA are derived from an apparentrandom collection of mutations and other changesto the biological systems. The molecules and cellsthat survive these mutations reveal the character-istics of the RNA that maintain the integrity of theirstructure and function. Thus, the task for compar-ative analysis is complementary to hypothesis-driven experimentation. Experimentalists prove,disprove, or determined more details for theirhypothesis while comparative analysis attempts todecipher the principles that are the boundaryconditions for the collections of biological datathat have survived their evolutionary process.The first stage of comparative analysis is the

collection of a phylogenetically diverse set of RNAsequences and structures, followed by the com-parative and covariation analysis of these linearstrings of the four nucleotides in RNA—adenine(A), guanine (G), cytosine (C), and uracil (U)—toidentify a secondary structure that is similar foreach of the RNA sequences that are in the sameRNA family. For each of these RNA families, suchas tRNA and 16S ribosomal (r)RNA, manydifferent sequences fold into the same higher-order structure. Encrypted in these relationshipsbetween sequence and higher-order structuremodels are the fundamental rules that govern themultiple levels of RNA structure, starting with theformation of the smaller structural elements suchas the base pair and base stacking, continuing tolarger structural elements that are composed ofdifferent types and arrangements of these basepairs and base stacks, and culminating in theformation of significantly larger higher-orderstructures that have the capacity to dynamicallycatalyze chemical reactions and change theirhigher-order structure. To facilitate the RNA'sfunction, these fundamental rules for RNA struc-ture are also directly associated with the folding ofan RNA's primary structure into its secondary,tertiary, and quaternary structures.Comparative analysis is composed of multiple

dimensions of information. New technology pro-vides us with significant amounts of data for each ofthe dimensions of RNA: (1) nucleotide sequences fororganisms that span the entire phylogenetic tree oflife, (2) the accurate prediction of the secondarystructures that are similar for each of the sequencesin a single RNA family, (3) analysis of the high-resolution crystal structures and the comparativestructure models reveals different RNA structuralmotifs and elements that are the basic buildingblocks of a complete RNA structure, and (4) thehistorical record of these evolving RNAs provides

insight into their evolutionary dynamics and phy-logenetic relationships.In contrast to comparative analysis, physical

biochemists usually use different experimentalmethods to solve simplified model systems thatare less complex than the structure of the entireRNA. In particular, many laboratories have beenobtaining free-energy values for different structuralelements. Approximately 66% of many RNA struc-tures are composed of a set of base pairs that form aregular helix.2,3 The energetic values for consecutivebase pairs have been studied for more than 25 years,initially focusing on canonical (i.e., G:C, A:U, and G:U) and, later, noncanonical base pairs.4–7 Theenergetic values for other types of structuralelements, including helices with dangling ends,8

hairpin,9 internal10,11 and multi-stem12 loops, co-axial stacking,13 and other structural motifs, forexample, the UAA/GAN motif,14 have also beendetermined.The most widely used program (and its de-

rivatives) to predict an RNA secondary structurewith the minimal free energy from a single nucleicacid sequence is Mfold.15 Early studies revealed thatthe accuracy of the predicted structures is depen-dent in part on the free-energy values for differentstructural motifs and the length of the RNAmolecule. 16 As more free-energy values weredetermined for consecutive base pairs and newRNA structural motifs, the prediction accuraciesincreased. For example, the identification of theGNRA, UUCG, and CUUG hairpin tetraloops17,18

and the subsequent determination of their extra-stable free-energy value19,20 resulted in an improve-ment in the prediction accuracy.16 Subsequentstudies showed that the prediction accuracy isdependent on the phylogenetic group of the RNAmolecule and the distance separating the nucleo-tides that are base paired (i.e., simple distance).21 Ananalysis of a significantly larger data set substanti-ated these earlier studies22 while providing a moredetailed assessment of the factors that affectprediction accuracy. For example, base pairs witha smaller simple distance occur significantly morefrequently than base pairs with larger simpledistances, and the prediction accuracy of individualbase pairs decreases exponentially as their simpledistance increases.22

Thus, a larger number of free-energy values for avariety of structural elements are required toaccurately and routinely predict the secondarystructure for an RNA molecule. Carl Woese'sremarkable foresight in 1983 that comparativeanalysis can be used to determine RNA energeticmeasurements of higher-order structural elementswas not appreciated at that time. However, thisapproach has been used in the prediction of proteinstructure,23–29 suggesting that Woese's idea couldhave the potential to reveal free-energy values for

Fig. 1. The ranked order of the 20 tetraloop hairpin loops (with any closing canonical base pair) with the highest C/Pratios (red bars) is shown along the x-axis. The C/P ratio for each of these tetraloop hairpin loops is shown on the y-axis.The ratios for tetraloop hairpin loops flanked by any canonical base pair are shown as red bars, while the tetraloop hairpinloops flanked by a CG base pair are shown as blue bars. The values are for bacterial 16S rRNA.

475Accurate Prediction of RNA Structure

RNA that are not easily discernable with experi-mental methods. Within the past few years, statis-tical potentials determined with comparativeanalysis30,31 for a few RNA structural elementswere similar to the free-energy values determinedwith experimental methods. The replacement ofbase-pair stacking energetic parameters with statis-tical potentials generated from an analysis of RNAcrystal structures showed similar predictionaccuracies.30 These results emphasize that compar-ative data can be used to create similar energyvalues for some structural elements.Previously, we determined statistical potentials

for canonical base-pair stacks that occur within aregular helix. While the statistical potentials forcanonical base-pair stacks resulted in a veryminimal improvement in the accuracy of thepredicted secondary structure, a larger improve-ment was observed when statistical potentials weredetermined for the nucleotides immediately flank-ing the ends of the helix and in small internal loops(1×1, 1×2, 2×2)31 and used in place of theequivalent experimentally determined energeticparameters.Statistical learning procedures are another formof a

knowledge-based approach for improving energeticparameters. Methods using stochastic context-freegrammars showed prediction accuracies32 near thoseof RNAstructure33 and Mfold.15 CONTRAfold34 isbased upon conditional log-linear models, which arean extension of stochastic context-free grammars.34

The energetic parameters used byCONTRAfoldwereselected to maximize the conditional likelihood of thestructures within the sequences analyzed. Andro-nescu et al. utilized constraint generation and Boltz-mann likelihood methods to estimate their energeticparameters used by the program MultiFold.35

Our confidence in Woese's 1983 statement influ-enced the development of our RNA ComparativeAnalysis Database (rCAD) (Ozer, Doshi, Xu andGutell, in press). One objective of this article is toutilize rCAD to determine a richer set of energeticparameters from our comparative analysis of RNAsequences and their structures. We have developednew statistical potentials for hairpin and internalloops but not for base-pair stacks and multi-stemloops. A modified version of RNAfold36,37 wasdeveloped to utilize this new set of statisticalpotentials. Another objective of this article is toquantify the effect that our new statistical potentialshad on the accuracy of the predicted secondary-structure model.

Results and Discussion

Hairpin loop comparative/potential ratio

To determine the likelihood that a structuralelement will occur in the correct structure, wedetermined a ratio of the number of occurrences ofthat element in the comparative structure modeldivided by the number of potential occurrences ofthat element in the same RNA molecular class (seeMethods). An example of the comparative/potential(C/P) ratio for tetraloop hairpin loops in bacterial16S rRNA is shown in Figure 1. The following are afew of the highlights: (1) five of the tetraloop hairpinloops with any closing canonical base pairs have aC/P value greater than 0.5; (2) the closing base pairof these hairpin loops can alter the C/P values. Forexample, the C:G closing base pair usually increasesthe C/P values significantly for the 20 tetraloopsshown in Figure 1.


The different closing base pair's effect on the C/Pvalue for tetraloops is available at the ComparativeRNA Web (CRW) site†. Also available are the C/Pratios for hairpin loops of lengths 3–5 and for all ofthe molecular classes used in this study. The otherstructural statistics at the CRW site (i.e., nucleotide,base pairs, internal and multi-stem loops) all revealsignificant biases in the frequencies of the sequencesand their lengths. This general concept is used tocreate the statistical potentials.

Hairpin loop statistical potentials

Hairpin loop statistical potentials were createdand tested using Eqs. (2) and (4) (see Methods). The16 RNAmolecular classes (see Methods) included inthe creation of our statistical potentials were thebacterial and eukaryotic 5S rRNA, bacterial andeukaryotic 16S rRNA, bacterial 23S rRNA, tRNA,38

bacterial RNase P class A,39 bacterial signal recog-nition particle (SRP),40 U1 spliceosomal RNA,41

hepatitis C virus internal ribosome entry site (HCVIRES),42 Ykok leader,43 TPP44 and SAM45 ribos-witches, iron response element (IRE),46 humanimmunodeficiency virus type 1 dimerization initia-tion site (HIV DIS),47 and UnaL2 Line 3′ element.48

The first flanking (closing) canonical base pair isincluded when our comparative and potentialcounts and statistical potentials are generated.For hairpin loops of length 4, the values ofm and b

in Eq. (2) (see Methods) with the best accuracy were2.25 and 0.8, respectively. For the restricted range of0 to 2 for − ln(C/P) (see Methods), the statisticalpotentials of hairpin loops of length 4 will vary from5.3 to 0.8 kcal/mol, with 5.3 kcal/mol set as thedefault value. Hairpin loops of different sizes willhave different m and b values (see SupplementalData, Excel file HPComparison). Statistical poten-tials were generated for 908 hairpin loops plusdefault values.The approach used to determine the statistical

potentials for hairpin loops is illustrated with acomparison with recent experimentally derivedtetraloop free-energy values.49 For the 1536 possiblecombinations (256 hairpin loops ×6 base pairs),1225 (80%) had an absolute difference less than0.5 kcal/mol and 1243 (81%) had an absolutedifference less than 1.0 kcal/mol. A total of 191(12%) combinations had absolute differences between1.025 and 2.0 kcal/mol, and 102 (7%) combinationshad differences between 2.075 and 3.1 kcal/mol(Supplemental Data, see Excel file HPComparison).The 14 tetraloop closing base-pair combinationswith the largest absolute difference all had smallerkcal/mol values and thus are more energetically

†http://www.rna.ccbb.utexas.edu/SAE/2D/index.php

stable. However, the majority of the combinations(232 out of 311) with absolute difference greaterthan 0.5 kcal/mol had experimentally derivedenergetic values smaller (i.e., more stable) than thederived statistical potential.For triloops, the experimentally derived free-

energy values were taken from Thulasi et al.50

Only 6 out of the 384 (0.2%) triloop combinationshad an absolute difference of less than 1.0 kcal/molbetween the experimentally derived free energiesand statistical potentials. Most of the triloops (369out of 384) (94%) had absolute differences between1.0 and 2.0 kcal/mol. The absolute difference for theother 23 combinations ranged from 2.028 to2.61 kcal/mol (Supplemental Data, see Excel fileHPComparison). For the pentaloop comparison, theenergetic parameters from TURNER046,51 wereused. Of the 6144 possible pentaloop combinations,3354 (55%) had an absolute difference of 0.5 kcal/mol or less and 4674 (76%) had an absolutedifference less than 1.0 kcal/mol. A total of 1146(19%) had an absolute difference between 1.02 and2.0 kcal/mol, 287 (5%) had an absolute differencebetween 2.068 and 3.0 kcal/mol, and 36 (0.6%) hadan absolute difference between 3.1 and 4.0 kcal/mol.The remaining pentaloop has an absolute differenceof 4.408 kcal/mol (Supplemental Data, see Excel fileHPComparison). Statistical potentials have beencreated for hairpin loops for all observed lengthsin the molecular classes studied with comparativemethods.

Internal loop statistical potentials

Internal loop statistical potentials were createdusing Eqs. (2) and (4). The same 16 RNA molecularclasses used in the generation of the hairpin loopstatistical potentials were used for the internal loops.Both base pairs flanking an internal loop areincluded in the generation of statistical potentialsfor internal loops. For 1×1 internal loops, the valuesof m and b in Eq. (2) (see Methods) with the bestaccuracy were 2.5 and −1.0, respectively. For therestricted range of 0 to 2 for − ln(C/P) (see Methods),the statistical potentials of 1×1 internal loops willvary from 4.0 to −1.0 kcal/mol, with 4.0 kcal/molset as the default value. Internal loops of differentsizes will have different m and b values (seeSupplemental Data, Excel file ILComparison). Sta-tistical potentials were generated for 1368 internalloop plus default values.The approach used to determine the statistical

potentials for internal loops is illustrated with 1×1internal loops. For these internal loops, the absolutedifferences between the statistical potentials and theTURNER046 experimentally derived energetic pa-rameters were usually large. There are 360 possible1×1 internal loops—6 base pairs ×6 base pairs ×10internal loops. Only 57 out of the 360 (16%) had an

http://www.rna.ccbb.utexas.edu/SAE/2D/index.php


absolute difference of less than 1.0 kcal/mol andonly 10 (3%) had absolute differences between 1.0and 2.0 kcal/mol. A total of 130 (36%) had absolutedifferences between 2.0 and 3.0 kcal/mol, and 111(30%) had absolute differences between 3.0 and4.0 kcal/mol. The 30 1×1 internal loops with thelargest difference between experimentally derivedfree-energies and statistical potentials all had a G–Ginternal loop. The values for the experimentallyderived free energies and statistical potentials for all360 1×1 and all 9216 2×2 internal loops are in theSupplemental Data (Excel file ILComparison). Sta-tistical potentials have been created for internalloops for any length observed on the 5′ and 3′ sidesof the loop in those molecular classes studied withcomparative methods.

Evaluation of hairpin loop statistical potentials

The prediction of an RNA structure is evaluatedwith the statistical potentials for hairpin loops. Inprevious versions of RNAfold, the only hairpin loopswith specific free-energy values were triloops andtetraloops. Free-energy values for longer hairpinloops were calculated using the length of the hairpinloop and the composition of the first and lastnucleotides of the hairpin loop and the flanking(closing) base pair. To determine if statistical poten-tials generated with Eqs. (2) and (4) would improvethe accuracy of RNA secondary-structure prediction,we modified the program RNAfold36,37 to acceptdetailed statistical potentials for hairpin loops of anylength. When testing the hairpin loop statisticalpotentials, the experimentally derived energeticparameters (TURNER99) for base-pair stacks andinternal and multi-stem loops were used.Similar to previous studies,21,31 sensitivity has

been used to gauge prediction accuracy. Sensitivityis defined as the number of canonical base pairs inthe predicted minimal free-energy structure presentin the comparative model divided by the totalnumber of comparative canonical base pairs. Differ-ences in prediction accuracy are defined as (sensi-tivity using statistical potentials)− (sensitivity usingother energetic parameters and/or folding pro-grams). If a program returns suboptimal structures,only the optimal structure is used in our analysis.Results in the Supplemental Data (supplemental.

pdf, pages 1-4) reveal that the statistical potentialsfor hairpin loops improved the prediction of theRNA structure.

Evaluation of internal loop statistical potentials

To utilize the new internal loop statistical poten-tials, the functionality of RNAfold was againextended to accept a wider range of energeticparameters. The original version of RNAfold hadspecific free-energy values for internal loops of

lengths 1×1, 1×2, 2×2, and 2×3. For larger internalloops, the calculation of the experimentally derivedfree-energy values was based on the number ofnucleotides in the internal loop plus the compositionof the ends of the internal loop and both flankingbase pairs. The modified RNAfold accepts specificfree-energy values for internal loops of any size.When testing hairpin loop statistical potentials, theexperimentally derived energetic parameters(TURNER99) for base-pair stacks and hairpin andmulti-stem loops are used.Results in the Supplemental Data (supplemental.

pdf, pages 1-4) reveal that the statistical potentialsfor the internal loops improved the prediction of theRNA structure.

Combining statistical potentials and comparisonwith other programs

The prediction accuracy using the combination ofhairpin and internal loop molecule-independentstatistical potentials for all 16 RNAmolecular classeswas compared with the results from four other RNAfolding programs—RNAfold 36 (TURNER99),RNAstructure33 using just TURNER04 and usingTURNER04 plus the newer triloop and tetraloopthermodynamic parameters,49,50 CONTRAfold,34

and MultiFold (BL⁎ parameter set).35 RNAfoldand RNAstructure utilize experimentally derivedenergetic parameters while CONTRAfold and Mul-tiFold use parameters derived with statisticallearning. When testing the hairpin and internalloop statistical potentials with RNAfold, the exper-imentally derived energetic parameters (TURN-ER99) for base-pair stacks and multi-stem loopsare used.Overall, the combined molecule-independent sta-

tistical potentials outperformed the other four pro-grams (Fig. 2a and b). On average, over the 16 RNAmolecular classes, our statistical potentials scored15% higher than RNAfold (TURNER99), 14% forRNAstructure (TURNER04), 14% higher for RNAs-tructure (TURNER04 Plus), 12% for CONTRAfold,and 13% for MultiFold. Our statistical potentialsoutperformed all four programs for all 16 RNAmolecular classes with the exception of the Ykokleader RNA where RNAfold (TURNER99) matchedour score and RNase P A where CONTRAfoldscored 3% higher. The difference in accuracybetween our statistical potentials and the competingprogram with the best results for a given moleculeranged from −3% (RNase P A) to 15% (UnaL2Line 3′element) (Fig. 2a and b). On average, our statisticalpotentials outperformed the program with the bestresults for a given RNA molecule by 7% (Supple-mental Data, see Excel file Accuracies.xlsx). Stan-dard deviation results for each program on eachmolecule are contained in the Supplemental Data(supplemental.pdf, pages 5-6).

Fig. 2. RNA secondary-structure prediction accuracies for four RNA folding programs: RNAfold, RNAstructure(TURNER04 and TURNER04 plus the newer triloop and tetraloop thermodynamic parameters), CONTRAfold,MultiFold, and RNAfold using statistical potentials. Results for 16 RNAmolecular classes are divided into (A) bacterial 5SrRNA, eukaryotic 5S rRNA, bacterial 16S rRNA, bacterial 23S rRNA, tRNA, eukaryotic 16S rRNA, RNase P A, andbacterial SRP and (B) U1 spliceosomal RNA, HCV IRES, Ykok leader, TPP and SAM riboswitches, IRE, HIV DIS, andUnaL2 Line 3′ element.


Two methods were used to evaluate the cross-validation of the statistical potentials. The firstutilized the same method used for MultiFold.35

The results in the Supplemental Data reveal that theaccuracies of the predicted RNA secondary struc-tures are very similar between the training andtesting on the full set of sequences and on an80%/20% split (see Supplemental Data, supplemen-tal.pdf, pages 7-8). The second method tested ourstatistical potentials and the four other RNA foldingprograms against nine control RNA molecularclasses (see Methods) that were not used in thegeneration of the statistical potentials. The controlmolecular classes are RNase P B,39 Hammerhead IIIribozyme,52 purine riboswitch,53 hepatitis delta

virus (HDV) ribozyme,54 HIV ribosomal frameshiftsignal,55 GEMM cis-regulatory element,56 R2 RNAelement,57 and mitochondrial and archaeal 16SrRNA.38 On average, over these nine RNA molecu-lar classes, our statistical potentials essentiallyequaled the performance of the four other RNAfolding programs (Supplemental Data, see supple-mental.pdf, pages 9-14).Given that our approach utilizes comparative data

for generating the statistical potentials, it is notsurprising that they perform only on par with theother RNA folding programs over the control RNAmolecular classes. The nine RNA molecular classesin our test set must have some structural elementsthat are not present and/or absent in the original 16

Fig. 3. a) Nucleotides in the tetraloop hairpin loops that occur in the comparative structure for a modified Escherichiacoli 16S rRNA secondary structure between positions 118 and 241 are colored blue. For this figure the E.coli sequence waschanged at a few positions to create better examples of potential base pairings that form hairpin loops. Potential tetraloophairpin loop, as defined by four nucleotides that are closed by two or more canonical base pairs, are colored red. The basepairs flanking the tetraloop hairpin loops are circled and connected with a red line. Nucleotides that are base paired in thecomparative structure are connected with a thick black line. c) Nucleotides in the internal loop that occur in our modifiedEscherichia coli comparative secondary structure between positions 139 and 184 are colored blue; b&c) Nucleotides inpotential internal loops are colored red and the nucleotides that form a set of base pairs within the potential helix in theinternal loop are circled and connected with a red line. Nucleotides that are base paired in the comparative structure areconnected with a thick black line.


classes. This indicates that increasing the number ofRNA molecular classes used to generate the statis-tical potentials is necessary before the statisticalpotentials will have higher accuracies for a largernumber of molecular classes. During the course ofthese studies, we observed improvements in theaccuracies for a larger number of molecular classesas the training set included more RNA families.

RNA folding website

RNA sequences can be folded on our modifiedRNAfold program that contains our new statisticalpotentials‡. The C# code and the new statisticalpotentials will also bemade available at this website.

Summary

The focus of this study was to improve theenergetic parameters for hairpin and internalloops. Previously, the base-pair stack statisticalpotentials created with comparative data, on aver-age, only slightly improved the prediction accuracy,demonstrating that statistical potentials can gener-

‡http://www.rna .ccbb.utexas .edu/SAE/2E/Folding2D/

ate analogous energetic parameters.31 This minorimprovement in the accuracy from the base-pairstack statistical potentials was not as much as weanticipated. However, our previous analysis didreveal that flanking nucleotides of the hairpin andinternal loops did have a more pronounced im-provement, suggesting that a richer set of statisticalpotentials for the loop regions of the secondarystructure could have a larger enhancement in theaccurate prediction.The new comparative analysis system in develop-

ment in the Gutell laboratory, rCAD (Ozer, Doshi,Xu and Gutell, in press), was used to determine thiscollection of statistical potentials that representsmore of the structural elements present in RNAmolecules. This new set of energetic parameters useda new structural statistic—the C/P ratio. TheRNAfold programwasmodified to utilize our largerset of statistical potentials since it originally hadmore limited hairpin and internal loop energeticparameters.This modified RNAfold program and our new

hairpin and internal loop statistical potentialsdemonstrated significant increases in the predictionaccuracy of RNA secondary structure. Over 16 RNAmolecular classes, the statistical potentials alwaysoutperformed the four existing RNA folding pro-grams with the exception of two RNA moleculeswhere our accuracies were equal to or slightly worse

http://www.rna.ccbb.utexas.edu/SAE/2E/Folding2D/

http://www.rna.ccbb.utexas.edu/SAE/2E/Folding2D/

image of Fig. 3


than one other program. On average, the improve-ments ranged from 12% to 15% compared to thecompeting four programs. Our program predictedthe accuracy of the RNA secondary structure betterin 78 of the 80 comparisons. When our program wasnot included in these comparisons, RNAfold(TURNER99) and RNAstructure (TURNER99+) out-performed the other programs in 19 out of 64comparisons; RNAstructure (TURNER04), Multi-Fold and CONTRAfold outperformed the otherprograms in 20 out of 64 comparisons, 39 out of 64comparisons and 45 out of 64 comparisons, respec-tively. Our statistical potentials also were approxi-mately the same as the performances of the otherfour programs when tested over the nine additionalcontrol RNAmolecular classes that were not used inthe generation of the statistical potentials.Our intention with this work was to determine if

this generalized approach would improve theprediction of RNA secondary structure beyondcurrent approaches. Given that this approach didsignificantly increase prediction accuracy in the 16training RNA molecular classes, we will extend andimprove upon our generalized approach with avariety of approaches in the future.We will add more RNA molecular classes when

generating the statistical potentials. We will also aimto identify the most essential structural elementsand components that will produce the highestaccuracy of the predicted RNA structure. Thisshould help identify general structural families andreduce the number of needed energetic parameters.We will also investigate extending the statisticalpotentials and folding program to utilize non-nearest-neighbor effects.

§http://www.rna.ccbb.utexas.edu/DAT/3C/Structure/index.php

Methods

Comparative and potential secondary structuralelements

A potential secondary structural element, such as ahairpin loop, an internal loop, or a helix, is defined as theset of nucleotides that forms the motif. This potentialstructural element may or may not occur in the compar-ative secondary structure of the RNA molecule, whileevery comparative structural element is a potentialstructural element. Our objective is to generate a statisticalpotential from the ratio of comparative and potentialstructural elements.Potential hairpin loops are a set of consecutive

nucleotides of a specific length that are flanked by twoor more canonical base pairs in the RNA sequence(Fig. 3). The determination of a potential internal loopinitiateswith a comparative helix. The nucleotides flankingthe 5′ and 3′ ends of this helix that contain at least twopotential canonical base pairs are identified (Fig. 3). Thenucleotides between the comparative and the potentialhelices are defined as a potential internal loop.

Creation of statistical potentials

A basic assumption in the creation of the statisticalpotentials is:

−lnðC=PÞeFree energy ð1Þwhere C is the frequency of a structural elementappearing in the comparative structure and P is thepotential frequency of the structural element. Everycomparative structure is considered to be a potentialstructure as well; C/P will have values in the rangebetween 0 and 1. A typical statistical potential utilizes− ln(C) with C normalized with the frequency ofindividual nucleotides. The formula proposed here canbe considered as normalized by the potential to form astructure element. A statistical potential is determinedwith the equation:

−m ln C= Pð Þ + b = SPð Þ ð2Þwhere SP is a statistical potential and m and b are globalparameters that will be selected to optimize the overallaccuracy of the folding program. For the vast majority ofstructural elements, the comparative count will be 0 orthe C/P ratio too low and the default value will be used.Restricting the range of values for − ln(C/P) between0 and 2 provides the best prediction accuracies; thisrestricts C/P values to a minimum of 0.01. If a structuralelement has no potential structures or the C/P value isless than 0.01, the C/P value is set to 0.01. The defaultvalue for a structural element is set to:

−m × 2 + b = default ð3Þ

Molecule-independent statistical potential

Initially, a set of statistical potentials will begenerated for each type of RNA molecular classanalyzed (e.g., 16S rRNA—bacteria). The statisticalpotentials for each molecule-specific set will not havedetailed values for all possible structural elements. Ourultimate goal is to create one set of statistical potentialsthat are applicable for all types of RNAs. To create amolecule-independent set of statistical potentials, wetreated each molecule-dependent set as a member of aBoltzmann distribution. For every secondary structuralelement, the molecule-independent statistical potential isa Boltzmann-weighted sum of statistical potentials fromeach molecule i:

SPmolecule−ind =P

iaI exp −SPi = kbTð ÞSPiPiaI exp −SPi = kbTð Þ ð4Þ

CRW site

The Gutell laboratory's CRW site§38 has a diversecollection of secondary-structure models predicted fromcomparative analysis for different phylogenetic groups ofthe 5S, 16S, and 23S rRNAs; tRNAs for different amino

http://www.rna.ccbb.utexas.edu/DAT/3C/Structure/index.php

http://www.rna.ccbb.utexas.edu/DAT/3C/Structure/index.php


acids; and group I and II introns. The number ofsecondary diagrams currently available is 1092, whilethe number of sequences with only base-pair informationis 54,525. The accuracy of these secondary-structuremodels is extremely high; approximately 97% of the basepairs in the ribosomal RNA structures predicted withcomparative methods are present in the high-resolutioncrystal structure.58

RNA Comparative Analysis Database

All sequence and comparative structure information isstored in the rCAD. rCAD at the time the manuscript wassubmitted contains 293,039 aligned RNA sequences andtheir comparative structure information. These data areutilized to determine the number of structural elements inthe comparative structures. rCAD also contains structuralstatistics (comparative and potential counts) on nearly500,000 different internal loops and almost 2.3 milliondifferent hairpin loops.

RNA molecular classes

The RNA molecule sequences and structures initiallystudied for their comparative and potential counts ofstructural elements and used in the generation of thestatistical potentials were aligned and created by theGutell laboratory∥. They include sequences from thebacterial and eukaryotic phylogenetic groups and from5S, 16S, and 23S rRNA and tRNA.Additional RNA sequences and structures were

obtained from the RFam website.59 These includedbacterial RNase P class A, bacterial SRP, U1 spliceosomalRNA, HCV IRES, Ykok leader, TPP and SAM ribos-witches, IRE, HIV DIS, and UnaL2 Line 3′ element. All ofthese sequences and structures were taken from theirrespective RFam full alignments.For the training and initial testing of the statistical

potentials, sequences with a similarity of greater than 97%were removed to minimize the folding of duplicate RNAsequences. Also, only complete or nearly completesequences were analyzed. The total number of RNAsequences analyzed for testing RNA secondary-structureaccuracy for each molecular class is as follows: 1094bacterial and 258 eukaryotic 16S rRNA, 65 bacterial 23SrRNA, 230 bacterial and 310 eukaryotic 5S rRNA, 2112tRNA, 274 RNase P class A, 937 U1 spliceosomal RNA,1049 bacterial SRP, 550 HCV IRES, 188 Ykok leader, 726TPP and 589 SAM riboswitches, 371 IRE, 136 HIV DIS, and572 UnaL2 Line 3′ element. The number of sequences andtheir average length are available in the SupplementalData (see supplemental.pdf).For the additional testing of control RNA molecules,

seven sets of RNA sequences and structures were obtainedfrom the RFam website. These are the RNase P B,Hammerhead III ribozyme, purine riboswitch, HDVribozyme, HIV ribosomal frameshift signal, GEMM cis-regulatory element, and R2 RNA element. All of thesesequences are taken from their respective RFam seed

∥Available at http://www.rna.ccbb.utexas.edu/DAT/3C

alignment. Two sets of RNA sequences and structures arefrom the Gutell laboratory—mitochondrial and archaeal16S rRNA.The total number of RNA sequences for each of the nine

classes is as follows: 366 RNase P B, 84 Hammerhead IIIribozymes, 133 purine riboswitches, 33 HDV ribozymes,145 HIV ribosomal frameshift signal, 162 GEMM cis-regulatory element, and 15 R2 RNA element. There were128 and 143 RNA sequences tested for mitochondrial andarchaeal 16S rRNA, respectively. The number of se-quences and their average length are available in theSupplemental Data (see supplemental.pdf).

Acknowledgements

This article is dedicated to Dr. Carl Woese for hisintuition that comparative analysis could reveal“energetic measurements too subtle for physicalchemical measurements to determine” and to ourerstwhile colleague Dr. Jim Gray whose pioneeringwork on transaction control enables databasesystems to be the foundation for Jim's vision of the“Fourth Paradigm”, following experimental, theo-retical, and computer science. Jim appreciated thatthe overwhelming amount of multiple dimensionsof information was not strictly a computer scienceproblem, but instead a collaborative effort betweencomputer scientists and (in this case) molecularbiologists. The authors are also most grateful toYuxing Li, Jamie Cannone, Ame Wongsa, andYanan Jiang for help establishing the RNA foldingwebsite. Grants from the Robert A. Welch Founda-tion [grant numbers F-1691 (P.R.) and F-1427 (R.G.)],National Institutes of Health [grant numbers R01GM0796686 (P.R.), R01 GM067317 (R.G.), andGM085337 (R.G.)], and Microsoft Research TCI/ER(R.G.) were essential for this project to come tofruition. The authors appreciated the constructivecomments from the reviewers and the editor.

Supplementary Data

Supplementary data to this article can be foundonline at doi:10.1016/j.jmb.2011.08.033

References

1. Woese, C. R., Gutell, R., Gupta, R. & Noller, H. F.(1983). Detailed analysis of the higher-order structureof 16S-like ribosomal ribonucleic acids. Microbiol. Rev.47, 621–669.

2. Gutell, R. R., Weiser, B., Woese, C. R. & Noller, H. F.(1985). Comparative anatomy of 16-S-like ribosomalRNA. Prog. Nucleic Acid Res. Mol. Biol. 32, 155–216.

3. Gutell, R. R., Cannone, J. J., Shang, Z., Du, Y. & Serra,M. J. (2000). A story: unpaired adenosine bases inribosomal RNAs. J. Mol. Biol. 304, 335–354.

http://dx.doi.org/10.1016/j.jmb.2011.08.033

http://www.rna.ccbb.utexas.edu/DAT/3C


4. Freier, S. M., Kierzek, R., Jaeger, J. A., Sugimoto, N.,Caruthers, M. H., Neilson, T. & Turner, D. H. (1986).Improved free-energy parameters for predictions ofRNA duplex stability. Proc. Natl Acad. Sci. USA, 83,9373–9377.

5. Mathews, D. H., Sabina, J., Zuker, M. & Turner, D. H.(1999). Expanded sequence dependence of thermody-namic parameters improves prediction of RNAsecondary structure. J. Mol. Biol. 288, 911–940.

6. Turner, D. H. & Mathews, D. H. (2010). NNDB: thenearest neighbor parameter database for predictingstability of nucleic acid secondary structure. NucleicAcids Res. 38, D280–D282.

7. Xia, T., SantaLucia, J., Jr, Burkard, M. E., Kierzek,R., Schroeder, S. J., Jiao, X. et al. (1998). Thermo-dynamic parameters for an expanded nearest-neighbor model for formation of RNA duplexeswith Watson–Crick base pairs. Biochemistry, 37,14719–14735.

8. Liu, J. D., Zhao, L. & Xia, T. (2008). The dynamicstructural basis of differential enhancement of confor-mational stability by 5′- and 3′-dangling ends in RNA.Biochemistry, 47, 5962–5975.

9. Antao, V. P. & Tinoco, I., Jr (1992). Thermodynamicparameters for loop formation in RNA and DNAhairpin tetraloops. Nucleic Acids Res. 20, 819–824.

10. Schroeder, S. J., Burkard, M. E. & Turner, D. H. (1999).The energetics of small internal loops in RNA.Biopolymers, 52, 157–167.

11. Walter, A. E., Wu, M. & Turner, D. H. (1994). Thestability and structure of tandem GA mismatches inRNA depend on closing base pairs. Biochemistry, 33,11349–11354.

12. Diamond, J. M., Turner, D. H. & Mathews, D. H.(2001). Thermodynamics of three-way multibranchloops in RNA. Biochemistry, 40, 6971–6981.

13. Walter, A. E. & Turner, D. H. (1994). Sequencedependence of stability for coaxial stacking of RNAhelixes with Watson–Crick base paired interfaces.Biochemistry, 33, 12715–12719.

14. Shankar, N., Kennedy, S. D., Chen, G., Krugh, T. R. &Turner, D. H. (2006). The NMR structure of an internalloop from 23S ribosomal RNA differs from itsstructure in crystals of 50S ribosomal subunits.Biochemistry, 45, 11776–11789.

15. Zuker, M. (1989). On finding all suboptimal foldingsof an RNA molecule. Science, 244, 48–52.

16. Jaeger, J. A., Turner, D. H. & Zuker, M. (1989).Improved predictions of secondary structures forRNA. Proc. Natl Acad. Sci. USA, 86, 7706–7710.

17. Woese, C. R., Winker, S. & Gutell, R. R. (1990).Architecture of ribosomal RNA: constraints on thesequence of “tetra-loops”.Proc. Natl Acad. Sci. USA, 87,8467–8471.

18. Michel, F. &Westhof, E. (1990). Modelling of the three-dimensional architecture of group I catalytic intronsbased on comparative sequence analysis. J. Mol. Biol.216, 585–610.

19. Tuerk, C., Gauss, P., Thermes, C., Groebe, D. R.,Gayle, M., Guild, N. et al. (1988). CUUCGG hairpins:extraordinarily stable RNA secondary structuresassociated with various biochemical processes. Proc.Natl Acad. Sci. USA, 85, 1364–1368.

20. Antao, V. P., Lai, S. Y. & Tinoco, I., Jr (1991).A thermodynamic study of unusually stable

RNA and DNA hairpins. Nucleic Acids Res. 19,5901–5905.

21. Konings, D. A. & Gutell, R. R. (1995). A comparison ofthermodynamic foldings with comparatively derivedstructures of 16S and 16S-like rRNAs.RNA, 1, 559–574.

22. Doshi, K. J., Cannone, J. J., Cobaugh, C. W. & Gutell,R. R. (2004). Evaluation of the suitability of free-energy minimization using nearest-neighbor energyparameters for RNA secondary structure prediction.BMC Bioinformatics, 5, 105.

23. Tanaka, S. & Scheraga, H. A. (1976). Medium- andlong-range interaction parameters between aminoacids for predicting three-dimensional structures ofproteins. Macromolecules, 9, 945–950.

24. Moult, J. (2005). A decade of CASP: progress,bottlenecks and prognosis in protein structure pre-diction. Curr. Opin. Struct. Biol. 15, 285–289.

25. Floudas, C. A., Fung, H. K., McAllister, S. R.,Monnigmann, M. & Rajgaria, R. (2006). Advances inprotein structure prediction and de novo proteindesign: a review. Chem. Eng. Sci. 61, 966–988.

26. Kryshtafovych, A., Venclovas, C., Fidelis, K. & Moult,J. (2005). Progress over the first decade of CASPexperiments. Proteins, 61, 225–236.

27. Shen, M. Y. & Sali, A. (2006). Statistical potential forassessment and prediction of protein structures.Protein Sci. 15, 2507–2524.

28. Summa, C. M. & Levitt, M. (2007). Near-nativestructure refinement using in vacuo energy minimi-zation. Proc. Natl Acad. Sci. USA, 104, 3177–3182.

29. Xu, B. S., Yang, Y. D., Liang, H. J. & Zhou, Y. Q.(2009). An all-atom knowledge-based energy func-tion for protein–DNA threading, docking decoydiscrimination, and prediction of transcription-factorbinding profiles. Proteins: Struct. Funct. Bioinform. 76,718–730.

30. Dima, R. I., Hyeon, C. & Thirumalai, D. (2005).Extracting stacking interaction parameters for RNAfrom the data set of native structures. J. Mol. Biol. 347,53–69.

31. Wu, J. C., Gardner, D. P., Ozer, S., Gutell, R. R. & Ren,P. (2009). Correlation of RNA secondary structurestatistics with thermodynamic stability and applica-tions to folding. J. Mol. Biol. 391, 769–783.

32. Dowell, R. D. & Eddy, S. R. (2004). Evaluation ofseveral lightweight stochastic context-free grammarsfor RNA secondary structure prediction. BMC Bioin-formatics, 5, 71.

33. Reuter, J. S. & Mathews, D. H. (2010). RNAstructure:software for RNA secondary structure prediction andanalysis. BMC Bioinformatics, 11, 129.

34. Do, C. B., Woods, D. A. & Batzoglou, S. (2006).CONTRAfold: RNA secondary structure predictionwithout physics-based models. Bioinformatics, 22,e90–e98.

35. Andronescu, M., Condon, A., Hoos, H. H., Mathews,D. H. & Murphy, K. P. (2010). Computationalapproaches for RNA energy parameter estimation.RNA, 16, 2304–2318.

36. Hofacker, I. L. (2003). Vienna RNA secondarystructure server. Nucleic Acids Res. 31, 3429–3431.

37. Hofacker, I. L., Fontana, W., Stadler, P. F., Bonhoeffer,L. S., Tacker, M. & Schuster, P. (1994). Fast folding andcomparison of RNA secondary structures. Monatsh.Chem. 125, 167–188.


38. Cannone, J. J., Subramanian, S., Schnare, M. N.,Collett, J. R., D'Souza, L. M., Du, Y. et al. (2002). Thecomparative RNAweb (CRW) site: an online databaseof comparative sequence and structure informationfor ribosomal, intron, and other RNAs. BMC Bioinfor-matics, 3, 2.

39. Brown, J. W. (1999). The Ribonuclease P Database.Nucleic Acids Res. 27, 314.

40. Rosenblad, M. A., Gorodkin, J., Knudsen, B., Zwieb,C. & Samuelsson, T. (2003). SRPDB: Signal Recogni-tion Particle Database. Nucleic Acids Res. 31, 363–364.

41. Kretzner, L., Krol, A. & Rosbash, M. (1990). Saccharo-myces cerevisiae U1 small nuclear RNA secondarystructure contains both universal and yeast-specificdomains. Proc. Natl Acad. Sci. USA, 87, 851–855.

42. Gallego, J. & Varani, G. (2002). The hepatitis C virusinternal ribosome-entry site: a new target for antiviralresearch. Biochem. Soc. Trans. 30, 140–145.

43. Barrick, J. E., Corbino, K. A., Winkler, W. C., Nahvi,A., Mandal, M., Collins, J. et al. (2004). New RNAmotifs suggest an expanded scope for riboswitches inbacterial genetic control. Proc. Natl Acad. Sci. USA,101, 6421–6426.

44. Miranda-Rios, J., Navarro, M. & Soberon, M.(2001). A conserved RNA structure (thi box) isinvolved in regulation of thiamin biosynthetic geneexpression in bacteria. Proc. Natl Acad. Sci. USA, 98,9736–9741.

45. Grundy, F. J. & Henkin, T. M. (1998). The S boxregulon: a new global transcription terminationcontrol system for methionine and cysteine biosyn-thesis genes in Gram-positive bacteria. Mol. Microbiol.30, 737–749.

46. Hentze, M. W. & Kuhn, L. C. (1996). Molecular controlof vertebrate ironmetabolism:mRNA-based regulatorycircuits operated by iron, nitric oxide, and oxidativestress. Proc. Natl Acad. Sci. USA, 93, 8175–8182.

47. McBride, M. S. & Panganiban, A. T. (1996). Thehuman immunodeficiency virus type 1 encapsidationsite is a multipartite RNA element composed offunctional hairpin structures. J. Virol. 70, 2963–2973.

48. Baba, S., Kajikawa, M., Okada, N. & Kawai, G. (2004).Solution structure of an RNA stem–loop derived fromthe 3′ conserved region of eel LINE UnaL2. RNA, 10,1380–1387.

49. Sheehy, J. P., Davis, A. R. & Znosko, B. M. (2010).Thermodynamic characterization of naturally occur-ring RNA tetraloops. RNA, 16, 417–429.

50. Thulasi, P., Pandya, L. K. & Znosko, B. M. (2010).Thermodynamic characterization of RNA triloops.Biochemistry, 49, 9058–9062.

51. Mathews, D. H., Disney, M. D., Childs, J. L.,Schroeder, S. J., Zuker, M. & Turner, D. H. (2004).Incorporating chemical modification constraints intoa dynamic programming algorithm for prediction ofRNA secondary structure. Proc. Natl Acad. Sci. USA,101, 7287–7292.

52. Murray, J. B., Terwey, D. P., Maloney, L., Karpeisky,A., Usman, N., Beigelman, L. & Scott, W. G. (1998).The structural basis of hammerhead ribozyme self-cleavage. Cell, 92, 665–673.

53. Mandal, M., Boese, B., Barrick, J. E., Winkler, W. C. &Breaker, R. R. (2003). Riboswitches control fundamen-tal biochemical pathways in Bacillus subtilis and otherbacteria. Cell, 113, 577–586.

54. Chen, P. J., Kalpana, G., Goldberg, J., Mason, W.,Werner, B., Gerin, J. & Taylor, J. (1986). Structure andreplication of the genome of the hepatitis delta-virus.Proc. Natl Acad. Sci. USA, 83, 8774–8778.

55. Biswas, P., Jiang, X., Pacchia, A. L., Dougherty, J. P. &Peltz, S. W. (2004). The human immunodeficiencyvirus type 1 ribosomal frameshifting site is aninvariant sequence determinant and an importanttarget for antiviral therapy. J. Virol. 78, 2082–2087.

56. Sudarsan, N., Lee, E. R., Weinberg, Z., Moy, R. H.,Kim, J. N., Link, K. H. & Breaker, R. R. (2008).Riboswitches in eubacteria sense the second messen-ger cyclic di-GMP. Science, 321, 411–413.

57. Ruschak, A. M., Mathews, D. H., Bibillo, A., Spinelli,S. L., Childs, J. L., Eickbush, T. H. & Turner, D. H.(2004). Secondary structure models of the 3′untranslated regions of diverse R2 RNAs. RNA, 10,978–987.

58. Gutell, R. R., Lee, J. C. & Cannone, J. J. (2002). Theaccuracy of ribosomal RNA comparative structuremodels. Curr. Opin. Struct. Biol. 12, 301–310.

59. Gardner, P. P., Daub, J., Tate, J. G., Nawrocki, E. P.,Kolbe, D. L., Lindgreen, S. et al. (2009). Rfam: updatesto the RNA families database. Nucleic Acids Res. 37,D136–D140.

gutell 114.jmb.2011.413.0473

Technology