doi:10.1093/nar/gki267 modular rna architecture revealed ... · modular rna architecture revealed...

15
Modular RNA architecture revealed by computational analysis of existing pseudoknots and ribosomal RNAs Samuela Pasquali, Hin Hark Gan 1 and Tamar Schlick 1,2, * Department of Physics, 1 Department of Chemistry and 2 Courant Institute of Mathematical Sciences, New York University, 251 Mercer Street, New York, NY 10021, USA Received December 1, 2004; Revised and Accepted February 5, 2005 ABSTRACT Modular architecture is a hallmark of RNA structures, implying structural, and possibly functional, similar- ity among existing RNAs. To systematically delineate the existence of smaller topologies within larger structures, we develop and apply an efficient RNA sec- ondary structure comparison algorithm using a newly developed two-dimensional RNA graphical repres- entation. Our survey of similarity among 14 pseudo- knots and subtopologies within ribosomal RNAs (rRNAs) uncovers eight pairs of structurally related pseudoknots with non-random sequence matches and reveals modular units in rRNAs. Significantly, three structurally related pseudoknot pairs have functional similarities not previously known: one pair involves the 3 0 end of brome mosaic virus genomic RNA (PKB134) and the alternative hammerhead ribozyme pseudoknot (PKB173), both of which are replicase templates for viral RNA replication; the second pair involves structural elements for transla- tion initiation and ribosome recruitment found in the viral internal ribosome entry site (PKB223) and the V4 domain of 18S rRNA (PKB205); the third pair involves 18S rRNA (PKB205) and viral tRNA-like pseudoknot (PKB134), which probably recruits ribo- somes via structural mimicry and base comple- mentarity. Additionally, we quantify the modularity of 16S and 23S rRNAs by showing that RNA motifs can be constructed from at least 210 building blocks. Interestingly, we find that the 5S rRNA and two tree modules within 16S and 23S rRNAs have similar topologies and tertiary shapes. These modules can be applied to design novel RNA motifs via build-up-like procedures for constructing sequences and folds. INTRODUCTION RNA secondary and tertiary structures are composed of modular substructures that often fold independently and in a hierarchical manner (1). This modularity of RNA secondary structure has been exploited in the design of novel functional molecules (2–6). Modular architecture also implies similarity of substructural motifs among existing RNAs, suggesting pos- sible functional relationships. An interesting example noted by Mitchell et al. (7) is the occurrence of the box H/ACA motif of snoRNA in telomerase RNA, indicating a functional relation- ship between these RNAs; human telomerase RNA (hTR) H/ACA domain is essential for hTR accumulation, hTR 3 0 end processing and telomerase activity. Indeed, similar func- tions are observed when the H/ACA snoRNA motif is sub- stituted for the hTR H/ACA domain in human telomerase. As the repertoire of RNA structures increases through the discov- ery of natural (8,9) and synthetic or designed (3,10,11) RNAs, and as our interest in RNA structure and functions intensifies (12–14), automatic computer approaches are needed to detect structural similarity of RNAs within RNAs, as commonly done for establishing relatedness in protein families (15,16). Most RNA structure comparison algorithms are designed for secondary structures because not many three-dimensional (3D) RNA structures are available for analysis (17–20). An exception is the recent PRIMOS program for comparing and identifying novel motifs in tertiary structures (21). Comparing RNAs based on secondary motifs/submotifs can yield insights about their relationships and topological properties because secondary structures belonging to the same functional group are generally conserved. Current secondary-structure comparison algorithms have focused exclusively on tree structures owing to their relative simplicity for quantitative analysis. Tree structures refer to mathematical constructs for RNA secondary structures without pseudoknots (2,19,20). Various types of graphical tree repres- entations have been used to develop RNA structure comparison and clustering algorithms, including ordered, labeled and *To whom correspondence should be addressed: Tel: +1 212 998 3116; Fax: +1 212 995 4152; Email: [email protected] ª The Author 2005. Published by Oxford University Press. All rights reserved. The online version of this article has been published under an open access model. Users are entitled to use, reproduce, disseminate, or display the open access version of this article for non-commercial purposes provided that: the original authorship is properly and fully attributed; the Journal and Oxford University Press are attributed as the original place of publication with the correct citation details given; if an article is subsequently reproduced or disseminated not in its entirety but only in part or as a derivative work this must be clearly indicated. For commercial re-use, please contact [email protected] 1384–1398 Nucleic Acids Research, 2005, Vol. 33, No. 4 doi:10.1093/nar/gki267

Upload: dinhanh

Post on 10-Jun-2018

219 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: doi:10.1093/nar/gki267 Modular RNA architecture revealed ... · Modular RNA architecture revealed by computational analysis of existing pseudoknots and ... Samuela Pasquali, ... RNAs

Modular RNA architecture revealed by computationalanalysis of existing pseudoknots andribosomal RNAsSamuela Pasquali Hin Hark Gan1 and Tamar Schlick12

Department of Physics 1Department of Chemistry and 2Courant Institute of Mathematical SciencesNew York University 251 Mercer Street New York NY 10021 USA

Received December 1 2004 Revised and Accepted February 5 2005

ABSTRACT

Modular architecture is a hallmark of RNA structuresimplying structural and possibly functional similar-ity among existing RNAs To systematically delineatethe existence of smaller topologies within largerstructures we develop and apply an efficient RNA sec-ondary structure comparison algorithm using a newlydeveloped two-dimensional RNA graphical repres-entation Our survey of similarity among 14 pseudo-knots and subtopologies within ribosomal RNAs(rRNAs) uncovers eight pairs of structurally relatedpseudoknots with non-random sequence matchesand reveals modular units in rRNAs Significantlythree structurally related pseudoknot pairs havefunctional similarities not previously known one pairinvolves the 30 end of brome mosaic virus genomicRNA (PKB134) and the alternative hammerheadribozyme pseudoknot (PKB173) both of which arereplicase templates for viral RNA replication thesecond pair involves structural elements for transla-tion initiation and ribosome recruitment found in theviral internal ribosome entry site (PKB223) and theV4 domain of 18S rRNA (PKB205) the third pairinvolves 18S rRNA (PKB205) and viral tRNA-likepseudoknot (PKB134) which probably recruits ribo-somes via structural mimicry and base comple-mentarity Additionally we quantify the modularityof 16S and 23S rRNAs by showing that RNA motifscan be constructed from at least 210 building blocksInterestingly we find that the 5S rRNA and two treemodules within 16S and 23S rRNAs have similartopologies and tertiary shapes These modules can beapplied to design novel RNA motifs via build-up-likeprocedures for constructing sequences and folds

INTRODUCTION

RNA secondary and tertiary structures are composed ofmodular substructures that often fold independently and ina hierarchical manner (1) This modularity of RNA secondarystructure has been exploited in the design of novel functionalmolecules (2ndash6) Modular architecture also implies similarityof substructural motifs among existing RNAs suggesting pos-sible functional relationships An interesting example noted byMitchell et al (7) is the occurrence of the box HACA motif ofsnoRNA in telomerase RNA indicating a functional relation-ship between these RNAs human telomerase RNA (hTR)HACA domain is essential for hTR accumulation hTR30 end processing and telomerase activity Indeed similar func-tions are observed when the HACA snoRNA motif is sub-stituted for the hTR HACA domain in human telomerase Asthe repertoire of RNA structures increases through the discov-ery of natural (89) and synthetic or designed (31011) RNAsand as our interest in RNA structure and functions intensifies(12ndash14) automatic computer approaches are needed to detectstructural similarity of RNAs within RNAs as commonlydone for establishing relatedness in protein families (1516)

Most RNA structure comparison algorithms are designedfor secondary structures because not many three-dimensional(3D) RNA structures are available for analysis (17ndash20) Anexception is the recent PRIMOS program for comparing andidentifying novel motifs in tertiary structures (21) ComparingRNAs based on secondary motifssubmotifs can yield insightsabout their relationships and topological properties becausesecondary structures belonging to the same functional groupare generally conserved

Current secondary-structure comparison algorithms havefocused exclusively on tree structures owing to their relativesimplicity for quantitative analysis Tree structures refer tomathematical constructs for RNA secondary structures withoutpseudoknots (21920) Various types of graphical tree repres-entations have been used to develop RNA structure comparisonand clustering algorithms including ordered labeled and

To whom correspondence should be addressed Tel +1 212 998 3116 Fax +1 212 995 4152 Email schlicknyuedu

ordf The Author 2005 Published by Oxford University Press All rights reserved

The online version of this article has been published under an open access model Users are entitled to use reproduce disseminate or display the open accessversion of this article for non-commercial purposes provided that the original authorship is properly and fully attributed the Journal and Oxford University Pressare attributed as the original place of publication with the correct citation details given if an article is subsequently reproduced or disseminated not in its entirety butonly in part or as a derivative work this must be clearly indicated For commercial re-use please contact journalspermissionsoupjournalsorg

1384ndash1398 Nucleic Acids Research 2005 Vol 33 No 4doi101093nargki267

unlabeled trees (17ndash2022ndash24) The use of tree representationshas made possible clustering of related tree structures (1819)detection of point mutations with specific structural effects (18)search of recurring subtrees to deduce consensus structuralmotifs (20) and analysis of secondary structure statistics oflarge ensembles of random RNA sequences (24)

RNA pseudoknots which cannot be represented as treegraphs are common in nature represented by many catalyticRNAs and small subunit (SSU) rRNA and their dominance inthe RNA repertoire universe may increase with RNA size (6)see PseudoBase for a catalogue of known pseudoknots (25)In fact our recent theoretical analysis of the abundance ofRNA types (trees pseudoknots and bridge topologies) predictsthat the collection of possible motifs is dominated bypseudoknots (6)

In this study we develop a structure-based RNA compar-ison algorithm using our 2D RNA dual graphical representa-tion for both trees and pseudoknots (22627) coupled with agraph similarity (isomorphism) search method and sequencealignment analysis Because of available work in the fieldrelated to RNA trees we focus here on uncovering structuraland functional similarities of pseudoknots Our graphicalanalysis of structure similarity emphasizes global topologicalsimilarity a well-known fact for existing functional RNAfamilies such as tRNA rRNA RNase P and snoRNA Thisautomated approach is advantageous for comparing relatedRNAs with a low-sequence similarity A disadvantage ofgraphical analysis is that both detailed base pair and struc-tural information are necessarily ignored For example minorsequence and secondary structure changes in a viral frameshift-ing pseudoknot that preserve the RNA topology but disrupt theRNA function would be considered as a match (2829) Such asubtle structuralndashfunctional relationship may not be amenableto automated analysis A similar problem also arises in struc-ture alignment of protein structures caused by for examplemutations in functional sites (1530) To reduce the error ratesof false functional identification such cases for proteins andRNAs are best screened by manual analysis after the detectionof structural similarity We adopt this two-step strategy for theidentification of structural and functional similarity in RNA

Our systematic comparison of structural similarities within14 existing RNA pseudoknotsmdashincluding viral frameshiftingtRNA-like internal ribosome entry site (IRES) ribosomalRNA (rRNA) and tmRNAmdashunravels eight non-trivialmatches including three novel findings supported by sequence

and functional data Functional relationship can be deducedsubsequent to finding structural matches by analyzingsequence alignment and pseudoknot functional propertiesOne such structural and functional similarity example is analternative hammerhead ribozyme pseudoknot of satellitecereal yellow dwarf virus (PKB173) embedded in the pseu-doknot topology occurring in the 30 end of brome mosaic virusgenomic RNA (PKB134) (see Pair 5 in Table 1) both RNAsare templates for replicase in viral RNA replication Anothernovel similarity example is between the viral IRES (PKB223)and the eukaryotic 18S rRNA (PKB205) (see Pair 10 inTable 1) both participate in translation initiation Yet anotherrelated pair (Pair 6) involves 18S rRNA (PKB205) and viraltRNA-like pseudoknot (PKB134) which recruits hostribosomes for viral replication We find that tRNA-likemolecules possess an 8ndash11 nt region complementary to 18SrRNA (PKB205) suggesting that tRNA-like molecules mayrecruit ribosomes via structural mimicry (of tRNA shape) (31)and base complementarity To the best of our knowledge thestructural and functional relationships identified in these threepseudoknot pairs have not been described previously

In addition we analyze the distribution of small treemodules or submotifs within the Saccharomyces cerevisiae16S and 23S rRNAs using mathematically enumerated (dis-tinct) tree graphs and our submotif search algorithm We findthat most small motifs (ie tree graphs with lt6 vertices or100 nt) exist within rRNAs whereas the larger submotifs(ie graphs with gt9 vertices or 160 nt) occur much lessfrequently Specifically our analysis suggests that rRNAsare constructed from at least 210 small distinct tree modules

In summary these quantitative analyses of similarity amongpseudoknots and modular subunits of rRNAs may be appliedto increasing repertoire of RNA structures and exploitedfor the identification and the modular design of novel RNAstructures

MATERIALS AND METHODS

We describe our computational methods for analyzing andcomparing RNA secondary structures in the following sub-sections Analyzing secondary rather than tertiary structuresas currently performed in many other studies is appropriatebecause the former is conserved for functional RNA classes(eg tRNA 5S rRNA and RNase P) Of course analysis oftertiary folds is a separate and important problem

Table 1 Significant LALIGN aligned pseudoknot pairs

Pair Aligned pair S(G) Sequence identity () Aligned length (nt) Score ltRgt Functional similarity

1 PKB134PKB135 117 56 129 86 154 Known2 PKB205PKB233 047 63 32 50 147 Unknown3 PKB178PKB135 117 59 29 54 142 Unknown4 PKB217PKB174 047 62 47 57 124 Known (Figure 4)5 PKB134PKB173 117 80 19 48 123 Novel (Figure 5)6 PKB134PKB205 117 76 17 49 123 Novel (Figure 5)7 PKB173PKB233 047 63 35 48 120 Unknown8 PKB173PKB191 117 70 20 46 115 Unknown9 PKB178PKB173 117 71 24 38 114 Unknown

10 PKB205PKB223 047 72 14 34 106 Novel (Figure 5)

Local alignment gap penalties 151 S(G) values for simple and double pseudoknots are 047 and 117 respectively ltRgt is R-value averaged over alignmentswith five randomized sequences

Nucleic Acids Research 2005 Vol 33 No 4 1385

Graphical representation of RNA secondary structures

RNA secondary structures can be schematically represented asmathematical graphs to capture the essential topological con-nectivity of loops bulges junctions and stems (220) The useof simplified graphical representation allows RNA structuresto be efficiently analyzed and compared Recently wedeveloped a general class of RNA graphs called dual graphscapable of representing both RNA tree and pseudoknot struc-tures (2) (compare dual and tree graphs in Figures 1 and 2respectively) We use both tree and dual graphs in this workThe rules for mapping RNA secondary structures onto tree anddual graphs are given in (2) and our RNA-As-Graphs webresource (httpmonodbiomathnyuedurna) (2627)

Graph isomorphism search algorithm

The representation of RNAs as graphs provides a systematicframework to search for similar substructures in RNAsthrough the concept of graph isomorphism (32) Isomorphicgraphs share the same pattern of connectivity among all ver-tices The challenge is to identify by an efficient computeralgorithm a graph as a subgraph of a larger graph or in

molecular terms an existing RNA motif contained in a largerRNA The computational complexity of identifying two struc-turally equivalent (ie isomorphic) graphs with V vertices is ofthe order of V factorial (V) and known as the lsquograph isomorph-ism problemrsquo (3233)

Our efficient method for testing graph isomorphism is basedon graph topological numbers or invariants a sketch of thisidea was reported previously [(2) Appendix D] We associateeach graph or subgraph with one or more topological invari-ants which are computed based on the patterns of connectivityamong graph vertices Thus isomorphic graphs have the sametopological invariants while dissimilar graphs have differenttopological invariants The similarity or dissimilarity betweengraphs can be readily established by comparing their topo-logical invariants (17) Our topological invariants allow theidentification of equivalent graphs differing only in the place-ment of chain ends andor hairpin loops as explained below

We define the topological invariants using a graphrsquos con-nectivity as described by the adjacency matrix Given a graphG with V vertices the V middot V adjacency matrix A(G) of thegraph specifies the connectivity between all pairs of verticesFor example matrix element Aij indicates the number of edges

1

2 3

4

6 5

0 1 0 1 0 11 1 1 0 0 00 1 1 1 0 01 0 1 0 1 00 0 0 1 0 31 0 0 0 3 0

A = S = 3988

0 1 2 1 2 11 0 1 2 3 22 1 0 1 2 31 2 1 0 1 22 3 2 1 0 131 2 3 2 13 0

D =

654321

0 2 0 0 0 02 0 2 0 0 00 2 0 2 0 00 0 2 0 2 00 0 0 2 0 20 0 0 0 2 1

A =

0 12 2 3 4 5

2 12 0 12 2 3

4 3 2 12 0 12 5 4 3 2 12 0

3 2 12 0 12 2

12 0 12 2 3 4

D = S = 4640

5

3

3

5

Figure 1 Two hypothetical RNA secondary structures and their dual graphical representations The RNA graphs are quantitatively described using adjacencymatrix A distance matrix D (Equation 1) and topological invariant S (Equation 2) We use these measures to compare and search for similar substructures inexisting RNAs

1386 Nucleic Acids Research 2005 Vol 33 No 4

connecting vertex i with vertex j For RNA dual graphs ele-ments of the adjacency matrix have the following propertiesAij is symmetric the allowed number of edges (connections)between two vertices is 0 1 2 or 3 a vertex can have at mostone self-loop (value of 1) representing a hairpin loop and eachvertex i is connected to four edges representing two incomingand two outgoing connecting strands except where the chainends occur

Our graph invariant is defined based on a modified distance-like matrix D(G) derived from the adjacency matrix A(G) Theelements of symmetric matrix (Dij) are defined as follows

Dij = A1ij for Aij bdquo 0

Dij = dij for Aij = 0

Dii = 0 for all i1

where dij is the minimum number of edges separating verticesi and j ie the number of edges defining the shortest possiblepath between these two vertices We define the graph invariantS(G) as follows

S Geth THORN2 = V1eth THORN1XVifrac141

XVjfrac141

Dij

2

2

Approximately S(G) measures the average distance amongthe vertices in the graph G For example the component Sidefined as

PVjfrac141 Dij measures how well-connected vertex i is

to the other vertices in the graph a peripheral vertex willproduce a higher Si value than a vertex located more in thecenter of the graph Effectively our scheme assigns adjacent

vertices with smaller weights than non-adjacent vertices Thusa large S value corresponds to an extended structure with fewbranches and a small S value to a compact graph with high-order junctions or a complex pseudoknot fold This property ofS can also be used to rank and measure the topological distancebetween graphs In general such topological invariants arebelieved to have a better correspondence with the physico-chemical properties of real molecules since many of the prop-erties depend critically on the exposed surface and lesscritically on the buried elements (34) Figure 1 shows theS values for two distinct RNA topologies

Our distance matrix D(G) and topological invariant S can beextended to tree graphs although their adjacency matriceshave different properties from those for dual graphs Wecall Stree a tree topological invariant which will be used foranalyzing submotifs in 16S and 23S rRNAs Figure 2A com-pares two topologically distinct 10-vertex tree graphs withdifferent Stree values

Our topological invariants [S(G) and Stree] are degeneratedwith respect to the location of chain ends or hairpin loopssince these elements are not scored Thus topologically equi-valent structures (ie same connectivity) embedded within alarger structure can be identified All known topological invari-ants are imperfect since distinct structures can be incorrectlyassigned as identical That is graphs with distinct topologicalinvariants are non-isomorphic but the converse is not true(not all graphs with same topological invariant are iso-morphic) This is the well-known graph isomorphism problem(33) In our analysis of 100 connected graphs with our topo-logical invariant S we only encountered one pair of similar but

1 3 4 5

6

7

9

8 10

2

101 2

34

5

6

78

9

S = 9079 S = 6285

S = 3085

A

B

Figure 2 Tree and dual graphical representations and topological invariants of hypothetical RNA secondary structures (A) RNA tree graphs and their topologicalinvariants Stree (B) The only example we found (out of 100) of two distinct RNA-like pseudoknot graphs with the same topological invariant S the pseudoknotsonly differ by the placement of a stemndashloop Although our structural comparison algorithm regards this structure pair as a positive match in practice such cases areeliminated by manual screening

Nucleic Acids Research 2005 Vol 33 No 4 1387

non-isomorphic graphs (for V = 5) that resulted in the sameinvariant S (see Figure 2B) In a previous work we have alsoused the Laplacian eigenvalue spectrum as a topologicalinvariant yielding a comparable error rate of a few percent(26) Some false positive matches are reported because of thisgraph isomorphism problem Since we manually examine allpositive matches the false positive matches are eliminated

Searching for the motif of a 10-vertex graph in a 30-vertexgraph requires 20 min CPU time on an SGI 300 MHz MIPSR12000 IP27 processor The search of a 15-vertex motif in a40-vertex graph would imply 1000-fold increase in CPU time

In practice our method can easily search (ie in minutes) forsubgraphs of size ranging from 5 to 11 vertices within graphsof size up to 30 vertices (corresponding to 600 nt RNAs)

RESULTS AND DISCUSSION

RNA pseudoknots

We analyze the structural similarity of a set of 14 pseudoknotsranging from 41 to 161 nt Figure 3 shows the secondarystructures and graphical representations of the pseudoknots

Viral frameshiftin g

PKB217

PKB174PKB131

PKB178

PKB223

PKB134PKB135Self cleavingVS ribozyme

PKB205

18S ribosomal RNA

PKB191

PKB173

Hammerhead ribozyme

PKB239

HIV1 5rsquo UTR

frameshiftingViral

Internal ribosome entry siteAptamer

PKB143

tmRNA

PKB210

PKB233

Figure 3 Topological matches generated by the simple (PKB217 PKB233) and double (PKB178) pseudoknots The pseudoknots are shown as schematic secondarystructures and dual graphs There are 2 probe and 11 target pseudoknot topologies representing diverse functional RNA pseudoknots For each probe pseudoknotstructure the corresponding matched submotifssubgraphs in the larger pseudoknots are shaded

1388 Nucleic Acids Research 2005 Vol 33 No 4

selected from the PseudoBase (25) (httpwwwbioleidenunivnl~BatenburgPKBhtml) The RNA pseudoknots are viralframeshifting (PKB174 PKB217 PKB233) viral tRNA-like(PKB134 PKB135 PKB143 PKB191) IRES RNA (PKB223)18S rRNA (PKB205) HIV-1 50-untranslated region (50-UTR)(PKB239) alternative pseudoknot of hammerhead ribozyme(PKB173) aptamer that binds human nerve growth factor(PKB131) tmRNA (PKB210) and self-cleaving Neurosporavs ribozyme (PKB178) (PKB233 and PKB217 appear thesame but have different sequences and this difference affectssubsequent sequence alignment analysis) These pseudoknotclasses are not known to be functionally related We choosemore than one member for some classes because of exhibitedtopological differences

The above pseudoknot structures were derived frommutational analysis (PKB173 PKB174 PKB191 PKB233PKB239) enzymatic and chemical structure probing(PKB131 PKB174 PKB191 PKB239) phylogenetic orsequence comparison analysis (PKB131 PKB135 PKB143PKB205 PKB217 PKB233 PKB234 PKB191 PKB210PKB239) and 3D modeling (PKB134 PKB135 PKB210)The pseudoknots listed in more than one category were probedusing two or more methods

To use these structures for graphical analysis we convertthe base pairing information of each structure provided byPSEUDOBASE to a dual graphical representation using ourdual graph rules [D1ndashD3 (2)] We then use pseudoknot graphsto specify their corresponding adjacency matrices Thesetwo steps are performed manually For RNA tree structureswe use our RNA Matrix program to automatically convert asecondary structure into its corresponding adjacency matrix(2635) The search for a small motif within a larger motif isperformed using input adjacency matrices and the comparisonof matrices is performed via the topological invariant S(Equation 2)

Sequence and structural relationships of functionallyrelated pseudoknots

We illustrate using several pseudoknot pairs the functionalsimilarity identified by topological rather than sequence sim-ilarity Figure 4 compares viral frameshifting pseudoknotstructures Viral RNA frameshifting is a process by whichexpression of overlapping open reading frames (ORFs) canbe achieved (36) We compare three pairs from subfamiliesGag-pro ORF1aORF1b and ORF2ORF3 ribosomalframeshifting Frameshifting signals involve two essentialcomponents a lsquoslipperyrsquo (or sliding) sequence of nucleotideswhere frameshifting takes place and a stimulatory RNAsecondary structure (usually a pseudoknot) located a fewnucleotides downstream Although members of the sameframeshifting subfamilies have significant sequence similarityglobal sequence identity between subfamilies is as low as28 similar to random sequences Because all frameshiftingpseudoknots in Figure 4A have the same topology and func-tion finding topological similarity between RNA structureshas an advantage for functional annotation over sequencecomparison

Figure 4B shows that this finding also holds for a frameshift-ing pseudoknot pair (PKB217 and PKB174 Pair 4 in Table 1)The 72 nt PKB217 is the 1 frameshifting pseudoknot of

lactate dehydrogenase-elevating virus (LDV) (37) Togetherwith its sliding sequence UUUAAAC it regulates the expres-sion of ORF1a for a polyprotein containing proteases andORF1b for an RNA-dependent RNA polymerase in LDVThe 127 nt PKB174 is the 1 frameshifting pseudoknot ofRous sarcoma virus (RSV) (3638) This pseudoknot and slid-ing sequence AAAUUUA regulate the expression of Gag-Pro-Pol and Gag-Pol polyproteins Although both the LDV andRSV frameshifting pseudoknots have a simple pseudoknotstructure the latter has two extra stemndashloops Despite theirfunctional similarity the global sequence identity is only 37[computed using the program ALIGN (39) available at httpworkbenchsdscedu] comparable with random sequencestheir sliding sequences are also dissimilar Using our localsequence alignment parameters we find a large significantmatched region in PKB217 (6833ndash6877) as shown inFigure 4B AAACUGCUAGCCACCUCUGGUCUCGACC-GCUGUACUAGAGGUGG

Although functional similarity between frameshiftingpseudoknots is expected this case illuminates how examiningtopological similarity aids in identifying pseudoknots withsimilar functions despite low overall sequence similarity

Comparison of pseudoknot structures

Our two probe motifsmdasha simple pseudoknot (PKB217 orPKB233) and a double pseudoknot (PKB178)mdashare used tosearch larger RNAs We generate two groups of topologicalmatches as shown in Figure 3 with the matched motifs showndarkened we found 11 such matched pseudoknot pairs Inaddition we analyze sequence similarity of many of the84 (14 middot 132) possible pseudoknot pairs since topologicalsimilarities are apparent within each pseudoknot group gen-erated by the probe motifs Although not exhaustive thisadditional analysis may help to identify pairs having similarsubstructures that are not detected by using fixed probes(simple and double pseudoknots) Table 1 lists 10 pseudoknotpairs (out of the 11 probestructure matches and subset of the84 structurendashstructure analysis) that have significant structuraland sequence similarities Of the total of 10 pairs listed(Pairs 1ndash10) two known functional pairs (1 and 4 in Table 1)are included for comparison with the eight other pairs that arenot previously known to be functionally related

We begin by analyzing the overall patterns of relationshipamong our 14 pseudoknots Figure 3 organizes the matchesinto two groups five matches for the simple pseudoknot(PKB217) and six for the double pseudoknot (PKB178)The matched motifs appear to be lsquomodular constructionsrsquoaround the probe motifmodule with added stemndashloops(three or fewer) To the best of our knowledge these resultsreveal modular features and topological similarities of RNApseudoknots that have not been described

The biological significance of topological matches dependson the size and complexity of pseudoknots a match for thedouble pseudoknot is probably more significant than for asimple pseudoknot The six pseudoknots generated by thedouble pseudoknot probe PKB178 show striking topologicalas well as functional similarity (eg viral tRNA-like PKB134and PKB135) we analyze their similarity with PKB178 andamong themselves Still the matched pairs of RNAs are func-tionally diverse besides PKB134 and PKB135 such examples

Nucleic Acids Research 2005 Vol 33 No 4 1389

of diverse RNAs include IRES (PKB223) RNA 18S rRNA(PKB205 PKB234) HIV-1 50-UTR (PKB239) and alternativepseudoknot of hammerhead ribozyme (PKB173) In this dou-ble pseudoknot group six pairs (1 3 5 6 8 and 9 in Table 1)produce local sequence alignments [using LALIGN (40)

httpworkbenchsdscedu] that suggest significant resultsIn particular the pair PKB134PKB173 (Pair 5 in Table 1)indicates a significant match The pairs PKB134PKB205(Pair 6) and PKB205PKB223 (Pair 10 in Table 1) alsodisplay intriguing sequence and functional similarities

A A A A A A C G G G A A G C A A G G G C U

GGGA

CA A

CCCU

A

A

A

U

GA U

C C C G G

G

A AACC

A

U

PKB 3

Gag-polglobal sequence identity 57

PKB 128U U U A A A C U G U U G A G A G G U G C C U G

U C U C U A C G G A C

GCCUG

GA G

CGGAC

UU U

G

U AUA

G

A

C

PKB 217U U U A A A C U G C U A G C C A C C U C U G G U

G G U G G A G A U C A

GACCGC

CU C C

UGGCG

CU

U

G

G

GCA

U

G GY

AG

CG

ORF 1a - ORF 1bglobal sequence identity 53

G G G A A A C G G G C A G G C G G

GCG

C GCGC

A

G

A

C C G C C

AC

AA

C

PKB 44

G G G A A A U G G A C U G A G G C G

GGCG

C CGGC

A

C

A

C G C C

A

CA

A

C

PKB 240

A

A

ORF2 - ORF3global sequence identity 79

global sequence identity as low as 28

A

U CC

G

C

AA A

C

U

C

GG

C

U

G

UU

AA

PKB 217GCUGACGGUGUYUGGCG

UA

CUC

UG

UG A C C G C

C U G G C G

CC

CCUCUGGU

GUGGAGA

CA

A

U

GU U U A A A C U G C U A G C 5

3

PKB 174

G A AG

UG

AA

AC

U

U

A

U

UG

C A

UUG

UC

UC

U

C

A

U A

U

AGGG

UCCCU

CGGGCCACUG

CCCGGUGAC

G

CC

GU

GU

G

A

CA

C

G C G C U A C A

C G C G A U G U

A G A C C G

U C U G G C

A

2603 3

52476 A A A U U U A U

6830

BPair 4

global sequence identity 37

G G G A A A C U G G A A G G C G G G G C

GCUGCA

GA C

GACUG

G

A

A

C C C C G

PKB 4

C

AA

AUA

UG

Figure 4 Sequence and structural comparisons of pseudoknots in the frameshifting functional families (A) Two secondary structures and corresponding topologiesfor each of the three viral frameshifting subfamilies Gag-pro ORF1aORF1b and ORF2ORF3 Sequence identity within the same subfamily ranges between 50 and80 it is as low as 28 when comparing pseudoknots from different subfamilies All frameshifting pseudoknots are functionally related and have the same topology(dual graph) (B) Secondary structures and their aligned nucleotides (color coded) for frameshifting pseudoknots of LDV (PKB217) and Rous sarcoma virus(PKB174) Global sequence alignments were performed using the program ALIGN with default (164) gap openinggap extension penalties

1390 Nucleic Acids Research 2005 Vol 33 No 4

These relationships suggest structural and functional similar-ities that are not previously known (Table 1)

To support the significance of these matches we performsequence alignments using SmithndashWaterman parameters forgap opening and gap extension penalties (41) We choose thegap opening penalty value of 15 from a range of 0 to 19 Toproduce better alignments for large sequence segments (shortalignments are less likely to be significant in our context) ourgap extension penalty of 1 from a range of 1 to 8 allowslarger extensions consistent with the possibility of the inser-tion of long sequences and added 2D motifs In Table 1 wealso report the alignment score which is the sum of nucleotidematches mismatches and gap penalties

Table 1 shows that the probe structure (PKB178) matchestwo other pseudoknots (PKB135 PKB173) with sequenceidentities ranging from 59 to 71 for 24ndash49 nt alignmentlengths (compared with 56 sequence identity over 129 ntfor the functionally related pair PKB134PKB135 Pair 1)Significantly the pairs PKB134PKB173 (Pair 5) PKB134PKB205 (Pair 6) and PKB205PKB223 (Pair 10) havesequence identity range of 72ndash80 The non-randomness ofthe 10 aligned pairs in Table 1 is also evident in the ratio Rwhich measures the level of significance above the randomexpectation and has value 1 for random pairs

R = (percent sequence identity)(percent sequence identitywhen one of the aligned sequences is randomized)

We also define ltRgt as the R-value averaged over align-ments with different randomized sequences Table 1 showsthat some pseudoknot pairs with no known functional rela-tionship have ltRgt values ranging between 106 and 147whereas the related pair PKB134PKB135 has a value of154 many matched pairs have ltRgt values lt1 Thus thetopological similarities we identify led to matched pairswith non-random sequence alignment results Non-randomsequence matches may arise from topological andor func-tional similarities as illustrated below for three pseudoknotpairs (5 6 and 10 in Table 1)

The rate of false positives generated by topological and ltRgtvalue screening can be easily estimated by considering allpossible pairs from the set of six matched double pseudo-knots in Figure 3 These pseudoknots form 15 distinctpairs 6 of which have ltRgt values gt114 [Table 1 withS(G) = 117] Table 1 shows that only four pairs have largeltRgt values (gt12) Pair 1 (PKB134PKB135) is related Pair 5(PKB134PKB173) and Pair 6 (PKB134PKB205) are mostprobably related and Pair 3 (PKB178PKB173) is of unknownrelationship (in fact both are self-cleaving ribozymes)Assuming that the Pair 3 is unrelated and therefore reflectsa false positive we have an error rate of 115 or 7 If theinferred functionally related Pairs 5 and 6 are also consideredas false positives we then have a conservative error rateof 20

Functional relationship between PKB134 and PKB173(Pair 5 in Table 1)

The 116 nt PKB134 pseudoknot occurs in the 30 end of bromemosaic virus genomic RNA (31) The 73 nt PKB173 is ahammerhead ribozyme of satellite cereal yellow dwarfvirus-RPV (satRPV) RNA capable of adopting an alternativepseudoknot conformation by forming an (L1ndashL2a) stem whose

existence is established by experimental mutational analysis(42) Figure 3 shows that their topologies have the same lsquocorersquobut PKB134 has two extra stemndashloops The local sequencealignment of this pair using LALIGN (Figure 5) shows 19bases aligned to 80 sequence identity and an ltRgt valueof 123 (Table 1) There are two significant segments inPKB173 that are matched with those in PKB134 UACUGU-CUGACGACGUAUCC (nt 11ndash30) and UAGAAGGCUG-GUGCC (nt 39ndash53) These segments and their matchedregions in PKB134 span the stem and unpaired elements ofthe secondary structures (see Figure 5) Significantly bothPKB134 and PKB173 pseudoknots serve as template for rep-licase a protein that recognizes specific sequence regions toinitiate replication of viral genomes (3142) Moreover a con-served sequence (CUGANGA) of PKB173 is in the alignedregion (nt 11ndash30) Thus our analysis suggests that PKB134and PKB173 pseudoknots are structurally and functionallyrelated

Further analysis of pseudoknots PKB134 and PKB173reveals specific domains interacting with the replicase Experi-ments have shown that mutations in domains A C B1 and B2[defined in (31)] of PKB134 impair viral RNA replication (43)Significantly our two aligned sequence regions span domainsA B1 C and also D Thus the replicase recognizes severaldomains in the PKB134 pseudoknot structure In PKB173 thepseudoknot forming five base pair helix (GCGCG) is implic-ated in replicase recognition or ligation (42) This pseudoknotalso has conserved sequence regions ACAAA (61ndash65) or itscomplement (possibly the replication origin) and AGAAA(nucleotides attached to the 30 end but not shown) Howeverthese regions are not observed in PKB134

Functional relationship between PKB205 and PKB223(Pair 10 in Table 1)

The 63 nt PKB223 is the pseudoknot of IRES of capsid-producing cricket paralysis-like viruses (CrPV) formethionine-independent initiation of translation (44) Appear-ing immediately upstream from the capsid-coding sequenceit initiates translation without a canonical AUG (methionine)initiation codon The 48 nt PKB205 is the pseudoknotsubstructure (V4 domain) of SSU of Palmaria palmata 18SrRNA which was found using extensive comparative analysis(4546) Figure 5 (Pair 10 lower panel) shows that PKB223 isa 4-vertex simple pseudoknot with two inserted stemndashloopsand PKB205 is a 4-vertex double pseudoknot with an insertedstemndashloop Figure 5 (middle panel) displays the local sequencealignment of PKB223 and PKB205 yielding a match of 14 ntwith 72 sequence identity This significant sequence simil-arity occurs in regions ACAAUAUCCAGGAA (6148ndash6161)and ACAUUAGCAUGGAA (765ndash778) of PKB223 andPKB205 respectively Remarkably we find that the 9 ntsegment 50-UUAGGUUAG-30 (nt 6100ndash6108) of PKB223is complementary to 30-GGUACGAUU-50 (nt 768ndash776) ofPKB205 except at G6103 Our finding corroborates withthe experimental identification that IRES elements in cellularmRNA contain a short 9 nt (nt 133ndash141 CCGGCGGGU) inthe 50-UTR of the homeodomain protein Gtx which is com-plementary to nt 1132ndash1124 of 18S rRNA (47) Indeed manysuch near complementary sequences have been found bet-ween the 30 end of 18S rRNA and picornavirus RNAs (48)

Nucleic Acids Research 2005 Vol 33 No 4 1391

Fig

ure

5

Th

ree

no

vel

fun

ctio

nal

lyre

late

dp

seu

do

kn

otp

airs

infe

rred

by

top

olo

gy

mat

chin

gan

dse

qu

ence

alig

nm

ent

Up

per

pan

els

Sec

ond

ary

stru

ctu

res

and

thei

ral

ign

edn

ucl

eoti

des

(co

lor

cod

ed)

for

pse

udo

kn

ot

pai

rsP

KB

13

4P

KB

17

3(P

air

5T

able

1

left

pan

el)

PK

B2

05

PK

B2

23

(Pai

r1

0T

able

1

mid

dle

pan

el)

and

PK

B1

34

PK

B20

5(P

air

6

rig

ht

pan

el)

PK

B1

34

occ

urs

in30 e

nd

of

bro

me

mosa

icv

iru

sg

eno

mic

RN

A

PK

B1

73

isa

pse

udo

kn

oto

fsa

tell

ite

cere

aly

ello

wd

war

ftv

iru

s-R

PV

RN

AP

KB

22

3is

the

pse

ud

ok

no

tofin

tern

alen

try

site

(IR

ES

)o

fca

spid

-pro

du

cin

gC

rPV

san

dP

KB

20

5is

the

pse

udo

kn

ots

ub

stru

ctu

re(V

4d

om

ain

)o

fS

SU

of

PP

alm

ata

18

SrR

NA

L

ow

erp

anel

sT

he

alig

ned

sub

gra

ph

so

rre

gio

ns

are

shad

edo

rm

ark

ed

1392 Nucleic Acids Research 2005 Vol 33 No 4

Intriguingly for CrPVs the complementary segments occur inpseudoknots with similar folds not involving the dangling30 end of 18S rRNA Base pair complementary between IRESelements of mRNA and 18S rRNA is a mechanism for ribo-some recruitment and translation In viral IRES elements thesecondary and tertiary structures are believed to play criticalroles in translation initiation (44)

Functional relationship between PKB205 and PKB134(Pair 6 in Table 1)

The functional relationsip of the pair PKB205PKB134 issimilar to that for PKB205PKB223 above PKB134 belongsto a common viral tRNA-like functional class whose role is torecruit host ribosomes to initiate viral protein synthesis (49)It is known to interact with the ribosome via tRNA shapemimicry (31) Our analysis indicates that tRNA-like mole-cules also contain a region which is nearly complementary toPKB205 a pseudoknot region of the 18S rRNA Table 2shows regions of PKB205 complementary to four tRNA-like molecules (PKB134 191 138 17) the complementaryregion in PKB223 is shown for comparison Significantlyeach of the pseudoknots PKB223 PKB134 and PKB191has a region complementary to the same 8ndash9 nt unpairedregion in PKB205 (767ndash776) This single-stranded regionis most probably available for base pairing with incomingtRNA-like molecules In addition two other regions inPKB205 are complementary to tRNA-like pseudoknotsPKB138 and PKB17 one region is a hairpin loop and anotherregion is part of short stems Thus this analysis suggests thattRNA-like molecules may employ both structural mimicry andsequence complementarity to recruit SSU rRNA an interac-tion mode similar to tRNA PKB134 also has a region UACA-GUGUUGAAAAACA (nt 2078ndash2094) closely matched tothe region UAGAGUGUUCAAAGCCA (nt 732ndash748) inPKB205 with an average R-value of 123

Tertiary structure similarity of topological matches inribosomal RNAs

The preceding examination of three pseudoknot pairs showsthat topological matches can suggest functionally relatedpseudoknots It is also important to show that novel topo-logical matches led to similar tertiary structures We focusfor this purpose on rRNA since several bacterial and archaealribosome structures have been solved in recent years In factin our previous work we have found several topologicalmatches of the Scerevisiae 5S rRNA motif within its largerrRNAs two within 16S rRNA (domains II and III) and threewithin 23S rRNA (domains III IV and VI) (2) From availableX-ray structures for Thermus thermophilus 16S rRNA (PDBaccession no 1fjg) (50) and 23S rRNA from Deinococcus

radiodurans (1nkw) and Haloarcula marismortui (1ffk)(5152) only 5S rRNA motifrsquos matches in domain II of16S rRNA and domain III of 23S rRNA are relevant theother three cases are not valid matches in these species

Figure 6 displays the secondary diagrams graphical repres-entations and tertiary structures of 5S rRNA for bacteriumDradiodurans (1nkw) and archaea Hmarismortui (1ffk)and their topologically matched modules within 16S and23S rRNAs the secondary diagrams are adapted from thosein Gutellrsquos comparative RNA website (httpwwwrnaicmbutexasedu) (53) The module in 23S is an identicalmatch for the 6-vertex 5S graph except for a hairpin loopwhich does not contribute to the topological invariant SThe 16S module is a 5-vertex graph because a 2 bp stem atthe junction has a non-canonical base pair AG (dual graph ruleD1 defines a stem as having at least two canonical andorwobble base pairs) for Scerevisiae this stem consists ofonly canonical base pairs making the 16S module an identicalmatch for the 5S topology As shown in Figure 6 the 16S and23S modules have a similar L-shaped tertiary fold as the 5Sstructure Tertiary structure similarity with 5S is more strikingfor the module of domain II (nt 653ndash753) of Tthermophilus16S rRNA than for the module of domain III (nt 1033ndash1143) of23S rRNA Both the 5S structure and 16S module have apronounced 3-stem junction where the chain ends of the16S module are located Structural (backbone) overlap ofthe 5S structure and 16S module shows similar structuralfeatures except for the orientations of the smaller stems atthe junction (We aligned the structures manually because thechain ends of 5S and 16S module occur at non-correspondinglocations) In contrast the 23S module shows significant dis-tortions on the shorter helical arm with the hairpin loop foldedback to the helix as shown in the structural overlap of the 5Sstructure and 23S module Structural similarities seen in theseexamples are probably due to geometric constraints which areapparent in their secondary diagrams and graphical representa-tions Thus 2D topological matches can lead to similar 3Dstructures Below we survey the distribution of modules inrRNAs

Distribution of submotifs in 16S and 23S ribosomalRNAs

Characterizing the submotifs within the 16S and 23S rRNAsthe largest known RNA molecules can help uncover themodular construction of RNA structures as analysis ofrRNArsquos tertiary modules has revealed Furthermore our pre-vious analysis of rRNArsquos structural motifs indicated existenceof characteristic features (35) For example the distributionsof pairedunpaired bases in stems bulgesinternal loops hair-pin loops and junctions follow specific functional forms

Table 2 Short sequence regions in PKB205 complementary to tRNA-like molecules

PKB205 regions Complementary sequences

30-GGUACGAUU-50 (768ndash776) PKB223 50-UUAGGUUAG-30 (6100ndash6108) IRES30-GUGAGAUU-50 (769ndash776) PKB134 50-CACUGUAAA-30 (113ndash120) tRNA-like30-UACGAUUAC-50 (767ndash775) PKB191 50-AUGCUCAUG-30 (77ndash85) tRNA-like30-UGUGAGAUU-50 (731ndash739) PKB138 50-ACACUUUAA-30 (38ndash47) tRNA-like30-GUUUGCGGACC-50 (746ndash756) PKB17 50-CAAAACCCUGG-30 (19ndash29) tRNA-like

Nucleic Acids Research 2005 Vol 33 No 4 1393

Figure 6 Comparison of the graph topologies and 2D and 3D structures of 5S rRNAs (A and B) and modules in 16S and 23S rRNAs (C and D) (C and D) Twostructural (backbone) overlaps of the 5S structure (1ffk gold color) with 16S module (1fjg blue) and 5S structure with 23S module (1nkw green) are shown Thesecondary diagrams are adapted from those in Gutellrsquos comparative RNA website (httpwwwrnaicmbutexasedu) and 3D structures are from the Protein DataBank (httpwwwrcsborgpdb) The 2D modules were identified computationally by our earlier work on finding120 nt 5S rRNA motifs in 16S and 23S rRNAs ofScerevisiae(2) The equivalent tertiary structures and modules for 5S 16S and 23S rRNAs from other organisms show that similar graph topologies can lead tosimilar secondary and tertiary structures All structures have similar graph topologies The Tthermophilus 16S module is a 5-vertex instead of 6-vertex graph becausea 2 bp stem at the junction has a non-canonical base pair AG for Scerevisiae this stem consists of only canonical base pairs making the 16S module an identicalmatch for the 5S topology

1394 Nucleic Acids Research 2005 Vol 33 No 4

Moreover the percentages of nucleotides in stems bulgesinternal loops hairpin loops and junctions do not vary for16S and 23S rRNAs even though the latter is twice thesize of the former Such motif characteristics help discriminateRNA-like from non-RNA-like molecules and guide the designof novel RNAs Here we analyze the distinct modular unitsmdashsmall tree submotifsmdashthat makeup the rRNAs to advance suchapplications Since all small tree graphs up to 10 vertices havebeen exhaustively enumerated (5455) we can use our sub-structure search algorithm to identify the occurrences of alldistinct tree submotifs in rRNAs

Specifically we use sets of tree graphs for V = 4 5 6 7 8 9and 10 comprising 2 3 6 11 23 47 and 106 graphs respect-ively for a total of 198 tree motifs The complete enumeratedmotifs can be found in the RAG web resource (httpmonodbiomathnyuedurna) Figure 7 shows as an example the11 possible 7-vertex tree graphs and lists the frequency ofoccurrence of each tree graph in rRNAs Remarkably thesame tree motifs are abundant in both 16S and 23S rRNAsThe most abundant (10 or 15 times) 7-vertex tree has two3-stem junctions (topological invariant Stree = 6160) 4- and5-stem junctions are the next most prevalent submotifs(Stree = 5137 5921 and 5667) Interestingly the unbranchedtree (Stree = 7219) and highly branched trees (Stree = 4509)are rare

Figure 8 plots the percentage of distinct tree graphs in eachset V (fraction of motifs identified relative to possible motifs)

found in rRNA For small tree graphs V = 4 5 and 6 allpossible topologies are present but not for V gt 6 In fact thepercentage of trees found declines with V In 23S rRNA90 of possible trees are found for V = 7 and only40 (representing 47 tree topologies) are found forV = 10 Larger trees (V gt 10) are expected to have muchsmaller percentages This percentage decline is expectedsince small tree modules are less specific than large treesIn total 111 submotifs out of 198 are identified The observedtrends in Figure 8 may be generally valid for rRNA structuresof other organisms since they are highly conserved althoughsome statistical fluctuations are expected due to structuraldifferences and RNArsquos finite size

These data show that RNA structures are modular constructsof relatively few (tree) building blocks Extrapolation of thecurves in Figure 8 suggests that 20 of V = 11 motifs (totalof 235 trees) and 10 of V = 12 motifs (total of 551 trees)probably exist in rRNAs Thus the total number of buildingblocks in rRNAs with V lt 13 would be 210 Although thepercentages of trees with higher V values are low they aremore numerous and will contribute to the overall number ofbuilding blocks We thus consider 210 as the minimum num-ber of building blocks The building block number may beexploited in the future design of novel functional RNAmolecules using modular assembly from 210 distinct treemotifs The tree motifs may be taken from substructures ofexisting RNAs allowing simultaneous sequence and structure

treeS =

16S 23SOccurrence

ID

6 4

)

0 1

(7 8)

6191

(7 11)

4509

0 1

(7 9)

5385

3 3

7

(7 10)

5137

5 5

(7 6)

5916

5

(7 5)

5921

(7 1)

0 0

7219

(7 3)

2 4

6446

4 4

(7 7)

5667

10 15

(7 4

6160

0 3

6692

(7 2)

Figure 7 Occurrence frequency of 11 distinct 7-vertex tree topologies as submotifs of Scerevisiae 16S and 23S rRNAs The tree graphs are labeled using topologicalinvariant Stree (second top row) and ID number (i j) (first row) which refers to motif identification number as used in RAG database (httpmonodbiomathnyuedurna) The two numbers at the bottom of each graph refer to the number of times that topology is found embedded in 16S and in 23S rRNAs respectively

Nucleic Acids Research 2005 Vol 33 No 4 1395

design Such a build-up approach appears feasible from our(26) and other works (35) and is akin to the lsquobuilduprsquo tech-nique for proteins pioneered by Vasquez and Scheraga (56)

The above estimate of building block number can be eithera consequence of the limited number of available structures orthe fact that RNA adopts only a subset of possible graphsCertainly future discoveries of novel RNA structures mightindicate a larger repertoire of building blocks Still our sur-veys of natural RNAs (26) suggest that RNAs possess specifictopological characteristics and favor a subset of the possiblegraph motifs The missing motifs may not exist in nature forRNA owing to physical or functional constraints

The building block number can also be defined in terms ofirreducible graphs ie trees that cannot be decomposed intosmaller trees However this approach has a major drawbackSince all RNAs are composed of connected stems their sub-structures can be reduced to two vertex graphs which is not aninformative description of complex RNA folds Still if wedefine 4-vertex trees as the smallest irreducible graphs thenumber of irreducible graphs can be calculated Based on oursubmotif data we estimate that for V up to 10 there are about42 irreducible tree modules in rRNAs

SUMMARY AND CONCLUSION

Our automated comparison of RNA secondary structuresbased on graphical representation allows the examination ofconserved 2D RNA topological features which in turn suggestfunctional relationships This approach has an advantage oversequence alignment because functionally related RNAs oftenlack sequence similarity as shown for frameshifting pseudo-knots with low global sequence identities Of course ourstructure alignment approach relies on available 2D RNA

structures and thus will increase in scope as more RNAstructures become available Our eight identified pairs ofpotentially related pseudoknots (Table 1) reveal three pairsthat are functionally related the pseudoknots of 30 end ofbrome mosaic virus genomic RNA (PKB134) and alternativehammerhead ribozyme pseudoknot (PKB173) are templatesfor replicase to initiate replication of viral genomes the pseu-doknots of internal entry site (IRES) of caspid-producingCrPVs (PKB223) and SSU of 18S rRNA (PKB205) are asso-ciated with translation initiation and the pseudoknots of 18SrRNA (PKB205) and viral tRNA-like (PKB134) most prob-ably interact via both structural mimicry and base comple-mentarity To the best of our knowledge these functionalrelationships have not been described previously

The developed RNA comparison algorithm also led us toquantify the modularity of 16S and 23S rRNAs using smallenumerated tree motifs Our results suggest that rRNAs areconstructed from at least 210 distinct tree subtopologies(or modules) having up to 180 nt Significantly we findthat 5S rRNA and two tree modules within 16S and 23SrRNAs with similar topologies have similar tertiary shapesaffirming the relationship between 2D graphical representa-tion and 3D structure The tree modules in rRNA may be usedas building blocks for the design of novel RNA folds A similaridea has been employed to design and improve functionalRNA molecules (510) Design by assembly of modulesexploits motifs present in natural RNAs unlike de novo designof sequences and structures (11) The issues of discoveringnovel topologies (6) building module library and designingfunctional folds are likely to increase in importance withinterest in novel RNAs

Our algorithm has several limitations Of course its utilitydepends on availability of reliable RNA secondary foldingalgorithms (5758) More fundamentally although simplifiedgraphical representations specify topological characteristicsdetailed sequence and motif information is missing An asso-ciated problem is that unusual RNA structurendashfunction rela-tionships due to deleterious mutations as in frameshiftingpseudoknots (28) and structural flexibility are not capturedMore advanced graph constructs such as weighted graphs mayhowever specify detailed motif features at the secondary andbase pair levels Another limitation of the present algorithmis the requirement to fix the probe structure or graph As inprotein structure alignment algorithms (15) future refinementsof our algorithm should allow some flexibility in motifdefinitions as the similarity search proceeds We inviteRNA scientists to visit our RAG web resource at httpmonodbiomathnyuedurna and suggest to us specific enhancementsthat will be useful in practice

ACKNOWLEDGEMENTS

We thank Dinshaw Patel for comments that motivated thiswork This research was supported by a Joint NSFNIGMSInitiative in Mathematical Biology (DMS-0201160) and theHuman Frontier Science Program (HFSP)

REFERENCES

1 TinocoI and BustamanteC (1999) How RNA folds J Mol Biol293 271ndash281

Figure 8 Distribution of all possible distinct tree modules with V vertices foundin Scerevisiae 16S and 23S rRNAs In graph theory each V is represented by aset of possible tree structures For example for V= 4 5 6 7 8 9 and 10 there are2 3 6 11 23 47 and 106 possible trees respectively The distribution of treesquantifies the diversity of tree modules within rRNAs This analysis suggeststhat rRNAs possess at least 210 tree modules

1396 Nucleic Acids Research 2005 Vol 33 No 4

2 GanHH PasqualiS and SchlickT (2003) Exploring the repertoire ofRNA secondary motifs using graph theory implications for RNA designNucleic Acids Res 31 2926ndash2943

3 SoukupGA and BreakerRR (1999) Engineering precision RNAmolecular switches Proc Natl Acad Sci USA 96 3584ndash3589

4 BreakerRR (2002) Engineered allosteric ribozymes as biosensorcomponents Curr Opin Biotechnol 13 31ndash39

5 IkawaY FukadaK WatanabeS ShiraishiH and InoueT (2002)Design construction and analysis of a novel class of self-foldingRNA Structure 10 527ndash534

6 KimN ShiffeldrimN GanHH and SchlickT (2004) Candidatesfor novel RNA topologies J Mol Biol 341 1129ndash1144

7 MitchellJR ChengJ and CollinsK (1999) A box HACAsmall nucleolar RNA-like domain at the human telomerase RNA 30

end Mol Cell Biol 19 567ndash5768 DoudnaJA and CechTR (2002) The chemical repertoire

of natural ribozymes Nature 418 222ndash2289 StorzG (2002) An expanding universe of noncoding RNAs Science

296 1260ndash126310 BreakerRR and JoyceGF (1994) Inventing and improving ribozyme

function rational design versus iterative selection methods TrendsBiotechnol 12 268ndash275

11 AndronescuM FejesAP HutterF HoosHH and CondonA (2004)A new algorithm for RNA secondary structure design J Mol Biol336 607ndash624

12 CarterRJ DubchakI and HolbrookSR (2001) A computationalapproach to identify genes for functional RNAs in genomic sequencesNucleic Acids Res 29 3928ndash3938

13 DoudnaJA (2000) Structural genomics of RNA Nature Struct Biol7 954ndash956

14 RivasE KleinRJ JonesTA and EddySR (2001) Computationalidentification of noncoding RNAs in Ecoli by comparativegenomics Curr Biol 11 1369ndash1373

15 HolmL and SanderC (1996) Mapping the protein universeScience 273 595ndash602

16 ShindyalovIN and BournePE (2001) A database and tools for 3-Dprotein structure comparison and alignment using the CombinatorialExtension (CE) algorithm Nucleic Acids Res 29 228ndash229

17 BenedettiG and MorosettiS (1996) A graph-topological approach torecognition of pattern and similarity in RNA secondary structuresBiophys Chem 59 179ndash184

18 MargalitH ShapiroBA OppenheimAB and MaizelJVJr (1989)Detection of common motifs in RNA secondary structures Nucleic AcidsRes 17 4829ndash4845

19 ShapiroBA and ZhangKZ (1990) Comparing multiple RNAsecondary structures using tree comparisons Comput Appl Biosci6 309ndash318

20 LeSY NussinovR and MaizelJV (1989) Tree graphs of RNAsecondary structures and their comparisons Comput Biomed Res22 461ndash473

21 DuarteCM WadleyLM and PyleAM (2003) RNA structurecomparison motif search and discovery using a reduced representation ofRNA conformational space Nucleic Acids Res 31 4755ndash4761

22 ChevaletC and MichotB (1992) An algorithm for comparing RNAsecondary structures and searching for similar substructures ComputAppl Biosci 8 215ndash225

23 WangL and ZhaoJ (2003) Parametric alignment of ordered treesBioinformatics 19 2237ndash2245

24 FontanaW KoningsDAM StadlerPF and SchusterP (1993)Statistics of RNA secondary structures Biopolymers 33 1389ndash1404

25 van BatenburgFHD GultyaevAP PleijCWA NgJ andOliehoekJ (2000) PseudoBase a database with RNA pseudoknotsNucleic Acids Res 28 201ndash204

26 GanHH FeraD ZornJ ShiffeldrimN TangM LasersonUKimN and SchlickT (2004) RAG RNA-As-Graphsdatabasemdashconcepts analysis and features Bioinformatics20 1285ndash1291

27 FeraD KimN ShiffeldrimN ZornJ LasersonU GanHH andSchlickT (2004) RAG RNA-As-Graphs web resource BMCBioinformatics 5 88

28 ChenXY KangHS ShenLX ChamorroM VarmusHE andTinocoI (1996) A characteristic bent conformation of RNA pseudoknotspromotes 1 frameshifting during translation of retroviral RNAJ Mol Biol 260 479ndash483

29 KangHS HinesJV and TinocoI (1996) Conformation of anon-frameshifting RNA pseudoknot from mouse mammary tumor virusJ Mol Biol 259 135ndash147

30 GanHH PerlowRA RoyS KoJ WuM HuangJ YanSNicolettaA VafaiJ SunD et al (2002) Analysis of protein sequencestructure similarity relationships Biophys J 83 2781ndash2791

31 FeldenB FlorentzC GiegeR and WesthofE (1994) Solutionstructure of the 30-end of brome mosaic-virus genomic RNAsmdashconformational mimicry with canonical transfer-RNAs J Mol Biol235 508ndash531

32 GrossJL and YellenJ (1999) Graph Theory and its Applications CRCPress Boca Raton FL

33 KoblerJ SchoningU and ToraanJ (1993) The Graph IsomorphismProblem Birkhauser Boston MA

34 RandicM and ZupanJ (2001) On interpretation of well-knowntopological indices J Chem Inform Comput Sci 41 550ndash560

35 ZornJ GanHH ShiffeldrimN and SchlickT (2004) Structuralmotifs in ribosomal RNAs implications for RNA design and genomicsBiopolymers 73 340ndash347

36 JacksT MadhaniHD MasiarzFR and VarmusHE (1988) Signalsfor ribosomal frameshifting in the Rous-Sarcoma virus GagndashPolregion Cell 55 447ndash458

37 GodenyEK ChenL KumarSN MethvenSL KooninEV andBrintonMA (1993) Complete genomic sequence and phylogeneticanalysis of the lactate dehydrogenase-elevating virus (Ldv) Virology194 585ndash596

38 MarczinkeB FisherR VidakovicM BloysAJ and BrierleyI (1998)Secondary structure and mutational analysis of the ribosomalframeshift signal of Rous sarcoma virus J Mol Biol 284 205ndash225

39 MyersEW and MillerW (1988) Optimal alignments in linear spaceComput Appl Biosci 4 11ndash17

40 HuangXQ and MillerW (1991) A time-efficient linear-space localsimilarity algorithm Adv Appl Math 12 337ndash357

41 SmithTF and WatermanMS (1981) Identification of commonmolecular subsequences J Mol Biol 147 195ndash197

42 SongSI SilverSL AulikMA RasochovaL MohanBR andMillerWA (1999) Satellite cereal yellow dwarf virus-RPV (satRPV)RNA requires a DouXble hammerhead for self-cleavage and analternative structure for replication J Mol Biol 293 781ndash793

43 DreherTW and HallTC (1988) Mutational analysis of thetransfer-RNA mimicry of brome mosaic-virus RNAmdashsequence andstructural requirements for aminoacylation and 30-adenylationJ Mol Biol 201 41ndash55

44 KanamoriY and NakashimaN (2001) A tertiary structure model of theinternal ribosome entry site (IRES) for methionine-independent initiationof translation RNA 7 266ndash274

45 WuytsJ De RijkP Van de PeerY PisonG RousseeuwP andDe WachterR (2000)Comparativeanalysisof more than 3000 sequencesreveals the existence of two pseudoknots in area V4 of eukaryotic smallsubunit ribosomal RNA Nucleic Acids Res 28 4698ndash4708

46 Van de PeerY De RijkP WuytsJ WinkelmansT and De WachterR(2000) The European small subunit ribosomal RNA databaseNucleic Acids Res 28 175ndash176

47 ChappellSA EdelmanGM and MauroVP (2000) A 9-nt segment ofa cellular mRNA can function as an internal ribosome entry site (IRES)and when present in linked multiple copies greatly enhances IRESactivity Proc Natl Acad Sci USA 97 1536ndash1541

48 ScheperGC VoormaHO and ThomasAA (1994) Basepairing with18S ribosomal RNA in internal initiation of translation FEBS Lett352 271ndash275

49 BarendsS Rudinger-ThirionJ FlorentzC GiegeR PleijCW andKraalB (2004) tRNA-like structure regulates translation of Bromemosaic virus RNA J Virol 78 4003ndash4010

50 CarterAP ClemonsWM BrodersenDE Morgan-WarrenRJWimberlyBT and RamakrishnanV (2000) Functional insights fromthe structure of the 30S ribosomal subunit and its interactions withantibiotics Nature 407 340ndash348

51 HarmsJ SchluenzenF ZarivachR BashanA GatS AgmonIBartelsH FranceschiF and YonathA (2001) High resolution structureof the large ribosomal subunit from a mesophilic Eubacterium Cell 107679ndash688

52 BanN NissenP HansenJ MoorePB and SteitzTA (2000) Thecomplete atomic structure of the large ribosomal subunit at 24 angstromresolution Science 289 905ndash920

Nucleic Acids Research 2005 Vol 33 No 4 1397

53 CannoneJJ SubramanianS SchnareMN CollettJRDrsquoSouzaLM DuYS FengB LinN MadabusiLV MullerKMet al (2002) The Comparative RNA Web (CRW) Site an online databaseof comparative sequence and structure information for ribosomal intronand other RNAs BMC Bioinformatics 3 2

54 HararyF and PrinsG (1959) The number of homeomorphicallyirreducible trees and other species Acta Math 101141ndash162

55 HararyF (1969) Graph Theory Addison-Wesley Reading Mass

56 VasquezM and ScheragaHA (1985) Use of buildup andenergy-minimization procedures to compute low-energy structures ofthe backbone of enkephalin Biopolymers 24 1437ndash1447

57 ZukerM MathewsDH and TurnerDH (1999) Algorithms andthermodynamics for RNA secondary structure prediction A practicalguide In BarciszewskiJ and ClarkBFC (eds) RNA Biochemistry andBiotechnology Kluwer Academic Publishers Dordrecht pp 11ndash43

58 HofackerIL (2003) Vienna RNA secondary structure serverNucleic Acids Res 31 3429ndash3431

1398 Nucleic Acids Research 2005 Vol 33 No 4

Page 2: doi:10.1093/nar/gki267 Modular RNA architecture revealed ... · Modular RNA architecture revealed by computational analysis of existing pseudoknots and ... Samuela Pasquali, ... RNAs

unlabeled trees (17ndash2022ndash24) The use of tree representationshas made possible clustering of related tree structures (1819)detection of point mutations with specific structural effects (18)search of recurring subtrees to deduce consensus structuralmotifs (20) and analysis of secondary structure statistics oflarge ensembles of random RNA sequences (24)

RNA pseudoknots which cannot be represented as treegraphs are common in nature represented by many catalyticRNAs and small subunit (SSU) rRNA and their dominance inthe RNA repertoire universe may increase with RNA size (6)see PseudoBase for a catalogue of known pseudoknots (25)In fact our recent theoretical analysis of the abundance ofRNA types (trees pseudoknots and bridge topologies) predictsthat the collection of possible motifs is dominated bypseudoknots (6)

In this study we develop a structure-based RNA compar-ison algorithm using our 2D RNA dual graphical representa-tion for both trees and pseudoknots (22627) coupled with agraph similarity (isomorphism) search method and sequencealignment analysis Because of available work in the fieldrelated to RNA trees we focus here on uncovering structuraland functional similarities of pseudoknots Our graphicalanalysis of structure similarity emphasizes global topologicalsimilarity a well-known fact for existing functional RNAfamilies such as tRNA rRNA RNase P and snoRNA Thisautomated approach is advantageous for comparing relatedRNAs with a low-sequence similarity A disadvantage ofgraphical analysis is that both detailed base pair and struc-tural information are necessarily ignored For example minorsequence and secondary structure changes in a viral frameshift-ing pseudoknot that preserve the RNA topology but disrupt theRNA function would be considered as a match (2829) Such asubtle structuralndashfunctional relationship may not be amenableto automated analysis A similar problem also arises in struc-ture alignment of protein structures caused by for examplemutations in functional sites (1530) To reduce the error ratesof false functional identification such cases for proteins andRNAs are best screened by manual analysis after the detectionof structural similarity We adopt this two-step strategy for theidentification of structural and functional similarity in RNA

Our systematic comparison of structural similarities within14 existing RNA pseudoknotsmdashincluding viral frameshiftingtRNA-like internal ribosome entry site (IRES) ribosomalRNA (rRNA) and tmRNAmdashunravels eight non-trivialmatches including three novel findings supported by sequence

and functional data Functional relationship can be deducedsubsequent to finding structural matches by analyzingsequence alignment and pseudoknot functional propertiesOne such structural and functional similarity example is analternative hammerhead ribozyme pseudoknot of satellitecereal yellow dwarf virus (PKB173) embedded in the pseu-doknot topology occurring in the 30 end of brome mosaic virusgenomic RNA (PKB134) (see Pair 5 in Table 1) both RNAsare templates for replicase in viral RNA replication Anothernovel similarity example is between the viral IRES (PKB223)and the eukaryotic 18S rRNA (PKB205) (see Pair 10 inTable 1) both participate in translation initiation Yet anotherrelated pair (Pair 6) involves 18S rRNA (PKB205) and viraltRNA-like pseudoknot (PKB134) which recruits hostribosomes for viral replication We find that tRNA-likemolecules possess an 8ndash11 nt region complementary to 18SrRNA (PKB205) suggesting that tRNA-like molecules mayrecruit ribosomes via structural mimicry (of tRNA shape) (31)and base complementarity To the best of our knowledge thestructural and functional relationships identified in these threepseudoknot pairs have not been described previously

In addition we analyze the distribution of small treemodules or submotifs within the Saccharomyces cerevisiae16S and 23S rRNAs using mathematically enumerated (dis-tinct) tree graphs and our submotif search algorithm We findthat most small motifs (ie tree graphs with lt6 vertices or100 nt) exist within rRNAs whereas the larger submotifs(ie graphs with gt9 vertices or 160 nt) occur much lessfrequently Specifically our analysis suggests that rRNAsare constructed from at least 210 small distinct tree modules

In summary these quantitative analyses of similarity amongpseudoknots and modular subunits of rRNAs may be appliedto increasing repertoire of RNA structures and exploitedfor the identification and the modular design of novel RNAstructures

MATERIALS AND METHODS

We describe our computational methods for analyzing andcomparing RNA secondary structures in the following sub-sections Analyzing secondary rather than tertiary structuresas currently performed in many other studies is appropriatebecause the former is conserved for functional RNA classes(eg tRNA 5S rRNA and RNase P) Of course analysis oftertiary folds is a separate and important problem

Table 1 Significant LALIGN aligned pseudoknot pairs

Pair Aligned pair S(G) Sequence identity () Aligned length (nt) Score ltRgt Functional similarity

1 PKB134PKB135 117 56 129 86 154 Known2 PKB205PKB233 047 63 32 50 147 Unknown3 PKB178PKB135 117 59 29 54 142 Unknown4 PKB217PKB174 047 62 47 57 124 Known (Figure 4)5 PKB134PKB173 117 80 19 48 123 Novel (Figure 5)6 PKB134PKB205 117 76 17 49 123 Novel (Figure 5)7 PKB173PKB233 047 63 35 48 120 Unknown8 PKB173PKB191 117 70 20 46 115 Unknown9 PKB178PKB173 117 71 24 38 114 Unknown

10 PKB205PKB223 047 72 14 34 106 Novel (Figure 5)

Local alignment gap penalties 151 S(G) values for simple and double pseudoknots are 047 and 117 respectively ltRgt is R-value averaged over alignmentswith five randomized sequences

Nucleic Acids Research 2005 Vol 33 No 4 1385

Graphical representation of RNA secondary structures

RNA secondary structures can be schematically represented asmathematical graphs to capture the essential topological con-nectivity of loops bulges junctions and stems (220) The useof simplified graphical representation allows RNA structuresto be efficiently analyzed and compared Recently wedeveloped a general class of RNA graphs called dual graphscapable of representing both RNA tree and pseudoknot struc-tures (2) (compare dual and tree graphs in Figures 1 and 2respectively) We use both tree and dual graphs in this workThe rules for mapping RNA secondary structures onto tree anddual graphs are given in (2) and our RNA-As-Graphs webresource (httpmonodbiomathnyuedurna) (2627)

Graph isomorphism search algorithm

The representation of RNAs as graphs provides a systematicframework to search for similar substructures in RNAsthrough the concept of graph isomorphism (32) Isomorphicgraphs share the same pattern of connectivity among all ver-tices The challenge is to identify by an efficient computeralgorithm a graph as a subgraph of a larger graph or in

molecular terms an existing RNA motif contained in a largerRNA The computational complexity of identifying two struc-turally equivalent (ie isomorphic) graphs with V vertices is ofthe order of V factorial (V) and known as the lsquograph isomorph-ism problemrsquo (3233)

Our efficient method for testing graph isomorphism is basedon graph topological numbers or invariants a sketch of thisidea was reported previously [(2) Appendix D] We associateeach graph or subgraph with one or more topological invari-ants which are computed based on the patterns of connectivityamong graph vertices Thus isomorphic graphs have the sametopological invariants while dissimilar graphs have differenttopological invariants The similarity or dissimilarity betweengraphs can be readily established by comparing their topo-logical invariants (17) Our topological invariants allow theidentification of equivalent graphs differing only in the place-ment of chain ends andor hairpin loops as explained below

We define the topological invariants using a graphrsquos con-nectivity as described by the adjacency matrix Given a graphG with V vertices the V middot V adjacency matrix A(G) of thegraph specifies the connectivity between all pairs of verticesFor example matrix element Aij indicates the number of edges

1

2 3

4

6 5

0 1 0 1 0 11 1 1 0 0 00 1 1 1 0 01 0 1 0 1 00 0 0 1 0 31 0 0 0 3 0

A = S = 3988

0 1 2 1 2 11 0 1 2 3 22 1 0 1 2 31 2 1 0 1 22 3 2 1 0 131 2 3 2 13 0

D =

654321

0 2 0 0 0 02 0 2 0 0 00 2 0 2 0 00 0 2 0 2 00 0 0 2 0 20 0 0 0 2 1

A =

0 12 2 3 4 5

2 12 0 12 2 3

4 3 2 12 0 12 5 4 3 2 12 0

3 2 12 0 12 2

12 0 12 2 3 4

D = S = 4640

5

3

3

5

Figure 1 Two hypothetical RNA secondary structures and their dual graphical representations The RNA graphs are quantitatively described using adjacencymatrix A distance matrix D (Equation 1) and topological invariant S (Equation 2) We use these measures to compare and search for similar substructures inexisting RNAs

1386 Nucleic Acids Research 2005 Vol 33 No 4

connecting vertex i with vertex j For RNA dual graphs ele-ments of the adjacency matrix have the following propertiesAij is symmetric the allowed number of edges (connections)between two vertices is 0 1 2 or 3 a vertex can have at mostone self-loop (value of 1) representing a hairpin loop and eachvertex i is connected to four edges representing two incomingand two outgoing connecting strands except where the chainends occur

Our graph invariant is defined based on a modified distance-like matrix D(G) derived from the adjacency matrix A(G) Theelements of symmetric matrix (Dij) are defined as follows

Dij = A1ij for Aij bdquo 0

Dij = dij for Aij = 0

Dii = 0 for all i1

where dij is the minimum number of edges separating verticesi and j ie the number of edges defining the shortest possiblepath between these two vertices We define the graph invariantS(G) as follows

S Geth THORN2 = V1eth THORN1XVifrac141

XVjfrac141

Dij

2

2

Approximately S(G) measures the average distance amongthe vertices in the graph G For example the component Sidefined as

PVjfrac141 Dij measures how well-connected vertex i is

to the other vertices in the graph a peripheral vertex willproduce a higher Si value than a vertex located more in thecenter of the graph Effectively our scheme assigns adjacent

vertices with smaller weights than non-adjacent vertices Thusa large S value corresponds to an extended structure with fewbranches and a small S value to a compact graph with high-order junctions or a complex pseudoknot fold This property ofS can also be used to rank and measure the topological distancebetween graphs In general such topological invariants arebelieved to have a better correspondence with the physico-chemical properties of real molecules since many of the prop-erties depend critically on the exposed surface and lesscritically on the buried elements (34) Figure 1 shows theS values for two distinct RNA topologies

Our distance matrix D(G) and topological invariant S can beextended to tree graphs although their adjacency matriceshave different properties from those for dual graphs Wecall Stree a tree topological invariant which will be used foranalyzing submotifs in 16S and 23S rRNAs Figure 2A com-pares two topologically distinct 10-vertex tree graphs withdifferent Stree values

Our topological invariants [S(G) and Stree] are degeneratedwith respect to the location of chain ends or hairpin loopssince these elements are not scored Thus topologically equi-valent structures (ie same connectivity) embedded within alarger structure can be identified All known topological invari-ants are imperfect since distinct structures can be incorrectlyassigned as identical That is graphs with distinct topologicalinvariants are non-isomorphic but the converse is not true(not all graphs with same topological invariant are iso-morphic) This is the well-known graph isomorphism problem(33) In our analysis of 100 connected graphs with our topo-logical invariant S we only encountered one pair of similar but

1 3 4 5

6

7

9

8 10

2

101 2

34

5

6

78

9

S = 9079 S = 6285

S = 3085

A

B

Figure 2 Tree and dual graphical representations and topological invariants of hypothetical RNA secondary structures (A) RNA tree graphs and their topologicalinvariants Stree (B) The only example we found (out of 100) of two distinct RNA-like pseudoknot graphs with the same topological invariant S the pseudoknotsonly differ by the placement of a stemndashloop Although our structural comparison algorithm regards this structure pair as a positive match in practice such cases areeliminated by manual screening

Nucleic Acids Research 2005 Vol 33 No 4 1387

non-isomorphic graphs (for V = 5) that resulted in the sameinvariant S (see Figure 2B) In a previous work we have alsoused the Laplacian eigenvalue spectrum as a topologicalinvariant yielding a comparable error rate of a few percent(26) Some false positive matches are reported because of thisgraph isomorphism problem Since we manually examine allpositive matches the false positive matches are eliminated

Searching for the motif of a 10-vertex graph in a 30-vertexgraph requires 20 min CPU time on an SGI 300 MHz MIPSR12000 IP27 processor The search of a 15-vertex motif in a40-vertex graph would imply 1000-fold increase in CPU time

In practice our method can easily search (ie in minutes) forsubgraphs of size ranging from 5 to 11 vertices within graphsof size up to 30 vertices (corresponding to 600 nt RNAs)

RESULTS AND DISCUSSION

RNA pseudoknots

We analyze the structural similarity of a set of 14 pseudoknotsranging from 41 to 161 nt Figure 3 shows the secondarystructures and graphical representations of the pseudoknots

Viral frameshiftin g

PKB217

PKB174PKB131

PKB178

PKB223

PKB134PKB135Self cleavingVS ribozyme

PKB205

18S ribosomal RNA

PKB191

PKB173

Hammerhead ribozyme

PKB239

HIV1 5rsquo UTR

frameshiftingViral

Internal ribosome entry siteAptamer

PKB143

tmRNA

PKB210

PKB233

Figure 3 Topological matches generated by the simple (PKB217 PKB233) and double (PKB178) pseudoknots The pseudoknots are shown as schematic secondarystructures and dual graphs There are 2 probe and 11 target pseudoknot topologies representing diverse functional RNA pseudoknots For each probe pseudoknotstructure the corresponding matched submotifssubgraphs in the larger pseudoknots are shaded

1388 Nucleic Acids Research 2005 Vol 33 No 4

selected from the PseudoBase (25) (httpwwwbioleidenunivnl~BatenburgPKBhtml) The RNA pseudoknots are viralframeshifting (PKB174 PKB217 PKB233) viral tRNA-like(PKB134 PKB135 PKB143 PKB191) IRES RNA (PKB223)18S rRNA (PKB205) HIV-1 50-untranslated region (50-UTR)(PKB239) alternative pseudoknot of hammerhead ribozyme(PKB173) aptamer that binds human nerve growth factor(PKB131) tmRNA (PKB210) and self-cleaving Neurosporavs ribozyme (PKB178) (PKB233 and PKB217 appear thesame but have different sequences and this difference affectssubsequent sequence alignment analysis) These pseudoknotclasses are not known to be functionally related We choosemore than one member for some classes because of exhibitedtopological differences

The above pseudoknot structures were derived frommutational analysis (PKB173 PKB174 PKB191 PKB233PKB239) enzymatic and chemical structure probing(PKB131 PKB174 PKB191 PKB239) phylogenetic orsequence comparison analysis (PKB131 PKB135 PKB143PKB205 PKB217 PKB233 PKB234 PKB191 PKB210PKB239) and 3D modeling (PKB134 PKB135 PKB210)The pseudoknots listed in more than one category were probedusing two or more methods

To use these structures for graphical analysis we convertthe base pairing information of each structure provided byPSEUDOBASE to a dual graphical representation using ourdual graph rules [D1ndashD3 (2)] We then use pseudoknot graphsto specify their corresponding adjacency matrices Thesetwo steps are performed manually For RNA tree structureswe use our RNA Matrix program to automatically convert asecondary structure into its corresponding adjacency matrix(2635) The search for a small motif within a larger motif isperformed using input adjacency matrices and the comparisonof matrices is performed via the topological invariant S(Equation 2)

Sequence and structural relationships of functionallyrelated pseudoknots

We illustrate using several pseudoknot pairs the functionalsimilarity identified by topological rather than sequence sim-ilarity Figure 4 compares viral frameshifting pseudoknotstructures Viral RNA frameshifting is a process by whichexpression of overlapping open reading frames (ORFs) canbe achieved (36) We compare three pairs from subfamiliesGag-pro ORF1aORF1b and ORF2ORF3 ribosomalframeshifting Frameshifting signals involve two essentialcomponents a lsquoslipperyrsquo (or sliding) sequence of nucleotideswhere frameshifting takes place and a stimulatory RNAsecondary structure (usually a pseudoknot) located a fewnucleotides downstream Although members of the sameframeshifting subfamilies have significant sequence similarityglobal sequence identity between subfamilies is as low as28 similar to random sequences Because all frameshiftingpseudoknots in Figure 4A have the same topology and func-tion finding topological similarity between RNA structureshas an advantage for functional annotation over sequencecomparison

Figure 4B shows that this finding also holds for a frameshift-ing pseudoknot pair (PKB217 and PKB174 Pair 4 in Table 1)The 72 nt PKB217 is the 1 frameshifting pseudoknot of

lactate dehydrogenase-elevating virus (LDV) (37) Togetherwith its sliding sequence UUUAAAC it regulates the expres-sion of ORF1a for a polyprotein containing proteases andORF1b for an RNA-dependent RNA polymerase in LDVThe 127 nt PKB174 is the 1 frameshifting pseudoknot ofRous sarcoma virus (RSV) (3638) This pseudoknot and slid-ing sequence AAAUUUA regulate the expression of Gag-Pro-Pol and Gag-Pol polyproteins Although both the LDV andRSV frameshifting pseudoknots have a simple pseudoknotstructure the latter has two extra stemndashloops Despite theirfunctional similarity the global sequence identity is only 37[computed using the program ALIGN (39) available at httpworkbenchsdscedu] comparable with random sequencestheir sliding sequences are also dissimilar Using our localsequence alignment parameters we find a large significantmatched region in PKB217 (6833ndash6877) as shown inFigure 4B AAACUGCUAGCCACCUCUGGUCUCGACC-GCUGUACUAGAGGUGG

Although functional similarity between frameshiftingpseudoknots is expected this case illuminates how examiningtopological similarity aids in identifying pseudoknots withsimilar functions despite low overall sequence similarity

Comparison of pseudoknot structures

Our two probe motifsmdasha simple pseudoknot (PKB217 orPKB233) and a double pseudoknot (PKB178)mdashare used tosearch larger RNAs We generate two groups of topologicalmatches as shown in Figure 3 with the matched motifs showndarkened we found 11 such matched pseudoknot pairs Inaddition we analyze sequence similarity of many of the84 (14 middot 132) possible pseudoknot pairs since topologicalsimilarities are apparent within each pseudoknot group gen-erated by the probe motifs Although not exhaustive thisadditional analysis may help to identify pairs having similarsubstructures that are not detected by using fixed probes(simple and double pseudoknots) Table 1 lists 10 pseudoknotpairs (out of the 11 probestructure matches and subset of the84 structurendashstructure analysis) that have significant structuraland sequence similarities Of the total of 10 pairs listed(Pairs 1ndash10) two known functional pairs (1 and 4 in Table 1)are included for comparison with the eight other pairs that arenot previously known to be functionally related

We begin by analyzing the overall patterns of relationshipamong our 14 pseudoknots Figure 3 organizes the matchesinto two groups five matches for the simple pseudoknot(PKB217) and six for the double pseudoknot (PKB178)The matched motifs appear to be lsquomodular constructionsrsquoaround the probe motifmodule with added stemndashloops(three or fewer) To the best of our knowledge these resultsreveal modular features and topological similarities of RNApseudoknots that have not been described

The biological significance of topological matches dependson the size and complexity of pseudoknots a match for thedouble pseudoknot is probably more significant than for asimple pseudoknot The six pseudoknots generated by thedouble pseudoknot probe PKB178 show striking topologicalas well as functional similarity (eg viral tRNA-like PKB134and PKB135) we analyze their similarity with PKB178 andamong themselves Still the matched pairs of RNAs are func-tionally diverse besides PKB134 and PKB135 such examples

Nucleic Acids Research 2005 Vol 33 No 4 1389

of diverse RNAs include IRES (PKB223) RNA 18S rRNA(PKB205 PKB234) HIV-1 50-UTR (PKB239) and alternativepseudoknot of hammerhead ribozyme (PKB173) In this dou-ble pseudoknot group six pairs (1 3 5 6 8 and 9 in Table 1)produce local sequence alignments [using LALIGN (40)

httpworkbenchsdscedu] that suggest significant resultsIn particular the pair PKB134PKB173 (Pair 5 in Table 1)indicates a significant match The pairs PKB134PKB205(Pair 6) and PKB205PKB223 (Pair 10 in Table 1) alsodisplay intriguing sequence and functional similarities

A A A A A A C G G G A A G C A A G G G C U

GGGA

CA A

CCCU

A

A

A

U

GA U

C C C G G

G

A AACC

A

U

PKB 3

Gag-polglobal sequence identity 57

PKB 128U U U A A A C U G U U G A G A G G U G C C U G

U C U C U A C G G A C

GCCUG

GA G

CGGAC

UU U

G

U AUA

G

A

C

PKB 217U U U A A A C U G C U A G C C A C C U C U G G U

G G U G G A G A U C A

GACCGC

CU C C

UGGCG

CU

U

G

G

GCA

U

G GY

AG

CG

ORF 1a - ORF 1bglobal sequence identity 53

G G G A A A C G G G C A G G C G G

GCG

C GCGC

A

G

A

C C G C C

AC

AA

C

PKB 44

G G G A A A U G G A C U G A G G C G

GGCG

C CGGC

A

C

A

C G C C

A

CA

A

C

PKB 240

A

A

ORF2 - ORF3global sequence identity 79

global sequence identity as low as 28

A

U CC

G

C

AA A

C

U

C

GG

C

U

G

UU

AA

PKB 217GCUGACGGUGUYUGGCG

UA

CUC

UG

UG A C C G C

C U G G C G

CC

CCUCUGGU

GUGGAGA

CA

A

U

GU U U A A A C U G C U A G C 5

3

PKB 174

G A AG

UG

AA

AC

U

U

A

U

UG

C A

UUG

UC

UC

U

C

A

U A

U

AGGG

UCCCU

CGGGCCACUG

CCCGGUGAC

G

CC

GU

GU

G

A

CA

C

G C G C U A C A

C G C G A U G U

A G A C C G

U C U G G C

A

2603 3

52476 A A A U U U A U

6830

BPair 4

global sequence identity 37

G G G A A A C U G G A A G G C G G G G C

GCUGCA

GA C

GACUG

G

A

A

C C C C G

PKB 4

C

AA

AUA

UG

Figure 4 Sequence and structural comparisons of pseudoknots in the frameshifting functional families (A) Two secondary structures and corresponding topologiesfor each of the three viral frameshifting subfamilies Gag-pro ORF1aORF1b and ORF2ORF3 Sequence identity within the same subfamily ranges between 50 and80 it is as low as 28 when comparing pseudoknots from different subfamilies All frameshifting pseudoknots are functionally related and have the same topology(dual graph) (B) Secondary structures and their aligned nucleotides (color coded) for frameshifting pseudoknots of LDV (PKB217) and Rous sarcoma virus(PKB174) Global sequence alignments were performed using the program ALIGN with default (164) gap openinggap extension penalties

1390 Nucleic Acids Research 2005 Vol 33 No 4

These relationships suggest structural and functional similar-ities that are not previously known (Table 1)

To support the significance of these matches we performsequence alignments using SmithndashWaterman parameters forgap opening and gap extension penalties (41) We choose thegap opening penalty value of 15 from a range of 0 to 19 Toproduce better alignments for large sequence segments (shortalignments are less likely to be significant in our context) ourgap extension penalty of 1 from a range of 1 to 8 allowslarger extensions consistent with the possibility of the inser-tion of long sequences and added 2D motifs In Table 1 wealso report the alignment score which is the sum of nucleotidematches mismatches and gap penalties

Table 1 shows that the probe structure (PKB178) matchestwo other pseudoknots (PKB135 PKB173) with sequenceidentities ranging from 59 to 71 for 24ndash49 nt alignmentlengths (compared with 56 sequence identity over 129 ntfor the functionally related pair PKB134PKB135 Pair 1)Significantly the pairs PKB134PKB173 (Pair 5) PKB134PKB205 (Pair 6) and PKB205PKB223 (Pair 10) havesequence identity range of 72ndash80 The non-randomness ofthe 10 aligned pairs in Table 1 is also evident in the ratio Rwhich measures the level of significance above the randomexpectation and has value 1 for random pairs

R = (percent sequence identity)(percent sequence identitywhen one of the aligned sequences is randomized)

We also define ltRgt as the R-value averaged over align-ments with different randomized sequences Table 1 showsthat some pseudoknot pairs with no known functional rela-tionship have ltRgt values ranging between 106 and 147whereas the related pair PKB134PKB135 has a value of154 many matched pairs have ltRgt values lt1 Thus thetopological similarities we identify led to matched pairswith non-random sequence alignment results Non-randomsequence matches may arise from topological andor func-tional similarities as illustrated below for three pseudoknotpairs (5 6 and 10 in Table 1)

The rate of false positives generated by topological and ltRgtvalue screening can be easily estimated by considering allpossible pairs from the set of six matched double pseudo-knots in Figure 3 These pseudoknots form 15 distinctpairs 6 of which have ltRgt values gt114 [Table 1 withS(G) = 117] Table 1 shows that only four pairs have largeltRgt values (gt12) Pair 1 (PKB134PKB135) is related Pair 5(PKB134PKB173) and Pair 6 (PKB134PKB205) are mostprobably related and Pair 3 (PKB178PKB173) is of unknownrelationship (in fact both are self-cleaving ribozymes)Assuming that the Pair 3 is unrelated and therefore reflectsa false positive we have an error rate of 115 or 7 If theinferred functionally related Pairs 5 and 6 are also consideredas false positives we then have a conservative error rateof 20

Functional relationship between PKB134 and PKB173(Pair 5 in Table 1)

The 116 nt PKB134 pseudoknot occurs in the 30 end of bromemosaic virus genomic RNA (31) The 73 nt PKB173 is ahammerhead ribozyme of satellite cereal yellow dwarfvirus-RPV (satRPV) RNA capable of adopting an alternativepseudoknot conformation by forming an (L1ndashL2a) stem whose

existence is established by experimental mutational analysis(42) Figure 3 shows that their topologies have the same lsquocorersquobut PKB134 has two extra stemndashloops The local sequencealignment of this pair using LALIGN (Figure 5) shows 19bases aligned to 80 sequence identity and an ltRgt valueof 123 (Table 1) There are two significant segments inPKB173 that are matched with those in PKB134 UACUGU-CUGACGACGUAUCC (nt 11ndash30) and UAGAAGGCUG-GUGCC (nt 39ndash53) These segments and their matchedregions in PKB134 span the stem and unpaired elements ofthe secondary structures (see Figure 5) Significantly bothPKB134 and PKB173 pseudoknots serve as template for rep-licase a protein that recognizes specific sequence regions toinitiate replication of viral genomes (3142) Moreover a con-served sequence (CUGANGA) of PKB173 is in the alignedregion (nt 11ndash30) Thus our analysis suggests that PKB134and PKB173 pseudoknots are structurally and functionallyrelated

Further analysis of pseudoknots PKB134 and PKB173reveals specific domains interacting with the replicase Experi-ments have shown that mutations in domains A C B1 and B2[defined in (31)] of PKB134 impair viral RNA replication (43)Significantly our two aligned sequence regions span domainsA B1 C and also D Thus the replicase recognizes severaldomains in the PKB134 pseudoknot structure In PKB173 thepseudoknot forming five base pair helix (GCGCG) is implic-ated in replicase recognition or ligation (42) This pseudoknotalso has conserved sequence regions ACAAA (61ndash65) or itscomplement (possibly the replication origin) and AGAAA(nucleotides attached to the 30 end but not shown) Howeverthese regions are not observed in PKB134

Functional relationship between PKB205 and PKB223(Pair 10 in Table 1)

The 63 nt PKB223 is the pseudoknot of IRES of capsid-producing cricket paralysis-like viruses (CrPV) formethionine-independent initiation of translation (44) Appear-ing immediately upstream from the capsid-coding sequenceit initiates translation without a canonical AUG (methionine)initiation codon The 48 nt PKB205 is the pseudoknotsubstructure (V4 domain) of SSU of Palmaria palmata 18SrRNA which was found using extensive comparative analysis(4546) Figure 5 (Pair 10 lower panel) shows that PKB223 isa 4-vertex simple pseudoknot with two inserted stemndashloopsand PKB205 is a 4-vertex double pseudoknot with an insertedstemndashloop Figure 5 (middle panel) displays the local sequencealignment of PKB223 and PKB205 yielding a match of 14 ntwith 72 sequence identity This significant sequence simil-arity occurs in regions ACAAUAUCCAGGAA (6148ndash6161)and ACAUUAGCAUGGAA (765ndash778) of PKB223 andPKB205 respectively Remarkably we find that the 9 ntsegment 50-UUAGGUUAG-30 (nt 6100ndash6108) of PKB223is complementary to 30-GGUACGAUU-50 (nt 768ndash776) ofPKB205 except at G6103 Our finding corroborates withthe experimental identification that IRES elements in cellularmRNA contain a short 9 nt (nt 133ndash141 CCGGCGGGU) inthe 50-UTR of the homeodomain protein Gtx which is com-plementary to nt 1132ndash1124 of 18S rRNA (47) Indeed manysuch near complementary sequences have been found bet-ween the 30 end of 18S rRNA and picornavirus RNAs (48)

Nucleic Acids Research 2005 Vol 33 No 4 1391

Fig

ure

5

Th

ree

no

vel

fun

ctio

nal

lyre

late

dp

seu

do

kn

otp

airs

infe

rred

by

top

olo

gy

mat

chin

gan

dse

qu

ence

alig

nm

ent

Up

per

pan

els

Sec

ond

ary

stru

ctu

res

and

thei

ral

ign

edn

ucl

eoti

des

(co

lor

cod

ed)

for

pse

udo

kn

ot

pai

rsP

KB

13

4P

KB

17

3(P

air

5T

able

1

left

pan

el)

PK

B2

05

PK

B2

23

(Pai

r1

0T

able

1

mid

dle

pan

el)

and

PK

B1

34

PK

B20

5(P

air

6

rig

ht

pan

el)

PK

B1

34

occ

urs

in30 e

nd

of

bro

me

mosa

icv

iru

sg

eno

mic

RN

A

PK

B1

73

isa

pse

udo

kn

oto

fsa

tell

ite

cere

aly

ello

wd

war

ftv

iru

s-R

PV

RN

AP

KB

22

3is

the

pse

ud

ok

no

tofin

tern

alen

try

site

(IR

ES

)o

fca

spid

-pro

du

cin

gC

rPV

san

dP

KB

20

5is

the

pse

udo

kn

ots

ub

stru

ctu

re(V

4d

om

ain

)o

fS

SU

of

PP

alm

ata

18

SrR

NA

L

ow

erp

anel

sT

he

alig

ned

sub

gra

ph

so

rre

gio

ns

are

shad

edo

rm

ark

ed

1392 Nucleic Acids Research 2005 Vol 33 No 4

Intriguingly for CrPVs the complementary segments occur inpseudoknots with similar folds not involving the dangling30 end of 18S rRNA Base pair complementary between IRESelements of mRNA and 18S rRNA is a mechanism for ribo-some recruitment and translation In viral IRES elements thesecondary and tertiary structures are believed to play criticalroles in translation initiation (44)

Functional relationship between PKB205 and PKB134(Pair 6 in Table 1)

The functional relationsip of the pair PKB205PKB134 issimilar to that for PKB205PKB223 above PKB134 belongsto a common viral tRNA-like functional class whose role is torecruit host ribosomes to initiate viral protein synthesis (49)It is known to interact with the ribosome via tRNA shapemimicry (31) Our analysis indicates that tRNA-like mole-cules also contain a region which is nearly complementary toPKB205 a pseudoknot region of the 18S rRNA Table 2shows regions of PKB205 complementary to four tRNA-like molecules (PKB134 191 138 17) the complementaryregion in PKB223 is shown for comparison Significantlyeach of the pseudoknots PKB223 PKB134 and PKB191has a region complementary to the same 8ndash9 nt unpairedregion in PKB205 (767ndash776) This single-stranded regionis most probably available for base pairing with incomingtRNA-like molecules In addition two other regions inPKB205 are complementary to tRNA-like pseudoknotsPKB138 and PKB17 one region is a hairpin loop and anotherregion is part of short stems Thus this analysis suggests thattRNA-like molecules may employ both structural mimicry andsequence complementarity to recruit SSU rRNA an interac-tion mode similar to tRNA PKB134 also has a region UACA-GUGUUGAAAAACA (nt 2078ndash2094) closely matched tothe region UAGAGUGUUCAAAGCCA (nt 732ndash748) inPKB205 with an average R-value of 123

Tertiary structure similarity of topological matches inribosomal RNAs

The preceding examination of three pseudoknot pairs showsthat topological matches can suggest functionally relatedpseudoknots It is also important to show that novel topo-logical matches led to similar tertiary structures We focusfor this purpose on rRNA since several bacterial and archaealribosome structures have been solved in recent years In factin our previous work we have found several topologicalmatches of the Scerevisiae 5S rRNA motif within its largerrRNAs two within 16S rRNA (domains II and III) and threewithin 23S rRNA (domains III IV and VI) (2) From availableX-ray structures for Thermus thermophilus 16S rRNA (PDBaccession no 1fjg) (50) and 23S rRNA from Deinococcus

radiodurans (1nkw) and Haloarcula marismortui (1ffk)(5152) only 5S rRNA motifrsquos matches in domain II of16S rRNA and domain III of 23S rRNA are relevant theother three cases are not valid matches in these species

Figure 6 displays the secondary diagrams graphical repres-entations and tertiary structures of 5S rRNA for bacteriumDradiodurans (1nkw) and archaea Hmarismortui (1ffk)and their topologically matched modules within 16S and23S rRNAs the secondary diagrams are adapted from thosein Gutellrsquos comparative RNA website (httpwwwrnaicmbutexasedu) (53) The module in 23S is an identicalmatch for the 6-vertex 5S graph except for a hairpin loopwhich does not contribute to the topological invariant SThe 16S module is a 5-vertex graph because a 2 bp stem atthe junction has a non-canonical base pair AG (dual graph ruleD1 defines a stem as having at least two canonical andorwobble base pairs) for Scerevisiae this stem consists ofonly canonical base pairs making the 16S module an identicalmatch for the 5S topology As shown in Figure 6 the 16S and23S modules have a similar L-shaped tertiary fold as the 5Sstructure Tertiary structure similarity with 5S is more strikingfor the module of domain II (nt 653ndash753) of Tthermophilus16S rRNA than for the module of domain III (nt 1033ndash1143) of23S rRNA Both the 5S structure and 16S module have apronounced 3-stem junction where the chain ends of the16S module are located Structural (backbone) overlap ofthe 5S structure and 16S module shows similar structuralfeatures except for the orientations of the smaller stems atthe junction (We aligned the structures manually because thechain ends of 5S and 16S module occur at non-correspondinglocations) In contrast the 23S module shows significant dis-tortions on the shorter helical arm with the hairpin loop foldedback to the helix as shown in the structural overlap of the 5Sstructure and 23S module Structural similarities seen in theseexamples are probably due to geometric constraints which areapparent in their secondary diagrams and graphical representa-tions Thus 2D topological matches can lead to similar 3Dstructures Below we survey the distribution of modules inrRNAs

Distribution of submotifs in 16S and 23S ribosomalRNAs

Characterizing the submotifs within the 16S and 23S rRNAsthe largest known RNA molecules can help uncover themodular construction of RNA structures as analysis ofrRNArsquos tertiary modules has revealed Furthermore our pre-vious analysis of rRNArsquos structural motifs indicated existenceof characteristic features (35) For example the distributionsof pairedunpaired bases in stems bulgesinternal loops hair-pin loops and junctions follow specific functional forms

Table 2 Short sequence regions in PKB205 complementary to tRNA-like molecules

PKB205 regions Complementary sequences

30-GGUACGAUU-50 (768ndash776) PKB223 50-UUAGGUUAG-30 (6100ndash6108) IRES30-GUGAGAUU-50 (769ndash776) PKB134 50-CACUGUAAA-30 (113ndash120) tRNA-like30-UACGAUUAC-50 (767ndash775) PKB191 50-AUGCUCAUG-30 (77ndash85) tRNA-like30-UGUGAGAUU-50 (731ndash739) PKB138 50-ACACUUUAA-30 (38ndash47) tRNA-like30-GUUUGCGGACC-50 (746ndash756) PKB17 50-CAAAACCCUGG-30 (19ndash29) tRNA-like

Nucleic Acids Research 2005 Vol 33 No 4 1393

Figure 6 Comparison of the graph topologies and 2D and 3D structures of 5S rRNAs (A and B) and modules in 16S and 23S rRNAs (C and D) (C and D) Twostructural (backbone) overlaps of the 5S structure (1ffk gold color) with 16S module (1fjg blue) and 5S structure with 23S module (1nkw green) are shown Thesecondary diagrams are adapted from those in Gutellrsquos comparative RNA website (httpwwwrnaicmbutexasedu) and 3D structures are from the Protein DataBank (httpwwwrcsborgpdb) The 2D modules were identified computationally by our earlier work on finding120 nt 5S rRNA motifs in 16S and 23S rRNAs ofScerevisiae(2) The equivalent tertiary structures and modules for 5S 16S and 23S rRNAs from other organisms show that similar graph topologies can lead tosimilar secondary and tertiary structures All structures have similar graph topologies The Tthermophilus 16S module is a 5-vertex instead of 6-vertex graph becausea 2 bp stem at the junction has a non-canonical base pair AG for Scerevisiae this stem consists of only canonical base pairs making the 16S module an identicalmatch for the 5S topology

1394 Nucleic Acids Research 2005 Vol 33 No 4

Moreover the percentages of nucleotides in stems bulgesinternal loops hairpin loops and junctions do not vary for16S and 23S rRNAs even though the latter is twice thesize of the former Such motif characteristics help discriminateRNA-like from non-RNA-like molecules and guide the designof novel RNAs Here we analyze the distinct modular unitsmdashsmall tree submotifsmdashthat makeup the rRNAs to advance suchapplications Since all small tree graphs up to 10 vertices havebeen exhaustively enumerated (5455) we can use our sub-structure search algorithm to identify the occurrences of alldistinct tree submotifs in rRNAs

Specifically we use sets of tree graphs for V = 4 5 6 7 8 9and 10 comprising 2 3 6 11 23 47 and 106 graphs respect-ively for a total of 198 tree motifs The complete enumeratedmotifs can be found in the RAG web resource (httpmonodbiomathnyuedurna) Figure 7 shows as an example the11 possible 7-vertex tree graphs and lists the frequency ofoccurrence of each tree graph in rRNAs Remarkably thesame tree motifs are abundant in both 16S and 23S rRNAsThe most abundant (10 or 15 times) 7-vertex tree has two3-stem junctions (topological invariant Stree = 6160) 4- and5-stem junctions are the next most prevalent submotifs(Stree = 5137 5921 and 5667) Interestingly the unbranchedtree (Stree = 7219) and highly branched trees (Stree = 4509)are rare

Figure 8 plots the percentage of distinct tree graphs in eachset V (fraction of motifs identified relative to possible motifs)

found in rRNA For small tree graphs V = 4 5 and 6 allpossible topologies are present but not for V gt 6 In fact thepercentage of trees found declines with V In 23S rRNA90 of possible trees are found for V = 7 and only40 (representing 47 tree topologies) are found forV = 10 Larger trees (V gt 10) are expected to have muchsmaller percentages This percentage decline is expectedsince small tree modules are less specific than large treesIn total 111 submotifs out of 198 are identified The observedtrends in Figure 8 may be generally valid for rRNA structuresof other organisms since they are highly conserved althoughsome statistical fluctuations are expected due to structuraldifferences and RNArsquos finite size

These data show that RNA structures are modular constructsof relatively few (tree) building blocks Extrapolation of thecurves in Figure 8 suggests that 20 of V = 11 motifs (totalof 235 trees) and 10 of V = 12 motifs (total of 551 trees)probably exist in rRNAs Thus the total number of buildingblocks in rRNAs with V lt 13 would be 210 Although thepercentages of trees with higher V values are low they aremore numerous and will contribute to the overall number ofbuilding blocks We thus consider 210 as the minimum num-ber of building blocks The building block number may beexploited in the future design of novel functional RNAmolecules using modular assembly from 210 distinct treemotifs The tree motifs may be taken from substructures ofexisting RNAs allowing simultaneous sequence and structure

treeS =

16S 23SOccurrence

ID

6 4

)

0 1

(7 8)

6191

(7 11)

4509

0 1

(7 9)

5385

3 3

7

(7 10)

5137

5 5

(7 6)

5916

5

(7 5)

5921

(7 1)

0 0

7219

(7 3)

2 4

6446

4 4

(7 7)

5667

10 15

(7 4

6160

0 3

6692

(7 2)

Figure 7 Occurrence frequency of 11 distinct 7-vertex tree topologies as submotifs of Scerevisiae 16S and 23S rRNAs The tree graphs are labeled using topologicalinvariant Stree (second top row) and ID number (i j) (first row) which refers to motif identification number as used in RAG database (httpmonodbiomathnyuedurna) The two numbers at the bottom of each graph refer to the number of times that topology is found embedded in 16S and in 23S rRNAs respectively

Nucleic Acids Research 2005 Vol 33 No 4 1395

design Such a build-up approach appears feasible from our(26) and other works (35) and is akin to the lsquobuilduprsquo tech-nique for proteins pioneered by Vasquez and Scheraga (56)

The above estimate of building block number can be eithera consequence of the limited number of available structures orthe fact that RNA adopts only a subset of possible graphsCertainly future discoveries of novel RNA structures mightindicate a larger repertoire of building blocks Still our sur-veys of natural RNAs (26) suggest that RNAs possess specifictopological characteristics and favor a subset of the possiblegraph motifs The missing motifs may not exist in nature forRNA owing to physical or functional constraints

The building block number can also be defined in terms ofirreducible graphs ie trees that cannot be decomposed intosmaller trees However this approach has a major drawbackSince all RNAs are composed of connected stems their sub-structures can be reduced to two vertex graphs which is not aninformative description of complex RNA folds Still if wedefine 4-vertex trees as the smallest irreducible graphs thenumber of irreducible graphs can be calculated Based on oursubmotif data we estimate that for V up to 10 there are about42 irreducible tree modules in rRNAs

SUMMARY AND CONCLUSION

Our automated comparison of RNA secondary structuresbased on graphical representation allows the examination ofconserved 2D RNA topological features which in turn suggestfunctional relationships This approach has an advantage oversequence alignment because functionally related RNAs oftenlack sequence similarity as shown for frameshifting pseudo-knots with low global sequence identities Of course ourstructure alignment approach relies on available 2D RNA

structures and thus will increase in scope as more RNAstructures become available Our eight identified pairs ofpotentially related pseudoknots (Table 1) reveal three pairsthat are functionally related the pseudoknots of 30 end ofbrome mosaic virus genomic RNA (PKB134) and alternativehammerhead ribozyme pseudoknot (PKB173) are templatesfor replicase to initiate replication of viral genomes the pseu-doknots of internal entry site (IRES) of caspid-producingCrPVs (PKB223) and SSU of 18S rRNA (PKB205) are asso-ciated with translation initiation and the pseudoknots of 18SrRNA (PKB205) and viral tRNA-like (PKB134) most prob-ably interact via both structural mimicry and base comple-mentarity To the best of our knowledge these functionalrelationships have not been described previously

The developed RNA comparison algorithm also led us toquantify the modularity of 16S and 23S rRNAs using smallenumerated tree motifs Our results suggest that rRNAs areconstructed from at least 210 distinct tree subtopologies(or modules) having up to 180 nt Significantly we findthat 5S rRNA and two tree modules within 16S and 23SrRNAs with similar topologies have similar tertiary shapesaffirming the relationship between 2D graphical representa-tion and 3D structure The tree modules in rRNA may be usedas building blocks for the design of novel RNA folds A similaridea has been employed to design and improve functionalRNA molecules (510) Design by assembly of modulesexploits motifs present in natural RNAs unlike de novo designof sequences and structures (11) The issues of discoveringnovel topologies (6) building module library and designingfunctional folds are likely to increase in importance withinterest in novel RNAs

Our algorithm has several limitations Of course its utilitydepends on availability of reliable RNA secondary foldingalgorithms (5758) More fundamentally although simplifiedgraphical representations specify topological characteristicsdetailed sequence and motif information is missing An asso-ciated problem is that unusual RNA structurendashfunction rela-tionships due to deleterious mutations as in frameshiftingpseudoknots (28) and structural flexibility are not capturedMore advanced graph constructs such as weighted graphs mayhowever specify detailed motif features at the secondary andbase pair levels Another limitation of the present algorithmis the requirement to fix the probe structure or graph As inprotein structure alignment algorithms (15) future refinementsof our algorithm should allow some flexibility in motifdefinitions as the similarity search proceeds We inviteRNA scientists to visit our RAG web resource at httpmonodbiomathnyuedurna and suggest to us specific enhancementsthat will be useful in practice

ACKNOWLEDGEMENTS

We thank Dinshaw Patel for comments that motivated thiswork This research was supported by a Joint NSFNIGMSInitiative in Mathematical Biology (DMS-0201160) and theHuman Frontier Science Program (HFSP)

REFERENCES

1 TinocoI and BustamanteC (1999) How RNA folds J Mol Biol293 271ndash281

Figure 8 Distribution of all possible distinct tree modules with V vertices foundin Scerevisiae 16S and 23S rRNAs In graph theory each V is represented by aset of possible tree structures For example for V= 4 5 6 7 8 9 and 10 there are2 3 6 11 23 47 and 106 possible trees respectively The distribution of treesquantifies the diversity of tree modules within rRNAs This analysis suggeststhat rRNAs possess at least 210 tree modules

1396 Nucleic Acids Research 2005 Vol 33 No 4

2 GanHH PasqualiS and SchlickT (2003) Exploring the repertoire ofRNA secondary motifs using graph theory implications for RNA designNucleic Acids Res 31 2926ndash2943

3 SoukupGA and BreakerRR (1999) Engineering precision RNAmolecular switches Proc Natl Acad Sci USA 96 3584ndash3589

4 BreakerRR (2002) Engineered allosteric ribozymes as biosensorcomponents Curr Opin Biotechnol 13 31ndash39

5 IkawaY FukadaK WatanabeS ShiraishiH and InoueT (2002)Design construction and analysis of a novel class of self-foldingRNA Structure 10 527ndash534

6 KimN ShiffeldrimN GanHH and SchlickT (2004) Candidatesfor novel RNA topologies J Mol Biol 341 1129ndash1144

7 MitchellJR ChengJ and CollinsK (1999) A box HACAsmall nucleolar RNA-like domain at the human telomerase RNA 30

end Mol Cell Biol 19 567ndash5768 DoudnaJA and CechTR (2002) The chemical repertoire

of natural ribozymes Nature 418 222ndash2289 StorzG (2002) An expanding universe of noncoding RNAs Science

296 1260ndash126310 BreakerRR and JoyceGF (1994) Inventing and improving ribozyme

function rational design versus iterative selection methods TrendsBiotechnol 12 268ndash275

11 AndronescuM FejesAP HutterF HoosHH and CondonA (2004)A new algorithm for RNA secondary structure design J Mol Biol336 607ndash624

12 CarterRJ DubchakI and HolbrookSR (2001) A computationalapproach to identify genes for functional RNAs in genomic sequencesNucleic Acids Res 29 3928ndash3938

13 DoudnaJA (2000) Structural genomics of RNA Nature Struct Biol7 954ndash956

14 RivasE KleinRJ JonesTA and EddySR (2001) Computationalidentification of noncoding RNAs in Ecoli by comparativegenomics Curr Biol 11 1369ndash1373

15 HolmL and SanderC (1996) Mapping the protein universeScience 273 595ndash602

16 ShindyalovIN and BournePE (2001) A database and tools for 3-Dprotein structure comparison and alignment using the CombinatorialExtension (CE) algorithm Nucleic Acids Res 29 228ndash229

17 BenedettiG and MorosettiS (1996) A graph-topological approach torecognition of pattern and similarity in RNA secondary structuresBiophys Chem 59 179ndash184

18 MargalitH ShapiroBA OppenheimAB and MaizelJVJr (1989)Detection of common motifs in RNA secondary structures Nucleic AcidsRes 17 4829ndash4845

19 ShapiroBA and ZhangKZ (1990) Comparing multiple RNAsecondary structures using tree comparisons Comput Appl Biosci6 309ndash318

20 LeSY NussinovR and MaizelJV (1989) Tree graphs of RNAsecondary structures and their comparisons Comput Biomed Res22 461ndash473

21 DuarteCM WadleyLM and PyleAM (2003) RNA structurecomparison motif search and discovery using a reduced representation ofRNA conformational space Nucleic Acids Res 31 4755ndash4761

22 ChevaletC and MichotB (1992) An algorithm for comparing RNAsecondary structures and searching for similar substructures ComputAppl Biosci 8 215ndash225

23 WangL and ZhaoJ (2003) Parametric alignment of ordered treesBioinformatics 19 2237ndash2245

24 FontanaW KoningsDAM StadlerPF and SchusterP (1993)Statistics of RNA secondary structures Biopolymers 33 1389ndash1404

25 van BatenburgFHD GultyaevAP PleijCWA NgJ andOliehoekJ (2000) PseudoBase a database with RNA pseudoknotsNucleic Acids Res 28 201ndash204

26 GanHH FeraD ZornJ ShiffeldrimN TangM LasersonUKimN and SchlickT (2004) RAG RNA-As-Graphsdatabasemdashconcepts analysis and features Bioinformatics20 1285ndash1291

27 FeraD KimN ShiffeldrimN ZornJ LasersonU GanHH andSchlickT (2004) RAG RNA-As-Graphs web resource BMCBioinformatics 5 88

28 ChenXY KangHS ShenLX ChamorroM VarmusHE andTinocoI (1996) A characteristic bent conformation of RNA pseudoknotspromotes 1 frameshifting during translation of retroviral RNAJ Mol Biol 260 479ndash483

29 KangHS HinesJV and TinocoI (1996) Conformation of anon-frameshifting RNA pseudoknot from mouse mammary tumor virusJ Mol Biol 259 135ndash147

30 GanHH PerlowRA RoyS KoJ WuM HuangJ YanSNicolettaA VafaiJ SunD et al (2002) Analysis of protein sequencestructure similarity relationships Biophys J 83 2781ndash2791

31 FeldenB FlorentzC GiegeR and WesthofE (1994) Solutionstructure of the 30-end of brome mosaic-virus genomic RNAsmdashconformational mimicry with canonical transfer-RNAs J Mol Biol235 508ndash531

32 GrossJL and YellenJ (1999) Graph Theory and its Applications CRCPress Boca Raton FL

33 KoblerJ SchoningU and ToraanJ (1993) The Graph IsomorphismProblem Birkhauser Boston MA

34 RandicM and ZupanJ (2001) On interpretation of well-knowntopological indices J Chem Inform Comput Sci 41 550ndash560

35 ZornJ GanHH ShiffeldrimN and SchlickT (2004) Structuralmotifs in ribosomal RNAs implications for RNA design and genomicsBiopolymers 73 340ndash347

36 JacksT MadhaniHD MasiarzFR and VarmusHE (1988) Signalsfor ribosomal frameshifting in the Rous-Sarcoma virus GagndashPolregion Cell 55 447ndash458

37 GodenyEK ChenL KumarSN MethvenSL KooninEV andBrintonMA (1993) Complete genomic sequence and phylogeneticanalysis of the lactate dehydrogenase-elevating virus (Ldv) Virology194 585ndash596

38 MarczinkeB FisherR VidakovicM BloysAJ and BrierleyI (1998)Secondary structure and mutational analysis of the ribosomalframeshift signal of Rous sarcoma virus J Mol Biol 284 205ndash225

39 MyersEW and MillerW (1988) Optimal alignments in linear spaceComput Appl Biosci 4 11ndash17

40 HuangXQ and MillerW (1991) A time-efficient linear-space localsimilarity algorithm Adv Appl Math 12 337ndash357

41 SmithTF and WatermanMS (1981) Identification of commonmolecular subsequences J Mol Biol 147 195ndash197

42 SongSI SilverSL AulikMA RasochovaL MohanBR andMillerWA (1999) Satellite cereal yellow dwarf virus-RPV (satRPV)RNA requires a DouXble hammerhead for self-cleavage and analternative structure for replication J Mol Biol 293 781ndash793

43 DreherTW and HallTC (1988) Mutational analysis of thetransfer-RNA mimicry of brome mosaic-virus RNAmdashsequence andstructural requirements for aminoacylation and 30-adenylationJ Mol Biol 201 41ndash55

44 KanamoriY and NakashimaN (2001) A tertiary structure model of theinternal ribosome entry site (IRES) for methionine-independent initiationof translation RNA 7 266ndash274

45 WuytsJ De RijkP Van de PeerY PisonG RousseeuwP andDe WachterR (2000)Comparativeanalysisof more than 3000 sequencesreveals the existence of two pseudoknots in area V4 of eukaryotic smallsubunit ribosomal RNA Nucleic Acids Res 28 4698ndash4708

46 Van de PeerY De RijkP WuytsJ WinkelmansT and De WachterR(2000) The European small subunit ribosomal RNA databaseNucleic Acids Res 28 175ndash176

47 ChappellSA EdelmanGM and MauroVP (2000) A 9-nt segment ofa cellular mRNA can function as an internal ribosome entry site (IRES)and when present in linked multiple copies greatly enhances IRESactivity Proc Natl Acad Sci USA 97 1536ndash1541

48 ScheperGC VoormaHO and ThomasAA (1994) Basepairing with18S ribosomal RNA in internal initiation of translation FEBS Lett352 271ndash275

49 BarendsS Rudinger-ThirionJ FlorentzC GiegeR PleijCW andKraalB (2004) tRNA-like structure regulates translation of Bromemosaic virus RNA J Virol 78 4003ndash4010

50 CarterAP ClemonsWM BrodersenDE Morgan-WarrenRJWimberlyBT and RamakrishnanV (2000) Functional insights fromthe structure of the 30S ribosomal subunit and its interactions withantibiotics Nature 407 340ndash348

51 HarmsJ SchluenzenF ZarivachR BashanA GatS AgmonIBartelsH FranceschiF and YonathA (2001) High resolution structureof the large ribosomal subunit from a mesophilic Eubacterium Cell 107679ndash688

52 BanN NissenP HansenJ MoorePB and SteitzTA (2000) Thecomplete atomic structure of the large ribosomal subunit at 24 angstromresolution Science 289 905ndash920

Nucleic Acids Research 2005 Vol 33 No 4 1397

53 CannoneJJ SubramanianS SchnareMN CollettJRDrsquoSouzaLM DuYS FengB LinN MadabusiLV MullerKMet al (2002) The Comparative RNA Web (CRW) Site an online databaseof comparative sequence and structure information for ribosomal intronand other RNAs BMC Bioinformatics 3 2

54 HararyF and PrinsG (1959) The number of homeomorphicallyirreducible trees and other species Acta Math 101141ndash162

55 HararyF (1969) Graph Theory Addison-Wesley Reading Mass

56 VasquezM and ScheragaHA (1985) Use of buildup andenergy-minimization procedures to compute low-energy structures ofthe backbone of enkephalin Biopolymers 24 1437ndash1447

57 ZukerM MathewsDH and TurnerDH (1999) Algorithms andthermodynamics for RNA secondary structure prediction A practicalguide In BarciszewskiJ and ClarkBFC (eds) RNA Biochemistry andBiotechnology Kluwer Academic Publishers Dordrecht pp 11ndash43

58 HofackerIL (2003) Vienna RNA secondary structure serverNucleic Acids Res 31 3429ndash3431

1398 Nucleic Acids Research 2005 Vol 33 No 4

Page 3: doi:10.1093/nar/gki267 Modular RNA architecture revealed ... · Modular RNA architecture revealed by computational analysis of existing pseudoknots and ... Samuela Pasquali, ... RNAs

Graphical representation of RNA secondary structures

RNA secondary structures can be schematically represented asmathematical graphs to capture the essential topological con-nectivity of loops bulges junctions and stems (220) The useof simplified graphical representation allows RNA structuresto be efficiently analyzed and compared Recently wedeveloped a general class of RNA graphs called dual graphscapable of representing both RNA tree and pseudoknot struc-tures (2) (compare dual and tree graphs in Figures 1 and 2respectively) We use both tree and dual graphs in this workThe rules for mapping RNA secondary structures onto tree anddual graphs are given in (2) and our RNA-As-Graphs webresource (httpmonodbiomathnyuedurna) (2627)

Graph isomorphism search algorithm

The representation of RNAs as graphs provides a systematicframework to search for similar substructures in RNAsthrough the concept of graph isomorphism (32) Isomorphicgraphs share the same pattern of connectivity among all ver-tices The challenge is to identify by an efficient computeralgorithm a graph as a subgraph of a larger graph or in

molecular terms an existing RNA motif contained in a largerRNA The computational complexity of identifying two struc-turally equivalent (ie isomorphic) graphs with V vertices is ofthe order of V factorial (V) and known as the lsquograph isomorph-ism problemrsquo (3233)

Our efficient method for testing graph isomorphism is basedon graph topological numbers or invariants a sketch of thisidea was reported previously [(2) Appendix D] We associateeach graph or subgraph with one or more topological invari-ants which are computed based on the patterns of connectivityamong graph vertices Thus isomorphic graphs have the sametopological invariants while dissimilar graphs have differenttopological invariants The similarity or dissimilarity betweengraphs can be readily established by comparing their topo-logical invariants (17) Our topological invariants allow theidentification of equivalent graphs differing only in the place-ment of chain ends andor hairpin loops as explained below

We define the topological invariants using a graphrsquos con-nectivity as described by the adjacency matrix Given a graphG with V vertices the V middot V adjacency matrix A(G) of thegraph specifies the connectivity between all pairs of verticesFor example matrix element Aij indicates the number of edges

1

2 3

4

6 5

0 1 0 1 0 11 1 1 0 0 00 1 1 1 0 01 0 1 0 1 00 0 0 1 0 31 0 0 0 3 0

A = S = 3988

0 1 2 1 2 11 0 1 2 3 22 1 0 1 2 31 2 1 0 1 22 3 2 1 0 131 2 3 2 13 0

D =

654321

0 2 0 0 0 02 0 2 0 0 00 2 0 2 0 00 0 2 0 2 00 0 0 2 0 20 0 0 0 2 1

A =

0 12 2 3 4 5

2 12 0 12 2 3

4 3 2 12 0 12 5 4 3 2 12 0

3 2 12 0 12 2

12 0 12 2 3 4

D = S = 4640

5

3

3

5

Figure 1 Two hypothetical RNA secondary structures and their dual graphical representations The RNA graphs are quantitatively described using adjacencymatrix A distance matrix D (Equation 1) and topological invariant S (Equation 2) We use these measures to compare and search for similar substructures inexisting RNAs

1386 Nucleic Acids Research 2005 Vol 33 No 4

connecting vertex i with vertex j For RNA dual graphs ele-ments of the adjacency matrix have the following propertiesAij is symmetric the allowed number of edges (connections)between two vertices is 0 1 2 or 3 a vertex can have at mostone self-loop (value of 1) representing a hairpin loop and eachvertex i is connected to four edges representing two incomingand two outgoing connecting strands except where the chainends occur

Our graph invariant is defined based on a modified distance-like matrix D(G) derived from the adjacency matrix A(G) Theelements of symmetric matrix (Dij) are defined as follows

Dij = A1ij for Aij bdquo 0

Dij = dij for Aij = 0

Dii = 0 for all i1

where dij is the minimum number of edges separating verticesi and j ie the number of edges defining the shortest possiblepath between these two vertices We define the graph invariantS(G) as follows

S Geth THORN2 = V1eth THORN1XVifrac141

XVjfrac141

Dij

2

2

Approximately S(G) measures the average distance amongthe vertices in the graph G For example the component Sidefined as

PVjfrac141 Dij measures how well-connected vertex i is

to the other vertices in the graph a peripheral vertex willproduce a higher Si value than a vertex located more in thecenter of the graph Effectively our scheme assigns adjacent

vertices with smaller weights than non-adjacent vertices Thusa large S value corresponds to an extended structure with fewbranches and a small S value to a compact graph with high-order junctions or a complex pseudoknot fold This property ofS can also be used to rank and measure the topological distancebetween graphs In general such topological invariants arebelieved to have a better correspondence with the physico-chemical properties of real molecules since many of the prop-erties depend critically on the exposed surface and lesscritically on the buried elements (34) Figure 1 shows theS values for two distinct RNA topologies

Our distance matrix D(G) and topological invariant S can beextended to tree graphs although their adjacency matriceshave different properties from those for dual graphs Wecall Stree a tree topological invariant which will be used foranalyzing submotifs in 16S and 23S rRNAs Figure 2A com-pares two topologically distinct 10-vertex tree graphs withdifferent Stree values

Our topological invariants [S(G) and Stree] are degeneratedwith respect to the location of chain ends or hairpin loopssince these elements are not scored Thus topologically equi-valent structures (ie same connectivity) embedded within alarger structure can be identified All known topological invari-ants are imperfect since distinct structures can be incorrectlyassigned as identical That is graphs with distinct topologicalinvariants are non-isomorphic but the converse is not true(not all graphs with same topological invariant are iso-morphic) This is the well-known graph isomorphism problem(33) In our analysis of 100 connected graphs with our topo-logical invariant S we only encountered one pair of similar but

1 3 4 5

6

7

9

8 10

2

101 2

34

5

6

78

9

S = 9079 S = 6285

S = 3085

A

B

Figure 2 Tree and dual graphical representations and topological invariants of hypothetical RNA secondary structures (A) RNA tree graphs and their topologicalinvariants Stree (B) The only example we found (out of 100) of two distinct RNA-like pseudoknot graphs with the same topological invariant S the pseudoknotsonly differ by the placement of a stemndashloop Although our structural comparison algorithm regards this structure pair as a positive match in practice such cases areeliminated by manual screening

Nucleic Acids Research 2005 Vol 33 No 4 1387

non-isomorphic graphs (for V = 5) that resulted in the sameinvariant S (see Figure 2B) In a previous work we have alsoused the Laplacian eigenvalue spectrum as a topologicalinvariant yielding a comparable error rate of a few percent(26) Some false positive matches are reported because of thisgraph isomorphism problem Since we manually examine allpositive matches the false positive matches are eliminated

Searching for the motif of a 10-vertex graph in a 30-vertexgraph requires 20 min CPU time on an SGI 300 MHz MIPSR12000 IP27 processor The search of a 15-vertex motif in a40-vertex graph would imply 1000-fold increase in CPU time

In practice our method can easily search (ie in minutes) forsubgraphs of size ranging from 5 to 11 vertices within graphsof size up to 30 vertices (corresponding to 600 nt RNAs)

RESULTS AND DISCUSSION

RNA pseudoknots

We analyze the structural similarity of a set of 14 pseudoknotsranging from 41 to 161 nt Figure 3 shows the secondarystructures and graphical representations of the pseudoknots

Viral frameshiftin g

PKB217

PKB174PKB131

PKB178

PKB223

PKB134PKB135Self cleavingVS ribozyme

PKB205

18S ribosomal RNA

PKB191

PKB173

Hammerhead ribozyme

PKB239

HIV1 5rsquo UTR

frameshiftingViral

Internal ribosome entry siteAptamer

PKB143

tmRNA

PKB210

PKB233

Figure 3 Topological matches generated by the simple (PKB217 PKB233) and double (PKB178) pseudoknots The pseudoknots are shown as schematic secondarystructures and dual graphs There are 2 probe and 11 target pseudoknot topologies representing diverse functional RNA pseudoknots For each probe pseudoknotstructure the corresponding matched submotifssubgraphs in the larger pseudoknots are shaded

1388 Nucleic Acids Research 2005 Vol 33 No 4

selected from the PseudoBase (25) (httpwwwbioleidenunivnl~BatenburgPKBhtml) The RNA pseudoknots are viralframeshifting (PKB174 PKB217 PKB233) viral tRNA-like(PKB134 PKB135 PKB143 PKB191) IRES RNA (PKB223)18S rRNA (PKB205) HIV-1 50-untranslated region (50-UTR)(PKB239) alternative pseudoknot of hammerhead ribozyme(PKB173) aptamer that binds human nerve growth factor(PKB131) tmRNA (PKB210) and self-cleaving Neurosporavs ribozyme (PKB178) (PKB233 and PKB217 appear thesame but have different sequences and this difference affectssubsequent sequence alignment analysis) These pseudoknotclasses are not known to be functionally related We choosemore than one member for some classes because of exhibitedtopological differences

The above pseudoknot structures were derived frommutational analysis (PKB173 PKB174 PKB191 PKB233PKB239) enzymatic and chemical structure probing(PKB131 PKB174 PKB191 PKB239) phylogenetic orsequence comparison analysis (PKB131 PKB135 PKB143PKB205 PKB217 PKB233 PKB234 PKB191 PKB210PKB239) and 3D modeling (PKB134 PKB135 PKB210)The pseudoknots listed in more than one category were probedusing two or more methods

To use these structures for graphical analysis we convertthe base pairing information of each structure provided byPSEUDOBASE to a dual graphical representation using ourdual graph rules [D1ndashD3 (2)] We then use pseudoknot graphsto specify their corresponding adjacency matrices Thesetwo steps are performed manually For RNA tree structureswe use our RNA Matrix program to automatically convert asecondary structure into its corresponding adjacency matrix(2635) The search for a small motif within a larger motif isperformed using input adjacency matrices and the comparisonof matrices is performed via the topological invariant S(Equation 2)

Sequence and structural relationships of functionallyrelated pseudoknots

We illustrate using several pseudoknot pairs the functionalsimilarity identified by topological rather than sequence sim-ilarity Figure 4 compares viral frameshifting pseudoknotstructures Viral RNA frameshifting is a process by whichexpression of overlapping open reading frames (ORFs) canbe achieved (36) We compare three pairs from subfamiliesGag-pro ORF1aORF1b and ORF2ORF3 ribosomalframeshifting Frameshifting signals involve two essentialcomponents a lsquoslipperyrsquo (or sliding) sequence of nucleotideswhere frameshifting takes place and a stimulatory RNAsecondary structure (usually a pseudoknot) located a fewnucleotides downstream Although members of the sameframeshifting subfamilies have significant sequence similarityglobal sequence identity between subfamilies is as low as28 similar to random sequences Because all frameshiftingpseudoknots in Figure 4A have the same topology and func-tion finding topological similarity between RNA structureshas an advantage for functional annotation over sequencecomparison

Figure 4B shows that this finding also holds for a frameshift-ing pseudoknot pair (PKB217 and PKB174 Pair 4 in Table 1)The 72 nt PKB217 is the 1 frameshifting pseudoknot of

lactate dehydrogenase-elevating virus (LDV) (37) Togetherwith its sliding sequence UUUAAAC it regulates the expres-sion of ORF1a for a polyprotein containing proteases andORF1b for an RNA-dependent RNA polymerase in LDVThe 127 nt PKB174 is the 1 frameshifting pseudoknot ofRous sarcoma virus (RSV) (3638) This pseudoknot and slid-ing sequence AAAUUUA regulate the expression of Gag-Pro-Pol and Gag-Pol polyproteins Although both the LDV andRSV frameshifting pseudoknots have a simple pseudoknotstructure the latter has two extra stemndashloops Despite theirfunctional similarity the global sequence identity is only 37[computed using the program ALIGN (39) available at httpworkbenchsdscedu] comparable with random sequencestheir sliding sequences are also dissimilar Using our localsequence alignment parameters we find a large significantmatched region in PKB217 (6833ndash6877) as shown inFigure 4B AAACUGCUAGCCACCUCUGGUCUCGACC-GCUGUACUAGAGGUGG

Although functional similarity between frameshiftingpseudoknots is expected this case illuminates how examiningtopological similarity aids in identifying pseudoknots withsimilar functions despite low overall sequence similarity

Comparison of pseudoknot structures

Our two probe motifsmdasha simple pseudoknot (PKB217 orPKB233) and a double pseudoknot (PKB178)mdashare used tosearch larger RNAs We generate two groups of topologicalmatches as shown in Figure 3 with the matched motifs showndarkened we found 11 such matched pseudoknot pairs Inaddition we analyze sequence similarity of many of the84 (14 middot 132) possible pseudoknot pairs since topologicalsimilarities are apparent within each pseudoknot group gen-erated by the probe motifs Although not exhaustive thisadditional analysis may help to identify pairs having similarsubstructures that are not detected by using fixed probes(simple and double pseudoknots) Table 1 lists 10 pseudoknotpairs (out of the 11 probestructure matches and subset of the84 structurendashstructure analysis) that have significant structuraland sequence similarities Of the total of 10 pairs listed(Pairs 1ndash10) two known functional pairs (1 and 4 in Table 1)are included for comparison with the eight other pairs that arenot previously known to be functionally related

We begin by analyzing the overall patterns of relationshipamong our 14 pseudoknots Figure 3 organizes the matchesinto two groups five matches for the simple pseudoknot(PKB217) and six for the double pseudoknot (PKB178)The matched motifs appear to be lsquomodular constructionsrsquoaround the probe motifmodule with added stemndashloops(three or fewer) To the best of our knowledge these resultsreveal modular features and topological similarities of RNApseudoknots that have not been described

The biological significance of topological matches dependson the size and complexity of pseudoknots a match for thedouble pseudoknot is probably more significant than for asimple pseudoknot The six pseudoknots generated by thedouble pseudoknot probe PKB178 show striking topologicalas well as functional similarity (eg viral tRNA-like PKB134and PKB135) we analyze their similarity with PKB178 andamong themselves Still the matched pairs of RNAs are func-tionally diverse besides PKB134 and PKB135 such examples

Nucleic Acids Research 2005 Vol 33 No 4 1389

of diverse RNAs include IRES (PKB223) RNA 18S rRNA(PKB205 PKB234) HIV-1 50-UTR (PKB239) and alternativepseudoknot of hammerhead ribozyme (PKB173) In this dou-ble pseudoknot group six pairs (1 3 5 6 8 and 9 in Table 1)produce local sequence alignments [using LALIGN (40)

httpworkbenchsdscedu] that suggest significant resultsIn particular the pair PKB134PKB173 (Pair 5 in Table 1)indicates a significant match The pairs PKB134PKB205(Pair 6) and PKB205PKB223 (Pair 10 in Table 1) alsodisplay intriguing sequence and functional similarities

A A A A A A C G G G A A G C A A G G G C U

GGGA

CA A

CCCU

A

A

A

U

GA U

C C C G G

G

A AACC

A

U

PKB 3

Gag-polglobal sequence identity 57

PKB 128U U U A A A C U G U U G A G A G G U G C C U G

U C U C U A C G G A C

GCCUG

GA G

CGGAC

UU U

G

U AUA

G

A

C

PKB 217U U U A A A C U G C U A G C C A C C U C U G G U

G G U G G A G A U C A

GACCGC

CU C C

UGGCG

CU

U

G

G

GCA

U

G GY

AG

CG

ORF 1a - ORF 1bglobal sequence identity 53

G G G A A A C G G G C A G G C G G

GCG

C GCGC

A

G

A

C C G C C

AC

AA

C

PKB 44

G G G A A A U G G A C U G A G G C G

GGCG

C CGGC

A

C

A

C G C C

A

CA

A

C

PKB 240

A

A

ORF2 - ORF3global sequence identity 79

global sequence identity as low as 28

A

U CC

G

C

AA A

C

U

C

GG

C

U

G

UU

AA

PKB 217GCUGACGGUGUYUGGCG

UA

CUC

UG

UG A C C G C

C U G G C G

CC

CCUCUGGU

GUGGAGA

CA

A

U

GU U U A A A C U G C U A G C 5

3

PKB 174

G A AG

UG

AA

AC

U

U

A

U

UG

C A

UUG

UC

UC

U

C

A

U A

U

AGGG

UCCCU

CGGGCCACUG

CCCGGUGAC

G

CC

GU

GU

G

A

CA

C

G C G C U A C A

C G C G A U G U

A G A C C G

U C U G G C

A

2603 3

52476 A A A U U U A U

6830

BPair 4

global sequence identity 37

G G G A A A C U G G A A G G C G G G G C

GCUGCA

GA C

GACUG

G

A

A

C C C C G

PKB 4

C

AA

AUA

UG

Figure 4 Sequence and structural comparisons of pseudoknots in the frameshifting functional families (A) Two secondary structures and corresponding topologiesfor each of the three viral frameshifting subfamilies Gag-pro ORF1aORF1b and ORF2ORF3 Sequence identity within the same subfamily ranges between 50 and80 it is as low as 28 when comparing pseudoknots from different subfamilies All frameshifting pseudoknots are functionally related and have the same topology(dual graph) (B) Secondary structures and their aligned nucleotides (color coded) for frameshifting pseudoknots of LDV (PKB217) and Rous sarcoma virus(PKB174) Global sequence alignments were performed using the program ALIGN with default (164) gap openinggap extension penalties

1390 Nucleic Acids Research 2005 Vol 33 No 4

These relationships suggest structural and functional similar-ities that are not previously known (Table 1)

To support the significance of these matches we performsequence alignments using SmithndashWaterman parameters forgap opening and gap extension penalties (41) We choose thegap opening penalty value of 15 from a range of 0 to 19 Toproduce better alignments for large sequence segments (shortalignments are less likely to be significant in our context) ourgap extension penalty of 1 from a range of 1 to 8 allowslarger extensions consistent with the possibility of the inser-tion of long sequences and added 2D motifs In Table 1 wealso report the alignment score which is the sum of nucleotidematches mismatches and gap penalties

Table 1 shows that the probe structure (PKB178) matchestwo other pseudoknots (PKB135 PKB173) with sequenceidentities ranging from 59 to 71 for 24ndash49 nt alignmentlengths (compared with 56 sequence identity over 129 ntfor the functionally related pair PKB134PKB135 Pair 1)Significantly the pairs PKB134PKB173 (Pair 5) PKB134PKB205 (Pair 6) and PKB205PKB223 (Pair 10) havesequence identity range of 72ndash80 The non-randomness ofthe 10 aligned pairs in Table 1 is also evident in the ratio Rwhich measures the level of significance above the randomexpectation and has value 1 for random pairs

R = (percent sequence identity)(percent sequence identitywhen one of the aligned sequences is randomized)

We also define ltRgt as the R-value averaged over align-ments with different randomized sequences Table 1 showsthat some pseudoknot pairs with no known functional rela-tionship have ltRgt values ranging between 106 and 147whereas the related pair PKB134PKB135 has a value of154 many matched pairs have ltRgt values lt1 Thus thetopological similarities we identify led to matched pairswith non-random sequence alignment results Non-randomsequence matches may arise from topological andor func-tional similarities as illustrated below for three pseudoknotpairs (5 6 and 10 in Table 1)

The rate of false positives generated by topological and ltRgtvalue screening can be easily estimated by considering allpossible pairs from the set of six matched double pseudo-knots in Figure 3 These pseudoknots form 15 distinctpairs 6 of which have ltRgt values gt114 [Table 1 withS(G) = 117] Table 1 shows that only four pairs have largeltRgt values (gt12) Pair 1 (PKB134PKB135) is related Pair 5(PKB134PKB173) and Pair 6 (PKB134PKB205) are mostprobably related and Pair 3 (PKB178PKB173) is of unknownrelationship (in fact both are self-cleaving ribozymes)Assuming that the Pair 3 is unrelated and therefore reflectsa false positive we have an error rate of 115 or 7 If theinferred functionally related Pairs 5 and 6 are also consideredas false positives we then have a conservative error rateof 20

Functional relationship between PKB134 and PKB173(Pair 5 in Table 1)

The 116 nt PKB134 pseudoknot occurs in the 30 end of bromemosaic virus genomic RNA (31) The 73 nt PKB173 is ahammerhead ribozyme of satellite cereal yellow dwarfvirus-RPV (satRPV) RNA capable of adopting an alternativepseudoknot conformation by forming an (L1ndashL2a) stem whose

existence is established by experimental mutational analysis(42) Figure 3 shows that their topologies have the same lsquocorersquobut PKB134 has two extra stemndashloops The local sequencealignment of this pair using LALIGN (Figure 5) shows 19bases aligned to 80 sequence identity and an ltRgt valueof 123 (Table 1) There are two significant segments inPKB173 that are matched with those in PKB134 UACUGU-CUGACGACGUAUCC (nt 11ndash30) and UAGAAGGCUG-GUGCC (nt 39ndash53) These segments and their matchedregions in PKB134 span the stem and unpaired elements ofthe secondary structures (see Figure 5) Significantly bothPKB134 and PKB173 pseudoknots serve as template for rep-licase a protein that recognizes specific sequence regions toinitiate replication of viral genomes (3142) Moreover a con-served sequence (CUGANGA) of PKB173 is in the alignedregion (nt 11ndash30) Thus our analysis suggests that PKB134and PKB173 pseudoknots are structurally and functionallyrelated

Further analysis of pseudoknots PKB134 and PKB173reveals specific domains interacting with the replicase Experi-ments have shown that mutations in domains A C B1 and B2[defined in (31)] of PKB134 impair viral RNA replication (43)Significantly our two aligned sequence regions span domainsA B1 C and also D Thus the replicase recognizes severaldomains in the PKB134 pseudoknot structure In PKB173 thepseudoknot forming five base pair helix (GCGCG) is implic-ated in replicase recognition or ligation (42) This pseudoknotalso has conserved sequence regions ACAAA (61ndash65) or itscomplement (possibly the replication origin) and AGAAA(nucleotides attached to the 30 end but not shown) Howeverthese regions are not observed in PKB134

Functional relationship between PKB205 and PKB223(Pair 10 in Table 1)

The 63 nt PKB223 is the pseudoknot of IRES of capsid-producing cricket paralysis-like viruses (CrPV) formethionine-independent initiation of translation (44) Appear-ing immediately upstream from the capsid-coding sequenceit initiates translation without a canonical AUG (methionine)initiation codon The 48 nt PKB205 is the pseudoknotsubstructure (V4 domain) of SSU of Palmaria palmata 18SrRNA which was found using extensive comparative analysis(4546) Figure 5 (Pair 10 lower panel) shows that PKB223 isa 4-vertex simple pseudoknot with two inserted stemndashloopsand PKB205 is a 4-vertex double pseudoknot with an insertedstemndashloop Figure 5 (middle panel) displays the local sequencealignment of PKB223 and PKB205 yielding a match of 14 ntwith 72 sequence identity This significant sequence simil-arity occurs in regions ACAAUAUCCAGGAA (6148ndash6161)and ACAUUAGCAUGGAA (765ndash778) of PKB223 andPKB205 respectively Remarkably we find that the 9 ntsegment 50-UUAGGUUAG-30 (nt 6100ndash6108) of PKB223is complementary to 30-GGUACGAUU-50 (nt 768ndash776) ofPKB205 except at G6103 Our finding corroborates withthe experimental identification that IRES elements in cellularmRNA contain a short 9 nt (nt 133ndash141 CCGGCGGGU) inthe 50-UTR of the homeodomain protein Gtx which is com-plementary to nt 1132ndash1124 of 18S rRNA (47) Indeed manysuch near complementary sequences have been found bet-ween the 30 end of 18S rRNA and picornavirus RNAs (48)

Nucleic Acids Research 2005 Vol 33 No 4 1391

Fig

ure

5

Th

ree

no

vel

fun

ctio

nal

lyre

late

dp

seu

do

kn

otp

airs

infe

rred

by

top

olo

gy

mat

chin

gan

dse

qu

ence

alig

nm

ent

Up

per

pan

els

Sec

ond

ary

stru

ctu

res

and

thei

ral

ign

edn

ucl

eoti

des

(co

lor

cod

ed)

for

pse

udo

kn

ot

pai

rsP

KB

13

4P

KB

17

3(P

air

5T

able

1

left

pan

el)

PK

B2

05

PK

B2

23

(Pai

r1

0T

able

1

mid

dle

pan

el)

and

PK

B1

34

PK

B20

5(P

air

6

rig

ht

pan

el)

PK

B1

34

occ

urs

in30 e

nd

of

bro

me

mosa

icv

iru

sg

eno

mic

RN

A

PK

B1

73

isa

pse

udo

kn

oto

fsa

tell

ite

cere

aly

ello

wd

war

ftv

iru

s-R

PV

RN

AP

KB

22

3is

the

pse

ud

ok

no

tofin

tern

alen

try

site

(IR

ES

)o

fca

spid

-pro

du

cin

gC

rPV

san

dP

KB

20

5is

the

pse

udo

kn

ots

ub

stru

ctu

re(V

4d

om

ain

)o

fS

SU

of

PP

alm

ata

18

SrR

NA

L

ow

erp

anel

sT

he

alig

ned

sub

gra

ph

so

rre

gio

ns

are

shad

edo

rm

ark

ed

1392 Nucleic Acids Research 2005 Vol 33 No 4

Intriguingly for CrPVs the complementary segments occur inpseudoknots with similar folds not involving the dangling30 end of 18S rRNA Base pair complementary between IRESelements of mRNA and 18S rRNA is a mechanism for ribo-some recruitment and translation In viral IRES elements thesecondary and tertiary structures are believed to play criticalroles in translation initiation (44)

Functional relationship between PKB205 and PKB134(Pair 6 in Table 1)

The functional relationsip of the pair PKB205PKB134 issimilar to that for PKB205PKB223 above PKB134 belongsto a common viral tRNA-like functional class whose role is torecruit host ribosomes to initiate viral protein synthesis (49)It is known to interact with the ribosome via tRNA shapemimicry (31) Our analysis indicates that tRNA-like mole-cules also contain a region which is nearly complementary toPKB205 a pseudoknot region of the 18S rRNA Table 2shows regions of PKB205 complementary to four tRNA-like molecules (PKB134 191 138 17) the complementaryregion in PKB223 is shown for comparison Significantlyeach of the pseudoknots PKB223 PKB134 and PKB191has a region complementary to the same 8ndash9 nt unpairedregion in PKB205 (767ndash776) This single-stranded regionis most probably available for base pairing with incomingtRNA-like molecules In addition two other regions inPKB205 are complementary to tRNA-like pseudoknotsPKB138 and PKB17 one region is a hairpin loop and anotherregion is part of short stems Thus this analysis suggests thattRNA-like molecules may employ both structural mimicry andsequence complementarity to recruit SSU rRNA an interac-tion mode similar to tRNA PKB134 also has a region UACA-GUGUUGAAAAACA (nt 2078ndash2094) closely matched tothe region UAGAGUGUUCAAAGCCA (nt 732ndash748) inPKB205 with an average R-value of 123

Tertiary structure similarity of topological matches inribosomal RNAs

The preceding examination of three pseudoknot pairs showsthat topological matches can suggest functionally relatedpseudoknots It is also important to show that novel topo-logical matches led to similar tertiary structures We focusfor this purpose on rRNA since several bacterial and archaealribosome structures have been solved in recent years In factin our previous work we have found several topologicalmatches of the Scerevisiae 5S rRNA motif within its largerrRNAs two within 16S rRNA (domains II and III) and threewithin 23S rRNA (domains III IV and VI) (2) From availableX-ray structures for Thermus thermophilus 16S rRNA (PDBaccession no 1fjg) (50) and 23S rRNA from Deinococcus

radiodurans (1nkw) and Haloarcula marismortui (1ffk)(5152) only 5S rRNA motifrsquos matches in domain II of16S rRNA and domain III of 23S rRNA are relevant theother three cases are not valid matches in these species

Figure 6 displays the secondary diagrams graphical repres-entations and tertiary structures of 5S rRNA for bacteriumDradiodurans (1nkw) and archaea Hmarismortui (1ffk)and their topologically matched modules within 16S and23S rRNAs the secondary diagrams are adapted from thosein Gutellrsquos comparative RNA website (httpwwwrnaicmbutexasedu) (53) The module in 23S is an identicalmatch for the 6-vertex 5S graph except for a hairpin loopwhich does not contribute to the topological invariant SThe 16S module is a 5-vertex graph because a 2 bp stem atthe junction has a non-canonical base pair AG (dual graph ruleD1 defines a stem as having at least two canonical andorwobble base pairs) for Scerevisiae this stem consists ofonly canonical base pairs making the 16S module an identicalmatch for the 5S topology As shown in Figure 6 the 16S and23S modules have a similar L-shaped tertiary fold as the 5Sstructure Tertiary structure similarity with 5S is more strikingfor the module of domain II (nt 653ndash753) of Tthermophilus16S rRNA than for the module of domain III (nt 1033ndash1143) of23S rRNA Both the 5S structure and 16S module have apronounced 3-stem junction where the chain ends of the16S module are located Structural (backbone) overlap ofthe 5S structure and 16S module shows similar structuralfeatures except for the orientations of the smaller stems atthe junction (We aligned the structures manually because thechain ends of 5S and 16S module occur at non-correspondinglocations) In contrast the 23S module shows significant dis-tortions on the shorter helical arm with the hairpin loop foldedback to the helix as shown in the structural overlap of the 5Sstructure and 23S module Structural similarities seen in theseexamples are probably due to geometric constraints which areapparent in their secondary diagrams and graphical representa-tions Thus 2D topological matches can lead to similar 3Dstructures Below we survey the distribution of modules inrRNAs

Distribution of submotifs in 16S and 23S ribosomalRNAs

Characterizing the submotifs within the 16S and 23S rRNAsthe largest known RNA molecules can help uncover themodular construction of RNA structures as analysis ofrRNArsquos tertiary modules has revealed Furthermore our pre-vious analysis of rRNArsquos structural motifs indicated existenceof characteristic features (35) For example the distributionsof pairedunpaired bases in stems bulgesinternal loops hair-pin loops and junctions follow specific functional forms

Table 2 Short sequence regions in PKB205 complementary to tRNA-like molecules

PKB205 regions Complementary sequences

30-GGUACGAUU-50 (768ndash776) PKB223 50-UUAGGUUAG-30 (6100ndash6108) IRES30-GUGAGAUU-50 (769ndash776) PKB134 50-CACUGUAAA-30 (113ndash120) tRNA-like30-UACGAUUAC-50 (767ndash775) PKB191 50-AUGCUCAUG-30 (77ndash85) tRNA-like30-UGUGAGAUU-50 (731ndash739) PKB138 50-ACACUUUAA-30 (38ndash47) tRNA-like30-GUUUGCGGACC-50 (746ndash756) PKB17 50-CAAAACCCUGG-30 (19ndash29) tRNA-like

Nucleic Acids Research 2005 Vol 33 No 4 1393

Figure 6 Comparison of the graph topologies and 2D and 3D structures of 5S rRNAs (A and B) and modules in 16S and 23S rRNAs (C and D) (C and D) Twostructural (backbone) overlaps of the 5S structure (1ffk gold color) with 16S module (1fjg blue) and 5S structure with 23S module (1nkw green) are shown Thesecondary diagrams are adapted from those in Gutellrsquos comparative RNA website (httpwwwrnaicmbutexasedu) and 3D structures are from the Protein DataBank (httpwwwrcsborgpdb) The 2D modules were identified computationally by our earlier work on finding120 nt 5S rRNA motifs in 16S and 23S rRNAs ofScerevisiae(2) The equivalent tertiary structures and modules for 5S 16S and 23S rRNAs from other organisms show that similar graph topologies can lead tosimilar secondary and tertiary structures All structures have similar graph topologies The Tthermophilus 16S module is a 5-vertex instead of 6-vertex graph becausea 2 bp stem at the junction has a non-canonical base pair AG for Scerevisiae this stem consists of only canonical base pairs making the 16S module an identicalmatch for the 5S topology

1394 Nucleic Acids Research 2005 Vol 33 No 4

Moreover the percentages of nucleotides in stems bulgesinternal loops hairpin loops and junctions do not vary for16S and 23S rRNAs even though the latter is twice thesize of the former Such motif characteristics help discriminateRNA-like from non-RNA-like molecules and guide the designof novel RNAs Here we analyze the distinct modular unitsmdashsmall tree submotifsmdashthat makeup the rRNAs to advance suchapplications Since all small tree graphs up to 10 vertices havebeen exhaustively enumerated (5455) we can use our sub-structure search algorithm to identify the occurrences of alldistinct tree submotifs in rRNAs

Specifically we use sets of tree graphs for V = 4 5 6 7 8 9and 10 comprising 2 3 6 11 23 47 and 106 graphs respect-ively for a total of 198 tree motifs The complete enumeratedmotifs can be found in the RAG web resource (httpmonodbiomathnyuedurna) Figure 7 shows as an example the11 possible 7-vertex tree graphs and lists the frequency ofoccurrence of each tree graph in rRNAs Remarkably thesame tree motifs are abundant in both 16S and 23S rRNAsThe most abundant (10 or 15 times) 7-vertex tree has two3-stem junctions (topological invariant Stree = 6160) 4- and5-stem junctions are the next most prevalent submotifs(Stree = 5137 5921 and 5667) Interestingly the unbranchedtree (Stree = 7219) and highly branched trees (Stree = 4509)are rare

Figure 8 plots the percentage of distinct tree graphs in eachset V (fraction of motifs identified relative to possible motifs)

found in rRNA For small tree graphs V = 4 5 and 6 allpossible topologies are present but not for V gt 6 In fact thepercentage of trees found declines with V In 23S rRNA90 of possible trees are found for V = 7 and only40 (representing 47 tree topologies) are found forV = 10 Larger trees (V gt 10) are expected to have muchsmaller percentages This percentage decline is expectedsince small tree modules are less specific than large treesIn total 111 submotifs out of 198 are identified The observedtrends in Figure 8 may be generally valid for rRNA structuresof other organisms since they are highly conserved althoughsome statistical fluctuations are expected due to structuraldifferences and RNArsquos finite size

These data show that RNA structures are modular constructsof relatively few (tree) building blocks Extrapolation of thecurves in Figure 8 suggests that 20 of V = 11 motifs (totalof 235 trees) and 10 of V = 12 motifs (total of 551 trees)probably exist in rRNAs Thus the total number of buildingblocks in rRNAs with V lt 13 would be 210 Although thepercentages of trees with higher V values are low they aremore numerous and will contribute to the overall number ofbuilding blocks We thus consider 210 as the minimum num-ber of building blocks The building block number may beexploited in the future design of novel functional RNAmolecules using modular assembly from 210 distinct treemotifs The tree motifs may be taken from substructures ofexisting RNAs allowing simultaneous sequence and structure

treeS =

16S 23SOccurrence

ID

6 4

)

0 1

(7 8)

6191

(7 11)

4509

0 1

(7 9)

5385

3 3

7

(7 10)

5137

5 5

(7 6)

5916

5

(7 5)

5921

(7 1)

0 0

7219

(7 3)

2 4

6446

4 4

(7 7)

5667

10 15

(7 4

6160

0 3

6692

(7 2)

Figure 7 Occurrence frequency of 11 distinct 7-vertex tree topologies as submotifs of Scerevisiae 16S and 23S rRNAs The tree graphs are labeled using topologicalinvariant Stree (second top row) and ID number (i j) (first row) which refers to motif identification number as used in RAG database (httpmonodbiomathnyuedurna) The two numbers at the bottom of each graph refer to the number of times that topology is found embedded in 16S and in 23S rRNAs respectively

Nucleic Acids Research 2005 Vol 33 No 4 1395

design Such a build-up approach appears feasible from our(26) and other works (35) and is akin to the lsquobuilduprsquo tech-nique for proteins pioneered by Vasquez and Scheraga (56)

The above estimate of building block number can be eithera consequence of the limited number of available structures orthe fact that RNA adopts only a subset of possible graphsCertainly future discoveries of novel RNA structures mightindicate a larger repertoire of building blocks Still our sur-veys of natural RNAs (26) suggest that RNAs possess specifictopological characteristics and favor a subset of the possiblegraph motifs The missing motifs may not exist in nature forRNA owing to physical or functional constraints

The building block number can also be defined in terms ofirreducible graphs ie trees that cannot be decomposed intosmaller trees However this approach has a major drawbackSince all RNAs are composed of connected stems their sub-structures can be reduced to two vertex graphs which is not aninformative description of complex RNA folds Still if wedefine 4-vertex trees as the smallest irreducible graphs thenumber of irreducible graphs can be calculated Based on oursubmotif data we estimate that for V up to 10 there are about42 irreducible tree modules in rRNAs

SUMMARY AND CONCLUSION

Our automated comparison of RNA secondary structuresbased on graphical representation allows the examination ofconserved 2D RNA topological features which in turn suggestfunctional relationships This approach has an advantage oversequence alignment because functionally related RNAs oftenlack sequence similarity as shown for frameshifting pseudo-knots with low global sequence identities Of course ourstructure alignment approach relies on available 2D RNA

structures and thus will increase in scope as more RNAstructures become available Our eight identified pairs ofpotentially related pseudoknots (Table 1) reveal three pairsthat are functionally related the pseudoknots of 30 end ofbrome mosaic virus genomic RNA (PKB134) and alternativehammerhead ribozyme pseudoknot (PKB173) are templatesfor replicase to initiate replication of viral genomes the pseu-doknots of internal entry site (IRES) of caspid-producingCrPVs (PKB223) and SSU of 18S rRNA (PKB205) are asso-ciated with translation initiation and the pseudoknots of 18SrRNA (PKB205) and viral tRNA-like (PKB134) most prob-ably interact via both structural mimicry and base comple-mentarity To the best of our knowledge these functionalrelationships have not been described previously

The developed RNA comparison algorithm also led us toquantify the modularity of 16S and 23S rRNAs using smallenumerated tree motifs Our results suggest that rRNAs areconstructed from at least 210 distinct tree subtopologies(or modules) having up to 180 nt Significantly we findthat 5S rRNA and two tree modules within 16S and 23SrRNAs with similar topologies have similar tertiary shapesaffirming the relationship between 2D graphical representa-tion and 3D structure The tree modules in rRNA may be usedas building blocks for the design of novel RNA folds A similaridea has been employed to design and improve functionalRNA molecules (510) Design by assembly of modulesexploits motifs present in natural RNAs unlike de novo designof sequences and structures (11) The issues of discoveringnovel topologies (6) building module library and designingfunctional folds are likely to increase in importance withinterest in novel RNAs

Our algorithm has several limitations Of course its utilitydepends on availability of reliable RNA secondary foldingalgorithms (5758) More fundamentally although simplifiedgraphical representations specify topological characteristicsdetailed sequence and motif information is missing An asso-ciated problem is that unusual RNA structurendashfunction rela-tionships due to deleterious mutations as in frameshiftingpseudoknots (28) and structural flexibility are not capturedMore advanced graph constructs such as weighted graphs mayhowever specify detailed motif features at the secondary andbase pair levels Another limitation of the present algorithmis the requirement to fix the probe structure or graph As inprotein structure alignment algorithms (15) future refinementsof our algorithm should allow some flexibility in motifdefinitions as the similarity search proceeds We inviteRNA scientists to visit our RAG web resource at httpmonodbiomathnyuedurna and suggest to us specific enhancementsthat will be useful in practice

ACKNOWLEDGEMENTS

We thank Dinshaw Patel for comments that motivated thiswork This research was supported by a Joint NSFNIGMSInitiative in Mathematical Biology (DMS-0201160) and theHuman Frontier Science Program (HFSP)

REFERENCES

1 TinocoI and BustamanteC (1999) How RNA folds J Mol Biol293 271ndash281

Figure 8 Distribution of all possible distinct tree modules with V vertices foundin Scerevisiae 16S and 23S rRNAs In graph theory each V is represented by aset of possible tree structures For example for V= 4 5 6 7 8 9 and 10 there are2 3 6 11 23 47 and 106 possible trees respectively The distribution of treesquantifies the diversity of tree modules within rRNAs This analysis suggeststhat rRNAs possess at least 210 tree modules

1396 Nucleic Acids Research 2005 Vol 33 No 4

2 GanHH PasqualiS and SchlickT (2003) Exploring the repertoire ofRNA secondary motifs using graph theory implications for RNA designNucleic Acids Res 31 2926ndash2943

3 SoukupGA and BreakerRR (1999) Engineering precision RNAmolecular switches Proc Natl Acad Sci USA 96 3584ndash3589

4 BreakerRR (2002) Engineered allosteric ribozymes as biosensorcomponents Curr Opin Biotechnol 13 31ndash39

5 IkawaY FukadaK WatanabeS ShiraishiH and InoueT (2002)Design construction and analysis of a novel class of self-foldingRNA Structure 10 527ndash534

6 KimN ShiffeldrimN GanHH and SchlickT (2004) Candidatesfor novel RNA topologies J Mol Biol 341 1129ndash1144

7 MitchellJR ChengJ and CollinsK (1999) A box HACAsmall nucleolar RNA-like domain at the human telomerase RNA 30

end Mol Cell Biol 19 567ndash5768 DoudnaJA and CechTR (2002) The chemical repertoire

of natural ribozymes Nature 418 222ndash2289 StorzG (2002) An expanding universe of noncoding RNAs Science

296 1260ndash126310 BreakerRR and JoyceGF (1994) Inventing and improving ribozyme

function rational design versus iterative selection methods TrendsBiotechnol 12 268ndash275

11 AndronescuM FejesAP HutterF HoosHH and CondonA (2004)A new algorithm for RNA secondary structure design J Mol Biol336 607ndash624

12 CarterRJ DubchakI and HolbrookSR (2001) A computationalapproach to identify genes for functional RNAs in genomic sequencesNucleic Acids Res 29 3928ndash3938

13 DoudnaJA (2000) Structural genomics of RNA Nature Struct Biol7 954ndash956

14 RivasE KleinRJ JonesTA and EddySR (2001) Computationalidentification of noncoding RNAs in Ecoli by comparativegenomics Curr Biol 11 1369ndash1373

15 HolmL and SanderC (1996) Mapping the protein universeScience 273 595ndash602

16 ShindyalovIN and BournePE (2001) A database and tools for 3-Dprotein structure comparison and alignment using the CombinatorialExtension (CE) algorithm Nucleic Acids Res 29 228ndash229

17 BenedettiG and MorosettiS (1996) A graph-topological approach torecognition of pattern and similarity in RNA secondary structuresBiophys Chem 59 179ndash184

18 MargalitH ShapiroBA OppenheimAB and MaizelJVJr (1989)Detection of common motifs in RNA secondary structures Nucleic AcidsRes 17 4829ndash4845

19 ShapiroBA and ZhangKZ (1990) Comparing multiple RNAsecondary structures using tree comparisons Comput Appl Biosci6 309ndash318

20 LeSY NussinovR and MaizelJV (1989) Tree graphs of RNAsecondary structures and their comparisons Comput Biomed Res22 461ndash473

21 DuarteCM WadleyLM and PyleAM (2003) RNA structurecomparison motif search and discovery using a reduced representation ofRNA conformational space Nucleic Acids Res 31 4755ndash4761

22 ChevaletC and MichotB (1992) An algorithm for comparing RNAsecondary structures and searching for similar substructures ComputAppl Biosci 8 215ndash225

23 WangL and ZhaoJ (2003) Parametric alignment of ordered treesBioinformatics 19 2237ndash2245

24 FontanaW KoningsDAM StadlerPF and SchusterP (1993)Statistics of RNA secondary structures Biopolymers 33 1389ndash1404

25 van BatenburgFHD GultyaevAP PleijCWA NgJ andOliehoekJ (2000) PseudoBase a database with RNA pseudoknotsNucleic Acids Res 28 201ndash204

26 GanHH FeraD ZornJ ShiffeldrimN TangM LasersonUKimN and SchlickT (2004) RAG RNA-As-Graphsdatabasemdashconcepts analysis and features Bioinformatics20 1285ndash1291

27 FeraD KimN ShiffeldrimN ZornJ LasersonU GanHH andSchlickT (2004) RAG RNA-As-Graphs web resource BMCBioinformatics 5 88

28 ChenXY KangHS ShenLX ChamorroM VarmusHE andTinocoI (1996) A characteristic bent conformation of RNA pseudoknotspromotes 1 frameshifting during translation of retroviral RNAJ Mol Biol 260 479ndash483

29 KangHS HinesJV and TinocoI (1996) Conformation of anon-frameshifting RNA pseudoknot from mouse mammary tumor virusJ Mol Biol 259 135ndash147

30 GanHH PerlowRA RoyS KoJ WuM HuangJ YanSNicolettaA VafaiJ SunD et al (2002) Analysis of protein sequencestructure similarity relationships Biophys J 83 2781ndash2791

31 FeldenB FlorentzC GiegeR and WesthofE (1994) Solutionstructure of the 30-end of brome mosaic-virus genomic RNAsmdashconformational mimicry with canonical transfer-RNAs J Mol Biol235 508ndash531

32 GrossJL and YellenJ (1999) Graph Theory and its Applications CRCPress Boca Raton FL

33 KoblerJ SchoningU and ToraanJ (1993) The Graph IsomorphismProblem Birkhauser Boston MA

34 RandicM and ZupanJ (2001) On interpretation of well-knowntopological indices J Chem Inform Comput Sci 41 550ndash560

35 ZornJ GanHH ShiffeldrimN and SchlickT (2004) Structuralmotifs in ribosomal RNAs implications for RNA design and genomicsBiopolymers 73 340ndash347

36 JacksT MadhaniHD MasiarzFR and VarmusHE (1988) Signalsfor ribosomal frameshifting in the Rous-Sarcoma virus GagndashPolregion Cell 55 447ndash458

37 GodenyEK ChenL KumarSN MethvenSL KooninEV andBrintonMA (1993) Complete genomic sequence and phylogeneticanalysis of the lactate dehydrogenase-elevating virus (Ldv) Virology194 585ndash596

38 MarczinkeB FisherR VidakovicM BloysAJ and BrierleyI (1998)Secondary structure and mutational analysis of the ribosomalframeshift signal of Rous sarcoma virus J Mol Biol 284 205ndash225

39 MyersEW and MillerW (1988) Optimal alignments in linear spaceComput Appl Biosci 4 11ndash17

40 HuangXQ and MillerW (1991) A time-efficient linear-space localsimilarity algorithm Adv Appl Math 12 337ndash357

41 SmithTF and WatermanMS (1981) Identification of commonmolecular subsequences J Mol Biol 147 195ndash197

42 SongSI SilverSL AulikMA RasochovaL MohanBR andMillerWA (1999) Satellite cereal yellow dwarf virus-RPV (satRPV)RNA requires a DouXble hammerhead for self-cleavage and analternative structure for replication J Mol Biol 293 781ndash793

43 DreherTW and HallTC (1988) Mutational analysis of thetransfer-RNA mimicry of brome mosaic-virus RNAmdashsequence andstructural requirements for aminoacylation and 30-adenylationJ Mol Biol 201 41ndash55

44 KanamoriY and NakashimaN (2001) A tertiary structure model of theinternal ribosome entry site (IRES) for methionine-independent initiationof translation RNA 7 266ndash274

45 WuytsJ De RijkP Van de PeerY PisonG RousseeuwP andDe WachterR (2000)Comparativeanalysisof more than 3000 sequencesreveals the existence of two pseudoknots in area V4 of eukaryotic smallsubunit ribosomal RNA Nucleic Acids Res 28 4698ndash4708

46 Van de PeerY De RijkP WuytsJ WinkelmansT and De WachterR(2000) The European small subunit ribosomal RNA databaseNucleic Acids Res 28 175ndash176

47 ChappellSA EdelmanGM and MauroVP (2000) A 9-nt segment ofa cellular mRNA can function as an internal ribosome entry site (IRES)and when present in linked multiple copies greatly enhances IRESactivity Proc Natl Acad Sci USA 97 1536ndash1541

48 ScheperGC VoormaHO and ThomasAA (1994) Basepairing with18S ribosomal RNA in internal initiation of translation FEBS Lett352 271ndash275

49 BarendsS Rudinger-ThirionJ FlorentzC GiegeR PleijCW andKraalB (2004) tRNA-like structure regulates translation of Bromemosaic virus RNA J Virol 78 4003ndash4010

50 CarterAP ClemonsWM BrodersenDE Morgan-WarrenRJWimberlyBT and RamakrishnanV (2000) Functional insights fromthe structure of the 30S ribosomal subunit and its interactions withantibiotics Nature 407 340ndash348

51 HarmsJ SchluenzenF ZarivachR BashanA GatS AgmonIBartelsH FranceschiF and YonathA (2001) High resolution structureof the large ribosomal subunit from a mesophilic Eubacterium Cell 107679ndash688

52 BanN NissenP HansenJ MoorePB and SteitzTA (2000) Thecomplete atomic structure of the large ribosomal subunit at 24 angstromresolution Science 289 905ndash920

Nucleic Acids Research 2005 Vol 33 No 4 1397

53 CannoneJJ SubramanianS SchnareMN CollettJRDrsquoSouzaLM DuYS FengB LinN MadabusiLV MullerKMet al (2002) The Comparative RNA Web (CRW) Site an online databaseof comparative sequence and structure information for ribosomal intronand other RNAs BMC Bioinformatics 3 2

54 HararyF and PrinsG (1959) The number of homeomorphicallyirreducible trees and other species Acta Math 101141ndash162

55 HararyF (1969) Graph Theory Addison-Wesley Reading Mass

56 VasquezM and ScheragaHA (1985) Use of buildup andenergy-minimization procedures to compute low-energy structures ofthe backbone of enkephalin Biopolymers 24 1437ndash1447

57 ZukerM MathewsDH and TurnerDH (1999) Algorithms andthermodynamics for RNA secondary structure prediction A practicalguide In BarciszewskiJ and ClarkBFC (eds) RNA Biochemistry andBiotechnology Kluwer Academic Publishers Dordrecht pp 11ndash43

58 HofackerIL (2003) Vienna RNA secondary structure serverNucleic Acids Res 31 3429ndash3431

1398 Nucleic Acids Research 2005 Vol 33 No 4

Page 4: doi:10.1093/nar/gki267 Modular RNA architecture revealed ... · Modular RNA architecture revealed by computational analysis of existing pseudoknots and ... Samuela Pasquali, ... RNAs

connecting vertex i with vertex j For RNA dual graphs ele-ments of the adjacency matrix have the following propertiesAij is symmetric the allowed number of edges (connections)between two vertices is 0 1 2 or 3 a vertex can have at mostone self-loop (value of 1) representing a hairpin loop and eachvertex i is connected to four edges representing two incomingand two outgoing connecting strands except where the chainends occur

Our graph invariant is defined based on a modified distance-like matrix D(G) derived from the adjacency matrix A(G) Theelements of symmetric matrix (Dij) are defined as follows

Dij = A1ij for Aij bdquo 0

Dij = dij for Aij = 0

Dii = 0 for all i1

where dij is the minimum number of edges separating verticesi and j ie the number of edges defining the shortest possiblepath between these two vertices We define the graph invariantS(G) as follows

S Geth THORN2 = V1eth THORN1XVifrac141

XVjfrac141

Dij

2

2

Approximately S(G) measures the average distance amongthe vertices in the graph G For example the component Sidefined as

PVjfrac141 Dij measures how well-connected vertex i is

to the other vertices in the graph a peripheral vertex willproduce a higher Si value than a vertex located more in thecenter of the graph Effectively our scheme assigns adjacent

vertices with smaller weights than non-adjacent vertices Thusa large S value corresponds to an extended structure with fewbranches and a small S value to a compact graph with high-order junctions or a complex pseudoknot fold This property ofS can also be used to rank and measure the topological distancebetween graphs In general such topological invariants arebelieved to have a better correspondence with the physico-chemical properties of real molecules since many of the prop-erties depend critically on the exposed surface and lesscritically on the buried elements (34) Figure 1 shows theS values for two distinct RNA topologies

Our distance matrix D(G) and topological invariant S can beextended to tree graphs although their adjacency matriceshave different properties from those for dual graphs Wecall Stree a tree topological invariant which will be used foranalyzing submotifs in 16S and 23S rRNAs Figure 2A com-pares two topologically distinct 10-vertex tree graphs withdifferent Stree values

Our topological invariants [S(G) and Stree] are degeneratedwith respect to the location of chain ends or hairpin loopssince these elements are not scored Thus topologically equi-valent structures (ie same connectivity) embedded within alarger structure can be identified All known topological invari-ants are imperfect since distinct structures can be incorrectlyassigned as identical That is graphs with distinct topologicalinvariants are non-isomorphic but the converse is not true(not all graphs with same topological invariant are iso-morphic) This is the well-known graph isomorphism problem(33) In our analysis of 100 connected graphs with our topo-logical invariant S we only encountered one pair of similar but

1 3 4 5

6

7

9

8 10

2

101 2

34

5

6

78

9

S = 9079 S = 6285

S = 3085

A

B

Figure 2 Tree and dual graphical representations and topological invariants of hypothetical RNA secondary structures (A) RNA tree graphs and their topologicalinvariants Stree (B) The only example we found (out of 100) of two distinct RNA-like pseudoknot graphs with the same topological invariant S the pseudoknotsonly differ by the placement of a stemndashloop Although our structural comparison algorithm regards this structure pair as a positive match in practice such cases areeliminated by manual screening

Nucleic Acids Research 2005 Vol 33 No 4 1387

non-isomorphic graphs (for V = 5) that resulted in the sameinvariant S (see Figure 2B) In a previous work we have alsoused the Laplacian eigenvalue spectrum as a topologicalinvariant yielding a comparable error rate of a few percent(26) Some false positive matches are reported because of thisgraph isomorphism problem Since we manually examine allpositive matches the false positive matches are eliminated

Searching for the motif of a 10-vertex graph in a 30-vertexgraph requires 20 min CPU time on an SGI 300 MHz MIPSR12000 IP27 processor The search of a 15-vertex motif in a40-vertex graph would imply 1000-fold increase in CPU time

In practice our method can easily search (ie in minutes) forsubgraphs of size ranging from 5 to 11 vertices within graphsof size up to 30 vertices (corresponding to 600 nt RNAs)

RESULTS AND DISCUSSION

RNA pseudoknots

We analyze the structural similarity of a set of 14 pseudoknotsranging from 41 to 161 nt Figure 3 shows the secondarystructures and graphical representations of the pseudoknots

Viral frameshiftin g

PKB217

PKB174PKB131

PKB178

PKB223

PKB134PKB135Self cleavingVS ribozyme

PKB205

18S ribosomal RNA

PKB191

PKB173

Hammerhead ribozyme

PKB239

HIV1 5rsquo UTR

frameshiftingViral

Internal ribosome entry siteAptamer

PKB143

tmRNA

PKB210

PKB233

Figure 3 Topological matches generated by the simple (PKB217 PKB233) and double (PKB178) pseudoknots The pseudoknots are shown as schematic secondarystructures and dual graphs There are 2 probe and 11 target pseudoknot topologies representing diverse functional RNA pseudoknots For each probe pseudoknotstructure the corresponding matched submotifssubgraphs in the larger pseudoknots are shaded

1388 Nucleic Acids Research 2005 Vol 33 No 4

selected from the PseudoBase (25) (httpwwwbioleidenunivnl~BatenburgPKBhtml) The RNA pseudoknots are viralframeshifting (PKB174 PKB217 PKB233) viral tRNA-like(PKB134 PKB135 PKB143 PKB191) IRES RNA (PKB223)18S rRNA (PKB205) HIV-1 50-untranslated region (50-UTR)(PKB239) alternative pseudoknot of hammerhead ribozyme(PKB173) aptamer that binds human nerve growth factor(PKB131) tmRNA (PKB210) and self-cleaving Neurosporavs ribozyme (PKB178) (PKB233 and PKB217 appear thesame but have different sequences and this difference affectssubsequent sequence alignment analysis) These pseudoknotclasses are not known to be functionally related We choosemore than one member for some classes because of exhibitedtopological differences

The above pseudoknot structures were derived frommutational analysis (PKB173 PKB174 PKB191 PKB233PKB239) enzymatic and chemical structure probing(PKB131 PKB174 PKB191 PKB239) phylogenetic orsequence comparison analysis (PKB131 PKB135 PKB143PKB205 PKB217 PKB233 PKB234 PKB191 PKB210PKB239) and 3D modeling (PKB134 PKB135 PKB210)The pseudoknots listed in more than one category were probedusing two or more methods

To use these structures for graphical analysis we convertthe base pairing information of each structure provided byPSEUDOBASE to a dual graphical representation using ourdual graph rules [D1ndashD3 (2)] We then use pseudoknot graphsto specify their corresponding adjacency matrices Thesetwo steps are performed manually For RNA tree structureswe use our RNA Matrix program to automatically convert asecondary structure into its corresponding adjacency matrix(2635) The search for a small motif within a larger motif isperformed using input adjacency matrices and the comparisonof matrices is performed via the topological invariant S(Equation 2)

Sequence and structural relationships of functionallyrelated pseudoknots

We illustrate using several pseudoknot pairs the functionalsimilarity identified by topological rather than sequence sim-ilarity Figure 4 compares viral frameshifting pseudoknotstructures Viral RNA frameshifting is a process by whichexpression of overlapping open reading frames (ORFs) canbe achieved (36) We compare three pairs from subfamiliesGag-pro ORF1aORF1b and ORF2ORF3 ribosomalframeshifting Frameshifting signals involve two essentialcomponents a lsquoslipperyrsquo (or sliding) sequence of nucleotideswhere frameshifting takes place and a stimulatory RNAsecondary structure (usually a pseudoknot) located a fewnucleotides downstream Although members of the sameframeshifting subfamilies have significant sequence similarityglobal sequence identity between subfamilies is as low as28 similar to random sequences Because all frameshiftingpseudoknots in Figure 4A have the same topology and func-tion finding topological similarity between RNA structureshas an advantage for functional annotation over sequencecomparison

Figure 4B shows that this finding also holds for a frameshift-ing pseudoknot pair (PKB217 and PKB174 Pair 4 in Table 1)The 72 nt PKB217 is the 1 frameshifting pseudoknot of

lactate dehydrogenase-elevating virus (LDV) (37) Togetherwith its sliding sequence UUUAAAC it regulates the expres-sion of ORF1a for a polyprotein containing proteases andORF1b for an RNA-dependent RNA polymerase in LDVThe 127 nt PKB174 is the 1 frameshifting pseudoknot ofRous sarcoma virus (RSV) (3638) This pseudoknot and slid-ing sequence AAAUUUA regulate the expression of Gag-Pro-Pol and Gag-Pol polyproteins Although both the LDV andRSV frameshifting pseudoknots have a simple pseudoknotstructure the latter has two extra stemndashloops Despite theirfunctional similarity the global sequence identity is only 37[computed using the program ALIGN (39) available at httpworkbenchsdscedu] comparable with random sequencestheir sliding sequences are also dissimilar Using our localsequence alignment parameters we find a large significantmatched region in PKB217 (6833ndash6877) as shown inFigure 4B AAACUGCUAGCCACCUCUGGUCUCGACC-GCUGUACUAGAGGUGG

Although functional similarity between frameshiftingpseudoknots is expected this case illuminates how examiningtopological similarity aids in identifying pseudoknots withsimilar functions despite low overall sequence similarity

Comparison of pseudoknot structures

Our two probe motifsmdasha simple pseudoknot (PKB217 orPKB233) and a double pseudoknot (PKB178)mdashare used tosearch larger RNAs We generate two groups of topologicalmatches as shown in Figure 3 with the matched motifs showndarkened we found 11 such matched pseudoknot pairs Inaddition we analyze sequence similarity of many of the84 (14 middot 132) possible pseudoknot pairs since topologicalsimilarities are apparent within each pseudoknot group gen-erated by the probe motifs Although not exhaustive thisadditional analysis may help to identify pairs having similarsubstructures that are not detected by using fixed probes(simple and double pseudoknots) Table 1 lists 10 pseudoknotpairs (out of the 11 probestructure matches and subset of the84 structurendashstructure analysis) that have significant structuraland sequence similarities Of the total of 10 pairs listed(Pairs 1ndash10) two known functional pairs (1 and 4 in Table 1)are included for comparison with the eight other pairs that arenot previously known to be functionally related

We begin by analyzing the overall patterns of relationshipamong our 14 pseudoknots Figure 3 organizes the matchesinto two groups five matches for the simple pseudoknot(PKB217) and six for the double pseudoknot (PKB178)The matched motifs appear to be lsquomodular constructionsrsquoaround the probe motifmodule with added stemndashloops(three or fewer) To the best of our knowledge these resultsreveal modular features and topological similarities of RNApseudoknots that have not been described

The biological significance of topological matches dependson the size and complexity of pseudoknots a match for thedouble pseudoknot is probably more significant than for asimple pseudoknot The six pseudoknots generated by thedouble pseudoknot probe PKB178 show striking topologicalas well as functional similarity (eg viral tRNA-like PKB134and PKB135) we analyze their similarity with PKB178 andamong themselves Still the matched pairs of RNAs are func-tionally diverse besides PKB134 and PKB135 such examples

Nucleic Acids Research 2005 Vol 33 No 4 1389

of diverse RNAs include IRES (PKB223) RNA 18S rRNA(PKB205 PKB234) HIV-1 50-UTR (PKB239) and alternativepseudoknot of hammerhead ribozyme (PKB173) In this dou-ble pseudoknot group six pairs (1 3 5 6 8 and 9 in Table 1)produce local sequence alignments [using LALIGN (40)

httpworkbenchsdscedu] that suggest significant resultsIn particular the pair PKB134PKB173 (Pair 5 in Table 1)indicates a significant match The pairs PKB134PKB205(Pair 6) and PKB205PKB223 (Pair 10 in Table 1) alsodisplay intriguing sequence and functional similarities

A A A A A A C G G G A A G C A A G G G C U

GGGA

CA A

CCCU

A

A

A

U

GA U

C C C G G

G

A AACC

A

U

PKB 3

Gag-polglobal sequence identity 57

PKB 128U U U A A A C U G U U G A G A G G U G C C U G

U C U C U A C G G A C

GCCUG

GA G

CGGAC

UU U

G

U AUA

G

A

C

PKB 217U U U A A A C U G C U A G C C A C C U C U G G U

G G U G G A G A U C A

GACCGC

CU C C

UGGCG

CU

U

G

G

GCA

U

G GY

AG

CG

ORF 1a - ORF 1bglobal sequence identity 53

G G G A A A C G G G C A G G C G G

GCG

C GCGC

A

G

A

C C G C C

AC

AA

C

PKB 44

G G G A A A U G G A C U G A G G C G

GGCG

C CGGC

A

C

A

C G C C

A

CA

A

C

PKB 240

A

A

ORF2 - ORF3global sequence identity 79

global sequence identity as low as 28

A

U CC

G

C

AA A

C

U

C

GG

C

U

G

UU

AA

PKB 217GCUGACGGUGUYUGGCG

UA

CUC

UG

UG A C C G C

C U G G C G

CC

CCUCUGGU

GUGGAGA

CA

A

U

GU U U A A A C U G C U A G C 5

3

PKB 174

G A AG

UG

AA

AC

U

U

A

U

UG

C A

UUG

UC

UC

U

C

A

U A

U

AGGG

UCCCU

CGGGCCACUG

CCCGGUGAC

G

CC

GU

GU

G

A

CA

C

G C G C U A C A

C G C G A U G U

A G A C C G

U C U G G C

A

2603 3

52476 A A A U U U A U

6830

BPair 4

global sequence identity 37

G G G A A A C U G G A A G G C G G G G C

GCUGCA

GA C

GACUG

G

A

A

C C C C G

PKB 4

C

AA

AUA

UG

Figure 4 Sequence and structural comparisons of pseudoknots in the frameshifting functional families (A) Two secondary structures and corresponding topologiesfor each of the three viral frameshifting subfamilies Gag-pro ORF1aORF1b and ORF2ORF3 Sequence identity within the same subfamily ranges between 50 and80 it is as low as 28 when comparing pseudoknots from different subfamilies All frameshifting pseudoknots are functionally related and have the same topology(dual graph) (B) Secondary structures and their aligned nucleotides (color coded) for frameshifting pseudoknots of LDV (PKB217) and Rous sarcoma virus(PKB174) Global sequence alignments were performed using the program ALIGN with default (164) gap openinggap extension penalties

1390 Nucleic Acids Research 2005 Vol 33 No 4

These relationships suggest structural and functional similar-ities that are not previously known (Table 1)

To support the significance of these matches we performsequence alignments using SmithndashWaterman parameters forgap opening and gap extension penalties (41) We choose thegap opening penalty value of 15 from a range of 0 to 19 Toproduce better alignments for large sequence segments (shortalignments are less likely to be significant in our context) ourgap extension penalty of 1 from a range of 1 to 8 allowslarger extensions consistent with the possibility of the inser-tion of long sequences and added 2D motifs In Table 1 wealso report the alignment score which is the sum of nucleotidematches mismatches and gap penalties

Table 1 shows that the probe structure (PKB178) matchestwo other pseudoknots (PKB135 PKB173) with sequenceidentities ranging from 59 to 71 for 24ndash49 nt alignmentlengths (compared with 56 sequence identity over 129 ntfor the functionally related pair PKB134PKB135 Pair 1)Significantly the pairs PKB134PKB173 (Pair 5) PKB134PKB205 (Pair 6) and PKB205PKB223 (Pair 10) havesequence identity range of 72ndash80 The non-randomness ofthe 10 aligned pairs in Table 1 is also evident in the ratio Rwhich measures the level of significance above the randomexpectation and has value 1 for random pairs

R = (percent sequence identity)(percent sequence identitywhen one of the aligned sequences is randomized)

We also define ltRgt as the R-value averaged over align-ments with different randomized sequences Table 1 showsthat some pseudoknot pairs with no known functional rela-tionship have ltRgt values ranging between 106 and 147whereas the related pair PKB134PKB135 has a value of154 many matched pairs have ltRgt values lt1 Thus thetopological similarities we identify led to matched pairswith non-random sequence alignment results Non-randomsequence matches may arise from topological andor func-tional similarities as illustrated below for three pseudoknotpairs (5 6 and 10 in Table 1)

The rate of false positives generated by topological and ltRgtvalue screening can be easily estimated by considering allpossible pairs from the set of six matched double pseudo-knots in Figure 3 These pseudoknots form 15 distinctpairs 6 of which have ltRgt values gt114 [Table 1 withS(G) = 117] Table 1 shows that only four pairs have largeltRgt values (gt12) Pair 1 (PKB134PKB135) is related Pair 5(PKB134PKB173) and Pair 6 (PKB134PKB205) are mostprobably related and Pair 3 (PKB178PKB173) is of unknownrelationship (in fact both are self-cleaving ribozymes)Assuming that the Pair 3 is unrelated and therefore reflectsa false positive we have an error rate of 115 or 7 If theinferred functionally related Pairs 5 and 6 are also consideredas false positives we then have a conservative error rateof 20

Functional relationship between PKB134 and PKB173(Pair 5 in Table 1)

The 116 nt PKB134 pseudoknot occurs in the 30 end of bromemosaic virus genomic RNA (31) The 73 nt PKB173 is ahammerhead ribozyme of satellite cereal yellow dwarfvirus-RPV (satRPV) RNA capable of adopting an alternativepseudoknot conformation by forming an (L1ndashL2a) stem whose

existence is established by experimental mutational analysis(42) Figure 3 shows that their topologies have the same lsquocorersquobut PKB134 has two extra stemndashloops The local sequencealignment of this pair using LALIGN (Figure 5) shows 19bases aligned to 80 sequence identity and an ltRgt valueof 123 (Table 1) There are two significant segments inPKB173 that are matched with those in PKB134 UACUGU-CUGACGACGUAUCC (nt 11ndash30) and UAGAAGGCUG-GUGCC (nt 39ndash53) These segments and their matchedregions in PKB134 span the stem and unpaired elements ofthe secondary structures (see Figure 5) Significantly bothPKB134 and PKB173 pseudoknots serve as template for rep-licase a protein that recognizes specific sequence regions toinitiate replication of viral genomes (3142) Moreover a con-served sequence (CUGANGA) of PKB173 is in the alignedregion (nt 11ndash30) Thus our analysis suggests that PKB134and PKB173 pseudoknots are structurally and functionallyrelated

Further analysis of pseudoknots PKB134 and PKB173reveals specific domains interacting with the replicase Experi-ments have shown that mutations in domains A C B1 and B2[defined in (31)] of PKB134 impair viral RNA replication (43)Significantly our two aligned sequence regions span domainsA B1 C and also D Thus the replicase recognizes severaldomains in the PKB134 pseudoknot structure In PKB173 thepseudoknot forming five base pair helix (GCGCG) is implic-ated in replicase recognition or ligation (42) This pseudoknotalso has conserved sequence regions ACAAA (61ndash65) or itscomplement (possibly the replication origin) and AGAAA(nucleotides attached to the 30 end but not shown) Howeverthese regions are not observed in PKB134

Functional relationship between PKB205 and PKB223(Pair 10 in Table 1)

The 63 nt PKB223 is the pseudoknot of IRES of capsid-producing cricket paralysis-like viruses (CrPV) formethionine-independent initiation of translation (44) Appear-ing immediately upstream from the capsid-coding sequenceit initiates translation without a canonical AUG (methionine)initiation codon The 48 nt PKB205 is the pseudoknotsubstructure (V4 domain) of SSU of Palmaria palmata 18SrRNA which was found using extensive comparative analysis(4546) Figure 5 (Pair 10 lower panel) shows that PKB223 isa 4-vertex simple pseudoknot with two inserted stemndashloopsand PKB205 is a 4-vertex double pseudoknot with an insertedstemndashloop Figure 5 (middle panel) displays the local sequencealignment of PKB223 and PKB205 yielding a match of 14 ntwith 72 sequence identity This significant sequence simil-arity occurs in regions ACAAUAUCCAGGAA (6148ndash6161)and ACAUUAGCAUGGAA (765ndash778) of PKB223 andPKB205 respectively Remarkably we find that the 9 ntsegment 50-UUAGGUUAG-30 (nt 6100ndash6108) of PKB223is complementary to 30-GGUACGAUU-50 (nt 768ndash776) ofPKB205 except at G6103 Our finding corroborates withthe experimental identification that IRES elements in cellularmRNA contain a short 9 nt (nt 133ndash141 CCGGCGGGU) inthe 50-UTR of the homeodomain protein Gtx which is com-plementary to nt 1132ndash1124 of 18S rRNA (47) Indeed manysuch near complementary sequences have been found bet-ween the 30 end of 18S rRNA and picornavirus RNAs (48)

Nucleic Acids Research 2005 Vol 33 No 4 1391

Fig

ure

5

Th

ree

no

vel

fun

ctio

nal

lyre

late

dp

seu

do

kn

otp

airs

infe

rred

by

top

olo

gy

mat

chin

gan

dse

qu

ence

alig

nm

ent

Up

per

pan

els

Sec

ond

ary

stru

ctu

res

and

thei

ral

ign

edn

ucl

eoti

des

(co

lor

cod

ed)

for

pse

udo

kn

ot

pai

rsP

KB

13

4P

KB

17

3(P

air

5T

able

1

left

pan

el)

PK

B2

05

PK

B2

23

(Pai

r1

0T

able

1

mid

dle

pan

el)

and

PK

B1

34

PK

B20

5(P

air

6

rig

ht

pan

el)

PK

B1

34

occ

urs

in30 e

nd

of

bro

me

mosa

icv

iru

sg

eno

mic

RN

A

PK

B1

73

isa

pse

udo

kn

oto

fsa

tell

ite

cere

aly

ello

wd

war

ftv

iru

s-R

PV

RN

AP

KB

22

3is

the

pse

ud

ok

no

tofin

tern

alen

try

site

(IR

ES

)o

fca

spid

-pro

du

cin

gC

rPV

san

dP

KB

20

5is

the

pse

udo

kn

ots

ub

stru

ctu

re(V

4d

om

ain

)o

fS

SU

of

PP

alm

ata

18

SrR

NA

L

ow

erp

anel

sT

he

alig

ned

sub

gra

ph

so

rre

gio

ns

are

shad

edo

rm

ark

ed

1392 Nucleic Acids Research 2005 Vol 33 No 4

Intriguingly for CrPVs the complementary segments occur inpseudoknots with similar folds not involving the dangling30 end of 18S rRNA Base pair complementary between IRESelements of mRNA and 18S rRNA is a mechanism for ribo-some recruitment and translation In viral IRES elements thesecondary and tertiary structures are believed to play criticalroles in translation initiation (44)

Functional relationship between PKB205 and PKB134(Pair 6 in Table 1)

The functional relationsip of the pair PKB205PKB134 issimilar to that for PKB205PKB223 above PKB134 belongsto a common viral tRNA-like functional class whose role is torecruit host ribosomes to initiate viral protein synthesis (49)It is known to interact with the ribosome via tRNA shapemimicry (31) Our analysis indicates that tRNA-like mole-cules also contain a region which is nearly complementary toPKB205 a pseudoknot region of the 18S rRNA Table 2shows regions of PKB205 complementary to four tRNA-like molecules (PKB134 191 138 17) the complementaryregion in PKB223 is shown for comparison Significantlyeach of the pseudoknots PKB223 PKB134 and PKB191has a region complementary to the same 8ndash9 nt unpairedregion in PKB205 (767ndash776) This single-stranded regionis most probably available for base pairing with incomingtRNA-like molecules In addition two other regions inPKB205 are complementary to tRNA-like pseudoknotsPKB138 and PKB17 one region is a hairpin loop and anotherregion is part of short stems Thus this analysis suggests thattRNA-like molecules may employ both structural mimicry andsequence complementarity to recruit SSU rRNA an interac-tion mode similar to tRNA PKB134 also has a region UACA-GUGUUGAAAAACA (nt 2078ndash2094) closely matched tothe region UAGAGUGUUCAAAGCCA (nt 732ndash748) inPKB205 with an average R-value of 123

Tertiary structure similarity of topological matches inribosomal RNAs

The preceding examination of three pseudoknot pairs showsthat topological matches can suggest functionally relatedpseudoknots It is also important to show that novel topo-logical matches led to similar tertiary structures We focusfor this purpose on rRNA since several bacterial and archaealribosome structures have been solved in recent years In factin our previous work we have found several topologicalmatches of the Scerevisiae 5S rRNA motif within its largerrRNAs two within 16S rRNA (domains II and III) and threewithin 23S rRNA (domains III IV and VI) (2) From availableX-ray structures for Thermus thermophilus 16S rRNA (PDBaccession no 1fjg) (50) and 23S rRNA from Deinococcus

radiodurans (1nkw) and Haloarcula marismortui (1ffk)(5152) only 5S rRNA motifrsquos matches in domain II of16S rRNA and domain III of 23S rRNA are relevant theother three cases are not valid matches in these species

Figure 6 displays the secondary diagrams graphical repres-entations and tertiary structures of 5S rRNA for bacteriumDradiodurans (1nkw) and archaea Hmarismortui (1ffk)and their topologically matched modules within 16S and23S rRNAs the secondary diagrams are adapted from thosein Gutellrsquos comparative RNA website (httpwwwrnaicmbutexasedu) (53) The module in 23S is an identicalmatch for the 6-vertex 5S graph except for a hairpin loopwhich does not contribute to the topological invariant SThe 16S module is a 5-vertex graph because a 2 bp stem atthe junction has a non-canonical base pair AG (dual graph ruleD1 defines a stem as having at least two canonical andorwobble base pairs) for Scerevisiae this stem consists ofonly canonical base pairs making the 16S module an identicalmatch for the 5S topology As shown in Figure 6 the 16S and23S modules have a similar L-shaped tertiary fold as the 5Sstructure Tertiary structure similarity with 5S is more strikingfor the module of domain II (nt 653ndash753) of Tthermophilus16S rRNA than for the module of domain III (nt 1033ndash1143) of23S rRNA Both the 5S structure and 16S module have apronounced 3-stem junction where the chain ends of the16S module are located Structural (backbone) overlap ofthe 5S structure and 16S module shows similar structuralfeatures except for the orientations of the smaller stems atthe junction (We aligned the structures manually because thechain ends of 5S and 16S module occur at non-correspondinglocations) In contrast the 23S module shows significant dis-tortions on the shorter helical arm with the hairpin loop foldedback to the helix as shown in the structural overlap of the 5Sstructure and 23S module Structural similarities seen in theseexamples are probably due to geometric constraints which areapparent in their secondary diagrams and graphical representa-tions Thus 2D topological matches can lead to similar 3Dstructures Below we survey the distribution of modules inrRNAs

Distribution of submotifs in 16S and 23S ribosomalRNAs

Characterizing the submotifs within the 16S and 23S rRNAsthe largest known RNA molecules can help uncover themodular construction of RNA structures as analysis ofrRNArsquos tertiary modules has revealed Furthermore our pre-vious analysis of rRNArsquos structural motifs indicated existenceof characteristic features (35) For example the distributionsof pairedunpaired bases in stems bulgesinternal loops hair-pin loops and junctions follow specific functional forms

Table 2 Short sequence regions in PKB205 complementary to tRNA-like molecules

PKB205 regions Complementary sequences

30-GGUACGAUU-50 (768ndash776) PKB223 50-UUAGGUUAG-30 (6100ndash6108) IRES30-GUGAGAUU-50 (769ndash776) PKB134 50-CACUGUAAA-30 (113ndash120) tRNA-like30-UACGAUUAC-50 (767ndash775) PKB191 50-AUGCUCAUG-30 (77ndash85) tRNA-like30-UGUGAGAUU-50 (731ndash739) PKB138 50-ACACUUUAA-30 (38ndash47) tRNA-like30-GUUUGCGGACC-50 (746ndash756) PKB17 50-CAAAACCCUGG-30 (19ndash29) tRNA-like

Nucleic Acids Research 2005 Vol 33 No 4 1393

Figure 6 Comparison of the graph topologies and 2D and 3D structures of 5S rRNAs (A and B) and modules in 16S and 23S rRNAs (C and D) (C and D) Twostructural (backbone) overlaps of the 5S structure (1ffk gold color) with 16S module (1fjg blue) and 5S structure with 23S module (1nkw green) are shown Thesecondary diagrams are adapted from those in Gutellrsquos comparative RNA website (httpwwwrnaicmbutexasedu) and 3D structures are from the Protein DataBank (httpwwwrcsborgpdb) The 2D modules were identified computationally by our earlier work on finding120 nt 5S rRNA motifs in 16S and 23S rRNAs ofScerevisiae(2) The equivalent tertiary structures and modules for 5S 16S and 23S rRNAs from other organisms show that similar graph topologies can lead tosimilar secondary and tertiary structures All structures have similar graph topologies The Tthermophilus 16S module is a 5-vertex instead of 6-vertex graph becausea 2 bp stem at the junction has a non-canonical base pair AG for Scerevisiae this stem consists of only canonical base pairs making the 16S module an identicalmatch for the 5S topology

1394 Nucleic Acids Research 2005 Vol 33 No 4

Moreover the percentages of nucleotides in stems bulgesinternal loops hairpin loops and junctions do not vary for16S and 23S rRNAs even though the latter is twice thesize of the former Such motif characteristics help discriminateRNA-like from non-RNA-like molecules and guide the designof novel RNAs Here we analyze the distinct modular unitsmdashsmall tree submotifsmdashthat makeup the rRNAs to advance suchapplications Since all small tree graphs up to 10 vertices havebeen exhaustively enumerated (5455) we can use our sub-structure search algorithm to identify the occurrences of alldistinct tree submotifs in rRNAs

Specifically we use sets of tree graphs for V = 4 5 6 7 8 9and 10 comprising 2 3 6 11 23 47 and 106 graphs respect-ively for a total of 198 tree motifs The complete enumeratedmotifs can be found in the RAG web resource (httpmonodbiomathnyuedurna) Figure 7 shows as an example the11 possible 7-vertex tree graphs and lists the frequency ofoccurrence of each tree graph in rRNAs Remarkably thesame tree motifs are abundant in both 16S and 23S rRNAsThe most abundant (10 or 15 times) 7-vertex tree has two3-stem junctions (topological invariant Stree = 6160) 4- and5-stem junctions are the next most prevalent submotifs(Stree = 5137 5921 and 5667) Interestingly the unbranchedtree (Stree = 7219) and highly branched trees (Stree = 4509)are rare

Figure 8 plots the percentage of distinct tree graphs in eachset V (fraction of motifs identified relative to possible motifs)

found in rRNA For small tree graphs V = 4 5 and 6 allpossible topologies are present but not for V gt 6 In fact thepercentage of trees found declines with V In 23S rRNA90 of possible trees are found for V = 7 and only40 (representing 47 tree topologies) are found forV = 10 Larger trees (V gt 10) are expected to have muchsmaller percentages This percentage decline is expectedsince small tree modules are less specific than large treesIn total 111 submotifs out of 198 are identified The observedtrends in Figure 8 may be generally valid for rRNA structuresof other organisms since they are highly conserved althoughsome statistical fluctuations are expected due to structuraldifferences and RNArsquos finite size

These data show that RNA structures are modular constructsof relatively few (tree) building blocks Extrapolation of thecurves in Figure 8 suggests that 20 of V = 11 motifs (totalof 235 trees) and 10 of V = 12 motifs (total of 551 trees)probably exist in rRNAs Thus the total number of buildingblocks in rRNAs with V lt 13 would be 210 Although thepercentages of trees with higher V values are low they aremore numerous and will contribute to the overall number ofbuilding blocks We thus consider 210 as the minimum num-ber of building blocks The building block number may beexploited in the future design of novel functional RNAmolecules using modular assembly from 210 distinct treemotifs The tree motifs may be taken from substructures ofexisting RNAs allowing simultaneous sequence and structure

treeS =

16S 23SOccurrence

ID

6 4

)

0 1

(7 8)

6191

(7 11)

4509

0 1

(7 9)

5385

3 3

7

(7 10)

5137

5 5

(7 6)

5916

5

(7 5)

5921

(7 1)

0 0

7219

(7 3)

2 4

6446

4 4

(7 7)

5667

10 15

(7 4

6160

0 3

6692

(7 2)

Figure 7 Occurrence frequency of 11 distinct 7-vertex tree topologies as submotifs of Scerevisiae 16S and 23S rRNAs The tree graphs are labeled using topologicalinvariant Stree (second top row) and ID number (i j) (first row) which refers to motif identification number as used in RAG database (httpmonodbiomathnyuedurna) The two numbers at the bottom of each graph refer to the number of times that topology is found embedded in 16S and in 23S rRNAs respectively

Nucleic Acids Research 2005 Vol 33 No 4 1395

design Such a build-up approach appears feasible from our(26) and other works (35) and is akin to the lsquobuilduprsquo tech-nique for proteins pioneered by Vasquez and Scheraga (56)

The above estimate of building block number can be eithera consequence of the limited number of available structures orthe fact that RNA adopts only a subset of possible graphsCertainly future discoveries of novel RNA structures mightindicate a larger repertoire of building blocks Still our sur-veys of natural RNAs (26) suggest that RNAs possess specifictopological characteristics and favor a subset of the possiblegraph motifs The missing motifs may not exist in nature forRNA owing to physical or functional constraints

The building block number can also be defined in terms ofirreducible graphs ie trees that cannot be decomposed intosmaller trees However this approach has a major drawbackSince all RNAs are composed of connected stems their sub-structures can be reduced to two vertex graphs which is not aninformative description of complex RNA folds Still if wedefine 4-vertex trees as the smallest irreducible graphs thenumber of irreducible graphs can be calculated Based on oursubmotif data we estimate that for V up to 10 there are about42 irreducible tree modules in rRNAs

SUMMARY AND CONCLUSION

Our automated comparison of RNA secondary structuresbased on graphical representation allows the examination ofconserved 2D RNA topological features which in turn suggestfunctional relationships This approach has an advantage oversequence alignment because functionally related RNAs oftenlack sequence similarity as shown for frameshifting pseudo-knots with low global sequence identities Of course ourstructure alignment approach relies on available 2D RNA

structures and thus will increase in scope as more RNAstructures become available Our eight identified pairs ofpotentially related pseudoknots (Table 1) reveal three pairsthat are functionally related the pseudoknots of 30 end ofbrome mosaic virus genomic RNA (PKB134) and alternativehammerhead ribozyme pseudoknot (PKB173) are templatesfor replicase to initiate replication of viral genomes the pseu-doknots of internal entry site (IRES) of caspid-producingCrPVs (PKB223) and SSU of 18S rRNA (PKB205) are asso-ciated with translation initiation and the pseudoknots of 18SrRNA (PKB205) and viral tRNA-like (PKB134) most prob-ably interact via both structural mimicry and base comple-mentarity To the best of our knowledge these functionalrelationships have not been described previously

The developed RNA comparison algorithm also led us toquantify the modularity of 16S and 23S rRNAs using smallenumerated tree motifs Our results suggest that rRNAs areconstructed from at least 210 distinct tree subtopologies(or modules) having up to 180 nt Significantly we findthat 5S rRNA and two tree modules within 16S and 23SrRNAs with similar topologies have similar tertiary shapesaffirming the relationship between 2D graphical representa-tion and 3D structure The tree modules in rRNA may be usedas building blocks for the design of novel RNA folds A similaridea has been employed to design and improve functionalRNA molecules (510) Design by assembly of modulesexploits motifs present in natural RNAs unlike de novo designof sequences and structures (11) The issues of discoveringnovel topologies (6) building module library and designingfunctional folds are likely to increase in importance withinterest in novel RNAs

Our algorithm has several limitations Of course its utilitydepends on availability of reliable RNA secondary foldingalgorithms (5758) More fundamentally although simplifiedgraphical representations specify topological characteristicsdetailed sequence and motif information is missing An asso-ciated problem is that unusual RNA structurendashfunction rela-tionships due to deleterious mutations as in frameshiftingpseudoknots (28) and structural flexibility are not capturedMore advanced graph constructs such as weighted graphs mayhowever specify detailed motif features at the secondary andbase pair levels Another limitation of the present algorithmis the requirement to fix the probe structure or graph As inprotein structure alignment algorithms (15) future refinementsof our algorithm should allow some flexibility in motifdefinitions as the similarity search proceeds We inviteRNA scientists to visit our RAG web resource at httpmonodbiomathnyuedurna and suggest to us specific enhancementsthat will be useful in practice

ACKNOWLEDGEMENTS

We thank Dinshaw Patel for comments that motivated thiswork This research was supported by a Joint NSFNIGMSInitiative in Mathematical Biology (DMS-0201160) and theHuman Frontier Science Program (HFSP)

REFERENCES

1 TinocoI and BustamanteC (1999) How RNA folds J Mol Biol293 271ndash281

Figure 8 Distribution of all possible distinct tree modules with V vertices foundin Scerevisiae 16S and 23S rRNAs In graph theory each V is represented by aset of possible tree structures For example for V= 4 5 6 7 8 9 and 10 there are2 3 6 11 23 47 and 106 possible trees respectively The distribution of treesquantifies the diversity of tree modules within rRNAs This analysis suggeststhat rRNAs possess at least 210 tree modules

1396 Nucleic Acids Research 2005 Vol 33 No 4

2 GanHH PasqualiS and SchlickT (2003) Exploring the repertoire ofRNA secondary motifs using graph theory implications for RNA designNucleic Acids Res 31 2926ndash2943

3 SoukupGA and BreakerRR (1999) Engineering precision RNAmolecular switches Proc Natl Acad Sci USA 96 3584ndash3589

4 BreakerRR (2002) Engineered allosteric ribozymes as biosensorcomponents Curr Opin Biotechnol 13 31ndash39

5 IkawaY FukadaK WatanabeS ShiraishiH and InoueT (2002)Design construction and analysis of a novel class of self-foldingRNA Structure 10 527ndash534

6 KimN ShiffeldrimN GanHH and SchlickT (2004) Candidatesfor novel RNA topologies J Mol Biol 341 1129ndash1144

7 MitchellJR ChengJ and CollinsK (1999) A box HACAsmall nucleolar RNA-like domain at the human telomerase RNA 30

end Mol Cell Biol 19 567ndash5768 DoudnaJA and CechTR (2002) The chemical repertoire

of natural ribozymes Nature 418 222ndash2289 StorzG (2002) An expanding universe of noncoding RNAs Science

296 1260ndash126310 BreakerRR and JoyceGF (1994) Inventing and improving ribozyme

function rational design versus iterative selection methods TrendsBiotechnol 12 268ndash275

11 AndronescuM FejesAP HutterF HoosHH and CondonA (2004)A new algorithm for RNA secondary structure design J Mol Biol336 607ndash624

12 CarterRJ DubchakI and HolbrookSR (2001) A computationalapproach to identify genes for functional RNAs in genomic sequencesNucleic Acids Res 29 3928ndash3938

13 DoudnaJA (2000) Structural genomics of RNA Nature Struct Biol7 954ndash956

14 RivasE KleinRJ JonesTA and EddySR (2001) Computationalidentification of noncoding RNAs in Ecoli by comparativegenomics Curr Biol 11 1369ndash1373

15 HolmL and SanderC (1996) Mapping the protein universeScience 273 595ndash602

16 ShindyalovIN and BournePE (2001) A database and tools for 3-Dprotein structure comparison and alignment using the CombinatorialExtension (CE) algorithm Nucleic Acids Res 29 228ndash229

17 BenedettiG and MorosettiS (1996) A graph-topological approach torecognition of pattern and similarity in RNA secondary structuresBiophys Chem 59 179ndash184

18 MargalitH ShapiroBA OppenheimAB and MaizelJVJr (1989)Detection of common motifs in RNA secondary structures Nucleic AcidsRes 17 4829ndash4845

19 ShapiroBA and ZhangKZ (1990) Comparing multiple RNAsecondary structures using tree comparisons Comput Appl Biosci6 309ndash318

20 LeSY NussinovR and MaizelJV (1989) Tree graphs of RNAsecondary structures and their comparisons Comput Biomed Res22 461ndash473

21 DuarteCM WadleyLM and PyleAM (2003) RNA structurecomparison motif search and discovery using a reduced representation ofRNA conformational space Nucleic Acids Res 31 4755ndash4761

22 ChevaletC and MichotB (1992) An algorithm for comparing RNAsecondary structures and searching for similar substructures ComputAppl Biosci 8 215ndash225

23 WangL and ZhaoJ (2003) Parametric alignment of ordered treesBioinformatics 19 2237ndash2245

24 FontanaW KoningsDAM StadlerPF and SchusterP (1993)Statistics of RNA secondary structures Biopolymers 33 1389ndash1404

25 van BatenburgFHD GultyaevAP PleijCWA NgJ andOliehoekJ (2000) PseudoBase a database with RNA pseudoknotsNucleic Acids Res 28 201ndash204

26 GanHH FeraD ZornJ ShiffeldrimN TangM LasersonUKimN and SchlickT (2004) RAG RNA-As-Graphsdatabasemdashconcepts analysis and features Bioinformatics20 1285ndash1291

27 FeraD KimN ShiffeldrimN ZornJ LasersonU GanHH andSchlickT (2004) RAG RNA-As-Graphs web resource BMCBioinformatics 5 88

28 ChenXY KangHS ShenLX ChamorroM VarmusHE andTinocoI (1996) A characteristic bent conformation of RNA pseudoknotspromotes 1 frameshifting during translation of retroviral RNAJ Mol Biol 260 479ndash483

29 KangHS HinesJV and TinocoI (1996) Conformation of anon-frameshifting RNA pseudoknot from mouse mammary tumor virusJ Mol Biol 259 135ndash147

30 GanHH PerlowRA RoyS KoJ WuM HuangJ YanSNicolettaA VafaiJ SunD et al (2002) Analysis of protein sequencestructure similarity relationships Biophys J 83 2781ndash2791

31 FeldenB FlorentzC GiegeR and WesthofE (1994) Solutionstructure of the 30-end of brome mosaic-virus genomic RNAsmdashconformational mimicry with canonical transfer-RNAs J Mol Biol235 508ndash531

32 GrossJL and YellenJ (1999) Graph Theory and its Applications CRCPress Boca Raton FL

33 KoblerJ SchoningU and ToraanJ (1993) The Graph IsomorphismProblem Birkhauser Boston MA

34 RandicM and ZupanJ (2001) On interpretation of well-knowntopological indices J Chem Inform Comput Sci 41 550ndash560

35 ZornJ GanHH ShiffeldrimN and SchlickT (2004) Structuralmotifs in ribosomal RNAs implications for RNA design and genomicsBiopolymers 73 340ndash347

36 JacksT MadhaniHD MasiarzFR and VarmusHE (1988) Signalsfor ribosomal frameshifting in the Rous-Sarcoma virus GagndashPolregion Cell 55 447ndash458

37 GodenyEK ChenL KumarSN MethvenSL KooninEV andBrintonMA (1993) Complete genomic sequence and phylogeneticanalysis of the lactate dehydrogenase-elevating virus (Ldv) Virology194 585ndash596

38 MarczinkeB FisherR VidakovicM BloysAJ and BrierleyI (1998)Secondary structure and mutational analysis of the ribosomalframeshift signal of Rous sarcoma virus J Mol Biol 284 205ndash225

39 MyersEW and MillerW (1988) Optimal alignments in linear spaceComput Appl Biosci 4 11ndash17

40 HuangXQ and MillerW (1991) A time-efficient linear-space localsimilarity algorithm Adv Appl Math 12 337ndash357

41 SmithTF and WatermanMS (1981) Identification of commonmolecular subsequences J Mol Biol 147 195ndash197

42 SongSI SilverSL AulikMA RasochovaL MohanBR andMillerWA (1999) Satellite cereal yellow dwarf virus-RPV (satRPV)RNA requires a DouXble hammerhead for self-cleavage and analternative structure for replication J Mol Biol 293 781ndash793

43 DreherTW and HallTC (1988) Mutational analysis of thetransfer-RNA mimicry of brome mosaic-virus RNAmdashsequence andstructural requirements for aminoacylation and 30-adenylationJ Mol Biol 201 41ndash55

44 KanamoriY and NakashimaN (2001) A tertiary structure model of theinternal ribosome entry site (IRES) for methionine-independent initiationof translation RNA 7 266ndash274

45 WuytsJ De RijkP Van de PeerY PisonG RousseeuwP andDe WachterR (2000)Comparativeanalysisof more than 3000 sequencesreveals the existence of two pseudoknots in area V4 of eukaryotic smallsubunit ribosomal RNA Nucleic Acids Res 28 4698ndash4708

46 Van de PeerY De RijkP WuytsJ WinkelmansT and De WachterR(2000) The European small subunit ribosomal RNA databaseNucleic Acids Res 28 175ndash176

47 ChappellSA EdelmanGM and MauroVP (2000) A 9-nt segment ofa cellular mRNA can function as an internal ribosome entry site (IRES)and when present in linked multiple copies greatly enhances IRESactivity Proc Natl Acad Sci USA 97 1536ndash1541

48 ScheperGC VoormaHO and ThomasAA (1994) Basepairing with18S ribosomal RNA in internal initiation of translation FEBS Lett352 271ndash275

49 BarendsS Rudinger-ThirionJ FlorentzC GiegeR PleijCW andKraalB (2004) tRNA-like structure regulates translation of Bromemosaic virus RNA J Virol 78 4003ndash4010

50 CarterAP ClemonsWM BrodersenDE Morgan-WarrenRJWimberlyBT and RamakrishnanV (2000) Functional insights fromthe structure of the 30S ribosomal subunit and its interactions withantibiotics Nature 407 340ndash348

51 HarmsJ SchluenzenF ZarivachR BashanA GatS AgmonIBartelsH FranceschiF and YonathA (2001) High resolution structureof the large ribosomal subunit from a mesophilic Eubacterium Cell 107679ndash688

52 BanN NissenP HansenJ MoorePB and SteitzTA (2000) Thecomplete atomic structure of the large ribosomal subunit at 24 angstromresolution Science 289 905ndash920

Nucleic Acids Research 2005 Vol 33 No 4 1397

53 CannoneJJ SubramanianS SchnareMN CollettJRDrsquoSouzaLM DuYS FengB LinN MadabusiLV MullerKMet al (2002) The Comparative RNA Web (CRW) Site an online databaseof comparative sequence and structure information for ribosomal intronand other RNAs BMC Bioinformatics 3 2

54 HararyF and PrinsG (1959) The number of homeomorphicallyirreducible trees and other species Acta Math 101141ndash162

55 HararyF (1969) Graph Theory Addison-Wesley Reading Mass

56 VasquezM and ScheragaHA (1985) Use of buildup andenergy-minimization procedures to compute low-energy structures ofthe backbone of enkephalin Biopolymers 24 1437ndash1447

57 ZukerM MathewsDH and TurnerDH (1999) Algorithms andthermodynamics for RNA secondary structure prediction A practicalguide In BarciszewskiJ and ClarkBFC (eds) RNA Biochemistry andBiotechnology Kluwer Academic Publishers Dordrecht pp 11ndash43

58 HofackerIL (2003) Vienna RNA secondary structure serverNucleic Acids Res 31 3429ndash3431

1398 Nucleic Acids Research 2005 Vol 33 No 4

Page 5: doi:10.1093/nar/gki267 Modular RNA architecture revealed ... · Modular RNA architecture revealed by computational analysis of existing pseudoknots and ... Samuela Pasquali, ... RNAs

non-isomorphic graphs (for V = 5) that resulted in the sameinvariant S (see Figure 2B) In a previous work we have alsoused the Laplacian eigenvalue spectrum as a topologicalinvariant yielding a comparable error rate of a few percent(26) Some false positive matches are reported because of thisgraph isomorphism problem Since we manually examine allpositive matches the false positive matches are eliminated

Searching for the motif of a 10-vertex graph in a 30-vertexgraph requires 20 min CPU time on an SGI 300 MHz MIPSR12000 IP27 processor The search of a 15-vertex motif in a40-vertex graph would imply 1000-fold increase in CPU time

In practice our method can easily search (ie in minutes) forsubgraphs of size ranging from 5 to 11 vertices within graphsof size up to 30 vertices (corresponding to 600 nt RNAs)

RESULTS AND DISCUSSION

RNA pseudoknots

We analyze the structural similarity of a set of 14 pseudoknotsranging from 41 to 161 nt Figure 3 shows the secondarystructures and graphical representations of the pseudoknots

Viral frameshiftin g

PKB217

PKB174PKB131

PKB178

PKB223

PKB134PKB135Self cleavingVS ribozyme

PKB205

18S ribosomal RNA

PKB191

PKB173

Hammerhead ribozyme

PKB239

HIV1 5rsquo UTR

frameshiftingViral

Internal ribosome entry siteAptamer

PKB143

tmRNA

PKB210

PKB233

Figure 3 Topological matches generated by the simple (PKB217 PKB233) and double (PKB178) pseudoknots The pseudoknots are shown as schematic secondarystructures and dual graphs There are 2 probe and 11 target pseudoknot topologies representing diverse functional RNA pseudoknots For each probe pseudoknotstructure the corresponding matched submotifssubgraphs in the larger pseudoknots are shaded

1388 Nucleic Acids Research 2005 Vol 33 No 4

selected from the PseudoBase (25) (httpwwwbioleidenunivnl~BatenburgPKBhtml) The RNA pseudoknots are viralframeshifting (PKB174 PKB217 PKB233) viral tRNA-like(PKB134 PKB135 PKB143 PKB191) IRES RNA (PKB223)18S rRNA (PKB205) HIV-1 50-untranslated region (50-UTR)(PKB239) alternative pseudoknot of hammerhead ribozyme(PKB173) aptamer that binds human nerve growth factor(PKB131) tmRNA (PKB210) and self-cleaving Neurosporavs ribozyme (PKB178) (PKB233 and PKB217 appear thesame but have different sequences and this difference affectssubsequent sequence alignment analysis) These pseudoknotclasses are not known to be functionally related We choosemore than one member for some classes because of exhibitedtopological differences

The above pseudoknot structures were derived frommutational analysis (PKB173 PKB174 PKB191 PKB233PKB239) enzymatic and chemical structure probing(PKB131 PKB174 PKB191 PKB239) phylogenetic orsequence comparison analysis (PKB131 PKB135 PKB143PKB205 PKB217 PKB233 PKB234 PKB191 PKB210PKB239) and 3D modeling (PKB134 PKB135 PKB210)The pseudoknots listed in more than one category were probedusing two or more methods

To use these structures for graphical analysis we convertthe base pairing information of each structure provided byPSEUDOBASE to a dual graphical representation using ourdual graph rules [D1ndashD3 (2)] We then use pseudoknot graphsto specify their corresponding adjacency matrices Thesetwo steps are performed manually For RNA tree structureswe use our RNA Matrix program to automatically convert asecondary structure into its corresponding adjacency matrix(2635) The search for a small motif within a larger motif isperformed using input adjacency matrices and the comparisonof matrices is performed via the topological invariant S(Equation 2)

Sequence and structural relationships of functionallyrelated pseudoknots

We illustrate using several pseudoknot pairs the functionalsimilarity identified by topological rather than sequence sim-ilarity Figure 4 compares viral frameshifting pseudoknotstructures Viral RNA frameshifting is a process by whichexpression of overlapping open reading frames (ORFs) canbe achieved (36) We compare three pairs from subfamiliesGag-pro ORF1aORF1b and ORF2ORF3 ribosomalframeshifting Frameshifting signals involve two essentialcomponents a lsquoslipperyrsquo (or sliding) sequence of nucleotideswhere frameshifting takes place and a stimulatory RNAsecondary structure (usually a pseudoknot) located a fewnucleotides downstream Although members of the sameframeshifting subfamilies have significant sequence similarityglobal sequence identity between subfamilies is as low as28 similar to random sequences Because all frameshiftingpseudoknots in Figure 4A have the same topology and func-tion finding topological similarity between RNA structureshas an advantage for functional annotation over sequencecomparison

Figure 4B shows that this finding also holds for a frameshift-ing pseudoknot pair (PKB217 and PKB174 Pair 4 in Table 1)The 72 nt PKB217 is the 1 frameshifting pseudoknot of

lactate dehydrogenase-elevating virus (LDV) (37) Togetherwith its sliding sequence UUUAAAC it regulates the expres-sion of ORF1a for a polyprotein containing proteases andORF1b for an RNA-dependent RNA polymerase in LDVThe 127 nt PKB174 is the 1 frameshifting pseudoknot ofRous sarcoma virus (RSV) (3638) This pseudoknot and slid-ing sequence AAAUUUA regulate the expression of Gag-Pro-Pol and Gag-Pol polyproteins Although both the LDV andRSV frameshifting pseudoknots have a simple pseudoknotstructure the latter has two extra stemndashloops Despite theirfunctional similarity the global sequence identity is only 37[computed using the program ALIGN (39) available at httpworkbenchsdscedu] comparable with random sequencestheir sliding sequences are also dissimilar Using our localsequence alignment parameters we find a large significantmatched region in PKB217 (6833ndash6877) as shown inFigure 4B AAACUGCUAGCCACCUCUGGUCUCGACC-GCUGUACUAGAGGUGG

Although functional similarity between frameshiftingpseudoknots is expected this case illuminates how examiningtopological similarity aids in identifying pseudoknots withsimilar functions despite low overall sequence similarity

Comparison of pseudoknot structures

Our two probe motifsmdasha simple pseudoknot (PKB217 orPKB233) and a double pseudoknot (PKB178)mdashare used tosearch larger RNAs We generate two groups of topologicalmatches as shown in Figure 3 with the matched motifs showndarkened we found 11 such matched pseudoknot pairs Inaddition we analyze sequence similarity of many of the84 (14 middot 132) possible pseudoknot pairs since topologicalsimilarities are apparent within each pseudoknot group gen-erated by the probe motifs Although not exhaustive thisadditional analysis may help to identify pairs having similarsubstructures that are not detected by using fixed probes(simple and double pseudoknots) Table 1 lists 10 pseudoknotpairs (out of the 11 probestructure matches and subset of the84 structurendashstructure analysis) that have significant structuraland sequence similarities Of the total of 10 pairs listed(Pairs 1ndash10) two known functional pairs (1 and 4 in Table 1)are included for comparison with the eight other pairs that arenot previously known to be functionally related

We begin by analyzing the overall patterns of relationshipamong our 14 pseudoknots Figure 3 organizes the matchesinto two groups five matches for the simple pseudoknot(PKB217) and six for the double pseudoknot (PKB178)The matched motifs appear to be lsquomodular constructionsrsquoaround the probe motifmodule with added stemndashloops(three or fewer) To the best of our knowledge these resultsreveal modular features and topological similarities of RNApseudoknots that have not been described

The biological significance of topological matches dependson the size and complexity of pseudoknots a match for thedouble pseudoknot is probably more significant than for asimple pseudoknot The six pseudoknots generated by thedouble pseudoknot probe PKB178 show striking topologicalas well as functional similarity (eg viral tRNA-like PKB134and PKB135) we analyze their similarity with PKB178 andamong themselves Still the matched pairs of RNAs are func-tionally diverse besides PKB134 and PKB135 such examples

Nucleic Acids Research 2005 Vol 33 No 4 1389

of diverse RNAs include IRES (PKB223) RNA 18S rRNA(PKB205 PKB234) HIV-1 50-UTR (PKB239) and alternativepseudoknot of hammerhead ribozyme (PKB173) In this dou-ble pseudoknot group six pairs (1 3 5 6 8 and 9 in Table 1)produce local sequence alignments [using LALIGN (40)

httpworkbenchsdscedu] that suggest significant resultsIn particular the pair PKB134PKB173 (Pair 5 in Table 1)indicates a significant match The pairs PKB134PKB205(Pair 6) and PKB205PKB223 (Pair 10 in Table 1) alsodisplay intriguing sequence and functional similarities

A A A A A A C G G G A A G C A A G G G C U

GGGA

CA A

CCCU

A

A

A

U

GA U

C C C G G

G

A AACC

A

U

PKB 3

Gag-polglobal sequence identity 57

PKB 128U U U A A A C U G U U G A G A G G U G C C U G

U C U C U A C G G A C

GCCUG

GA G

CGGAC

UU U

G

U AUA

G

A

C

PKB 217U U U A A A C U G C U A G C C A C C U C U G G U

G G U G G A G A U C A

GACCGC

CU C C

UGGCG

CU

U

G

G

GCA

U

G GY

AG

CG

ORF 1a - ORF 1bglobal sequence identity 53

G G G A A A C G G G C A G G C G G

GCG

C GCGC

A

G

A

C C G C C

AC

AA

C

PKB 44

G G G A A A U G G A C U G A G G C G

GGCG

C CGGC

A

C

A

C G C C

A

CA

A

C

PKB 240

A

A

ORF2 - ORF3global sequence identity 79

global sequence identity as low as 28

A

U CC

G

C

AA A

C

U

C

GG

C

U

G

UU

AA

PKB 217GCUGACGGUGUYUGGCG

UA

CUC

UG

UG A C C G C

C U G G C G

CC

CCUCUGGU

GUGGAGA

CA

A

U

GU U U A A A C U G C U A G C 5

3

PKB 174

G A AG

UG

AA

AC

U

U

A

U

UG

C A

UUG

UC

UC

U

C

A

U A

U

AGGG

UCCCU

CGGGCCACUG

CCCGGUGAC

G

CC

GU

GU

G

A

CA

C

G C G C U A C A

C G C G A U G U

A G A C C G

U C U G G C

A

2603 3

52476 A A A U U U A U

6830

BPair 4

global sequence identity 37

G G G A A A C U G G A A G G C G G G G C

GCUGCA

GA C

GACUG

G

A

A

C C C C G

PKB 4

C

AA

AUA

UG

Figure 4 Sequence and structural comparisons of pseudoknots in the frameshifting functional families (A) Two secondary structures and corresponding topologiesfor each of the three viral frameshifting subfamilies Gag-pro ORF1aORF1b and ORF2ORF3 Sequence identity within the same subfamily ranges between 50 and80 it is as low as 28 when comparing pseudoknots from different subfamilies All frameshifting pseudoknots are functionally related and have the same topology(dual graph) (B) Secondary structures and their aligned nucleotides (color coded) for frameshifting pseudoknots of LDV (PKB217) and Rous sarcoma virus(PKB174) Global sequence alignments were performed using the program ALIGN with default (164) gap openinggap extension penalties

1390 Nucleic Acids Research 2005 Vol 33 No 4

These relationships suggest structural and functional similar-ities that are not previously known (Table 1)

To support the significance of these matches we performsequence alignments using SmithndashWaterman parameters forgap opening and gap extension penalties (41) We choose thegap opening penalty value of 15 from a range of 0 to 19 Toproduce better alignments for large sequence segments (shortalignments are less likely to be significant in our context) ourgap extension penalty of 1 from a range of 1 to 8 allowslarger extensions consistent with the possibility of the inser-tion of long sequences and added 2D motifs In Table 1 wealso report the alignment score which is the sum of nucleotidematches mismatches and gap penalties

Table 1 shows that the probe structure (PKB178) matchestwo other pseudoknots (PKB135 PKB173) with sequenceidentities ranging from 59 to 71 for 24ndash49 nt alignmentlengths (compared with 56 sequence identity over 129 ntfor the functionally related pair PKB134PKB135 Pair 1)Significantly the pairs PKB134PKB173 (Pair 5) PKB134PKB205 (Pair 6) and PKB205PKB223 (Pair 10) havesequence identity range of 72ndash80 The non-randomness ofthe 10 aligned pairs in Table 1 is also evident in the ratio Rwhich measures the level of significance above the randomexpectation and has value 1 for random pairs

R = (percent sequence identity)(percent sequence identitywhen one of the aligned sequences is randomized)

We also define ltRgt as the R-value averaged over align-ments with different randomized sequences Table 1 showsthat some pseudoknot pairs with no known functional rela-tionship have ltRgt values ranging between 106 and 147whereas the related pair PKB134PKB135 has a value of154 many matched pairs have ltRgt values lt1 Thus thetopological similarities we identify led to matched pairswith non-random sequence alignment results Non-randomsequence matches may arise from topological andor func-tional similarities as illustrated below for three pseudoknotpairs (5 6 and 10 in Table 1)

The rate of false positives generated by topological and ltRgtvalue screening can be easily estimated by considering allpossible pairs from the set of six matched double pseudo-knots in Figure 3 These pseudoknots form 15 distinctpairs 6 of which have ltRgt values gt114 [Table 1 withS(G) = 117] Table 1 shows that only four pairs have largeltRgt values (gt12) Pair 1 (PKB134PKB135) is related Pair 5(PKB134PKB173) and Pair 6 (PKB134PKB205) are mostprobably related and Pair 3 (PKB178PKB173) is of unknownrelationship (in fact both are self-cleaving ribozymes)Assuming that the Pair 3 is unrelated and therefore reflectsa false positive we have an error rate of 115 or 7 If theinferred functionally related Pairs 5 and 6 are also consideredas false positives we then have a conservative error rateof 20

Functional relationship between PKB134 and PKB173(Pair 5 in Table 1)

The 116 nt PKB134 pseudoknot occurs in the 30 end of bromemosaic virus genomic RNA (31) The 73 nt PKB173 is ahammerhead ribozyme of satellite cereal yellow dwarfvirus-RPV (satRPV) RNA capable of adopting an alternativepseudoknot conformation by forming an (L1ndashL2a) stem whose

existence is established by experimental mutational analysis(42) Figure 3 shows that their topologies have the same lsquocorersquobut PKB134 has two extra stemndashloops The local sequencealignment of this pair using LALIGN (Figure 5) shows 19bases aligned to 80 sequence identity and an ltRgt valueof 123 (Table 1) There are two significant segments inPKB173 that are matched with those in PKB134 UACUGU-CUGACGACGUAUCC (nt 11ndash30) and UAGAAGGCUG-GUGCC (nt 39ndash53) These segments and their matchedregions in PKB134 span the stem and unpaired elements ofthe secondary structures (see Figure 5) Significantly bothPKB134 and PKB173 pseudoknots serve as template for rep-licase a protein that recognizes specific sequence regions toinitiate replication of viral genomes (3142) Moreover a con-served sequence (CUGANGA) of PKB173 is in the alignedregion (nt 11ndash30) Thus our analysis suggests that PKB134and PKB173 pseudoknots are structurally and functionallyrelated

Further analysis of pseudoknots PKB134 and PKB173reveals specific domains interacting with the replicase Experi-ments have shown that mutations in domains A C B1 and B2[defined in (31)] of PKB134 impair viral RNA replication (43)Significantly our two aligned sequence regions span domainsA B1 C and also D Thus the replicase recognizes severaldomains in the PKB134 pseudoknot structure In PKB173 thepseudoknot forming five base pair helix (GCGCG) is implic-ated in replicase recognition or ligation (42) This pseudoknotalso has conserved sequence regions ACAAA (61ndash65) or itscomplement (possibly the replication origin) and AGAAA(nucleotides attached to the 30 end but not shown) Howeverthese regions are not observed in PKB134

Functional relationship between PKB205 and PKB223(Pair 10 in Table 1)

The 63 nt PKB223 is the pseudoknot of IRES of capsid-producing cricket paralysis-like viruses (CrPV) formethionine-independent initiation of translation (44) Appear-ing immediately upstream from the capsid-coding sequenceit initiates translation without a canonical AUG (methionine)initiation codon The 48 nt PKB205 is the pseudoknotsubstructure (V4 domain) of SSU of Palmaria palmata 18SrRNA which was found using extensive comparative analysis(4546) Figure 5 (Pair 10 lower panel) shows that PKB223 isa 4-vertex simple pseudoknot with two inserted stemndashloopsand PKB205 is a 4-vertex double pseudoknot with an insertedstemndashloop Figure 5 (middle panel) displays the local sequencealignment of PKB223 and PKB205 yielding a match of 14 ntwith 72 sequence identity This significant sequence simil-arity occurs in regions ACAAUAUCCAGGAA (6148ndash6161)and ACAUUAGCAUGGAA (765ndash778) of PKB223 andPKB205 respectively Remarkably we find that the 9 ntsegment 50-UUAGGUUAG-30 (nt 6100ndash6108) of PKB223is complementary to 30-GGUACGAUU-50 (nt 768ndash776) ofPKB205 except at G6103 Our finding corroborates withthe experimental identification that IRES elements in cellularmRNA contain a short 9 nt (nt 133ndash141 CCGGCGGGU) inthe 50-UTR of the homeodomain protein Gtx which is com-plementary to nt 1132ndash1124 of 18S rRNA (47) Indeed manysuch near complementary sequences have been found bet-ween the 30 end of 18S rRNA and picornavirus RNAs (48)

Nucleic Acids Research 2005 Vol 33 No 4 1391

Fig

ure

5

Th

ree

no

vel

fun

ctio

nal

lyre

late

dp

seu

do

kn

otp

airs

infe

rred

by

top

olo

gy

mat

chin

gan

dse

qu

ence

alig

nm

ent

Up

per

pan

els

Sec

ond

ary

stru

ctu

res

and

thei

ral

ign

edn

ucl

eoti

des

(co

lor

cod

ed)

for

pse

udo

kn

ot

pai

rsP

KB

13

4P

KB

17

3(P

air

5T

able

1

left

pan

el)

PK

B2

05

PK

B2

23

(Pai

r1

0T

able

1

mid

dle

pan

el)

and

PK

B1

34

PK

B20

5(P

air

6

rig

ht

pan

el)

PK

B1

34

occ

urs

in30 e

nd

of

bro

me

mosa

icv

iru

sg

eno

mic

RN

A

PK

B1

73

isa

pse

udo

kn

oto

fsa

tell

ite

cere

aly

ello

wd

war

ftv

iru

s-R

PV

RN

AP

KB

22

3is

the

pse

ud

ok

no

tofin

tern

alen

try

site

(IR

ES

)o

fca

spid

-pro

du

cin

gC

rPV

san

dP

KB

20

5is

the

pse

udo

kn

ots

ub

stru

ctu

re(V

4d

om

ain

)o

fS

SU

of

PP

alm

ata

18

SrR

NA

L

ow

erp

anel

sT

he

alig

ned

sub

gra

ph

so

rre

gio

ns

are

shad

edo

rm

ark

ed

1392 Nucleic Acids Research 2005 Vol 33 No 4

Intriguingly for CrPVs the complementary segments occur inpseudoknots with similar folds not involving the dangling30 end of 18S rRNA Base pair complementary between IRESelements of mRNA and 18S rRNA is a mechanism for ribo-some recruitment and translation In viral IRES elements thesecondary and tertiary structures are believed to play criticalroles in translation initiation (44)

Functional relationship between PKB205 and PKB134(Pair 6 in Table 1)

The functional relationsip of the pair PKB205PKB134 issimilar to that for PKB205PKB223 above PKB134 belongsto a common viral tRNA-like functional class whose role is torecruit host ribosomes to initiate viral protein synthesis (49)It is known to interact with the ribosome via tRNA shapemimicry (31) Our analysis indicates that tRNA-like mole-cules also contain a region which is nearly complementary toPKB205 a pseudoknot region of the 18S rRNA Table 2shows regions of PKB205 complementary to four tRNA-like molecules (PKB134 191 138 17) the complementaryregion in PKB223 is shown for comparison Significantlyeach of the pseudoknots PKB223 PKB134 and PKB191has a region complementary to the same 8ndash9 nt unpairedregion in PKB205 (767ndash776) This single-stranded regionis most probably available for base pairing with incomingtRNA-like molecules In addition two other regions inPKB205 are complementary to tRNA-like pseudoknotsPKB138 and PKB17 one region is a hairpin loop and anotherregion is part of short stems Thus this analysis suggests thattRNA-like molecules may employ both structural mimicry andsequence complementarity to recruit SSU rRNA an interac-tion mode similar to tRNA PKB134 also has a region UACA-GUGUUGAAAAACA (nt 2078ndash2094) closely matched tothe region UAGAGUGUUCAAAGCCA (nt 732ndash748) inPKB205 with an average R-value of 123

Tertiary structure similarity of topological matches inribosomal RNAs

The preceding examination of three pseudoknot pairs showsthat topological matches can suggest functionally relatedpseudoknots It is also important to show that novel topo-logical matches led to similar tertiary structures We focusfor this purpose on rRNA since several bacterial and archaealribosome structures have been solved in recent years In factin our previous work we have found several topologicalmatches of the Scerevisiae 5S rRNA motif within its largerrRNAs two within 16S rRNA (domains II and III) and threewithin 23S rRNA (domains III IV and VI) (2) From availableX-ray structures for Thermus thermophilus 16S rRNA (PDBaccession no 1fjg) (50) and 23S rRNA from Deinococcus

radiodurans (1nkw) and Haloarcula marismortui (1ffk)(5152) only 5S rRNA motifrsquos matches in domain II of16S rRNA and domain III of 23S rRNA are relevant theother three cases are not valid matches in these species

Figure 6 displays the secondary diagrams graphical repres-entations and tertiary structures of 5S rRNA for bacteriumDradiodurans (1nkw) and archaea Hmarismortui (1ffk)and their topologically matched modules within 16S and23S rRNAs the secondary diagrams are adapted from thosein Gutellrsquos comparative RNA website (httpwwwrnaicmbutexasedu) (53) The module in 23S is an identicalmatch for the 6-vertex 5S graph except for a hairpin loopwhich does not contribute to the topological invariant SThe 16S module is a 5-vertex graph because a 2 bp stem atthe junction has a non-canonical base pair AG (dual graph ruleD1 defines a stem as having at least two canonical andorwobble base pairs) for Scerevisiae this stem consists ofonly canonical base pairs making the 16S module an identicalmatch for the 5S topology As shown in Figure 6 the 16S and23S modules have a similar L-shaped tertiary fold as the 5Sstructure Tertiary structure similarity with 5S is more strikingfor the module of domain II (nt 653ndash753) of Tthermophilus16S rRNA than for the module of domain III (nt 1033ndash1143) of23S rRNA Both the 5S structure and 16S module have apronounced 3-stem junction where the chain ends of the16S module are located Structural (backbone) overlap ofthe 5S structure and 16S module shows similar structuralfeatures except for the orientations of the smaller stems atthe junction (We aligned the structures manually because thechain ends of 5S and 16S module occur at non-correspondinglocations) In contrast the 23S module shows significant dis-tortions on the shorter helical arm with the hairpin loop foldedback to the helix as shown in the structural overlap of the 5Sstructure and 23S module Structural similarities seen in theseexamples are probably due to geometric constraints which areapparent in their secondary diagrams and graphical representa-tions Thus 2D topological matches can lead to similar 3Dstructures Below we survey the distribution of modules inrRNAs

Distribution of submotifs in 16S and 23S ribosomalRNAs

Characterizing the submotifs within the 16S and 23S rRNAsthe largest known RNA molecules can help uncover themodular construction of RNA structures as analysis ofrRNArsquos tertiary modules has revealed Furthermore our pre-vious analysis of rRNArsquos structural motifs indicated existenceof characteristic features (35) For example the distributionsof pairedunpaired bases in stems bulgesinternal loops hair-pin loops and junctions follow specific functional forms

Table 2 Short sequence regions in PKB205 complementary to tRNA-like molecules

PKB205 regions Complementary sequences

30-GGUACGAUU-50 (768ndash776) PKB223 50-UUAGGUUAG-30 (6100ndash6108) IRES30-GUGAGAUU-50 (769ndash776) PKB134 50-CACUGUAAA-30 (113ndash120) tRNA-like30-UACGAUUAC-50 (767ndash775) PKB191 50-AUGCUCAUG-30 (77ndash85) tRNA-like30-UGUGAGAUU-50 (731ndash739) PKB138 50-ACACUUUAA-30 (38ndash47) tRNA-like30-GUUUGCGGACC-50 (746ndash756) PKB17 50-CAAAACCCUGG-30 (19ndash29) tRNA-like

Nucleic Acids Research 2005 Vol 33 No 4 1393

Figure 6 Comparison of the graph topologies and 2D and 3D structures of 5S rRNAs (A and B) and modules in 16S and 23S rRNAs (C and D) (C and D) Twostructural (backbone) overlaps of the 5S structure (1ffk gold color) with 16S module (1fjg blue) and 5S structure with 23S module (1nkw green) are shown Thesecondary diagrams are adapted from those in Gutellrsquos comparative RNA website (httpwwwrnaicmbutexasedu) and 3D structures are from the Protein DataBank (httpwwwrcsborgpdb) The 2D modules were identified computationally by our earlier work on finding120 nt 5S rRNA motifs in 16S and 23S rRNAs ofScerevisiae(2) The equivalent tertiary structures and modules for 5S 16S and 23S rRNAs from other organisms show that similar graph topologies can lead tosimilar secondary and tertiary structures All structures have similar graph topologies The Tthermophilus 16S module is a 5-vertex instead of 6-vertex graph becausea 2 bp stem at the junction has a non-canonical base pair AG for Scerevisiae this stem consists of only canonical base pairs making the 16S module an identicalmatch for the 5S topology

1394 Nucleic Acids Research 2005 Vol 33 No 4

Moreover the percentages of nucleotides in stems bulgesinternal loops hairpin loops and junctions do not vary for16S and 23S rRNAs even though the latter is twice thesize of the former Such motif characteristics help discriminateRNA-like from non-RNA-like molecules and guide the designof novel RNAs Here we analyze the distinct modular unitsmdashsmall tree submotifsmdashthat makeup the rRNAs to advance suchapplications Since all small tree graphs up to 10 vertices havebeen exhaustively enumerated (5455) we can use our sub-structure search algorithm to identify the occurrences of alldistinct tree submotifs in rRNAs

Specifically we use sets of tree graphs for V = 4 5 6 7 8 9and 10 comprising 2 3 6 11 23 47 and 106 graphs respect-ively for a total of 198 tree motifs The complete enumeratedmotifs can be found in the RAG web resource (httpmonodbiomathnyuedurna) Figure 7 shows as an example the11 possible 7-vertex tree graphs and lists the frequency ofoccurrence of each tree graph in rRNAs Remarkably thesame tree motifs are abundant in both 16S and 23S rRNAsThe most abundant (10 or 15 times) 7-vertex tree has two3-stem junctions (topological invariant Stree = 6160) 4- and5-stem junctions are the next most prevalent submotifs(Stree = 5137 5921 and 5667) Interestingly the unbranchedtree (Stree = 7219) and highly branched trees (Stree = 4509)are rare

Figure 8 plots the percentage of distinct tree graphs in eachset V (fraction of motifs identified relative to possible motifs)

found in rRNA For small tree graphs V = 4 5 and 6 allpossible topologies are present but not for V gt 6 In fact thepercentage of trees found declines with V In 23S rRNA90 of possible trees are found for V = 7 and only40 (representing 47 tree topologies) are found forV = 10 Larger trees (V gt 10) are expected to have muchsmaller percentages This percentage decline is expectedsince small tree modules are less specific than large treesIn total 111 submotifs out of 198 are identified The observedtrends in Figure 8 may be generally valid for rRNA structuresof other organisms since they are highly conserved althoughsome statistical fluctuations are expected due to structuraldifferences and RNArsquos finite size

These data show that RNA structures are modular constructsof relatively few (tree) building blocks Extrapolation of thecurves in Figure 8 suggests that 20 of V = 11 motifs (totalof 235 trees) and 10 of V = 12 motifs (total of 551 trees)probably exist in rRNAs Thus the total number of buildingblocks in rRNAs with V lt 13 would be 210 Although thepercentages of trees with higher V values are low they aremore numerous and will contribute to the overall number ofbuilding blocks We thus consider 210 as the minimum num-ber of building blocks The building block number may beexploited in the future design of novel functional RNAmolecules using modular assembly from 210 distinct treemotifs The tree motifs may be taken from substructures ofexisting RNAs allowing simultaneous sequence and structure

treeS =

16S 23SOccurrence

ID

6 4

)

0 1

(7 8)

6191

(7 11)

4509

0 1

(7 9)

5385

3 3

7

(7 10)

5137

5 5

(7 6)

5916

5

(7 5)

5921

(7 1)

0 0

7219

(7 3)

2 4

6446

4 4

(7 7)

5667

10 15

(7 4

6160

0 3

6692

(7 2)

Figure 7 Occurrence frequency of 11 distinct 7-vertex tree topologies as submotifs of Scerevisiae 16S and 23S rRNAs The tree graphs are labeled using topologicalinvariant Stree (second top row) and ID number (i j) (first row) which refers to motif identification number as used in RAG database (httpmonodbiomathnyuedurna) The two numbers at the bottom of each graph refer to the number of times that topology is found embedded in 16S and in 23S rRNAs respectively

Nucleic Acids Research 2005 Vol 33 No 4 1395

design Such a build-up approach appears feasible from our(26) and other works (35) and is akin to the lsquobuilduprsquo tech-nique for proteins pioneered by Vasquez and Scheraga (56)

The above estimate of building block number can be eithera consequence of the limited number of available structures orthe fact that RNA adopts only a subset of possible graphsCertainly future discoveries of novel RNA structures mightindicate a larger repertoire of building blocks Still our sur-veys of natural RNAs (26) suggest that RNAs possess specifictopological characteristics and favor a subset of the possiblegraph motifs The missing motifs may not exist in nature forRNA owing to physical or functional constraints

The building block number can also be defined in terms ofirreducible graphs ie trees that cannot be decomposed intosmaller trees However this approach has a major drawbackSince all RNAs are composed of connected stems their sub-structures can be reduced to two vertex graphs which is not aninformative description of complex RNA folds Still if wedefine 4-vertex trees as the smallest irreducible graphs thenumber of irreducible graphs can be calculated Based on oursubmotif data we estimate that for V up to 10 there are about42 irreducible tree modules in rRNAs

SUMMARY AND CONCLUSION

Our automated comparison of RNA secondary structuresbased on graphical representation allows the examination ofconserved 2D RNA topological features which in turn suggestfunctional relationships This approach has an advantage oversequence alignment because functionally related RNAs oftenlack sequence similarity as shown for frameshifting pseudo-knots with low global sequence identities Of course ourstructure alignment approach relies on available 2D RNA

structures and thus will increase in scope as more RNAstructures become available Our eight identified pairs ofpotentially related pseudoknots (Table 1) reveal three pairsthat are functionally related the pseudoknots of 30 end ofbrome mosaic virus genomic RNA (PKB134) and alternativehammerhead ribozyme pseudoknot (PKB173) are templatesfor replicase to initiate replication of viral genomes the pseu-doknots of internal entry site (IRES) of caspid-producingCrPVs (PKB223) and SSU of 18S rRNA (PKB205) are asso-ciated with translation initiation and the pseudoknots of 18SrRNA (PKB205) and viral tRNA-like (PKB134) most prob-ably interact via both structural mimicry and base comple-mentarity To the best of our knowledge these functionalrelationships have not been described previously

The developed RNA comparison algorithm also led us toquantify the modularity of 16S and 23S rRNAs using smallenumerated tree motifs Our results suggest that rRNAs areconstructed from at least 210 distinct tree subtopologies(or modules) having up to 180 nt Significantly we findthat 5S rRNA and two tree modules within 16S and 23SrRNAs with similar topologies have similar tertiary shapesaffirming the relationship between 2D graphical representa-tion and 3D structure The tree modules in rRNA may be usedas building blocks for the design of novel RNA folds A similaridea has been employed to design and improve functionalRNA molecules (510) Design by assembly of modulesexploits motifs present in natural RNAs unlike de novo designof sequences and structures (11) The issues of discoveringnovel topologies (6) building module library and designingfunctional folds are likely to increase in importance withinterest in novel RNAs

Our algorithm has several limitations Of course its utilitydepends on availability of reliable RNA secondary foldingalgorithms (5758) More fundamentally although simplifiedgraphical representations specify topological characteristicsdetailed sequence and motif information is missing An asso-ciated problem is that unusual RNA structurendashfunction rela-tionships due to deleterious mutations as in frameshiftingpseudoknots (28) and structural flexibility are not capturedMore advanced graph constructs such as weighted graphs mayhowever specify detailed motif features at the secondary andbase pair levels Another limitation of the present algorithmis the requirement to fix the probe structure or graph As inprotein structure alignment algorithms (15) future refinementsof our algorithm should allow some flexibility in motifdefinitions as the similarity search proceeds We inviteRNA scientists to visit our RAG web resource at httpmonodbiomathnyuedurna and suggest to us specific enhancementsthat will be useful in practice

ACKNOWLEDGEMENTS

We thank Dinshaw Patel for comments that motivated thiswork This research was supported by a Joint NSFNIGMSInitiative in Mathematical Biology (DMS-0201160) and theHuman Frontier Science Program (HFSP)

REFERENCES

1 TinocoI and BustamanteC (1999) How RNA folds J Mol Biol293 271ndash281

Figure 8 Distribution of all possible distinct tree modules with V vertices foundin Scerevisiae 16S and 23S rRNAs In graph theory each V is represented by aset of possible tree structures For example for V= 4 5 6 7 8 9 and 10 there are2 3 6 11 23 47 and 106 possible trees respectively The distribution of treesquantifies the diversity of tree modules within rRNAs This analysis suggeststhat rRNAs possess at least 210 tree modules

1396 Nucleic Acids Research 2005 Vol 33 No 4

2 GanHH PasqualiS and SchlickT (2003) Exploring the repertoire ofRNA secondary motifs using graph theory implications for RNA designNucleic Acids Res 31 2926ndash2943

3 SoukupGA and BreakerRR (1999) Engineering precision RNAmolecular switches Proc Natl Acad Sci USA 96 3584ndash3589

4 BreakerRR (2002) Engineered allosteric ribozymes as biosensorcomponents Curr Opin Biotechnol 13 31ndash39

5 IkawaY FukadaK WatanabeS ShiraishiH and InoueT (2002)Design construction and analysis of a novel class of self-foldingRNA Structure 10 527ndash534

6 KimN ShiffeldrimN GanHH and SchlickT (2004) Candidatesfor novel RNA topologies J Mol Biol 341 1129ndash1144

7 MitchellJR ChengJ and CollinsK (1999) A box HACAsmall nucleolar RNA-like domain at the human telomerase RNA 30

end Mol Cell Biol 19 567ndash5768 DoudnaJA and CechTR (2002) The chemical repertoire

of natural ribozymes Nature 418 222ndash2289 StorzG (2002) An expanding universe of noncoding RNAs Science

296 1260ndash126310 BreakerRR and JoyceGF (1994) Inventing and improving ribozyme

function rational design versus iterative selection methods TrendsBiotechnol 12 268ndash275

11 AndronescuM FejesAP HutterF HoosHH and CondonA (2004)A new algorithm for RNA secondary structure design J Mol Biol336 607ndash624

12 CarterRJ DubchakI and HolbrookSR (2001) A computationalapproach to identify genes for functional RNAs in genomic sequencesNucleic Acids Res 29 3928ndash3938

13 DoudnaJA (2000) Structural genomics of RNA Nature Struct Biol7 954ndash956

14 RivasE KleinRJ JonesTA and EddySR (2001) Computationalidentification of noncoding RNAs in Ecoli by comparativegenomics Curr Biol 11 1369ndash1373

15 HolmL and SanderC (1996) Mapping the protein universeScience 273 595ndash602

16 ShindyalovIN and BournePE (2001) A database and tools for 3-Dprotein structure comparison and alignment using the CombinatorialExtension (CE) algorithm Nucleic Acids Res 29 228ndash229

17 BenedettiG and MorosettiS (1996) A graph-topological approach torecognition of pattern and similarity in RNA secondary structuresBiophys Chem 59 179ndash184

18 MargalitH ShapiroBA OppenheimAB and MaizelJVJr (1989)Detection of common motifs in RNA secondary structures Nucleic AcidsRes 17 4829ndash4845

19 ShapiroBA and ZhangKZ (1990) Comparing multiple RNAsecondary structures using tree comparisons Comput Appl Biosci6 309ndash318

20 LeSY NussinovR and MaizelJV (1989) Tree graphs of RNAsecondary structures and their comparisons Comput Biomed Res22 461ndash473

21 DuarteCM WadleyLM and PyleAM (2003) RNA structurecomparison motif search and discovery using a reduced representation ofRNA conformational space Nucleic Acids Res 31 4755ndash4761

22 ChevaletC and MichotB (1992) An algorithm for comparing RNAsecondary structures and searching for similar substructures ComputAppl Biosci 8 215ndash225

23 WangL and ZhaoJ (2003) Parametric alignment of ordered treesBioinformatics 19 2237ndash2245

24 FontanaW KoningsDAM StadlerPF and SchusterP (1993)Statistics of RNA secondary structures Biopolymers 33 1389ndash1404

25 van BatenburgFHD GultyaevAP PleijCWA NgJ andOliehoekJ (2000) PseudoBase a database with RNA pseudoknotsNucleic Acids Res 28 201ndash204

26 GanHH FeraD ZornJ ShiffeldrimN TangM LasersonUKimN and SchlickT (2004) RAG RNA-As-Graphsdatabasemdashconcepts analysis and features Bioinformatics20 1285ndash1291

27 FeraD KimN ShiffeldrimN ZornJ LasersonU GanHH andSchlickT (2004) RAG RNA-As-Graphs web resource BMCBioinformatics 5 88

28 ChenXY KangHS ShenLX ChamorroM VarmusHE andTinocoI (1996) A characteristic bent conformation of RNA pseudoknotspromotes 1 frameshifting during translation of retroviral RNAJ Mol Biol 260 479ndash483

29 KangHS HinesJV and TinocoI (1996) Conformation of anon-frameshifting RNA pseudoknot from mouse mammary tumor virusJ Mol Biol 259 135ndash147

30 GanHH PerlowRA RoyS KoJ WuM HuangJ YanSNicolettaA VafaiJ SunD et al (2002) Analysis of protein sequencestructure similarity relationships Biophys J 83 2781ndash2791

31 FeldenB FlorentzC GiegeR and WesthofE (1994) Solutionstructure of the 30-end of brome mosaic-virus genomic RNAsmdashconformational mimicry with canonical transfer-RNAs J Mol Biol235 508ndash531

32 GrossJL and YellenJ (1999) Graph Theory and its Applications CRCPress Boca Raton FL

33 KoblerJ SchoningU and ToraanJ (1993) The Graph IsomorphismProblem Birkhauser Boston MA

34 RandicM and ZupanJ (2001) On interpretation of well-knowntopological indices J Chem Inform Comput Sci 41 550ndash560

35 ZornJ GanHH ShiffeldrimN and SchlickT (2004) Structuralmotifs in ribosomal RNAs implications for RNA design and genomicsBiopolymers 73 340ndash347

36 JacksT MadhaniHD MasiarzFR and VarmusHE (1988) Signalsfor ribosomal frameshifting in the Rous-Sarcoma virus GagndashPolregion Cell 55 447ndash458

37 GodenyEK ChenL KumarSN MethvenSL KooninEV andBrintonMA (1993) Complete genomic sequence and phylogeneticanalysis of the lactate dehydrogenase-elevating virus (Ldv) Virology194 585ndash596

38 MarczinkeB FisherR VidakovicM BloysAJ and BrierleyI (1998)Secondary structure and mutational analysis of the ribosomalframeshift signal of Rous sarcoma virus J Mol Biol 284 205ndash225

39 MyersEW and MillerW (1988) Optimal alignments in linear spaceComput Appl Biosci 4 11ndash17

40 HuangXQ and MillerW (1991) A time-efficient linear-space localsimilarity algorithm Adv Appl Math 12 337ndash357

41 SmithTF and WatermanMS (1981) Identification of commonmolecular subsequences J Mol Biol 147 195ndash197

42 SongSI SilverSL AulikMA RasochovaL MohanBR andMillerWA (1999) Satellite cereal yellow dwarf virus-RPV (satRPV)RNA requires a DouXble hammerhead for self-cleavage and analternative structure for replication J Mol Biol 293 781ndash793

43 DreherTW and HallTC (1988) Mutational analysis of thetransfer-RNA mimicry of brome mosaic-virus RNAmdashsequence andstructural requirements for aminoacylation and 30-adenylationJ Mol Biol 201 41ndash55

44 KanamoriY and NakashimaN (2001) A tertiary structure model of theinternal ribosome entry site (IRES) for methionine-independent initiationof translation RNA 7 266ndash274

45 WuytsJ De RijkP Van de PeerY PisonG RousseeuwP andDe WachterR (2000)Comparativeanalysisof more than 3000 sequencesreveals the existence of two pseudoknots in area V4 of eukaryotic smallsubunit ribosomal RNA Nucleic Acids Res 28 4698ndash4708

46 Van de PeerY De RijkP WuytsJ WinkelmansT and De WachterR(2000) The European small subunit ribosomal RNA databaseNucleic Acids Res 28 175ndash176

47 ChappellSA EdelmanGM and MauroVP (2000) A 9-nt segment ofa cellular mRNA can function as an internal ribosome entry site (IRES)and when present in linked multiple copies greatly enhances IRESactivity Proc Natl Acad Sci USA 97 1536ndash1541

48 ScheperGC VoormaHO and ThomasAA (1994) Basepairing with18S ribosomal RNA in internal initiation of translation FEBS Lett352 271ndash275

49 BarendsS Rudinger-ThirionJ FlorentzC GiegeR PleijCW andKraalB (2004) tRNA-like structure regulates translation of Bromemosaic virus RNA J Virol 78 4003ndash4010

50 CarterAP ClemonsWM BrodersenDE Morgan-WarrenRJWimberlyBT and RamakrishnanV (2000) Functional insights fromthe structure of the 30S ribosomal subunit and its interactions withantibiotics Nature 407 340ndash348

51 HarmsJ SchluenzenF ZarivachR BashanA GatS AgmonIBartelsH FranceschiF and YonathA (2001) High resolution structureof the large ribosomal subunit from a mesophilic Eubacterium Cell 107679ndash688

52 BanN NissenP HansenJ MoorePB and SteitzTA (2000) Thecomplete atomic structure of the large ribosomal subunit at 24 angstromresolution Science 289 905ndash920

Nucleic Acids Research 2005 Vol 33 No 4 1397

53 CannoneJJ SubramanianS SchnareMN CollettJRDrsquoSouzaLM DuYS FengB LinN MadabusiLV MullerKMet al (2002) The Comparative RNA Web (CRW) Site an online databaseof comparative sequence and structure information for ribosomal intronand other RNAs BMC Bioinformatics 3 2

54 HararyF and PrinsG (1959) The number of homeomorphicallyirreducible trees and other species Acta Math 101141ndash162

55 HararyF (1969) Graph Theory Addison-Wesley Reading Mass

56 VasquezM and ScheragaHA (1985) Use of buildup andenergy-minimization procedures to compute low-energy structures ofthe backbone of enkephalin Biopolymers 24 1437ndash1447

57 ZukerM MathewsDH and TurnerDH (1999) Algorithms andthermodynamics for RNA secondary structure prediction A practicalguide In BarciszewskiJ and ClarkBFC (eds) RNA Biochemistry andBiotechnology Kluwer Academic Publishers Dordrecht pp 11ndash43

58 HofackerIL (2003) Vienna RNA secondary structure serverNucleic Acids Res 31 3429ndash3431

1398 Nucleic Acids Research 2005 Vol 33 No 4

Page 6: doi:10.1093/nar/gki267 Modular RNA architecture revealed ... · Modular RNA architecture revealed by computational analysis of existing pseudoknots and ... Samuela Pasquali, ... RNAs

selected from the PseudoBase (25) (httpwwwbioleidenunivnl~BatenburgPKBhtml) The RNA pseudoknots are viralframeshifting (PKB174 PKB217 PKB233) viral tRNA-like(PKB134 PKB135 PKB143 PKB191) IRES RNA (PKB223)18S rRNA (PKB205) HIV-1 50-untranslated region (50-UTR)(PKB239) alternative pseudoknot of hammerhead ribozyme(PKB173) aptamer that binds human nerve growth factor(PKB131) tmRNA (PKB210) and self-cleaving Neurosporavs ribozyme (PKB178) (PKB233 and PKB217 appear thesame but have different sequences and this difference affectssubsequent sequence alignment analysis) These pseudoknotclasses are not known to be functionally related We choosemore than one member for some classes because of exhibitedtopological differences

The above pseudoknot structures were derived frommutational analysis (PKB173 PKB174 PKB191 PKB233PKB239) enzymatic and chemical structure probing(PKB131 PKB174 PKB191 PKB239) phylogenetic orsequence comparison analysis (PKB131 PKB135 PKB143PKB205 PKB217 PKB233 PKB234 PKB191 PKB210PKB239) and 3D modeling (PKB134 PKB135 PKB210)The pseudoknots listed in more than one category were probedusing two or more methods

To use these structures for graphical analysis we convertthe base pairing information of each structure provided byPSEUDOBASE to a dual graphical representation using ourdual graph rules [D1ndashD3 (2)] We then use pseudoknot graphsto specify their corresponding adjacency matrices Thesetwo steps are performed manually For RNA tree structureswe use our RNA Matrix program to automatically convert asecondary structure into its corresponding adjacency matrix(2635) The search for a small motif within a larger motif isperformed using input adjacency matrices and the comparisonof matrices is performed via the topological invariant S(Equation 2)

Sequence and structural relationships of functionallyrelated pseudoknots

We illustrate using several pseudoknot pairs the functionalsimilarity identified by topological rather than sequence sim-ilarity Figure 4 compares viral frameshifting pseudoknotstructures Viral RNA frameshifting is a process by whichexpression of overlapping open reading frames (ORFs) canbe achieved (36) We compare three pairs from subfamiliesGag-pro ORF1aORF1b and ORF2ORF3 ribosomalframeshifting Frameshifting signals involve two essentialcomponents a lsquoslipperyrsquo (or sliding) sequence of nucleotideswhere frameshifting takes place and a stimulatory RNAsecondary structure (usually a pseudoknot) located a fewnucleotides downstream Although members of the sameframeshifting subfamilies have significant sequence similarityglobal sequence identity between subfamilies is as low as28 similar to random sequences Because all frameshiftingpseudoknots in Figure 4A have the same topology and func-tion finding topological similarity between RNA structureshas an advantage for functional annotation over sequencecomparison

Figure 4B shows that this finding also holds for a frameshift-ing pseudoknot pair (PKB217 and PKB174 Pair 4 in Table 1)The 72 nt PKB217 is the 1 frameshifting pseudoknot of

lactate dehydrogenase-elevating virus (LDV) (37) Togetherwith its sliding sequence UUUAAAC it regulates the expres-sion of ORF1a for a polyprotein containing proteases andORF1b for an RNA-dependent RNA polymerase in LDVThe 127 nt PKB174 is the 1 frameshifting pseudoknot ofRous sarcoma virus (RSV) (3638) This pseudoknot and slid-ing sequence AAAUUUA regulate the expression of Gag-Pro-Pol and Gag-Pol polyproteins Although both the LDV andRSV frameshifting pseudoknots have a simple pseudoknotstructure the latter has two extra stemndashloops Despite theirfunctional similarity the global sequence identity is only 37[computed using the program ALIGN (39) available at httpworkbenchsdscedu] comparable with random sequencestheir sliding sequences are also dissimilar Using our localsequence alignment parameters we find a large significantmatched region in PKB217 (6833ndash6877) as shown inFigure 4B AAACUGCUAGCCACCUCUGGUCUCGACC-GCUGUACUAGAGGUGG

Although functional similarity between frameshiftingpseudoknots is expected this case illuminates how examiningtopological similarity aids in identifying pseudoknots withsimilar functions despite low overall sequence similarity

Comparison of pseudoknot structures

Our two probe motifsmdasha simple pseudoknot (PKB217 orPKB233) and a double pseudoknot (PKB178)mdashare used tosearch larger RNAs We generate two groups of topologicalmatches as shown in Figure 3 with the matched motifs showndarkened we found 11 such matched pseudoknot pairs Inaddition we analyze sequence similarity of many of the84 (14 middot 132) possible pseudoknot pairs since topologicalsimilarities are apparent within each pseudoknot group gen-erated by the probe motifs Although not exhaustive thisadditional analysis may help to identify pairs having similarsubstructures that are not detected by using fixed probes(simple and double pseudoknots) Table 1 lists 10 pseudoknotpairs (out of the 11 probestructure matches and subset of the84 structurendashstructure analysis) that have significant structuraland sequence similarities Of the total of 10 pairs listed(Pairs 1ndash10) two known functional pairs (1 and 4 in Table 1)are included for comparison with the eight other pairs that arenot previously known to be functionally related

We begin by analyzing the overall patterns of relationshipamong our 14 pseudoknots Figure 3 organizes the matchesinto two groups five matches for the simple pseudoknot(PKB217) and six for the double pseudoknot (PKB178)The matched motifs appear to be lsquomodular constructionsrsquoaround the probe motifmodule with added stemndashloops(three or fewer) To the best of our knowledge these resultsreveal modular features and topological similarities of RNApseudoknots that have not been described

The biological significance of topological matches dependson the size and complexity of pseudoknots a match for thedouble pseudoknot is probably more significant than for asimple pseudoknot The six pseudoknots generated by thedouble pseudoknot probe PKB178 show striking topologicalas well as functional similarity (eg viral tRNA-like PKB134and PKB135) we analyze their similarity with PKB178 andamong themselves Still the matched pairs of RNAs are func-tionally diverse besides PKB134 and PKB135 such examples

Nucleic Acids Research 2005 Vol 33 No 4 1389

of diverse RNAs include IRES (PKB223) RNA 18S rRNA(PKB205 PKB234) HIV-1 50-UTR (PKB239) and alternativepseudoknot of hammerhead ribozyme (PKB173) In this dou-ble pseudoknot group six pairs (1 3 5 6 8 and 9 in Table 1)produce local sequence alignments [using LALIGN (40)

httpworkbenchsdscedu] that suggest significant resultsIn particular the pair PKB134PKB173 (Pair 5 in Table 1)indicates a significant match The pairs PKB134PKB205(Pair 6) and PKB205PKB223 (Pair 10 in Table 1) alsodisplay intriguing sequence and functional similarities

A A A A A A C G G G A A G C A A G G G C U

GGGA

CA A

CCCU

A

A

A

U

GA U

C C C G G

G

A AACC

A

U

PKB 3

Gag-polglobal sequence identity 57

PKB 128U U U A A A C U G U U G A G A G G U G C C U G

U C U C U A C G G A C

GCCUG

GA G

CGGAC

UU U

G

U AUA

G

A

C

PKB 217U U U A A A C U G C U A G C C A C C U C U G G U

G G U G G A G A U C A

GACCGC

CU C C

UGGCG

CU

U

G

G

GCA

U

G GY

AG

CG

ORF 1a - ORF 1bglobal sequence identity 53

G G G A A A C G G G C A G G C G G

GCG

C GCGC

A

G

A

C C G C C

AC

AA

C

PKB 44

G G G A A A U G G A C U G A G G C G

GGCG

C CGGC

A

C

A

C G C C

A

CA

A

C

PKB 240

A

A

ORF2 - ORF3global sequence identity 79

global sequence identity as low as 28

A

U CC

G

C

AA A

C

U

C

GG

C

U

G

UU

AA

PKB 217GCUGACGGUGUYUGGCG

UA

CUC

UG

UG A C C G C

C U G G C G

CC

CCUCUGGU

GUGGAGA

CA

A

U

GU U U A A A C U G C U A G C 5

3

PKB 174

G A AG

UG

AA

AC

U

U

A

U

UG

C A

UUG

UC

UC

U

C

A

U A

U

AGGG

UCCCU

CGGGCCACUG

CCCGGUGAC

G

CC

GU

GU

G

A

CA

C

G C G C U A C A

C G C G A U G U

A G A C C G

U C U G G C

A

2603 3

52476 A A A U U U A U

6830

BPair 4

global sequence identity 37

G G G A A A C U G G A A G G C G G G G C

GCUGCA

GA C

GACUG

G

A

A

C C C C G

PKB 4

C

AA

AUA

UG

Figure 4 Sequence and structural comparisons of pseudoknots in the frameshifting functional families (A) Two secondary structures and corresponding topologiesfor each of the three viral frameshifting subfamilies Gag-pro ORF1aORF1b and ORF2ORF3 Sequence identity within the same subfamily ranges between 50 and80 it is as low as 28 when comparing pseudoknots from different subfamilies All frameshifting pseudoknots are functionally related and have the same topology(dual graph) (B) Secondary structures and their aligned nucleotides (color coded) for frameshifting pseudoknots of LDV (PKB217) and Rous sarcoma virus(PKB174) Global sequence alignments were performed using the program ALIGN with default (164) gap openinggap extension penalties

1390 Nucleic Acids Research 2005 Vol 33 No 4

These relationships suggest structural and functional similar-ities that are not previously known (Table 1)

To support the significance of these matches we performsequence alignments using SmithndashWaterman parameters forgap opening and gap extension penalties (41) We choose thegap opening penalty value of 15 from a range of 0 to 19 Toproduce better alignments for large sequence segments (shortalignments are less likely to be significant in our context) ourgap extension penalty of 1 from a range of 1 to 8 allowslarger extensions consistent with the possibility of the inser-tion of long sequences and added 2D motifs In Table 1 wealso report the alignment score which is the sum of nucleotidematches mismatches and gap penalties

Table 1 shows that the probe structure (PKB178) matchestwo other pseudoknots (PKB135 PKB173) with sequenceidentities ranging from 59 to 71 for 24ndash49 nt alignmentlengths (compared with 56 sequence identity over 129 ntfor the functionally related pair PKB134PKB135 Pair 1)Significantly the pairs PKB134PKB173 (Pair 5) PKB134PKB205 (Pair 6) and PKB205PKB223 (Pair 10) havesequence identity range of 72ndash80 The non-randomness ofthe 10 aligned pairs in Table 1 is also evident in the ratio Rwhich measures the level of significance above the randomexpectation and has value 1 for random pairs

R = (percent sequence identity)(percent sequence identitywhen one of the aligned sequences is randomized)

We also define ltRgt as the R-value averaged over align-ments with different randomized sequences Table 1 showsthat some pseudoknot pairs with no known functional rela-tionship have ltRgt values ranging between 106 and 147whereas the related pair PKB134PKB135 has a value of154 many matched pairs have ltRgt values lt1 Thus thetopological similarities we identify led to matched pairswith non-random sequence alignment results Non-randomsequence matches may arise from topological andor func-tional similarities as illustrated below for three pseudoknotpairs (5 6 and 10 in Table 1)

The rate of false positives generated by topological and ltRgtvalue screening can be easily estimated by considering allpossible pairs from the set of six matched double pseudo-knots in Figure 3 These pseudoknots form 15 distinctpairs 6 of which have ltRgt values gt114 [Table 1 withS(G) = 117] Table 1 shows that only four pairs have largeltRgt values (gt12) Pair 1 (PKB134PKB135) is related Pair 5(PKB134PKB173) and Pair 6 (PKB134PKB205) are mostprobably related and Pair 3 (PKB178PKB173) is of unknownrelationship (in fact both are self-cleaving ribozymes)Assuming that the Pair 3 is unrelated and therefore reflectsa false positive we have an error rate of 115 or 7 If theinferred functionally related Pairs 5 and 6 are also consideredas false positives we then have a conservative error rateof 20

Functional relationship between PKB134 and PKB173(Pair 5 in Table 1)

The 116 nt PKB134 pseudoknot occurs in the 30 end of bromemosaic virus genomic RNA (31) The 73 nt PKB173 is ahammerhead ribozyme of satellite cereal yellow dwarfvirus-RPV (satRPV) RNA capable of adopting an alternativepseudoknot conformation by forming an (L1ndashL2a) stem whose

existence is established by experimental mutational analysis(42) Figure 3 shows that their topologies have the same lsquocorersquobut PKB134 has two extra stemndashloops The local sequencealignment of this pair using LALIGN (Figure 5) shows 19bases aligned to 80 sequence identity and an ltRgt valueof 123 (Table 1) There are two significant segments inPKB173 that are matched with those in PKB134 UACUGU-CUGACGACGUAUCC (nt 11ndash30) and UAGAAGGCUG-GUGCC (nt 39ndash53) These segments and their matchedregions in PKB134 span the stem and unpaired elements ofthe secondary structures (see Figure 5) Significantly bothPKB134 and PKB173 pseudoknots serve as template for rep-licase a protein that recognizes specific sequence regions toinitiate replication of viral genomes (3142) Moreover a con-served sequence (CUGANGA) of PKB173 is in the alignedregion (nt 11ndash30) Thus our analysis suggests that PKB134and PKB173 pseudoknots are structurally and functionallyrelated

Further analysis of pseudoknots PKB134 and PKB173reveals specific domains interacting with the replicase Experi-ments have shown that mutations in domains A C B1 and B2[defined in (31)] of PKB134 impair viral RNA replication (43)Significantly our two aligned sequence regions span domainsA B1 C and also D Thus the replicase recognizes severaldomains in the PKB134 pseudoknot structure In PKB173 thepseudoknot forming five base pair helix (GCGCG) is implic-ated in replicase recognition or ligation (42) This pseudoknotalso has conserved sequence regions ACAAA (61ndash65) or itscomplement (possibly the replication origin) and AGAAA(nucleotides attached to the 30 end but not shown) Howeverthese regions are not observed in PKB134

Functional relationship between PKB205 and PKB223(Pair 10 in Table 1)

The 63 nt PKB223 is the pseudoknot of IRES of capsid-producing cricket paralysis-like viruses (CrPV) formethionine-independent initiation of translation (44) Appear-ing immediately upstream from the capsid-coding sequenceit initiates translation without a canonical AUG (methionine)initiation codon The 48 nt PKB205 is the pseudoknotsubstructure (V4 domain) of SSU of Palmaria palmata 18SrRNA which was found using extensive comparative analysis(4546) Figure 5 (Pair 10 lower panel) shows that PKB223 isa 4-vertex simple pseudoknot with two inserted stemndashloopsand PKB205 is a 4-vertex double pseudoknot with an insertedstemndashloop Figure 5 (middle panel) displays the local sequencealignment of PKB223 and PKB205 yielding a match of 14 ntwith 72 sequence identity This significant sequence simil-arity occurs in regions ACAAUAUCCAGGAA (6148ndash6161)and ACAUUAGCAUGGAA (765ndash778) of PKB223 andPKB205 respectively Remarkably we find that the 9 ntsegment 50-UUAGGUUAG-30 (nt 6100ndash6108) of PKB223is complementary to 30-GGUACGAUU-50 (nt 768ndash776) ofPKB205 except at G6103 Our finding corroborates withthe experimental identification that IRES elements in cellularmRNA contain a short 9 nt (nt 133ndash141 CCGGCGGGU) inthe 50-UTR of the homeodomain protein Gtx which is com-plementary to nt 1132ndash1124 of 18S rRNA (47) Indeed manysuch near complementary sequences have been found bet-ween the 30 end of 18S rRNA and picornavirus RNAs (48)

Nucleic Acids Research 2005 Vol 33 No 4 1391

Fig

ure

5

Th

ree

no

vel

fun

ctio

nal

lyre

late

dp

seu

do

kn

otp

airs

infe

rred

by

top

olo

gy

mat

chin

gan

dse

qu

ence

alig

nm

ent

Up

per

pan

els

Sec

ond

ary

stru

ctu

res

and

thei

ral

ign

edn

ucl

eoti

des

(co

lor

cod

ed)

for

pse

udo

kn

ot

pai

rsP

KB

13

4P

KB

17

3(P

air

5T

able

1

left

pan

el)

PK

B2

05

PK

B2

23

(Pai

r1

0T

able

1

mid

dle

pan

el)

and

PK

B1

34

PK

B20

5(P

air

6

rig

ht

pan

el)

PK

B1

34

occ

urs

in30 e

nd

of

bro

me

mosa

icv

iru

sg

eno

mic

RN

A

PK

B1

73

isa

pse

udo

kn

oto

fsa

tell

ite

cere

aly

ello

wd

war

ftv

iru

s-R

PV

RN

AP

KB

22

3is

the

pse

ud

ok

no

tofin

tern

alen

try

site

(IR

ES

)o

fca

spid

-pro

du

cin

gC

rPV

san

dP

KB

20

5is

the

pse

udo

kn

ots

ub

stru

ctu

re(V

4d

om

ain

)o

fS

SU

of

PP

alm

ata

18

SrR

NA

L

ow

erp

anel

sT

he

alig

ned

sub

gra

ph

so

rre

gio

ns

are

shad

edo

rm

ark

ed

1392 Nucleic Acids Research 2005 Vol 33 No 4

Intriguingly for CrPVs the complementary segments occur inpseudoknots with similar folds not involving the dangling30 end of 18S rRNA Base pair complementary between IRESelements of mRNA and 18S rRNA is a mechanism for ribo-some recruitment and translation In viral IRES elements thesecondary and tertiary structures are believed to play criticalroles in translation initiation (44)

Functional relationship between PKB205 and PKB134(Pair 6 in Table 1)

The functional relationsip of the pair PKB205PKB134 issimilar to that for PKB205PKB223 above PKB134 belongsto a common viral tRNA-like functional class whose role is torecruit host ribosomes to initiate viral protein synthesis (49)It is known to interact with the ribosome via tRNA shapemimicry (31) Our analysis indicates that tRNA-like mole-cules also contain a region which is nearly complementary toPKB205 a pseudoknot region of the 18S rRNA Table 2shows regions of PKB205 complementary to four tRNA-like molecules (PKB134 191 138 17) the complementaryregion in PKB223 is shown for comparison Significantlyeach of the pseudoknots PKB223 PKB134 and PKB191has a region complementary to the same 8ndash9 nt unpairedregion in PKB205 (767ndash776) This single-stranded regionis most probably available for base pairing with incomingtRNA-like molecules In addition two other regions inPKB205 are complementary to tRNA-like pseudoknotsPKB138 and PKB17 one region is a hairpin loop and anotherregion is part of short stems Thus this analysis suggests thattRNA-like molecules may employ both structural mimicry andsequence complementarity to recruit SSU rRNA an interac-tion mode similar to tRNA PKB134 also has a region UACA-GUGUUGAAAAACA (nt 2078ndash2094) closely matched tothe region UAGAGUGUUCAAAGCCA (nt 732ndash748) inPKB205 with an average R-value of 123

Tertiary structure similarity of topological matches inribosomal RNAs

The preceding examination of three pseudoknot pairs showsthat topological matches can suggest functionally relatedpseudoknots It is also important to show that novel topo-logical matches led to similar tertiary structures We focusfor this purpose on rRNA since several bacterial and archaealribosome structures have been solved in recent years In factin our previous work we have found several topologicalmatches of the Scerevisiae 5S rRNA motif within its largerrRNAs two within 16S rRNA (domains II and III) and threewithin 23S rRNA (domains III IV and VI) (2) From availableX-ray structures for Thermus thermophilus 16S rRNA (PDBaccession no 1fjg) (50) and 23S rRNA from Deinococcus

radiodurans (1nkw) and Haloarcula marismortui (1ffk)(5152) only 5S rRNA motifrsquos matches in domain II of16S rRNA and domain III of 23S rRNA are relevant theother three cases are not valid matches in these species

Figure 6 displays the secondary diagrams graphical repres-entations and tertiary structures of 5S rRNA for bacteriumDradiodurans (1nkw) and archaea Hmarismortui (1ffk)and their topologically matched modules within 16S and23S rRNAs the secondary diagrams are adapted from thosein Gutellrsquos comparative RNA website (httpwwwrnaicmbutexasedu) (53) The module in 23S is an identicalmatch for the 6-vertex 5S graph except for a hairpin loopwhich does not contribute to the topological invariant SThe 16S module is a 5-vertex graph because a 2 bp stem atthe junction has a non-canonical base pair AG (dual graph ruleD1 defines a stem as having at least two canonical andorwobble base pairs) for Scerevisiae this stem consists ofonly canonical base pairs making the 16S module an identicalmatch for the 5S topology As shown in Figure 6 the 16S and23S modules have a similar L-shaped tertiary fold as the 5Sstructure Tertiary structure similarity with 5S is more strikingfor the module of domain II (nt 653ndash753) of Tthermophilus16S rRNA than for the module of domain III (nt 1033ndash1143) of23S rRNA Both the 5S structure and 16S module have apronounced 3-stem junction where the chain ends of the16S module are located Structural (backbone) overlap ofthe 5S structure and 16S module shows similar structuralfeatures except for the orientations of the smaller stems atthe junction (We aligned the structures manually because thechain ends of 5S and 16S module occur at non-correspondinglocations) In contrast the 23S module shows significant dis-tortions on the shorter helical arm with the hairpin loop foldedback to the helix as shown in the structural overlap of the 5Sstructure and 23S module Structural similarities seen in theseexamples are probably due to geometric constraints which areapparent in their secondary diagrams and graphical representa-tions Thus 2D topological matches can lead to similar 3Dstructures Below we survey the distribution of modules inrRNAs

Distribution of submotifs in 16S and 23S ribosomalRNAs

Characterizing the submotifs within the 16S and 23S rRNAsthe largest known RNA molecules can help uncover themodular construction of RNA structures as analysis ofrRNArsquos tertiary modules has revealed Furthermore our pre-vious analysis of rRNArsquos structural motifs indicated existenceof characteristic features (35) For example the distributionsof pairedunpaired bases in stems bulgesinternal loops hair-pin loops and junctions follow specific functional forms

Table 2 Short sequence regions in PKB205 complementary to tRNA-like molecules

PKB205 regions Complementary sequences

30-GGUACGAUU-50 (768ndash776) PKB223 50-UUAGGUUAG-30 (6100ndash6108) IRES30-GUGAGAUU-50 (769ndash776) PKB134 50-CACUGUAAA-30 (113ndash120) tRNA-like30-UACGAUUAC-50 (767ndash775) PKB191 50-AUGCUCAUG-30 (77ndash85) tRNA-like30-UGUGAGAUU-50 (731ndash739) PKB138 50-ACACUUUAA-30 (38ndash47) tRNA-like30-GUUUGCGGACC-50 (746ndash756) PKB17 50-CAAAACCCUGG-30 (19ndash29) tRNA-like

Nucleic Acids Research 2005 Vol 33 No 4 1393

Figure 6 Comparison of the graph topologies and 2D and 3D structures of 5S rRNAs (A and B) and modules in 16S and 23S rRNAs (C and D) (C and D) Twostructural (backbone) overlaps of the 5S structure (1ffk gold color) with 16S module (1fjg blue) and 5S structure with 23S module (1nkw green) are shown Thesecondary diagrams are adapted from those in Gutellrsquos comparative RNA website (httpwwwrnaicmbutexasedu) and 3D structures are from the Protein DataBank (httpwwwrcsborgpdb) The 2D modules were identified computationally by our earlier work on finding120 nt 5S rRNA motifs in 16S and 23S rRNAs ofScerevisiae(2) The equivalent tertiary structures and modules for 5S 16S and 23S rRNAs from other organisms show that similar graph topologies can lead tosimilar secondary and tertiary structures All structures have similar graph topologies The Tthermophilus 16S module is a 5-vertex instead of 6-vertex graph becausea 2 bp stem at the junction has a non-canonical base pair AG for Scerevisiae this stem consists of only canonical base pairs making the 16S module an identicalmatch for the 5S topology

1394 Nucleic Acids Research 2005 Vol 33 No 4

Moreover the percentages of nucleotides in stems bulgesinternal loops hairpin loops and junctions do not vary for16S and 23S rRNAs even though the latter is twice thesize of the former Such motif characteristics help discriminateRNA-like from non-RNA-like molecules and guide the designof novel RNAs Here we analyze the distinct modular unitsmdashsmall tree submotifsmdashthat makeup the rRNAs to advance suchapplications Since all small tree graphs up to 10 vertices havebeen exhaustively enumerated (5455) we can use our sub-structure search algorithm to identify the occurrences of alldistinct tree submotifs in rRNAs

Specifically we use sets of tree graphs for V = 4 5 6 7 8 9and 10 comprising 2 3 6 11 23 47 and 106 graphs respect-ively for a total of 198 tree motifs The complete enumeratedmotifs can be found in the RAG web resource (httpmonodbiomathnyuedurna) Figure 7 shows as an example the11 possible 7-vertex tree graphs and lists the frequency ofoccurrence of each tree graph in rRNAs Remarkably thesame tree motifs are abundant in both 16S and 23S rRNAsThe most abundant (10 or 15 times) 7-vertex tree has two3-stem junctions (topological invariant Stree = 6160) 4- and5-stem junctions are the next most prevalent submotifs(Stree = 5137 5921 and 5667) Interestingly the unbranchedtree (Stree = 7219) and highly branched trees (Stree = 4509)are rare

Figure 8 plots the percentage of distinct tree graphs in eachset V (fraction of motifs identified relative to possible motifs)

found in rRNA For small tree graphs V = 4 5 and 6 allpossible topologies are present but not for V gt 6 In fact thepercentage of trees found declines with V In 23S rRNA90 of possible trees are found for V = 7 and only40 (representing 47 tree topologies) are found forV = 10 Larger trees (V gt 10) are expected to have muchsmaller percentages This percentage decline is expectedsince small tree modules are less specific than large treesIn total 111 submotifs out of 198 are identified The observedtrends in Figure 8 may be generally valid for rRNA structuresof other organisms since they are highly conserved althoughsome statistical fluctuations are expected due to structuraldifferences and RNArsquos finite size

These data show that RNA structures are modular constructsof relatively few (tree) building blocks Extrapolation of thecurves in Figure 8 suggests that 20 of V = 11 motifs (totalof 235 trees) and 10 of V = 12 motifs (total of 551 trees)probably exist in rRNAs Thus the total number of buildingblocks in rRNAs with V lt 13 would be 210 Although thepercentages of trees with higher V values are low they aremore numerous and will contribute to the overall number ofbuilding blocks We thus consider 210 as the minimum num-ber of building blocks The building block number may beexploited in the future design of novel functional RNAmolecules using modular assembly from 210 distinct treemotifs The tree motifs may be taken from substructures ofexisting RNAs allowing simultaneous sequence and structure

treeS =

16S 23SOccurrence

ID

6 4

)

0 1

(7 8)

6191

(7 11)

4509

0 1

(7 9)

5385

3 3

7

(7 10)

5137

5 5

(7 6)

5916

5

(7 5)

5921

(7 1)

0 0

7219

(7 3)

2 4

6446

4 4

(7 7)

5667

10 15

(7 4

6160

0 3

6692

(7 2)

Figure 7 Occurrence frequency of 11 distinct 7-vertex tree topologies as submotifs of Scerevisiae 16S and 23S rRNAs The tree graphs are labeled using topologicalinvariant Stree (second top row) and ID number (i j) (first row) which refers to motif identification number as used in RAG database (httpmonodbiomathnyuedurna) The two numbers at the bottom of each graph refer to the number of times that topology is found embedded in 16S and in 23S rRNAs respectively

Nucleic Acids Research 2005 Vol 33 No 4 1395

design Such a build-up approach appears feasible from our(26) and other works (35) and is akin to the lsquobuilduprsquo tech-nique for proteins pioneered by Vasquez and Scheraga (56)

The above estimate of building block number can be eithera consequence of the limited number of available structures orthe fact that RNA adopts only a subset of possible graphsCertainly future discoveries of novel RNA structures mightindicate a larger repertoire of building blocks Still our sur-veys of natural RNAs (26) suggest that RNAs possess specifictopological characteristics and favor a subset of the possiblegraph motifs The missing motifs may not exist in nature forRNA owing to physical or functional constraints

The building block number can also be defined in terms ofirreducible graphs ie trees that cannot be decomposed intosmaller trees However this approach has a major drawbackSince all RNAs are composed of connected stems their sub-structures can be reduced to two vertex graphs which is not aninformative description of complex RNA folds Still if wedefine 4-vertex trees as the smallest irreducible graphs thenumber of irreducible graphs can be calculated Based on oursubmotif data we estimate that for V up to 10 there are about42 irreducible tree modules in rRNAs

SUMMARY AND CONCLUSION

Our automated comparison of RNA secondary structuresbased on graphical representation allows the examination ofconserved 2D RNA topological features which in turn suggestfunctional relationships This approach has an advantage oversequence alignment because functionally related RNAs oftenlack sequence similarity as shown for frameshifting pseudo-knots with low global sequence identities Of course ourstructure alignment approach relies on available 2D RNA

structures and thus will increase in scope as more RNAstructures become available Our eight identified pairs ofpotentially related pseudoknots (Table 1) reveal three pairsthat are functionally related the pseudoknots of 30 end ofbrome mosaic virus genomic RNA (PKB134) and alternativehammerhead ribozyme pseudoknot (PKB173) are templatesfor replicase to initiate replication of viral genomes the pseu-doknots of internal entry site (IRES) of caspid-producingCrPVs (PKB223) and SSU of 18S rRNA (PKB205) are asso-ciated with translation initiation and the pseudoknots of 18SrRNA (PKB205) and viral tRNA-like (PKB134) most prob-ably interact via both structural mimicry and base comple-mentarity To the best of our knowledge these functionalrelationships have not been described previously

The developed RNA comparison algorithm also led us toquantify the modularity of 16S and 23S rRNAs using smallenumerated tree motifs Our results suggest that rRNAs areconstructed from at least 210 distinct tree subtopologies(or modules) having up to 180 nt Significantly we findthat 5S rRNA and two tree modules within 16S and 23SrRNAs with similar topologies have similar tertiary shapesaffirming the relationship between 2D graphical representa-tion and 3D structure The tree modules in rRNA may be usedas building blocks for the design of novel RNA folds A similaridea has been employed to design and improve functionalRNA molecules (510) Design by assembly of modulesexploits motifs present in natural RNAs unlike de novo designof sequences and structures (11) The issues of discoveringnovel topologies (6) building module library and designingfunctional folds are likely to increase in importance withinterest in novel RNAs

Our algorithm has several limitations Of course its utilitydepends on availability of reliable RNA secondary foldingalgorithms (5758) More fundamentally although simplifiedgraphical representations specify topological characteristicsdetailed sequence and motif information is missing An asso-ciated problem is that unusual RNA structurendashfunction rela-tionships due to deleterious mutations as in frameshiftingpseudoknots (28) and structural flexibility are not capturedMore advanced graph constructs such as weighted graphs mayhowever specify detailed motif features at the secondary andbase pair levels Another limitation of the present algorithmis the requirement to fix the probe structure or graph As inprotein structure alignment algorithms (15) future refinementsof our algorithm should allow some flexibility in motifdefinitions as the similarity search proceeds We inviteRNA scientists to visit our RAG web resource at httpmonodbiomathnyuedurna and suggest to us specific enhancementsthat will be useful in practice

ACKNOWLEDGEMENTS

We thank Dinshaw Patel for comments that motivated thiswork This research was supported by a Joint NSFNIGMSInitiative in Mathematical Biology (DMS-0201160) and theHuman Frontier Science Program (HFSP)

REFERENCES

1 TinocoI and BustamanteC (1999) How RNA folds J Mol Biol293 271ndash281

Figure 8 Distribution of all possible distinct tree modules with V vertices foundin Scerevisiae 16S and 23S rRNAs In graph theory each V is represented by aset of possible tree structures For example for V= 4 5 6 7 8 9 and 10 there are2 3 6 11 23 47 and 106 possible trees respectively The distribution of treesquantifies the diversity of tree modules within rRNAs This analysis suggeststhat rRNAs possess at least 210 tree modules

1396 Nucleic Acids Research 2005 Vol 33 No 4

2 GanHH PasqualiS and SchlickT (2003) Exploring the repertoire ofRNA secondary motifs using graph theory implications for RNA designNucleic Acids Res 31 2926ndash2943

3 SoukupGA and BreakerRR (1999) Engineering precision RNAmolecular switches Proc Natl Acad Sci USA 96 3584ndash3589

4 BreakerRR (2002) Engineered allosteric ribozymes as biosensorcomponents Curr Opin Biotechnol 13 31ndash39

5 IkawaY FukadaK WatanabeS ShiraishiH and InoueT (2002)Design construction and analysis of a novel class of self-foldingRNA Structure 10 527ndash534

6 KimN ShiffeldrimN GanHH and SchlickT (2004) Candidatesfor novel RNA topologies J Mol Biol 341 1129ndash1144

7 MitchellJR ChengJ and CollinsK (1999) A box HACAsmall nucleolar RNA-like domain at the human telomerase RNA 30

end Mol Cell Biol 19 567ndash5768 DoudnaJA and CechTR (2002) The chemical repertoire

of natural ribozymes Nature 418 222ndash2289 StorzG (2002) An expanding universe of noncoding RNAs Science

296 1260ndash126310 BreakerRR and JoyceGF (1994) Inventing and improving ribozyme

function rational design versus iterative selection methods TrendsBiotechnol 12 268ndash275

11 AndronescuM FejesAP HutterF HoosHH and CondonA (2004)A new algorithm for RNA secondary structure design J Mol Biol336 607ndash624

12 CarterRJ DubchakI and HolbrookSR (2001) A computationalapproach to identify genes for functional RNAs in genomic sequencesNucleic Acids Res 29 3928ndash3938

13 DoudnaJA (2000) Structural genomics of RNA Nature Struct Biol7 954ndash956

14 RivasE KleinRJ JonesTA and EddySR (2001) Computationalidentification of noncoding RNAs in Ecoli by comparativegenomics Curr Biol 11 1369ndash1373

15 HolmL and SanderC (1996) Mapping the protein universeScience 273 595ndash602

16 ShindyalovIN and BournePE (2001) A database and tools for 3-Dprotein structure comparison and alignment using the CombinatorialExtension (CE) algorithm Nucleic Acids Res 29 228ndash229

17 BenedettiG and MorosettiS (1996) A graph-topological approach torecognition of pattern and similarity in RNA secondary structuresBiophys Chem 59 179ndash184

18 MargalitH ShapiroBA OppenheimAB and MaizelJVJr (1989)Detection of common motifs in RNA secondary structures Nucleic AcidsRes 17 4829ndash4845

19 ShapiroBA and ZhangKZ (1990) Comparing multiple RNAsecondary structures using tree comparisons Comput Appl Biosci6 309ndash318

20 LeSY NussinovR and MaizelJV (1989) Tree graphs of RNAsecondary structures and their comparisons Comput Biomed Res22 461ndash473

21 DuarteCM WadleyLM and PyleAM (2003) RNA structurecomparison motif search and discovery using a reduced representation ofRNA conformational space Nucleic Acids Res 31 4755ndash4761

22 ChevaletC and MichotB (1992) An algorithm for comparing RNAsecondary structures and searching for similar substructures ComputAppl Biosci 8 215ndash225

23 WangL and ZhaoJ (2003) Parametric alignment of ordered treesBioinformatics 19 2237ndash2245

24 FontanaW KoningsDAM StadlerPF and SchusterP (1993)Statistics of RNA secondary structures Biopolymers 33 1389ndash1404

25 van BatenburgFHD GultyaevAP PleijCWA NgJ andOliehoekJ (2000) PseudoBase a database with RNA pseudoknotsNucleic Acids Res 28 201ndash204

26 GanHH FeraD ZornJ ShiffeldrimN TangM LasersonUKimN and SchlickT (2004) RAG RNA-As-Graphsdatabasemdashconcepts analysis and features Bioinformatics20 1285ndash1291

27 FeraD KimN ShiffeldrimN ZornJ LasersonU GanHH andSchlickT (2004) RAG RNA-As-Graphs web resource BMCBioinformatics 5 88

28 ChenXY KangHS ShenLX ChamorroM VarmusHE andTinocoI (1996) A characteristic bent conformation of RNA pseudoknotspromotes 1 frameshifting during translation of retroviral RNAJ Mol Biol 260 479ndash483

29 KangHS HinesJV and TinocoI (1996) Conformation of anon-frameshifting RNA pseudoknot from mouse mammary tumor virusJ Mol Biol 259 135ndash147

30 GanHH PerlowRA RoyS KoJ WuM HuangJ YanSNicolettaA VafaiJ SunD et al (2002) Analysis of protein sequencestructure similarity relationships Biophys J 83 2781ndash2791

31 FeldenB FlorentzC GiegeR and WesthofE (1994) Solutionstructure of the 30-end of brome mosaic-virus genomic RNAsmdashconformational mimicry with canonical transfer-RNAs J Mol Biol235 508ndash531

32 GrossJL and YellenJ (1999) Graph Theory and its Applications CRCPress Boca Raton FL

33 KoblerJ SchoningU and ToraanJ (1993) The Graph IsomorphismProblem Birkhauser Boston MA

34 RandicM and ZupanJ (2001) On interpretation of well-knowntopological indices J Chem Inform Comput Sci 41 550ndash560

35 ZornJ GanHH ShiffeldrimN and SchlickT (2004) Structuralmotifs in ribosomal RNAs implications for RNA design and genomicsBiopolymers 73 340ndash347

36 JacksT MadhaniHD MasiarzFR and VarmusHE (1988) Signalsfor ribosomal frameshifting in the Rous-Sarcoma virus GagndashPolregion Cell 55 447ndash458

37 GodenyEK ChenL KumarSN MethvenSL KooninEV andBrintonMA (1993) Complete genomic sequence and phylogeneticanalysis of the lactate dehydrogenase-elevating virus (Ldv) Virology194 585ndash596

38 MarczinkeB FisherR VidakovicM BloysAJ and BrierleyI (1998)Secondary structure and mutational analysis of the ribosomalframeshift signal of Rous sarcoma virus J Mol Biol 284 205ndash225

39 MyersEW and MillerW (1988) Optimal alignments in linear spaceComput Appl Biosci 4 11ndash17

40 HuangXQ and MillerW (1991) A time-efficient linear-space localsimilarity algorithm Adv Appl Math 12 337ndash357

41 SmithTF and WatermanMS (1981) Identification of commonmolecular subsequences J Mol Biol 147 195ndash197

42 SongSI SilverSL AulikMA RasochovaL MohanBR andMillerWA (1999) Satellite cereal yellow dwarf virus-RPV (satRPV)RNA requires a DouXble hammerhead for self-cleavage and analternative structure for replication J Mol Biol 293 781ndash793

43 DreherTW and HallTC (1988) Mutational analysis of thetransfer-RNA mimicry of brome mosaic-virus RNAmdashsequence andstructural requirements for aminoacylation and 30-adenylationJ Mol Biol 201 41ndash55

44 KanamoriY and NakashimaN (2001) A tertiary structure model of theinternal ribosome entry site (IRES) for methionine-independent initiationof translation RNA 7 266ndash274

45 WuytsJ De RijkP Van de PeerY PisonG RousseeuwP andDe WachterR (2000)Comparativeanalysisof more than 3000 sequencesreveals the existence of two pseudoknots in area V4 of eukaryotic smallsubunit ribosomal RNA Nucleic Acids Res 28 4698ndash4708

46 Van de PeerY De RijkP WuytsJ WinkelmansT and De WachterR(2000) The European small subunit ribosomal RNA databaseNucleic Acids Res 28 175ndash176

47 ChappellSA EdelmanGM and MauroVP (2000) A 9-nt segment ofa cellular mRNA can function as an internal ribosome entry site (IRES)and when present in linked multiple copies greatly enhances IRESactivity Proc Natl Acad Sci USA 97 1536ndash1541

48 ScheperGC VoormaHO and ThomasAA (1994) Basepairing with18S ribosomal RNA in internal initiation of translation FEBS Lett352 271ndash275

49 BarendsS Rudinger-ThirionJ FlorentzC GiegeR PleijCW andKraalB (2004) tRNA-like structure regulates translation of Bromemosaic virus RNA J Virol 78 4003ndash4010

50 CarterAP ClemonsWM BrodersenDE Morgan-WarrenRJWimberlyBT and RamakrishnanV (2000) Functional insights fromthe structure of the 30S ribosomal subunit and its interactions withantibiotics Nature 407 340ndash348

51 HarmsJ SchluenzenF ZarivachR BashanA GatS AgmonIBartelsH FranceschiF and YonathA (2001) High resolution structureof the large ribosomal subunit from a mesophilic Eubacterium Cell 107679ndash688

52 BanN NissenP HansenJ MoorePB and SteitzTA (2000) Thecomplete atomic structure of the large ribosomal subunit at 24 angstromresolution Science 289 905ndash920

Nucleic Acids Research 2005 Vol 33 No 4 1397

53 CannoneJJ SubramanianS SchnareMN CollettJRDrsquoSouzaLM DuYS FengB LinN MadabusiLV MullerKMet al (2002) The Comparative RNA Web (CRW) Site an online databaseof comparative sequence and structure information for ribosomal intronand other RNAs BMC Bioinformatics 3 2

54 HararyF and PrinsG (1959) The number of homeomorphicallyirreducible trees and other species Acta Math 101141ndash162

55 HararyF (1969) Graph Theory Addison-Wesley Reading Mass

56 VasquezM and ScheragaHA (1985) Use of buildup andenergy-minimization procedures to compute low-energy structures ofthe backbone of enkephalin Biopolymers 24 1437ndash1447

57 ZukerM MathewsDH and TurnerDH (1999) Algorithms andthermodynamics for RNA secondary structure prediction A practicalguide In BarciszewskiJ and ClarkBFC (eds) RNA Biochemistry andBiotechnology Kluwer Academic Publishers Dordrecht pp 11ndash43

58 HofackerIL (2003) Vienna RNA secondary structure serverNucleic Acids Res 31 3429ndash3431

1398 Nucleic Acids Research 2005 Vol 33 No 4

Page 7: doi:10.1093/nar/gki267 Modular RNA architecture revealed ... · Modular RNA architecture revealed by computational analysis of existing pseudoknots and ... Samuela Pasquali, ... RNAs

of diverse RNAs include IRES (PKB223) RNA 18S rRNA(PKB205 PKB234) HIV-1 50-UTR (PKB239) and alternativepseudoknot of hammerhead ribozyme (PKB173) In this dou-ble pseudoknot group six pairs (1 3 5 6 8 and 9 in Table 1)produce local sequence alignments [using LALIGN (40)

httpworkbenchsdscedu] that suggest significant resultsIn particular the pair PKB134PKB173 (Pair 5 in Table 1)indicates a significant match The pairs PKB134PKB205(Pair 6) and PKB205PKB223 (Pair 10 in Table 1) alsodisplay intriguing sequence and functional similarities

A A A A A A C G G G A A G C A A G G G C U

GGGA

CA A

CCCU

A

A

A

U

GA U

C C C G G

G

A AACC

A

U

PKB 3

Gag-polglobal sequence identity 57

PKB 128U U U A A A C U G U U G A G A G G U G C C U G

U C U C U A C G G A C

GCCUG

GA G

CGGAC

UU U

G

U AUA

G

A

C

PKB 217U U U A A A C U G C U A G C C A C C U C U G G U

G G U G G A G A U C A

GACCGC

CU C C

UGGCG

CU

U

G

G

GCA

U

G GY

AG

CG

ORF 1a - ORF 1bglobal sequence identity 53

G G G A A A C G G G C A G G C G G

GCG

C GCGC

A

G

A

C C G C C

AC

AA

C

PKB 44

G G G A A A U G G A C U G A G G C G

GGCG

C CGGC

A

C

A

C G C C

A

CA

A

C

PKB 240

A

A

ORF2 - ORF3global sequence identity 79

global sequence identity as low as 28

A

U CC

G

C

AA A

C

U

C

GG

C

U

G

UU

AA

PKB 217GCUGACGGUGUYUGGCG

UA

CUC

UG

UG A C C G C

C U G G C G

CC

CCUCUGGU

GUGGAGA

CA

A

U

GU U U A A A C U G C U A G C 5

3

PKB 174

G A AG

UG

AA

AC

U

U

A

U

UG

C A

UUG

UC

UC

U

C

A

U A

U

AGGG

UCCCU

CGGGCCACUG

CCCGGUGAC

G

CC

GU

GU

G

A

CA

C

G C G C U A C A

C G C G A U G U

A G A C C G

U C U G G C

A

2603 3

52476 A A A U U U A U

6830

BPair 4

global sequence identity 37

G G G A A A C U G G A A G G C G G G G C

GCUGCA

GA C

GACUG

G

A

A

C C C C G

PKB 4

C

AA

AUA

UG

Figure 4 Sequence and structural comparisons of pseudoknots in the frameshifting functional families (A) Two secondary structures and corresponding topologiesfor each of the three viral frameshifting subfamilies Gag-pro ORF1aORF1b and ORF2ORF3 Sequence identity within the same subfamily ranges between 50 and80 it is as low as 28 when comparing pseudoknots from different subfamilies All frameshifting pseudoknots are functionally related and have the same topology(dual graph) (B) Secondary structures and their aligned nucleotides (color coded) for frameshifting pseudoknots of LDV (PKB217) and Rous sarcoma virus(PKB174) Global sequence alignments were performed using the program ALIGN with default (164) gap openinggap extension penalties

1390 Nucleic Acids Research 2005 Vol 33 No 4

These relationships suggest structural and functional similar-ities that are not previously known (Table 1)

To support the significance of these matches we performsequence alignments using SmithndashWaterman parameters forgap opening and gap extension penalties (41) We choose thegap opening penalty value of 15 from a range of 0 to 19 Toproduce better alignments for large sequence segments (shortalignments are less likely to be significant in our context) ourgap extension penalty of 1 from a range of 1 to 8 allowslarger extensions consistent with the possibility of the inser-tion of long sequences and added 2D motifs In Table 1 wealso report the alignment score which is the sum of nucleotidematches mismatches and gap penalties

Table 1 shows that the probe structure (PKB178) matchestwo other pseudoknots (PKB135 PKB173) with sequenceidentities ranging from 59 to 71 for 24ndash49 nt alignmentlengths (compared with 56 sequence identity over 129 ntfor the functionally related pair PKB134PKB135 Pair 1)Significantly the pairs PKB134PKB173 (Pair 5) PKB134PKB205 (Pair 6) and PKB205PKB223 (Pair 10) havesequence identity range of 72ndash80 The non-randomness ofthe 10 aligned pairs in Table 1 is also evident in the ratio Rwhich measures the level of significance above the randomexpectation and has value 1 for random pairs

R = (percent sequence identity)(percent sequence identitywhen one of the aligned sequences is randomized)

We also define ltRgt as the R-value averaged over align-ments with different randomized sequences Table 1 showsthat some pseudoknot pairs with no known functional rela-tionship have ltRgt values ranging between 106 and 147whereas the related pair PKB134PKB135 has a value of154 many matched pairs have ltRgt values lt1 Thus thetopological similarities we identify led to matched pairswith non-random sequence alignment results Non-randomsequence matches may arise from topological andor func-tional similarities as illustrated below for three pseudoknotpairs (5 6 and 10 in Table 1)

The rate of false positives generated by topological and ltRgtvalue screening can be easily estimated by considering allpossible pairs from the set of six matched double pseudo-knots in Figure 3 These pseudoknots form 15 distinctpairs 6 of which have ltRgt values gt114 [Table 1 withS(G) = 117] Table 1 shows that only four pairs have largeltRgt values (gt12) Pair 1 (PKB134PKB135) is related Pair 5(PKB134PKB173) and Pair 6 (PKB134PKB205) are mostprobably related and Pair 3 (PKB178PKB173) is of unknownrelationship (in fact both are self-cleaving ribozymes)Assuming that the Pair 3 is unrelated and therefore reflectsa false positive we have an error rate of 115 or 7 If theinferred functionally related Pairs 5 and 6 are also consideredas false positives we then have a conservative error rateof 20

Functional relationship between PKB134 and PKB173(Pair 5 in Table 1)

The 116 nt PKB134 pseudoknot occurs in the 30 end of bromemosaic virus genomic RNA (31) The 73 nt PKB173 is ahammerhead ribozyme of satellite cereal yellow dwarfvirus-RPV (satRPV) RNA capable of adopting an alternativepseudoknot conformation by forming an (L1ndashL2a) stem whose

existence is established by experimental mutational analysis(42) Figure 3 shows that their topologies have the same lsquocorersquobut PKB134 has two extra stemndashloops The local sequencealignment of this pair using LALIGN (Figure 5) shows 19bases aligned to 80 sequence identity and an ltRgt valueof 123 (Table 1) There are two significant segments inPKB173 that are matched with those in PKB134 UACUGU-CUGACGACGUAUCC (nt 11ndash30) and UAGAAGGCUG-GUGCC (nt 39ndash53) These segments and their matchedregions in PKB134 span the stem and unpaired elements ofthe secondary structures (see Figure 5) Significantly bothPKB134 and PKB173 pseudoknots serve as template for rep-licase a protein that recognizes specific sequence regions toinitiate replication of viral genomes (3142) Moreover a con-served sequence (CUGANGA) of PKB173 is in the alignedregion (nt 11ndash30) Thus our analysis suggests that PKB134and PKB173 pseudoknots are structurally and functionallyrelated

Further analysis of pseudoknots PKB134 and PKB173reveals specific domains interacting with the replicase Experi-ments have shown that mutations in domains A C B1 and B2[defined in (31)] of PKB134 impair viral RNA replication (43)Significantly our two aligned sequence regions span domainsA B1 C and also D Thus the replicase recognizes severaldomains in the PKB134 pseudoknot structure In PKB173 thepseudoknot forming five base pair helix (GCGCG) is implic-ated in replicase recognition or ligation (42) This pseudoknotalso has conserved sequence regions ACAAA (61ndash65) or itscomplement (possibly the replication origin) and AGAAA(nucleotides attached to the 30 end but not shown) Howeverthese regions are not observed in PKB134

Functional relationship between PKB205 and PKB223(Pair 10 in Table 1)

The 63 nt PKB223 is the pseudoknot of IRES of capsid-producing cricket paralysis-like viruses (CrPV) formethionine-independent initiation of translation (44) Appear-ing immediately upstream from the capsid-coding sequenceit initiates translation without a canonical AUG (methionine)initiation codon The 48 nt PKB205 is the pseudoknotsubstructure (V4 domain) of SSU of Palmaria palmata 18SrRNA which was found using extensive comparative analysis(4546) Figure 5 (Pair 10 lower panel) shows that PKB223 isa 4-vertex simple pseudoknot with two inserted stemndashloopsand PKB205 is a 4-vertex double pseudoknot with an insertedstemndashloop Figure 5 (middle panel) displays the local sequencealignment of PKB223 and PKB205 yielding a match of 14 ntwith 72 sequence identity This significant sequence simil-arity occurs in regions ACAAUAUCCAGGAA (6148ndash6161)and ACAUUAGCAUGGAA (765ndash778) of PKB223 andPKB205 respectively Remarkably we find that the 9 ntsegment 50-UUAGGUUAG-30 (nt 6100ndash6108) of PKB223is complementary to 30-GGUACGAUU-50 (nt 768ndash776) ofPKB205 except at G6103 Our finding corroborates withthe experimental identification that IRES elements in cellularmRNA contain a short 9 nt (nt 133ndash141 CCGGCGGGU) inthe 50-UTR of the homeodomain protein Gtx which is com-plementary to nt 1132ndash1124 of 18S rRNA (47) Indeed manysuch near complementary sequences have been found bet-ween the 30 end of 18S rRNA and picornavirus RNAs (48)

Nucleic Acids Research 2005 Vol 33 No 4 1391

Fig

ure

5

Th

ree

no

vel

fun

ctio

nal

lyre

late

dp

seu

do

kn

otp

airs

infe

rred

by

top

olo

gy

mat

chin

gan

dse

qu

ence

alig

nm

ent

Up

per

pan

els

Sec

ond

ary

stru

ctu

res

and

thei

ral

ign

edn

ucl

eoti

des

(co

lor

cod

ed)

for

pse

udo

kn

ot

pai

rsP

KB

13

4P

KB

17

3(P

air

5T

able

1

left

pan

el)

PK

B2

05

PK

B2

23

(Pai

r1

0T

able

1

mid

dle

pan

el)

and

PK

B1

34

PK

B20

5(P

air

6

rig

ht

pan

el)

PK

B1

34

occ

urs

in30 e

nd

of

bro

me

mosa

icv

iru

sg

eno

mic

RN

A

PK

B1

73

isa

pse

udo

kn

oto

fsa

tell

ite

cere

aly

ello

wd

war

ftv

iru

s-R

PV

RN

AP

KB

22

3is

the

pse

ud

ok

no

tofin

tern

alen

try

site

(IR

ES

)o

fca

spid

-pro

du

cin

gC

rPV

san

dP

KB

20

5is

the

pse

udo

kn

ots

ub

stru

ctu

re(V

4d

om

ain

)o

fS

SU

of

PP

alm

ata

18

SrR

NA

L

ow

erp

anel

sT

he

alig

ned

sub

gra

ph

so

rre

gio

ns

are

shad

edo

rm

ark

ed

1392 Nucleic Acids Research 2005 Vol 33 No 4

Intriguingly for CrPVs the complementary segments occur inpseudoknots with similar folds not involving the dangling30 end of 18S rRNA Base pair complementary between IRESelements of mRNA and 18S rRNA is a mechanism for ribo-some recruitment and translation In viral IRES elements thesecondary and tertiary structures are believed to play criticalroles in translation initiation (44)

Functional relationship between PKB205 and PKB134(Pair 6 in Table 1)

The functional relationsip of the pair PKB205PKB134 issimilar to that for PKB205PKB223 above PKB134 belongsto a common viral tRNA-like functional class whose role is torecruit host ribosomes to initiate viral protein synthesis (49)It is known to interact with the ribosome via tRNA shapemimicry (31) Our analysis indicates that tRNA-like mole-cules also contain a region which is nearly complementary toPKB205 a pseudoknot region of the 18S rRNA Table 2shows regions of PKB205 complementary to four tRNA-like molecules (PKB134 191 138 17) the complementaryregion in PKB223 is shown for comparison Significantlyeach of the pseudoknots PKB223 PKB134 and PKB191has a region complementary to the same 8ndash9 nt unpairedregion in PKB205 (767ndash776) This single-stranded regionis most probably available for base pairing with incomingtRNA-like molecules In addition two other regions inPKB205 are complementary to tRNA-like pseudoknotsPKB138 and PKB17 one region is a hairpin loop and anotherregion is part of short stems Thus this analysis suggests thattRNA-like molecules may employ both structural mimicry andsequence complementarity to recruit SSU rRNA an interac-tion mode similar to tRNA PKB134 also has a region UACA-GUGUUGAAAAACA (nt 2078ndash2094) closely matched tothe region UAGAGUGUUCAAAGCCA (nt 732ndash748) inPKB205 with an average R-value of 123

Tertiary structure similarity of topological matches inribosomal RNAs

The preceding examination of three pseudoknot pairs showsthat topological matches can suggest functionally relatedpseudoknots It is also important to show that novel topo-logical matches led to similar tertiary structures We focusfor this purpose on rRNA since several bacterial and archaealribosome structures have been solved in recent years In factin our previous work we have found several topologicalmatches of the Scerevisiae 5S rRNA motif within its largerrRNAs two within 16S rRNA (domains II and III) and threewithin 23S rRNA (domains III IV and VI) (2) From availableX-ray structures for Thermus thermophilus 16S rRNA (PDBaccession no 1fjg) (50) and 23S rRNA from Deinococcus

radiodurans (1nkw) and Haloarcula marismortui (1ffk)(5152) only 5S rRNA motifrsquos matches in domain II of16S rRNA and domain III of 23S rRNA are relevant theother three cases are not valid matches in these species

Figure 6 displays the secondary diagrams graphical repres-entations and tertiary structures of 5S rRNA for bacteriumDradiodurans (1nkw) and archaea Hmarismortui (1ffk)and their topologically matched modules within 16S and23S rRNAs the secondary diagrams are adapted from thosein Gutellrsquos comparative RNA website (httpwwwrnaicmbutexasedu) (53) The module in 23S is an identicalmatch for the 6-vertex 5S graph except for a hairpin loopwhich does not contribute to the topological invariant SThe 16S module is a 5-vertex graph because a 2 bp stem atthe junction has a non-canonical base pair AG (dual graph ruleD1 defines a stem as having at least two canonical andorwobble base pairs) for Scerevisiae this stem consists ofonly canonical base pairs making the 16S module an identicalmatch for the 5S topology As shown in Figure 6 the 16S and23S modules have a similar L-shaped tertiary fold as the 5Sstructure Tertiary structure similarity with 5S is more strikingfor the module of domain II (nt 653ndash753) of Tthermophilus16S rRNA than for the module of domain III (nt 1033ndash1143) of23S rRNA Both the 5S structure and 16S module have apronounced 3-stem junction where the chain ends of the16S module are located Structural (backbone) overlap ofthe 5S structure and 16S module shows similar structuralfeatures except for the orientations of the smaller stems atthe junction (We aligned the structures manually because thechain ends of 5S and 16S module occur at non-correspondinglocations) In contrast the 23S module shows significant dis-tortions on the shorter helical arm with the hairpin loop foldedback to the helix as shown in the structural overlap of the 5Sstructure and 23S module Structural similarities seen in theseexamples are probably due to geometric constraints which areapparent in their secondary diagrams and graphical representa-tions Thus 2D topological matches can lead to similar 3Dstructures Below we survey the distribution of modules inrRNAs

Distribution of submotifs in 16S and 23S ribosomalRNAs

Characterizing the submotifs within the 16S and 23S rRNAsthe largest known RNA molecules can help uncover themodular construction of RNA structures as analysis ofrRNArsquos tertiary modules has revealed Furthermore our pre-vious analysis of rRNArsquos structural motifs indicated existenceof characteristic features (35) For example the distributionsof pairedunpaired bases in stems bulgesinternal loops hair-pin loops and junctions follow specific functional forms

Table 2 Short sequence regions in PKB205 complementary to tRNA-like molecules

PKB205 regions Complementary sequences

30-GGUACGAUU-50 (768ndash776) PKB223 50-UUAGGUUAG-30 (6100ndash6108) IRES30-GUGAGAUU-50 (769ndash776) PKB134 50-CACUGUAAA-30 (113ndash120) tRNA-like30-UACGAUUAC-50 (767ndash775) PKB191 50-AUGCUCAUG-30 (77ndash85) tRNA-like30-UGUGAGAUU-50 (731ndash739) PKB138 50-ACACUUUAA-30 (38ndash47) tRNA-like30-GUUUGCGGACC-50 (746ndash756) PKB17 50-CAAAACCCUGG-30 (19ndash29) tRNA-like

Nucleic Acids Research 2005 Vol 33 No 4 1393

Figure 6 Comparison of the graph topologies and 2D and 3D structures of 5S rRNAs (A and B) and modules in 16S and 23S rRNAs (C and D) (C and D) Twostructural (backbone) overlaps of the 5S structure (1ffk gold color) with 16S module (1fjg blue) and 5S structure with 23S module (1nkw green) are shown Thesecondary diagrams are adapted from those in Gutellrsquos comparative RNA website (httpwwwrnaicmbutexasedu) and 3D structures are from the Protein DataBank (httpwwwrcsborgpdb) The 2D modules were identified computationally by our earlier work on finding120 nt 5S rRNA motifs in 16S and 23S rRNAs ofScerevisiae(2) The equivalent tertiary structures and modules for 5S 16S and 23S rRNAs from other organisms show that similar graph topologies can lead tosimilar secondary and tertiary structures All structures have similar graph topologies The Tthermophilus 16S module is a 5-vertex instead of 6-vertex graph becausea 2 bp stem at the junction has a non-canonical base pair AG for Scerevisiae this stem consists of only canonical base pairs making the 16S module an identicalmatch for the 5S topology

1394 Nucleic Acids Research 2005 Vol 33 No 4

Moreover the percentages of nucleotides in stems bulgesinternal loops hairpin loops and junctions do not vary for16S and 23S rRNAs even though the latter is twice thesize of the former Such motif characteristics help discriminateRNA-like from non-RNA-like molecules and guide the designof novel RNAs Here we analyze the distinct modular unitsmdashsmall tree submotifsmdashthat makeup the rRNAs to advance suchapplications Since all small tree graphs up to 10 vertices havebeen exhaustively enumerated (5455) we can use our sub-structure search algorithm to identify the occurrences of alldistinct tree submotifs in rRNAs

Specifically we use sets of tree graphs for V = 4 5 6 7 8 9and 10 comprising 2 3 6 11 23 47 and 106 graphs respect-ively for a total of 198 tree motifs The complete enumeratedmotifs can be found in the RAG web resource (httpmonodbiomathnyuedurna) Figure 7 shows as an example the11 possible 7-vertex tree graphs and lists the frequency ofoccurrence of each tree graph in rRNAs Remarkably thesame tree motifs are abundant in both 16S and 23S rRNAsThe most abundant (10 or 15 times) 7-vertex tree has two3-stem junctions (topological invariant Stree = 6160) 4- and5-stem junctions are the next most prevalent submotifs(Stree = 5137 5921 and 5667) Interestingly the unbranchedtree (Stree = 7219) and highly branched trees (Stree = 4509)are rare

Figure 8 plots the percentage of distinct tree graphs in eachset V (fraction of motifs identified relative to possible motifs)

found in rRNA For small tree graphs V = 4 5 and 6 allpossible topologies are present but not for V gt 6 In fact thepercentage of trees found declines with V In 23S rRNA90 of possible trees are found for V = 7 and only40 (representing 47 tree topologies) are found forV = 10 Larger trees (V gt 10) are expected to have muchsmaller percentages This percentage decline is expectedsince small tree modules are less specific than large treesIn total 111 submotifs out of 198 are identified The observedtrends in Figure 8 may be generally valid for rRNA structuresof other organisms since they are highly conserved althoughsome statistical fluctuations are expected due to structuraldifferences and RNArsquos finite size

These data show that RNA structures are modular constructsof relatively few (tree) building blocks Extrapolation of thecurves in Figure 8 suggests that 20 of V = 11 motifs (totalof 235 trees) and 10 of V = 12 motifs (total of 551 trees)probably exist in rRNAs Thus the total number of buildingblocks in rRNAs with V lt 13 would be 210 Although thepercentages of trees with higher V values are low they aremore numerous and will contribute to the overall number ofbuilding blocks We thus consider 210 as the minimum num-ber of building blocks The building block number may beexploited in the future design of novel functional RNAmolecules using modular assembly from 210 distinct treemotifs The tree motifs may be taken from substructures ofexisting RNAs allowing simultaneous sequence and structure

treeS =

16S 23SOccurrence

ID

6 4

)

0 1

(7 8)

6191

(7 11)

4509

0 1

(7 9)

5385

3 3

7

(7 10)

5137

5 5

(7 6)

5916

5

(7 5)

5921

(7 1)

0 0

7219

(7 3)

2 4

6446

4 4

(7 7)

5667

10 15

(7 4

6160

0 3

6692

(7 2)

Figure 7 Occurrence frequency of 11 distinct 7-vertex tree topologies as submotifs of Scerevisiae 16S and 23S rRNAs The tree graphs are labeled using topologicalinvariant Stree (second top row) and ID number (i j) (first row) which refers to motif identification number as used in RAG database (httpmonodbiomathnyuedurna) The two numbers at the bottom of each graph refer to the number of times that topology is found embedded in 16S and in 23S rRNAs respectively

Nucleic Acids Research 2005 Vol 33 No 4 1395

design Such a build-up approach appears feasible from our(26) and other works (35) and is akin to the lsquobuilduprsquo tech-nique for proteins pioneered by Vasquez and Scheraga (56)

The above estimate of building block number can be eithera consequence of the limited number of available structures orthe fact that RNA adopts only a subset of possible graphsCertainly future discoveries of novel RNA structures mightindicate a larger repertoire of building blocks Still our sur-veys of natural RNAs (26) suggest that RNAs possess specifictopological characteristics and favor a subset of the possiblegraph motifs The missing motifs may not exist in nature forRNA owing to physical or functional constraints

The building block number can also be defined in terms ofirreducible graphs ie trees that cannot be decomposed intosmaller trees However this approach has a major drawbackSince all RNAs are composed of connected stems their sub-structures can be reduced to two vertex graphs which is not aninformative description of complex RNA folds Still if wedefine 4-vertex trees as the smallest irreducible graphs thenumber of irreducible graphs can be calculated Based on oursubmotif data we estimate that for V up to 10 there are about42 irreducible tree modules in rRNAs

SUMMARY AND CONCLUSION

Our automated comparison of RNA secondary structuresbased on graphical representation allows the examination ofconserved 2D RNA topological features which in turn suggestfunctional relationships This approach has an advantage oversequence alignment because functionally related RNAs oftenlack sequence similarity as shown for frameshifting pseudo-knots with low global sequence identities Of course ourstructure alignment approach relies on available 2D RNA

structures and thus will increase in scope as more RNAstructures become available Our eight identified pairs ofpotentially related pseudoknots (Table 1) reveal three pairsthat are functionally related the pseudoknots of 30 end ofbrome mosaic virus genomic RNA (PKB134) and alternativehammerhead ribozyme pseudoknot (PKB173) are templatesfor replicase to initiate replication of viral genomes the pseu-doknots of internal entry site (IRES) of caspid-producingCrPVs (PKB223) and SSU of 18S rRNA (PKB205) are asso-ciated with translation initiation and the pseudoknots of 18SrRNA (PKB205) and viral tRNA-like (PKB134) most prob-ably interact via both structural mimicry and base comple-mentarity To the best of our knowledge these functionalrelationships have not been described previously

The developed RNA comparison algorithm also led us toquantify the modularity of 16S and 23S rRNAs using smallenumerated tree motifs Our results suggest that rRNAs areconstructed from at least 210 distinct tree subtopologies(or modules) having up to 180 nt Significantly we findthat 5S rRNA and two tree modules within 16S and 23SrRNAs with similar topologies have similar tertiary shapesaffirming the relationship between 2D graphical representa-tion and 3D structure The tree modules in rRNA may be usedas building blocks for the design of novel RNA folds A similaridea has been employed to design and improve functionalRNA molecules (510) Design by assembly of modulesexploits motifs present in natural RNAs unlike de novo designof sequences and structures (11) The issues of discoveringnovel topologies (6) building module library and designingfunctional folds are likely to increase in importance withinterest in novel RNAs

Our algorithm has several limitations Of course its utilitydepends on availability of reliable RNA secondary foldingalgorithms (5758) More fundamentally although simplifiedgraphical representations specify topological characteristicsdetailed sequence and motif information is missing An asso-ciated problem is that unusual RNA structurendashfunction rela-tionships due to deleterious mutations as in frameshiftingpseudoknots (28) and structural flexibility are not capturedMore advanced graph constructs such as weighted graphs mayhowever specify detailed motif features at the secondary andbase pair levels Another limitation of the present algorithmis the requirement to fix the probe structure or graph As inprotein structure alignment algorithms (15) future refinementsof our algorithm should allow some flexibility in motifdefinitions as the similarity search proceeds We inviteRNA scientists to visit our RAG web resource at httpmonodbiomathnyuedurna and suggest to us specific enhancementsthat will be useful in practice

ACKNOWLEDGEMENTS

We thank Dinshaw Patel for comments that motivated thiswork This research was supported by a Joint NSFNIGMSInitiative in Mathematical Biology (DMS-0201160) and theHuman Frontier Science Program (HFSP)

REFERENCES

1 TinocoI and BustamanteC (1999) How RNA folds J Mol Biol293 271ndash281

Figure 8 Distribution of all possible distinct tree modules with V vertices foundin Scerevisiae 16S and 23S rRNAs In graph theory each V is represented by aset of possible tree structures For example for V= 4 5 6 7 8 9 and 10 there are2 3 6 11 23 47 and 106 possible trees respectively The distribution of treesquantifies the diversity of tree modules within rRNAs This analysis suggeststhat rRNAs possess at least 210 tree modules

1396 Nucleic Acids Research 2005 Vol 33 No 4

2 GanHH PasqualiS and SchlickT (2003) Exploring the repertoire ofRNA secondary motifs using graph theory implications for RNA designNucleic Acids Res 31 2926ndash2943

3 SoukupGA and BreakerRR (1999) Engineering precision RNAmolecular switches Proc Natl Acad Sci USA 96 3584ndash3589

4 BreakerRR (2002) Engineered allosteric ribozymes as biosensorcomponents Curr Opin Biotechnol 13 31ndash39

5 IkawaY FukadaK WatanabeS ShiraishiH and InoueT (2002)Design construction and analysis of a novel class of self-foldingRNA Structure 10 527ndash534

6 KimN ShiffeldrimN GanHH and SchlickT (2004) Candidatesfor novel RNA topologies J Mol Biol 341 1129ndash1144

7 MitchellJR ChengJ and CollinsK (1999) A box HACAsmall nucleolar RNA-like domain at the human telomerase RNA 30

end Mol Cell Biol 19 567ndash5768 DoudnaJA and CechTR (2002) The chemical repertoire

of natural ribozymes Nature 418 222ndash2289 StorzG (2002) An expanding universe of noncoding RNAs Science

296 1260ndash126310 BreakerRR and JoyceGF (1994) Inventing and improving ribozyme

function rational design versus iterative selection methods TrendsBiotechnol 12 268ndash275

11 AndronescuM FejesAP HutterF HoosHH and CondonA (2004)A new algorithm for RNA secondary structure design J Mol Biol336 607ndash624

12 CarterRJ DubchakI and HolbrookSR (2001) A computationalapproach to identify genes for functional RNAs in genomic sequencesNucleic Acids Res 29 3928ndash3938

13 DoudnaJA (2000) Structural genomics of RNA Nature Struct Biol7 954ndash956

14 RivasE KleinRJ JonesTA and EddySR (2001) Computationalidentification of noncoding RNAs in Ecoli by comparativegenomics Curr Biol 11 1369ndash1373

15 HolmL and SanderC (1996) Mapping the protein universeScience 273 595ndash602

16 ShindyalovIN and BournePE (2001) A database and tools for 3-Dprotein structure comparison and alignment using the CombinatorialExtension (CE) algorithm Nucleic Acids Res 29 228ndash229

17 BenedettiG and MorosettiS (1996) A graph-topological approach torecognition of pattern and similarity in RNA secondary structuresBiophys Chem 59 179ndash184

18 MargalitH ShapiroBA OppenheimAB and MaizelJVJr (1989)Detection of common motifs in RNA secondary structures Nucleic AcidsRes 17 4829ndash4845

19 ShapiroBA and ZhangKZ (1990) Comparing multiple RNAsecondary structures using tree comparisons Comput Appl Biosci6 309ndash318

20 LeSY NussinovR and MaizelJV (1989) Tree graphs of RNAsecondary structures and their comparisons Comput Biomed Res22 461ndash473

21 DuarteCM WadleyLM and PyleAM (2003) RNA structurecomparison motif search and discovery using a reduced representation ofRNA conformational space Nucleic Acids Res 31 4755ndash4761

22 ChevaletC and MichotB (1992) An algorithm for comparing RNAsecondary structures and searching for similar substructures ComputAppl Biosci 8 215ndash225

23 WangL and ZhaoJ (2003) Parametric alignment of ordered treesBioinformatics 19 2237ndash2245

24 FontanaW KoningsDAM StadlerPF and SchusterP (1993)Statistics of RNA secondary structures Biopolymers 33 1389ndash1404

25 van BatenburgFHD GultyaevAP PleijCWA NgJ andOliehoekJ (2000) PseudoBase a database with RNA pseudoknotsNucleic Acids Res 28 201ndash204

26 GanHH FeraD ZornJ ShiffeldrimN TangM LasersonUKimN and SchlickT (2004) RAG RNA-As-Graphsdatabasemdashconcepts analysis and features Bioinformatics20 1285ndash1291

27 FeraD KimN ShiffeldrimN ZornJ LasersonU GanHH andSchlickT (2004) RAG RNA-As-Graphs web resource BMCBioinformatics 5 88

28 ChenXY KangHS ShenLX ChamorroM VarmusHE andTinocoI (1996) A characteristic bent conformation of RNA pseudoknotspromotes 1 frameshifting during translation of retroviral RNAJ Mol Biol 260 479ndash483

29 KangHS HinesJV and TinocoI (1996) Conformation of anon-frameshifting RNA pseudoknot from mouse mammary tumor virusJ Mol Biol 259 135ndash147

30 GanHH PerlowRA RoyS KoJ WuM HuangJ YanSNicolettaA VafaiJ SunD et al (2002) Analysis of protein sequencestructure similarity relationships Biophys J 83 2781ndash2791

31 FeldenB FlorentzC GiegeR and WesthofE (1994) Solutionstructure of the 30-end of brome mosaic-virus genomic RNAsmdashconformational mimicry with canonical transfer-RNAs J Mol Biol235 508ndash531

32 GrossJL and YellenJ (1999) Graph Theory and its Applications CRCPress Boca Raton FL

33 KoblerJ SchoningU and ToraanJ (1993) The Graph IsomorphismProblem Birkhauser Boston MA

34 RandicM and ZupanJ (2001) On interpretation of well-knowntopological indices J Chem Inform Comput Sci 41 550ndash560

35 ZornJ GanHH ShiffeldrimN and SchlickT (2004) Structuralmotifs in ribosomal RNAs implications for RNA design and genomicsBiopolymers 73 340ndash347

36 JacksT MadhaniHD MasiarzFR and VarmusHE (1988) Signalsfor ribosomal frameshifting in the Rous-Sarcoma virus GagndashPolregion Cell 55 447ndash458

37 GodenyEK ChenL KumarSN MethvenSL KooninEV andBrintonMA (1993) Complete genomic sequence and phylogeneticanalysis of the lactate dehydrogenase-elevating virus (Ldv) Virology194 585ndash596

38 MarczinkeB FisherR VidakovicM BloysAJ and BrierleyI (1998)Secondary structure and mutational analysis of the ribosomalframeshift signal of Rous sarcoma virus J Mol Biol 284 205ndash225

39 MyersEW and MillerW (1988) Optimal alignments in linear spaceComput Appl Biosci 4 11ndash17

40 HuangXQ and MillerW (1991) A time-efficient linear-space localsimilarity algorithm Adv Appl Math 12 337ndash357

41 SmithTF and WatermanMS (1981) Identification of commonmolecular subsequences J Mol Biol 147 195ndash197

42 SongSI SilverSL AulikMA RasochovaL MohanBR andMillerWA (1999) Satellite cereal yellow dwarf virus-RPV (satRPV)RNA requires a DouXble hammerhead for self-cleavage and analternative structure for replication J Mol Biol 293 781ndash793

43 DreherTW and HallTC (1988) Mutational analysis of thetransfer-RNA mimicry of brome mosaic-virus RNAmdashsequence andstructural requirements for aminoacylation and 30-adenylationJ Mol Biol 201 41ndash55

44 KanamoriY and NakashimaN (2001) A tertiary structure model of theinternal ribosome entry site (IRES) for methionine-independent initiationof translation RNA 7 266ndash274

45 WuytsJ De RijkP Van de PeerY PisonG RousseeuwP andDe WachterR (2000)Comparativeanalysisof more than 3000 sequencesreveals the existence of two pseudoknots in area V4 of eukaryotic smallsubunit ribosomal RNA Nucleic Acids Res 28 4698ndash4708

46 Van de PeerY De RijkP WuytsJ WinkelmansT and De WachterR(2000) The European small subunit ribosomal RNA databaseNucleic Acids Res 28 175ndash176

47 ChappellSA EdelmanGM and MauroVP (2000) A 9-nt segment ofa cellular mRNA can function as an internal ribosome entry site (IRES)and when present in linked multiple copies greatly enhances IRESactivity Proc Natl Acad Sci USA 97 1536ndash1541

48 ScheperGC VoormaHO and ThomasAA (1994) Basepairing with18S ribosomal RNA in internal initiation of translation FEBS Lett352 271ndash275

49 BarendsS Rudinger-ThirionJ FlorentzC GiegeR PleijCW andKraalB (2004) tRNA-like structure regulates translation of Bromemosaic virus RNA J Virol 78 4003ndash4010

50 CarterAP ClemonsWM BrodersenDE Morgan-WarrenRJWimberlyBT and RamakrishnanV (2000) Functional insights fromthe structure of the 30S ribosomal subunit and its interactions withantibiotics Nature 407 340ndash348

51 HarmsJ SchluenzenF ZarivachR BashanA GatS AgmonIBartelsH FranceschiF and YonathA (2001) High resolution structureof the large ribosomal subunit from a mesophilic Eubacterium Cell 107679ndash688

52 BanN NissenP HansenJ MoorePB and SteitzTA (2000) Thecomplete atomic structure of the large ribosomal subunit at 24 angstromresolution Science 289 905ndash920

Nucleic Acids Research 2005 Vol 33 No 4 1397

53 CannoneJJ SubramanianS SchnareMN CollettJRDrsquoSouzaLM DuYS FengB LinN MadabusiLV MullerKMet al (2002) The Comparative RNA Web (CRW) Site an online databaseof comparative sequence and structure information for ribosomal intronand other RNAs BMC Bioinformatics 3 2

54 HararyF and PrinsG (1959) The number of homeomorphicallyirreducible trees and other species Acta Math 101141ndash162

55 HararyF (1969) Graph Theory Addison-Wesley Reading Mass

56 VasquezM and ScheragaHA (1985) Use of buildup andenergy-minimization procedures to compute low-energy structures ofthe backbone of enkephalin Biopolymers 24 1437ndash1447

57 ZukerM MathewsDH and TurnerDH (1999) Algorithms andthermodynamics for RNA secondary structure prediction A practicalguide In BarciszewskiJ and ClarkBFC (eds) RNA Biochemistry andBiotechnology Kluwer Academic Publishers Dordrecht pp 11ndash43

58 HofackerIL (2003) Vienna RNA secondary structure serverNucleic Acids Res 31 3429ndash3431

1398 Nucleic Acids Research 2005 Vol 33 No 4

Page 8: doi:10.1093/nar/gki267 Modular RNA architecture revealed ... · Modular RNA architecture revealed by computational analysis of existing pseudoknots and ... Samuela Pasquali, ... RNAs

These relationships suggest structural and functional similar-ities that are not previously known (Table 1)

To support the significance of these matches we performsequence alignments using SmithndashWaterman parameters forgap opening and gap extension penalties (41) We choose thegap opening penalty value of 15 from a range of 0 to 19 Toproduce better alignments for large sequence segments (shortalignments are less likely to be significant in our context) ourgap extension penalty of 1 from a range of 1 to 8 allowslarger extensions consistent with the possibility of the inser-tion of long sequences and added 2D motifs In Table 1 wealso report the alignment score which is the sum of nucleotidematches mismatches and gap penalties

Table 1 shows that the probe structure (PKB178) matchestwo other pseudoknots (PKB135 PKB173) with sequenceidentities ranging from 59 to 71 for 24ndash49 nt alignmentlengths (compared with 56 sequence identity over 129 ntfor the functionally related pair PKB134PKB135 Pair 1)Significantly the pairs PKB134PKB173 (Pair 5) PKB134PKB205 (Pair 6) and PKB205PKB223 (Pair 10) havesequence identity range of 72ndash80 The non-randomness ofthe 10 aligned pairs in Table 1 is also evident in the ratio Rwhich measures the level of significance above the randomexpectation and has value 1 for random pairs

R = (percent sequence identity)(percent sequence identitywhen one of the aligned sequences is randomized)

We also define ltRgt as the R-value averaged over align-ments with different randomized sequences Table 1 showsthat some pseudoknot pairs with no known functional rela-tionship have ltRgt values ranging between 106 and 147whereas the related pair PKB134PKB135 has a value of154 many matched pairs have ltRgt values lt1 Thus thetopological similarities we identify led to matched pairswith non-random sequence alignment results Non-randomsequence matches may arise from topological andor func-tional similarities as illustrated below for three pseudoknotpairs (5 6 and 10 in Table 1)

The rate of false positives generated by topological and ltRgtvalue screening can be easily estimated by considering allpossible pairs from the set of six matched double pseudo-knots in Figure 3 These pseudoknots form 15 distinctpairs 6 of which have ltRgt values gt114 [Table 1 withS(G) = 117] Table 1 shows that only four pairs have largeltRgt values (gt12) Pair 1 (PKB134PKB135) is related Pair 5(PKB134PKB173) and Pair 6 (PKB134PKB205) are mostprobably related and Pair 3 (PKB178PKB173) is of unknownrelationship (in fact both are self-cleaving ribozymes)Assuming that the Pair 3 is unrelated and therefore reflectsa false positive we have an error rate of 115 or 7 If theinferred functionally related Pairs 5 and 6 are also consideredas false positives we then have a conservative error rateof 20

Functional relationship between PKB134 and PKB173(Pair 5 in Table 1)

The 116 nt PKB134 pseudoknot occurs in the 30 end of bromemosaic virus genomic RNA (31) The 73 nt PKB173 is ahammerhead ribozyme of satellite cereal yellow dwarfvirus-RPV (satRPV) RNA capable of adopting an alternativepseudoknot conformation by forming an (L1ndashL2a) stem whose

existence is established by experimental mutational analysis(42) Figure 3 shows that their topologies have the same lsquocorersquobut PKB134 has two extra stemndashloops The local sequencealignment of this pair using LALIGN (Figure 5) shows 19bases aligned to 80 sequence identity and an ltRgt valueof 123 (Table 1) There are two significant segments inPKB173 that are matched with those in PKB134 UACUGU-CUGACGACGUAUCC (nt 11ndash30) and UAGAAGGCUG-GUGCC (nt 39ndash53) These segments and their matchedregions in PKB134 span the stem and unpaired elements ofthe secondary structures (see Figure 5) Significantly bothPKB134 and PKB173 pseudoknots serve as template for rep-licase a protein that recognizes specific sequence regions toinitiate replication of viral genomes (3142) Moreover a con-served sequence (CUGANGA) of PKB173 is in the alignedregion (nt 11ndash30) Thus our analysis suggests that PKB134and PKB173 pseudoknots are structurally and functionallyrelated

Further analysis of pseudoknots PKB134 and PKB173reveals specific domains interacting with the replicase Experi-ments have shown that mutations in domains A C B1 and B2[defined in (31)] of PKB134 impair viral RNA replication (43)Significantly our two aligned sequence regions span domainsA B1 C and also D Thus the replicase recognizes severaldomains in the PKB134 pseudoknot structure In PKB173 thepseudoknot forming five base pair helix (GCGCG) is implic-ated in replicase recognition or ligation (42) This pseudoknotalso has conserved sequence regions ACAAA (61ndash65) or itscomplement (possibly the replication origin) and AGAAA(nucleotides attached to the 30 end but not shown) Howeverthese regions are not observed in PKB134

Functional relationship between PKB205 and PKB223(Pair 10 in Table 1)

The 63 nt PKB223 is the pseudoknot of IRES of capsid-producing cricket paralysis-like viruses (CrPV) formethionine-independent initiation of translation (44) Appear-ing immediately upstream from the capsid-coding sequenceit initiates translation without a canonical AUG (methionine)initiation codon The 48 nt PKB205 is the pseudoknotsubstructure (V4 domain) of SSU of Palmaria palmata 18SrRNA which was found using extensive comparative analysis(4546) Figure 5 (Pair 10 lower panel) shows that PKB223 isa 4-vertex simple pseudoknot with two inserted stemndashloopsand PKB205 is a 4-vertex double pseudoknot with an insertedstemndashloop Figure 5 (middle panel) displays the local sequencealignment of PKB223 and PKB205 yielding a match of 14 ntwith 72 sequence identity This significant sequence simil-arity occurs in regions ACAAUAUCCAGGAA (6148ndash6161)and ACAUUAGCAUGGAA (765ndash778) of PKB223 andPKB205 respectively Remarkably we find that the 9 ntsegment 50-UUAGGUUAG-30 (nt 6100ndash6108) of PKB223is complementary to 30-GGUACGAUU-50 (nt 768ndash776) ofPKB205 except at G6103 Our finding corroborates withthe experimental identification that IRES elements in cellularmRNA contain a short 9 nt (nt 133ndash141 CCGGCGGGU) inthe 50-UTR of the homeodomain protein Gtx which is com-plementary to nt 1132ndash1124 of 18S rRNA (47) Indeed manysuch near complementary sequences have been found bet-ween the 30 end of 18S rRNA and picornavirus RNAs (48)

Nucleic Acids Research 2005 Vol 33 No 4 1391

Fig

ure

5

Th

ree

no

vel

fun

ctio

nal

lyre

late

dp

seu

do

kn

otp

airs

infe

rred

by

top

olo

gy

mat

chin

gan

dse

qu

ence

alig

nm

ent

Up

per

pan

els

Sec

ond

ary

stru

ctu

res

and

thei

ral

ign

edn

ucl

eoti

des

(co

lor

cod

ed)

for

pse

udo

kn

ot

pai

rsP

KB

13

4P

KB

17

3(P

air

5T

able

1

left

pan

el)

PK

B2

05

PK

B2

23

(Pai

r1

0T

able

1

mid

dle

pan

el)

and

PK

B1

34

PK

B20

5(P

air

6

rig

ht

pan

el)

PK

B1

34

occ

urs

in30 e

nd

of

bro

me

mosa

icv

iru

sg

eno

mic

RN

A

PK

B1

73

isa

pse

udo

kn

oto

fsa

tell

ite

cere

aly

ello

wd

war

ftv

iru

s-R

PV

RN

AP

KB

22

3is

the

pse

ud

ok

no

tofin

tern

alen

try

site

(IR

ES

)o

fca

spid

-pro

du

cin

gC

rPV

san

dP

KB

20

5is

the

pse

udo

kn

ots

ub

stru

ctu

re(V

4d

om

ain

)o

fS

SU

of

PP

alm

ata

18

SrR

NA

L

ow

erp

anel

sT

he

alig

ned

sub

gra

ph

so

rre

gio

ns

are

shad

edo

rm

ark

ed

1392 Nucleic Acids Research 2005 Vol 33 No 4

Intriguingly for CrPVs the complementary segments occur inpseudoknots with similar folds not involving the dangling30 end of 18S rRNA Base pair complementary between IRESelements of mRNA and 18S rRNA is a mechanism for ribo-some recruitment and translation In viral IRES elements thesecondary and tertiary structures are believed to play criticalroles in translation initiation (44)

Functional relationship between PKB205 and PKB134(Pair 6 in Table 1)

The functional relationsip of the pair PKB205PKB134 issimilar to that for PKB205PKB223 above PKB134 belongsto a common viral tRNA-like functional class whose role is torecruit host ribosomes to initiate viral protein synthesis (49)It is known to interact with the ribosome via tRNA shapemimicry (31) Our analysis indicates that tRNA-like mole-cules also contain a region which is nearly complementary toPKB205 a pseudoknot region of the 18S rRNA Table 2shows regions of PKB205 complementary to four tRNA-like molecules (PKB134 191 138 17) the complementaryregion in PKB223 is shown for comparison Significantlyeach of the pseudoknots PKB223 PKB134 and PKB191has a region complementary to the same 8ndash9 nt unpairedregion in PKB205 (767ndash776) This single-stranded regionis most probably available for base pairing with incomingtRNA-like molecules In addition two other regions inPKB205 are complementary to tRNA-like pseudoknotsPKB138 and PKB17 one region is a hairpin loop and anotherregion is part of short stems Thus this analysis suggests thattRNA-like molecules may employ both structural mimicry andsequence complementarity to recruit SSU rRNA an interac-tion mode similar to tRNA PKB134 also has a region UACA-GUGUUGAAAAACA (nt 2078ndash2094) closely matched tothe region UAGAGUGUUCAAAGCCA (nt 732ndash748) inPKB205 with an average R-value of 123

Tertiary structure similarity of topological matches inribosomal RNAs

The preceding examination of three pseudoknot pairs showsthat topological matches can suggest functionally relatedpseudoknots It is also important to show that novel topo-logical matches led to similar tertiary structures We focusfor this purpose on rRNA since several bacterial and archaealribosome structures have been solved in recent years In factin our previous work we have found several topologicalmatches of the Scerevisiae 5S rRNA motif within its largerrRNAs two within 16S rRNA (domains II and III) and threewithin 23S rRNA (domains III IV and VI) (2) From availableX-ray structures for Thermus thermophilus 16S rRNA (PDBaccession no 1fjg) (50) and 23S rRNA from Deinococcus

radiodurans (1nkw) and Haloarcula marismortui (1ffk)(5152) only 5S rRNA motifrsquos matches in domain II of16S rRNA and domain III of 23S rRNA are relevant theother three cases are not valid matches in these species

Figure 6 displays the secondary diagrams graphical repres-entations and tertiary structures of 5S rRNA for bacteriumDradiodurans (1nkw) and archaea Hmarismortui (1ffk)and their topologically matched modules within 16S and23S rRNAs the secondary diagrams are adapted from thosein Gutellrsquos comparative RNA website (httpwwwrnaicmbutexasedu) (53) The module in 23S is an identicalmatch for the 6-vertex 5S graph except for a hairpin loopwhich does not contribute to the topological invariant SThe 16S module is a 5-vertex graph because a 2 bp stem atthe junction has a non-canonical base pair AG (dual graph ruleD1 defines a stem as having at least two canonical andorwobble base pairs) for Scerevisiae this stem consists ofonly canonical base pairs making the 16S module an identicalmatch for the 5S topology As shown in Figure 6 the 16S and23S modules have a similar L-shaped tertiary fold as the 5Sstructure Tertiary structure similarity with 5S is more strikingfor the module of domain II (nt 653ndash753) of Tthermophilus16S rRNA than for the module of domain III (nt 1033ndash1143) of23S rRNA Both the 5S structure and 16S module have apronounced 3-stem junction where the chain ends of the16S module are located Structural (backbone) overlap ofthe 5S structure and 16S module shows similar structuralfeatures except for the orientations of the smaller stems atthe junction (We aligned the structures manually because thechain ends of 5S and 16S module occur at non-correspondinglocations) In contrast the 23S module shows significant dis-tortions on the shorter helical arm with the hairpin loop foldedback to the helix as shown in the structural overlap of the 5Sstructure and 23S module Structural similarities seen in theseexamples are probably due to geometric constraints which areapparent in their secondary diagrams and graphical representa-tions Thus 2D topological matches can lead to similar 3Dstructures Below we survey the distribution of modules inrRNAs

Distribution of submotifs in 16S and 23S ribosomalRNAs

Characterizing the submotifs within the 16S and 23S rRNAsthe largest known RNA molecules can help uncover themodular construction of RNA structures as analysis ofrRNArsquos tertiary modules has revealed Furthermore our pre-vious analysis of rRNArsquos structural motifs indicated existenceof characteristic features (35) For example the distributionsof pairedunpaired bases in stems bulgesinternal loops hair-pin loops and junctions follow specific functional forms

Table 2 Short sequence regions in PKB205 complementary to tRNA-like molecules

PKB205 regions Complementary sequences

30-GGUACGAUU-50 (768ndash776) PKB223 50-UUAGGUUAG-30 (6100ndash6108) IRES30-GUGAGAUU-50 (769ndash776) PKB134 50-CACUGUAAA-30 (113ndash120) tRNA-like30-UACGAUUAC-50 (767ndash775) PKB191 50-AUGCUCAUG-30 (77ndash85) tRNA-like30-UGUGAGAUU-50 (731ndash739) PKB138 50-ACACUUUAA-30 (38ndash47) tRNA-like30-GUUUGCGGACC-50 (746ndash756) PKB17 50-CAAAACCCUGG-30 (19ndash29) tRNA-like

Nucleic Acids Research 2005 Vol 33 No 4 1393

Figure 6 Comparison of the graph topologies and 2D and 3D structures of 5S rRNAs (A and B) and modules in 16S and 23S rRNAs (C and D) (C and D) Twostructural (backbone) overlaps of the 5S structure (1ffk gold color) with 16S module (1fjg blue) and 5S structure with 23S module (1nkw green) are shown Thesecondary diagrams are adapted from those in Gutellrsquos comparative RNA website (httpwwwrnaicmbutexasedu) and 3D structures are from the Protein DataBank (httpwwwrcsborgpdb) The 2D modules were identified computationally by our earlier work on finding120 nt 5S rRNA motifs in 16S and 23S rRNAs ofScerevisiae(2) The equivalent tertiary structures and modules for 5S 16S and 23S rRNAs from other organisms show that similar graph topologies can lead tosimilar secondary and tertiary structures All structures have similar graph topologies The Tthermophilus 16S module is a 5-vertex instead of 6-vertex graph becausea 2 bp stem at the junction has a non-canonical base pair AG for Scerevisiae this stem consists of only canonical base pairs making the 16S module an identicalmatch for the 5S topology

1394 Nucleic Acids Research 2005 Vol 33 No 4

Moreover the percentages of nucleotides in stems bulgesinternal loops hairpin loops and junctions do not vary for16S and 23S rRNAs even though the latter is twice thesize of the former Such motif characteristics help discriminateRNA-like from non-RNA-like molecules and guide the designof novel RNAs Here we analyze the distinct modular unitsmdashsmall tree submotifsmdashthat makeup the rRNAs to advance suchapplications Since all small tree graphs up to 10 vertices havebeen exhaustively enumerated (5455) we can use our sub-structure search algorithm to identify the occurrences of alldistinct tree submotifs in rRNAs

Specifically we use sets of tree graphs for V = 4 5 6 7 8 9and 10 comprising 2 3 6 11 23 47 and 106 graphs respect-ively for a total of 198 tree motifs The complete enumeratedmotifs can be found in the RAG web resource (httpmonodbiomathnyuedurna) Figure 7 shows as an example the11 possible 7-vertex tree graphs and lists the frequency ofoccurrence of each tree graph in rRNAs Remarkably thesame tree motifs are abundant in both 16S and 23S rRNAsThe most abundant (10 or 15 times) 7-vertex tree has two3-stem junctions (topological invariant Stree = 6160) 4- and5-stem junctions are the next most prevalent submotifs(Stree = 5137 5921 and 5667) Interestingly the unbranchedtree (Stree = 7219) and highly branched trees (Stree = 4509)are rare

Figure 8 plots the percentage of distinct tree graphs in eachset V (fraction of motifs identified relative to possible motifs)

found in rRNA For small tree graphs V = 4 5 and 6 allpossible topologies are present but not for V gt 6 In fact thepercentage of trees found declines with V In 23S rRNA90 of possible trees are found for V = 7 and only40 (representing 47 tree topologies) are found forV = 10 Larger trees (V gt 10) are expected to have muchsmaller percentages This percentage decline is expectedsince small tree modules are less specific than large treesIn total 111 submotifs out of 198 are identified The observedtrends in Figure 8 may be generally valid for rRNA structuresof other organisms since they are highly conserved althoughsome statistical fluctuations are expected due to structuraldifferences and RNArsquos finite size

These data show that RNA structures are modular constructsof relatively few (tree) building blocks Extrapolation of thecurves in Figure 8 suggests that 20 of V = 11 motifs (totalof 235 trees) and 10 of V = 12 motifs (total of 551 trees)probably exist in rRNAs Thus the total number of buildingblocks in rRNAs with V lt 13 would be 210 Although thepercentages of trees with higher V values are low they aremore numerous and will contribute to the overall number ofbuilding blocks We thus consider 210 as the minimum num-ber of building blocks The building block number may beexploited in the future design of novel functional RNAmolecules using modular assembly from 210 distinct treemotifs The tree motifs may be taken from substructures ofexisting RNAs allowing simultaneous sequence and structure

treeS =

16S 23SOccurrence

ID

6 4

)

0 1

(7 8)

6191

(7 11)

4509

0 1

(7 9)

5385

3 3

7

(7 10)

5137

5 5

(7 6)

5916

5

(7 5)

5921

(7 1)

0 0

7219

(7 3)

2 4

6446

4 4

(7 7)

5667

10 15

(7 4

6160

0 3

6692

(7 2)

Figure 7 Occurrence frequency of 11 distinct 7-vertex tree topologies as submotifs of Scerevisiae 16S and 23S rRNAs The tree graphs are labeled using topologicalinvariant Stree (second top row) and ID number (i j) (first row) which refers to motif identification number as used in RAG database (httpmonodbiomathnyuedurna) The two numbers at the bottom of each graph refer to the number of times that topology is found embedded in 16S and in 23S rRNAs respectively

Nucleic Acids Research 2005 Vol 33 No 4 1395

design Such a build-up approach appears feasible from our(26) and other works (35) and is akin to the lsquobuilduprsquo tech-nique for proteins pioneered by Vasquez and Scheraga (56)

The above estimate of building block number can be eithera consequence of the limited number of available structures orthe fact that RNA adopts only a subset of possible graphsCertainly future discoveries of novel RNA structures mightindicate a larger repertoire of building blocks Still our sur-veys of natural RNAs (26) suggest that RNAs possess specifictopological characteristics and favor a subset of the possiblegraph motifs The missing motifs may not exist in nature forRNA owing to physical or functional constraints

The building block number can also be defined in terms ofirreducible graphs ie trees that cannot be decomposed intosmaller trees However this approach has a major drawbackSince all RNAs are composed of connected stems their sub-structures can be reduced to two vertex graphs which is not aninformative description of complex RNA folds Still if wedefine 4-vertex trees as the smallest irreducible graphs thenumber of irreducible graphs can be calculated Based on oursubmotif data we estimate that for V up to 10 there are about42 irreducible tree modules in rRNAs

SUMMARY AND CONCLUSION

Our automated comparison of RNA secondary structuresbased on graphical representation allows the examination ofconserved 2D RNA topological features which in turn suggestfunctional relationships This approach has an advantage oversequence alignment because functionally related RNAs oftenlack sequence similarity as shown for frameshifting pseudo-knots with low global sequence identities Of course ourstructure alignment approach relies on available 2D RNA

structures and thus will increase in scope as more RNAstructures become available Our eight identified pairs ofpotentially related pseudoknots (Table 1) reveal three pairsthat are functionally related the pseudoknots of 30 end ofbrome mosaic virus genomic RNA (PKB134) and alternativehammerhead ribozyme pseudoknot (PKB173) are templatesfor replicase to initiate replication of viral genomes the pseu-doknots of internal entry site (IRES) of caspid-producingCrPVs (PKB223) and SSU of 18S rRNA (PKB205) are asso-ciated with translation initiation and the pseudoknots of 18SrRNA (PKB205) and viral tRNA-like (PKB134) most prob-ably interact via both structural mimicry and base comple-mentarity To the best of our knowledge these functionalrelationships have not been described previously

The developed RNA comparison algorithm also led us toquantify the modularity of 16S and 23S rRNAs using smallenumerated tree motifs Our results suggest that rRNAs areconstructed from at least 210 distinct tree subtopologies(or modules) having up to 180 nt Significantly we findthat 5S rRNA and two tree modules within 16S and 23SrRNAs with similar topologies have similar tertiary shapesaffirming the relationship between 2D graphical representa-tion and 3D structure The tree modules in rRNA may be usedas building blocks for the design of novel RNA folds A similaridea has been employed to design and improve functionalRNA molecules (510) Design by assembly of modulesexploits motifs present in natural RNAs unlike de novo designof sequences and structures (11) The issues of discoveringnovel topologies (6) building module library and designingfunctional folds are likely to increase in importance withinterest in novel RNAs

Our algorithm has several limitations Of course its utilitydepends on availability of reliable RNA secondary foldingalgorithms (5758) More fundamentally although simplifiedgraphical representations specify topological characteristicsdetailed sequence and motif information is missing An asso-ciated problem is that unusual RNA structurendashfunction rela-tionships due to deleterious mutations as in frameshiftingpseudoknots (28) and structural flexibility are not capturedMore advanced graph constructs such as weighted graphs mayhowever specify detailed motif features at the secondary andbase pair levels Another limitation of the present algorithmis the requirement to fix the probe structure or graph As inprotein structure alignment algorithms (15) future refinementsof our algorithm should allow some flexibility in motifdefinitions as the similarity search proceeds We inviteRNA scientists to visit our RAG web resource at httpmonodbiomathnyuedurna and suggest to us specific enhancementsthat will be useful in practice

ACKNOWLEDGEMENTS

We thank Dinshaw Patel for comments that motivated thiswork This research was supported by a Joint NSFNIGMSInitiative in Mathematical Biology (DMS-0201160) and theHuman Frontier Science Program (HFSP)

REFERENCES

1 TinocoI and BustamanteC (1999) How RNA folds J Mol Biol293 271ndash281

Figure 8 Distribution of all possible distinct tree modules with V vertices foundin Scerevisiae 16S and 23S rRNAs In graph theory each V is represented by aset of possible tree structures For example for V= 4 5 6 7 8 9 and 10 there are2 3 6 11 23 47 and 106 possible trees respectively The distribution of treesquantifies the diversity of tree modules within rRNAs This analysis suggeststhat rRNAs possess at least 210 tree modules

1396 Nucleic Acids Research 2005 Vol 33 No 4

2 GanHH PasqualiS and SchlickT (2003) Exploring the repertoire ofRNA secondary motifs using graph theory implications for RNA designNucleic Acids Res 31 2926ndash2943

3 SoukupGA and BreakerRR (1999) Engineering precision RNAmolecular switches Proc Natl Acad Sci USA 96 3584ndash3589

4 BreakerRR (2002) Engineered allosteric ribozymes as biosensorcomponents Curr Opin Biotechnol 13 31ndash39

5 IkawaY FukadaK WatanabeS ShiraishiH and InoueT (2002)Design construction and analysis of a novel class of self-foldingRNA Structure 10 527ndash534

6 KimN ShiffeldrimN GanHH and SchlickT (2004) Candidatesfor novel RNA topologies J Mol Biol 341 1129ndash1144

7 MitchellJR ChengJ and CollinsK (1999) A box HACAsmall nucleolar RNA-like domain at the human telomerase RNA 30

end Mol Cell Biol 19 567ndash5768 DoudnaJA and CechTR (2002) The chemical repertoire

of natural ribozymes Nature 418 222ndash2289 StorzG (2002) An expanding universe of noncoding RNAs Science

296 1260ndash126310 BreakerRR and JoyceGF (1994) Inventing and improving ribozyme

function rational design versus iterative selection methods TrendsBiotechnol 12 268ndash275

11 AndronescuM FejesAP HutterF HoosHH and CondonA (2004)A new algorithm for RNA secondary structure design J Mol Biol336 607ndash624

12 CarterRJ DubchakI and HolbrookSR (2001) A computationalapproach to identify genes for functional RNAs in genomic sequencesNucleic Acids Res 29 3928ndash3938

13 DoudnaJA (2000) Structural genomics of RNA Nature Struct Biol7 954ndash956

14 RivasE KleinRJ JonesTA and EddySR (2001) Computationalidentification of noncoding RNAs in Ecoli by comparativegenomics Curr Biol 11 1369ndash1373

15 HolmL and SanderC (1996) Mapping the protein universeScience 273 595ndash602

16 ShindyalovIN and BournePE (2001) A database and tools for 3-Dprotein structure comparison and alignment using the CombinatorialExtension (CE) algorithm Nucleic Acids Res 29 228ndash229

17 BenedettiG and MorosettiS (1996) A graph-topological approach torecognition of pattern and similarity in RNA secondary structuresBiophys Chem 59 179ndash184

18 MargalitH ShapiroBA OppenheimAB and MaizelJVJr (1989)Detection of common motifs in RNA secondary structures Nucleic AcidsRes 17 4829ndash4845

19 ShapiroBA and ZhangKZ (1990) Comparing multiple RNAsecondary structures using tree comparisons Comput Appl Biosci6 309ndash318

20 LeSY NussinovR and MaizelJV (1989) Tree graphs of RNAsecondary structures and their comparisons Comput Biomed Res22 461ndash473

21 DuarteCM WadleyLM and PyleAM (2003) RNA structurecomparison motif search and discovery using a reduced representation ofRNA conformational space Nucleic Acids Res 31 4755ndash4761

22 ChevaletC and MichotB (1992) An algorithm for comparing RNAsecondary structures and searching for similar substructures ComputAppl Biosci 8 215ndash225

23 WangL and ZhaoJ (2003) Parametric alignment of ordered treesBioinformatics 19 2237ndash2245

24 FontanaW KoningsDAM StadlerPF and SchusterP (1993)Statistics of RNA secondary structures Biopolymers 33 1389ndash1404

25 van BatenburgFHD GultyaevAP PleijCWA NgJ andOliehoekJ (2000) PseudoBase a database with RNA pseudoknotsNucleic Acids Res 28 201ndash204

26 GanHH FeraD ZornJ ShiffeldrimN TangM LasersonUKimN and SchlickT (2004) RAG RNA-As-Graphsdatabasemdashconcepts analysis and features Bioinformatics20 1285ndash1291

27 FeraD KimN ShiffeldrimN ZornJ LasersonU GanHH andSchlickT (2004) RAG RNA-As-Graphs web resource BMCBioinformatics 5 88

28 ChenXY KangHS ShenLX ChamorroM VarmusHE andTinocoI (1996) A characteristic bent conformation of RNA pseudoknotspromotes 1 frameshifting during translation of retroviral RNAJ Mol Biol 260 479ndash483

29 KangHS HinesJV and TinocoI (1996) Conformation of anon-frameshifting RNA pseudoknot from mouse mammary tumor virusJ Mol Biol 259 135ndash147

30 GanHH PerlowRA RoyS KoJ WuM HuangJ YanSNicolettaA VafaiJ SunD et al (2002) Analysis of protein sequencestructure similarity relationships Biophys J 83 2781ndash2791

31 FeldenB FlorentzC GiegeR and WesthofE (1994) Solutionstructure of the 30-end of brome mosaic-virus genomic RNAsmdashconformational mimicry with canonical transfer-RNAs J Mol Biol235 508ndash531

32 GrossJL and YellenJ (1999) Graph Theory and its Applications CRCPress Boca Raton FL

33 KoblerJ SchoningU and ToraanJ (1993) The Graph IsomorphismProblem Birkhauser Boston MA

34 RandicM and ZupanJ (2001) On interpretation of well-knowntopological indices J Chem Inform Comput Sci 41 550ndash560

35 ZornJ GanHH ShiffeldrimN and SchlickT (2004) Structuralmotifs in ribosomal RNAs implications for RNA design and genomicsBiopolymers 73 340ndash347

36 JacksT MadhaniHD MasiarzFR and VarmusHE (1988) Signalsfor ribosomal frameshifting in the Rous-Sarcoma virus GagndashPolregion Cell 55 447ndash458

37 GodenyEK ChenL KumarSN MethvenSL KooninEV andBrintonMA (1993) Complete genomic sequence and phylogeneticanalysis of the lactate dehydrogenase-elevating virus (Ldv) Virology194 585ndash596

38 MarczinkeB FisherR VidakovicM BloysAJ and BrierleyI (1998)Secondary structure and mutational analysis of the ribosomalframeshift signal of Rous sarcoma virus J Mol Biol 284 205ndash225

39 MyersEW and MillerW (1988) Optimal alignments in linear spaceComput Appl Biosci 4 11ndash17

40 HuangXQ and MillerW (1991) A time-efficient linear-space localsimilarity algorithm Adv Appl Math 12 337ndash357

41 SmithTF and WatermanMS (1981) Identification of commonmolecular subsequences J Mol Biol 147 195ndash197

42 SongSI SilverSL AulikMA RasochovaL MohanBR andMillerWA (1999) Satellite cereal yellow dwarf virus-RPV (satRPV)RNA requires a DouXble hammerhead for self-cleavage and analternative structure for replication J Mol Biol 293 781ndash793

43 DreherTW and HallTC (1988) Mutational analysis of thetransfer-RNA mimicry of brome mosaic-virus RNAmdashsequence andstructural requirements for aminoacylation and 30-adenylationJ Mol Biol 201 41ndash55

44 KanamoriY and NakashimaN (2001) A tertiary structure model of theinternal ribosome entry site (IRES) for methionine-independent initiationof translation RNA 7 266ndash274

45 WuytsJ De RijkP Van de PeerY PisonG RousseeuwP andDe WachterR (2000)Comparativeanalysisof more than 3000 sequencesreveals the existence of two pseudoknots in area V4 of eukaryotic smallsubunit ribosomal RNA Nucleic Acids Res 28 4698ndash4708

46 Van de PeerY De RijkP WuytsJ WinkelmansT and De WachterR(2000) The European small subunit ribosomal RNA databaseNucleic Acids Res 28 175ndash176

47 ChappellSA EdelmanGM and MauroVP (2000) A 9-nt segment ofa cellular mRNA can function as an internal ribosome entry site (IRES)and when present in linked multiple copies greatly enhances IRESactivity Proc Natl Acad Sci USA 97 1536ndash1541

48 ScheperGC VoormaHO and ThomasAA (1994) Basepairing with18S ribosomal RNA in internal initiation of translation FEBS Lett352 271ndash275

49 BarendsS Rudinger-ThirionJ FlorentzC GiegeR PleijCW andKraalB (2004) tRNA-like structure regulates translation of Bromemosaic virus RNA J Virol 78 4003ndash4010

50 CarterAP ClemonsWM BrodersenDE Morgan-WarrenRJWimberlyBT and RamakrishnanV (2000) Functional insights fromthe structure of the 30S ribosomal subunit and its interactions withantibiotics Nature 407 340ndash348

51 HarmsJ SchluenzenF ZarivachR BashanA GatS AgmonIBartelsH FranceschiF and YonathA (2001) High resolution structureof the large ribosomal subunit from a mesophilic Eubacterium Cell 107679ndash688

52 BanN NissenP HansenJ MoorePB and SteitzTA (2000) Thecomplete atomic structure of the large ribosomal subunit at 24 angstromresolution Science 289 905ndash920

Nucleic Acids Research 2005 Vol 33 No 4 1397

53 CannoneJJ SubramanianS SchnareMN CollettJRDrsquoSouzaLM DuYS FengB LinN MadabusiLV MullerKMet al (2002) The Comparative RNA Web (CRW) Site an online databaseof comparative sequence and structure information for ribosomal intronand other RNAs BMC Bioinformatics 3 2

54 HararyF and PrinsG (1959) The number of homeomorphicallyirreducible trees and other species Acta Math 101141ndash162

55 HararyF (1969) Graph Theory Addison-Wesley Reading Mass

56 VasquezM and ScheragaHA (1985) Use of buildup andenergy-minimization procedures to compute low-energy structures ofthe backbone of enkephalin Biopolymers 24 1437ndash1447

57 ZukerM MathewsDH and TurnerDH (1999) Algorithms andthermodynamics for RNA secondary structure prediction A practicalguide In BarciszewskiJ and ClarkBFC (eds) RNA Biochemistry andBiotechnology Kluwer Academic Publishers Dordrecht pp 11ndash43

58 HofackerIL (2003) Vienna RNA secondary structure serverNucleic Acids Res 31 3429ndash3431

1398 Nucleic Acids Research 2005 Vol 33 No 4

Page 9: doi:10.1093/nar/gki267 Modular RNA architecture revealed ... · Modular RNA architecture revealed by computational analysis of existing pseudoknots and ... Samuela Pasquali, ... RNAs

Fig

ure

5

Th

ree

no

vel

fun

ctio

nal

lyre

late

dp

seu

do

kn

otp

airs

infe

rred

by

top

olo

gy

mat

chin

gan

dse

qu

ence

alig

nm

ent

Up

per

pan

els

Sec

ond

ary

stru

ctu

res

and

thei

ral

ign

edn

ucl

eoti

des

(co

lor

cod

ed)

for

pse

udo

kn

ot

pai

rsP

KB

13

4P

KB

17

3(P

air

5T

able

1

left

pan

el)

PK

B2

05

PK

B2

23

(Pai

r1

0T

able

1

mid

dle

pan

el)

and

PK

B1

34

PK

B20

5(P

air

6

rig

ht

pan

el)

PK

B1

34

occ

urs

in30 e

nd

of

bro

me

mosa

icv

iru

sg

eno

mic

RN

A

PK

B1

73

isa

pse

udo

kn

oto

fsa

tell

ite

cere

aly

ello

wd

war

ftv

iru

s-R

PV

RN

AP

KB

22

3is

the

pse

ud

ok

no

tofin

tern

alen

try

site

(IR

ES

)o

fca

spid

-pro

du

cin

gC

rPV

san

dP

KB

20

5is

the

pse

udo

kn

ots

ub

stru

ctu

re(V

4d

om

ain

)o

fS

SU

of

PP

alm

ata

18

SrR

NA

L

ow

erp

anel

sT

he

alig

ned

sub

gra

ph

so

rre

gio

ns

are

shad

edo

rm

ark

ed

1392 Nucleic Acids Research 2005 Vol 33 No 4

Intriguingly for CrPVs the complementary segments occur inpseudoknots with similar folds not involving the dangling30 end of 18S rRNA Base pair complementary between IRESelements of mRNA and 18S rRNA is a mechanism for ribo-some recruitment and translation In viral IRES elements thesecondary and tertiary structures are believed to play criticalroles in translation initiation (44)

Functional relationship between PKB205 and PKB134(Pair 6 in Table 1)

The functional relationsip of the pair PKB205PKB134 issimilar to that for PKB205PKB223 above PKB134 belongsto a common viral tRNA-like functional class whose role is torecruit host ribosomes to initiate viral protein synthesis (49)It is known to interact with the ribosome via tRNA shapemimicry (31) Our analysis indicates that tRNA-like mole-cules also contain a region which is nearly complementary toPKB205 a pseudoknot region of the 18S rRNA Table 2shows regions of PKB205 complementary to four tRNA-like molecules (PKB134 191 138 17) the complementaryregion in PKB223 is shown for comparison Significantlyeach of the pseudoknots PKB223 PKB134 and PKB191has a region complementary to the same 8ndash9 nt unpairedregion in PKB205 (767ndash776) This single-stranded regionis most probably available for base pairing with incomingtRNA-like molecules In addition two other regions inPKB205 are complementary to tRNA-like pseudoknotsPKB138 and PKB17 one region is a hairpin loop and anotherregion is part of short stems Thus this analysis suggests thattRNA-like molecules may employ both structural mimicry andsequence complementarity to recruit SSU rRNA an interac-tion mode similar to tRNA PKB134 also has a region UACA-GUGUUGAAAAACA (nt 2078ndash2094) closely matched tothe region UAGAGUGUUCAAAGCCA (nt 732ndash748) inPKB205 with an average R-value of 123

Tertiary structure similarity of topological matches inribosomal RNAs

The preceding examination of three pseudoknot pairs showsthat topological matches can suggest functionally relatedpseudoknots It is also important to show that novel topo-logical matches led to similar tertiary structures We focusfor this purpose on rRNA since several bacterial and archaealribosome structures have been solved in recent years In factin our previous work we have found several topologicalmatches of the Scerevisiae 5S rRNA motif within its largerrRNAs two within 16S rRNA (domains II and III) and threewithin 23S rRNA (domains III IV and VI) (2) From availableX-ray structures for Thermus thermophilus 16S rRNA (PDBaccession no 1fjg) (50) and 23S rRNA from Deinococcus

radiodurans (1nkw) and Haloarcula marismortui (1ffk)(5152) only 5S rRNA motifrsquos matches in domain II of16S rRNA and domain III of 23S rRNA are relevant theother three cases are not valid matches in these species

Figure 6 displays the secondary diagrams graphical repres-entations and tertiary structures of 5S rRNA for bacteriumDradiodurans (1nkw) and archaea Hmarismortui (1ffk)and their topologically matched modules within 16S and23S rRNAs the secondary diagrams are adapted from thosein Gutellrsquos comparative RNA website (httpwwwrnaicmbutexasedu) (53) The module in 23S is an identicalmatch for the 6-vertex 5S graph except for a hairpin loopwhich does not contribute to the topological invariant SThe 16S module is a 5-vertex graph because a 2 bp stem atthe junction has a non-canonical base pair AG (dual graph ruleD1 defines a stem as having at least two canonical andorwobble base pairs) for Scerevisiae this stem consists ofonly canonical base pairs making the 16S module an identicalmatch for the 5S topology As shown in Figure 6 the 16S and23S modules have a similar L-shaped tertiary fold as the 5Sstructure Tertiary structure similarity with 5S is more strikingfor the module of domain II (nt 653ndash753) of Tthermophilus16S rRNA than for the module of domain III (nt 1033ndash1143) of23S rRNA Both the 5S structure and 16S module have apronounced 3-stem junction where the chain ends of the16S module are located Structural (backbone) overlap ofthe 5S structure and 16S module shows similar structuralfeatures except for the orientations of the smaller stems atthe junction (We aligned the structures manually because thechain ends of 5S and 16S module occur at non-correspondinglocations) In contrast the 23S module shows significant dis-tortions on the shorter helical arm with the hairpin loop foldedback to the helix as shown in the structural overlap of the 5Sstructure and 23S module Structural similarities seen in theseexamples are probably due to geometric constraints which areapparent in their secondary diagrams and graphical representa-tions Thus 2D topological matches can lead to similar 3Dstructures Below we survey the distribution of modules inrRNAs

Distribution of submotifs in 16S and 23S ribosomalRNAs

Characterizing the submotifs within the 16S and 23S rRNAsthe largest known RNA molecules can help uncover themodular construction of RNA structures as analysis ofrRNArsquos tertiary modules has revealed Furthermore our pre-vious analysis of rRNArsquos structural motifs indicated existenceof characteristic features (35) For example the distributionsof pairedunpaired bases in stems bulgesinternal loops hair-pin loops and junctions follow specific functional forms

Table 2 Short sequence regions in PKB205 complementary to tRNA-like molecules

PKB205 regions Complementary sequences

30-GGUACGAUU-50 (768ndash776) PKB223 50-UUAGGUUAG-30 (6100ndash6108) IRES30-GUGAGAUU-50 (769ndash776) PKB134 50-CACUGUAAA-30 (113ndash120) tRNA-like30-UACGAUUAC-50 (767ndash775) PKB191 50-AUGCUCAUG-30 (77ndash85) tRNA-like30-UGUGAGAUU-50 (731ndash739) PKB138 50-ACACUUUAA-30 (38ndash47) tRNA-like30-GUUUGCGGACC-50 (746ndash756) PKB17 50-CAAAACCCUGG-30 (19ndash29) tRNA-like

Nucleic Acids Research 2005 Vol 33 No 4 1393

Figure 6 Comparison of the graph topologies and 2D and 3D structures of 5S rRNAs (A and B) and modules in 16S and 23S rRNAs (C and D) (C and D) Twostructural (backbone) overlaps of the 5S structure (1ffk gold color) with 16S module (1fjg blue) and 5S structure with 23S module (1nkw green) are shown Thesecondary diagrams are adapted from those in Gutellrsquos comparative RNA website (httpwwwrnaicmbutexasedu) and 3D structures are from the Protein DataBank (httpwwwrcsborgpdb) The 2D modules were identified computationally by our earlier work on finding120 nt 5S rRNA motifs in 16S and 23S rRNAs ofScerevisiae(2) The equivalent tertiary structures and modules for 5S 16S and 23S rRNAs from other organisms show that similar graph topologies can lead tosimilar secondary and tertiary structures All structures have similar graph topologies The Tthermophilus 16S module is a 5-vertex instead of 6-vertex graph becausea 2 bp stem at the junction has a non-canonical base pair AG for Scerevisiae this stem consists of only canonical base pairs making the 16S module an identicalmatch for the 5S topology

1394 Nucleic Acids Research 2005 Vol 33 No 4

Moreover the percentages of nucleotides in stems bulgesinternal loops hairpin loops and junctions do not vary for16S and 23S rRNAs even though the latter is twice thesize of the former Such motif characteristics help discriminateRNA-like from non-RNA-like molecules and guide the designof novel RNAs Here we analyze the distinct modular unitsmdashsmall tree submotifsmdashthat makeup the rRNAs to advance suchapplications Since all small tree graphs up to 10 vertices havebeen exhaustively enumerated (5455) we can use our sub-structure search algorithm to identify the occurrences of alldistinct tree submotifs in rRNAs

Specifically we use sets of tree graphs for V = 4 5 6 7 8 9and 10 comprising 2 3 6 11 23 47 and 106 graphs respect-ively for a total of 198 tree motifs The complete enumeratedmotifs can be found in the RAG web resource (httpmonodbiomathnyuedurna) Figure 7 shows as an example the11 possible 7-vertex tree graphs and lists the frequency ofoccurrence of each tree graph in rRNAs Remarkably thesame tree motifs are abundant in both 16S and 23S rRNAsThe most abundant (10 or 15 times) 7-vertex tree has two3-stem junctions (topological invariant Stree = 6160) 4- and5-stem junctions are the next most prevalent submotifs(Stree = 5137 5921 and 5667) Interestingly the unbranchedtree (Stree = 7219) and highly branched trees (Stree = 4509)are rare

Figure 8 plots the percentage of distinct tree graphs in eachset V (fraction of motifs identified relative to possible motifs)

found in rRNA For small tree graphs V = 4 5 and 6 allpossible topologies are present but not for V gt 6 In fact thepercentage of trees found declines with V In 23S rRNA90 of possible trees are found for V = 7 and only40 (representing 47 tree topologies) are found forV = 10 Larger trees (V gt 10) are expected to have muchsmaller percentages This percentage decline is expectedsince small tree modules are less specific than large treesIn total 111 submotifs out of 198 are identified The observedtrends in Figure 8 may be generally valid for rRNA structuresof other organisms since they are highly conserved althoughsome statistical fluctuations are expected due to structuraldifferences and RNArsquos finite size

These data show that RNA structures are modular constructsof relatively few (tree) building blocks Extrapolation of thecurves in Figure 8 suggests that 20 of V = 11 motifs (totalof 235 trees) and 10 of V = 12 motifs (total of 551 trees)probably exist in rRNAs Thus the total number of buildingblocks in rRNAs with V lt 13 would be 210 Although thepercentages of trees with higher V values are low they aremore numerous and will contribute to the overall number ofbuilding blocks We thus consider 210 as the minimum num-ber of building blocks The building block number may beexploited in the future design of novel functional RNAmolecules using modular assembly from 210 distinct treemotifs The tree motifs may be taken from substructures ofexisting RNAs allowing simultaneous sequence and structure

treeS =

16S 23SOccurrence

ID

6 4

)

0 1

(7 8)

6191

(7 11)

4509

0 1

(7 9)

5385

3 3

7

(7 10)

5137

5 5

(7 6)

5916

5

(7 5)

5921

(7 1)

0 0

7219

(7 3)

2 4

6446

4 4

(7 7)

5667

10 15

(7 4

6160

0 3

6692

(7 2)

Figure 7 Occurrence frequency of 11 distinct 7-vertex tree topologies as submotifs of Scerevisiae 16S and 23S rRNAs The tree graphs are labeled using topologicalinvariant Stree (second top row) and ID number (i j) (first row) which refers to motif identification number as used in RAG database (httpmonodbiomathnyuedurna) The two numbers at the bottom of each graph refer to the number of times that topology is found embedded in 16S and in 23S rRNAs respectively

Nucleic Acids Research 2005 Vol 33 No 4 1395

design Such a build-up approach appears feasible from our(26) and other works (35) and is akin to the lsquobuilduprsquo tech-nique for proteins pioneered by Vasquez and Scheraga (56)

The above estimate of building block number can be eithera consequence of the limited number of available structures orthe fact that RNA adopts only a subset of possible graphsCertainly future discoveries of novel RNA structures mightindicate a larger repertoire of building blocks Still our sur-veys of natural RNAs (26) suggest that RNAs possess specifictopological characteristics and favor a subset of the possiblegraph motifs The missing motifs may not exist in nature forRNA owing to physical or functional constraints

The building block number can also be defined in terms ofirreducible graphs ie trees that cannot be decomposed intosmaller trees However this approach has a major drawbackSince all RNAs are composed of connected stems their sub-structures can be reduced to two vertex graphs which is not aninformative description of complex RNA folds Still if wedefine 4-vertex trees as the smallest irreducible graphs thenumber of irreducible graphs can be calculated Based on oursubmotif data we estimate that for V up to 10 there are about42 irreducible tree modules in rRNAs

SUMMARY AND CONCLUSION

Our automated comparison of RNA secondary structuresbased on graphical representation allows the examination ofconserved 2D RNA topological features which in turn suggestfunctional relationships This approach has an advantage oversequence alignment because functionally related RNAs oftenlack sequence similarity as shown for frameshifting pseudo-knots with low global sequence identities Of course ourstructure alignment approach relies on available 2D RNA

structures and thus will increase in scope as more RNAstructures become available Our eight identified pairs ofpotentially related pseudoknots (Table 1) reveal three pairsthat are functionally related the pseudoknots of 30 end ofbrome mosaic virus genomic RNA (PKB134) and alternativehammerhead ribozyme pseudoknot (PKB173) are templatesfor replicase to initiate replication of viral genomes the pseu-doknots of internal entry site (IRES) of caspid-producingCrPVs (PKB223) and SSU of 18S rRNA (PKB205) are asso-ciated with translation initiation and the pseudoknots of 18SrRNA (PKB205) and viral tRNA-like (PKB134) most prob-ably interact via both structural mimicry and base comple-mentarity To the best of our knowledge these functionalrelationships have not been described previously

The developed RNA comparison algorithm also led us toquantify the modularity of 16S and 23S rRNAs using smallenumerated tree motifs Our results suggest that rRNAs areconstructed from at least 210 distinct tree subtopologies(or modules) having up to 180 nt Significantly we findthat 5S rRNA and two tree modules within 16S and 23SrRNAs with similar topologies have similar tertiary shapesaffirming the relationship between 2D graphical representa-tion and 3D structure The tree modules in rRNA may be usedas building blocks for the design of novel RNA folds A similaridea has been employed to design and improve functionalRNA molecules (510) Design by assembly of modulesexploits motifs present in natural RNAs unlike de novo designof sequences and structures (11) The issues of discoveringnovel topologies (6) building module library and designingfunctional folds are likely to increase in importance withinterest in novel RNAs

Our algorithm has several limitations Of course its utilitydepends on availability of reliable RNA secondary foldingalgorithms (5758) More fundamentally although simplifiedgraphical representations specify topological characteristicsdetailed sequence and motif information is missing An asso-ciated problem is that unusual RNA structurendashfunction rela-tionships due to deleterious mutations as in frameshiftingpseudoknots (28) and structural flexibility are not capturedMore advanced graph constructs such as weighted graphs mayhowever specify detailed motif features at the secondary andbase pair levels Another limitation of the present algorithmis the requirement to fix the probe structure or graph As inprotein structure alignment algorithms (15) future refinementsof our algorithm should allow some flexibility in motifdefinitions as the similarity search proceeds We inviteRNA scientists to visit our RAG web resource at httpmonodbiomathnyuedurna and suggest to us specific enhancementsthat will be useful in practice

ACKNOWLEDGEMENTS

We thank Dinshaw Patel for comments that motivated thiswork This research was supported by a Joint NSFNIGMSInitiative in Mathematical Biology (DMS-0201160) and theHuman Frontier Science Program (HFSP)

REFERENCES

1 TinocoI and BustamanteC (1999) How RNA folds J Mol Biol293 271ndash281

Figure 8 Distribution of all possible distinct tree modules with V vertices foundin Scerevisiae 16S and 23S rRNAs In graph theory each V is represented by aset of possible tree structures For example for V= 4 5 6 7 8 9 and 10 there are2 3 6 11 23 47 and 106 possible trees respectively The distribution of treesquantifies the diversity of tree modules within rRNAs This analysis suggeststhat rRNAs possess at least 210 tree modules

1396 Nucleic Acids Research 2005 Vol 33 No 4

2 GanHH PasqualiS and SchlickT (2003) Exploring the repertoire ofRNA secondary motifs using graph theory implications for RNA designNucleic Acids Res 31 2926ndash2943

3 SoukupGA and BreakerRR (1999) Engineering precision RNAmolecular switches Proc Natl Acad Sci USA 96 3584ndash3589

4 BreakerRR (2002) Engineered allosteric ribozymes as biosensorcomponents Curr Opin Biotechnol 13 31ndash39

5 IkawaY FukadaK WatanabeS ShiraishiH and InoueT (2002)Design construction and analysis of a novel class of self-foldingRNA Structure 10 527ndash534

6 KimN ShiffeldrimN GanHH and SchlickT (2004) Candidatesfor novel RNA topologies J Mol Biol 341 1129ndash1144

7 MitchellJR ChengJ and CollinsK (1999) A box HACAsmall nucleolar RNA-like domain at the human telomerase RNA 30

end Mol Cell Biol 19 567ndash5768 DoudnaJA and CechTR (2002) The chemical repertoire

of natural ribozymes Nature 418 222ndash2289 StorzG (2002) An expanding universe of noncoding RNAs Science

296 1260ndash126310 BreakerRR and JoyceGF (1994) Inventing and improving ribozyme

function rational design versus iterative selection methods TrendsBiotechnol 12 268ndash275

11 AndronescuM FejesAP HutterF HoosHH and CondonA (2004)A new algorithm for RNA secondary structure design J Mol Biol336 607ndash624

12 CarterRJ DubchakI and HolbrookSR (2001) A computationalapproach to identify genes for functional RNAs in genomic sequencesNucleic Acids Res 29 3928ndash3938

13 DoudnaJA (2000) Structural genomics of RNA Nature Struct Biol7 954ndash956

14 RivasE KleinRJ JonesTA and EddySR (2001) Computationalidentification of noncoding RNAs in Ecoli by comparativegenomics Curr Biol 11 1369ndash1373

15 HolmL and SanderC (1996) Mapping the protein universeScience 273 595ndash602

16 ShindyalovIN and BournePE (2001) A database and tools for 3-Dprotein structure comparison and alignment using the CombinatorialExtension (CE) algorithm Nucleic Acids Res 29 228ndash229

17 BenedettiG and MorosettiS (1996) A graph-topological approach torecognition of pattern and similarity in RNA secondary structuresBiophys Chem 59 179ndash184

18 MargalitH ShapiroBA OppenheimAB and MaizelJVJr (1989)Detection of common motifs in RNA secondary structures Nucleic AcidsRes 17 4829ndash4845

19 ShapiroBA and ZhangKZ (1990) Comparing multiple RNAsecondary structures using tree comparisons Comput Appl Biosci6 309ndash318

20 LeSY NussinovR and MaizelJV (1989) Tree graphs of RNAsecondary structures and their comparisons Comput Biomed Res22 461ndash473

21 DuarteCM WadleyLM and PyleAM (2003) RNA structurecomparison motif search and discovery using a reduced representation ofRNA conformational space Nucleic Acids Res 31 4755ndash4761

22 ChevaletC and MichotB (1992) An algorithm for comparing RNAsecondary structures and searching for similar substructures ComputAppl Biosci 8 215ndash225

23 WangL and ZhaoJ (2003) Parametric alignment of ordered treesBioinformatics 19 2237ndash2245

24 FontanaW KoningsDAM StadlerPF and SchusterP (1993)Statistics of RNA secondary structures Biopolymers 33 1389ndash1404

25 van BatenburgFHD GultyaevAP PleijCWA NgJ andOliehoekJ (2000) PseudoBase a database with RNA pseudoknotsNucleic Acids Res 28 201ndash204

26 GanHH FeraD ZornJ ShiffeldrimN TangM LasersonUKimN and SchlickT (2004) RAG RNA-As-Graphsdatabasemdashconcepts analysis and features Bioinformatics20 1285ndash1291

27 FeraD KimN ShiffeldrimN ZornJ LasersonU GanHH andSchlickT (2004) RAG RNA-As-Graphs web resource BMCBioinformatics 5 88

28 ChenXY KangHS ShenLX ChamorroM VarmusHE andTinocoI (1996) A characteristic bent conformation of RNA pseudoknotspromotes 1 frameshifting during translation of retroviral RNAJ Mol Biol 260 479ndash483

29 KangHS HinesJV and TinocoI (1996) Conformation of anon-frameshifting RNA pseudoknot from mouse mammary tumor virusJ Mol Biol 259 135ndash147

30 GanHH PerlowRA RoyS KoJ WuM HuangJ YanSNicolettaA VafaiJ SunD et al (2002) Analysis of protein sequencestructure similarity relationships Biophys J 83 2781ndash2791

31 FeldenB FlorentzC GiegeR and WesthofE (1994) Solutionstructure of the 30-end of brome mosaic-virus genomic RNAsmdashconformational mimicry with canonical transfer-RNAs J Mol Biol235 508ndash531

32 GrossJL and YellenJ (1999) Graph Theory and its Applications CRCPress Boca Raton FL

33 KoblerJ SchoningU and ToraanJ (1993) The Graph IsomorphismProblem Birkhauser Boston MA

34 RandicM and ZupanJ (2001) On interpretation of well-knowntopological indices J Chem Inform Comput Sci 41 550ndash560

35 ZornJ GanHH ShiffeldrimN and SchlickT (2004) Structuralmotifs in ribosomal RNAs implications for RNA design and genomicsBiopolymers 73 340ndash347

36 JacksT MadhaniHD MasiarzFR and VarmusHE (1988) Signalsfor ribosomal frameshifting in the Rous-Sarcoma virus GagndashPolregion Cell 55 447ndash458

37 GodenyEK ChenL KumarSN MethvenSL KooninEV andBrintonMA (1993) Complete genomic sequence and phylogeneticanalysis of the lactate dehydrogenase-elevating virus (Ldv) Virology194 585ndash596

38 MarczinkeB FisherR VidakovicM BloysAJ and BrierleyI (1998)Secondary structure and mutational analysis of the ribosomalframeshift signal of Rous sarcoma virus J Mol Biol 284 205ndash225

39 MyersEW and MillerW (1988) Optimal alignments in linear spaceComput Appl Biosci 4 11ndash17

40 HuangXQ and MillerW (1991) A time-efficient linear-space localsimilarity algorithm Adv Appl Math 12 337ndash357

41 SmithTF and WatermanMS (1981) Identification of commonmolecular subsequences J Mol Biol 147 195ndash197

42 SongSI SilverSL AulikMA RasochovaL MohanBR andMillerWA (1999) Satellite cereal yellow dwarf virus-RPV (satRPV)RNA requires a DouXble hammerhead for self-cleavage and analternative structure for replication J Mol Biol 293 781ndash793

43 DreherTW and HallTC (1988) Mutational analysis of thetransfer-RNA mimicry of brome mosaic-virus RNAmdashsequence andstructural requirements for aminoacylation and 30-adenylationJ Mol Biol 201 41ndash55

44 KanamoriY and NakashimaN (2001) A tertiary structure model of theinternal ribosome entry site (IRES) for methionine-independent initiationof translation RNA 7 266ndash274

45 WuytsJ De RijkP Van de PeerY PisonG RousseeuwP andDe WachterR (2000)Comparativeanalysisof more than 3000 sequencesreveals the existence of two pseudoknots in area V4 of eukaryotic smallsubunit ribosomal RNA Nucleic Acids Res 28 4698ndash4708

46 Van de PeerY De RijkP WuytsJ WinkelmansT and De WachterR(2000) The European small subunit ribosomal RNA databaseNucleic Acids Res 28 175ndash176

47 ChappellSA EdelmanGM and MauroVP (2000) A 9-nt segment ofa cellular mRNA can function as an internal ribosome entry site (IRES)and when present in linked multiple copies greatly enhances IRESactivity Proc Natl Acad Sci USA 97 1536ndash1541

48 ScheperGC VoormaHO and ThomasAA (1994) Basepairing with18S ribosomal RNA in internal initiation of translation FEBS Lett352 271ndash275

49 BarendsS Rudinger-ThirionJ FlorentzC GiegeR PleijCW andKraalB (2004) tRNA-like structure regulates translation of Bromemosaic virus RNA J Virol 78 4003ndash4010

50 CarterAP ClemonsWM BrodersenDE Morgan-WarrenRJWimberlyBT and RamakrishnanV (2000) Functional insights fromthe structure of the 30S ribosomal subunit and its interactions withantibiotics Nature 407 340ndash348

51 HarmsJ SchluenzenF ZarivachR BashanA GatS AgmonIBartelsH FranceschiF and YonathA (2001) High resolution structureof the large ribosomal subunit from a mesophilic Eubacterium Cell 107679ndash688

52 BanN NissenP HansenJ MoorePB and SteitzTA (2000) Thecomplete atomic structure of the large ribosomal subunit at 24 angstromresolution Science 289 905ndash920

Nucleic Acids Research 2005 Vol 33 No 4 1397

53 CannoneJJ SubramanianS SchnareMN CollettJRDrsquoSouzaLM DuYS FengB LinN MadabusiLV MullerKMet al (2002) The Comparative RNA Web (CRW) Site an online databaseof comparative sequence and structure information for ribosomal intronand other RNAs BMC Bioinformatics 3 2

54 HararyF and PrinsG (1959) The number of homeomorphicallyirreducible trees and other species Acta Math 101141ndash162

55 HararyF (1969) Graph Theory Addison-Wesley Reading Mass

56 VasquezM and ScheragaHA (1985) Use of buildup andenergy-minimization procedures to compute low-energy structures ofthe backbone of enkephalin Biopolymers 24 1437ndash1447

57 ZukerM MathewsDH and TurnerDH (1999) Algorithms andthermodynamics for RNA secondary structure prediction A practicalguide In BarciszewskiJ and ClarkBFC (eds) RNA Biochemistry andBiotechnology Kluwer Academic Publishers Dordrecht pp 11ndash43

58 HofackerIL (2003) Vienna RNA secondary structure serverNucleic Acids Res 31 3429ndash3431

1398 Nucleic Acids Research 2005 Vol 33 No 4

Page 10: doi:10.1093/nar/gki267 Modular RNA architecture revealed ... · Modular RNA architecture revealed by computational analysis of existing pseudoknots and ... Samuela Pasquali, ... RNAs

Intriguingly for CrPVs the complementary segments occur inpseudoknots with similar folds not involving the dangling30 end of 18S rRNA Base pair complementary between IRESelements of mRNA and 18S rRNA is a mechanism for ribo-some recruitment and translation In viral IRES elements thesecondary and tertiary structures are believed to play criticalroles in translation initiation (44)

Functional relationship between PKB205 and PKB134(Pair 6 in Table 1)

The functional relationsip of the pair PKB205PKB134 issimilar to that for PKB205PKB223 above PKB134 belongsto a common viral tRNA-like functional class whose role is torecruit host ribosomes to initiate viral protein synthesis (49)It is known to interact with the ribosome via tRNA shapemimicry (31) Our analysis indicates that tRNA-like mole-cules also contain a region which is nearly complementary toPKB205 a pseudoknot region of the 18S rRNA Table 2shows regions of PKB205 complementary to four tRNA-like molecules (PKB134 191 138 17) the complementaryregion in PKB223 is shown for comparison Significantlyeach of the pseudoknots PKB223 PKB134 and PKB191has a region complementary to the same 8ndash9 nt unpairedregion in PKB205 (767ndash776) This single-stranded regionis most probably available for base pairing with incomingtRNA-like molecules In addition two other regions inPKB205 are complementary to tRNA-like pseudoknotsPKB138 and PKB17 one region is a hairpin loop and anotherregion is part of short stems Thus this analysis suggests thattRNA-like molecules may employ both structural mimicry andsequence complementarity to recruit SSU rRNA an interac-tion mode similar to tRNA PKB134 also has a region UACA-GUGUUGAAAAACA (nt 2078ndash2094) closely matched tothe region UAGAGUGUUCAAAGCCA (nt 732ndash748) inPKB205 with an average R-value of 123

Tertiary structure similarity of topological matches inribosomal RNAs

The preceding examination of three pseudoknot pairs showsthat topological matches can suggest functionally relatedpseudoknots It is also important to show that novel topo-logical matches led to similar tertiary structures We focusfor this purpose on rRNA since several bacterial and archaealribosome structures have been solved in recent years In factin our previous work we have found several topologicalmatches of the Scerevisiae 5S rRNA motif within its largerrRNAs two within 16S rRNA (domains II and III) and threewithin 23S rRNA (domains III IV and VI) (2) From availableX-ray structures for Thermus thermophilus 16S rRNA (PDBaccession no 1fjg) (50) and 23S rRNA from Deinococcus

radiodurans (1nkw) and Haloarcula marismortui (1ffk)(5152) only 5S rRNA motifrsquos matches in domain II of16S rRNA and domain III of 23S rRNA are relevant theother three cases are not valid matches in these species

Figure 6 displays the secondary diagrams graphical repres-entations and tertiary structures of 5S rRNA for bacteriumDradiodurans (1nkw) and archaea Hmarismortui (1ffk)and their topologically matched modules within 16S and23S rRNAs the secondary diagrams are adapted from thosein Gutellrsquos comparative RNA website (httpwwwrnaicmbutexasedu) (53) The module in 23S is an identicalmatch for the 6-vertex 5S graph except for a hairpin loopwhich does not contribute to the topological invariant SThe 16S module is a 5-vertex graph because a 2 bp stem atthe junction has a non-canonical base pair AG (dual graph ruleD1 defines a stem as having at least two canonical andorwobble base pairs) for Scerevisiae this stem consists ofonly canonical base pairs making the 16S module an identicalmatch for the 5S topology As shown in Figure 6 the 16S and23S modules have a similar L-shaped tertiary fold as the 5Sstructure Tertiary structure similarity with 5S is more strikingfor the module of domain II (nt 653ndash753) of Tthermophilus16S rRNA than for the module of domain III (nt 1033ndash1143) of23S rRNA Both the 5S structure and 16S module have apronounced 3-stem junction where the chain ends of the16S module are located Structural (backbone) overlap ofthe 5S structure and 16S module shows similar structuralfeatures except for the orientations of the smaller stems atthe junction (We aligned the structures manually because thechain ends of 5S and 16S module occur at non-correspondinglocations) In contrast the 23S module shows significant dis-tortions on the shorter helical arm with the hairpin loop foldedback to the helix as shown in the structural overlap of the 5Sstructure and 23S module Structural similarities seen in theseexamples are probably due to geometric constraints which areapparent in their secondary diagrams and graphical representa-tions Thus 2D topological matches can lead to similar 3Dstructures Below we survey the distribution of modules inrRNAs

Distribution of submotifs in 16S and 23S ribosomalRNAs

Characterizing the submotifs within the 16S and 23S rRNAsthe largest known RNA molecules can help uncover themodular construction of RNA structures as analysis ofrRNArsquos tertiary modules has revealed Furthermore our pre-vious analysis of rRNArsquos structural motifs indicated existenceof characteristic features (35) For example the distributionsof pairedunpaired bases in stems bulgesinternal loops hair-pin loops and junctions follow specific functional forms

Table 2 Short sequence regions in PKB205 complementary to tRNA-like molecules

PKB205 regions Complementary sequences

30-GGUACGAUU-50 (768ndash776) PKB223 50-UUAGGUUAG-30 (6100ndash6108) IRES30-GUGAGAUU-50 (769ndash776) PKB134 50-CACUGUAAA-30 (113ndash120) tRNA-like30-UACGAUUAC-50 (767ndash775) PKB191 50-AUGCUCAUG-30 (77ndash85) tRNA-like30-UGUGAGAUU-50 (731ndash739) PKB138 50-ACACUUUAA-30 (38ndash47) tRNA-like30-GUUUGCGGACC-50 (746ndash756) PKB17 50-CAAAACCCUGG-30 (19ndash29) tRNA-like

Nucleic Acids Research 2005 Vol 33 No 4 1393

Figure 6 Comparison of the graph topologies and 2D and 3D structures of 5S rRNAs (A and B) and modules in 16S and 23S rRNAs (C and D) (C and D) Twostructural (backbone) overlaps of the 5S structure (1ffk gold color) with 16S module (1fjg blue) and 5S structure with 23S module (1nkw green) are shown Thesecondary diagrams are adapted from those in Gutellrsquos comparative RNA website (httpwwwrnaicmbutexasedu) and 3D structures are from the Protein DataBank (httpwwwrcsborgpdb) The 2D modules were identified computationally by our earlier work on finding120 nt 5S rRNA motifs in 16S and 23S rRNAs ofScerevisiae(2) The equivalent tertiary structures and modules for 5S 16S and 23S rRNAs from other organisms show that similar graph topologies can lead tosimilar secondary and tertiary structures All structures have similar graph topologies The Tthermophilus 16S module is a 5-vertex instead of 6-vertex graph becausea 2 bp stem at the junction has a non-canonical base pair AG for Scerevisiae this stem consists of only canonical base pairs making the 16S module an identicalmatch for the 5S topology

1394 Nucleic Acids Research 2005 Vol 33 No 4

Moreover the percentages of nucleotides in stems bulgesinternal loops hairpin loops and junctions do not vary for16S and 23S rRNAs even though the latter is twice thesize of the former Such motif characteristics help discriminateRNA-like from non-RNA-like molecules and guide the designof novel RNAs Here we analyze the distinct modular unitsmdashsmall tree submotifsmdashthat makeup the rRNAs to advance suchapplications Since all small tree graphs up to 10 vertices havebeen exhaustively enumerated (5455) we can use our sub-structure search algorithm to identify the occurrences of alldistinct tree submotifs in rRNAs

Specifically we use sets of tree graphs for V = 4 5 6 7 8 9and 10 comprising 2 3 6 11 23 47 and 106 graphs respect-ively for a total of 198 tree motifs The complete enumeratedmotifs can be found in the RAG web resource (httpmonodbiomathnyuedurna) Figure 7 shows as an example the11 possible 7-vertex tree graphs and lists the frequency ofoccurrence of each tree graph in rRNAs Remarkably thesame tree motifs are abundant in both 16S and 23S rRNAsThe most abundant (10 or 15 times) 7-vertex tree has two3-stem junctions (topological invariant Stree = 6160) 4- and5-stem junctions are the next most prevalent submotifs(Stree = 5137 5921 and 5667) Interestingly the unbranchedtree (Stree = 7219) and highly branched trees (Stree = 4509)are rare

Figure 8 plots the percentage of distinct tree graphs in eachset V (fraction of motifs identified relative to possible motifs)

found in rRNA For small tree graphs V = 4 5 and 6 allpossible topologies are present but not for V gt 6 In fact thepercentage of trees found declines with V In 23S rRNA90 of possible trees are found for V = 7 and only40 (representing 47 tree topologies) are found forV = 10 Larger trees (V gt 10) are expected to have muchsmaller percentages This percentage decline is expectedsince small tree modules are less specific than large treesIn total 111 submotifs out of 198 are identified The observedtrends in Figure 8 may be generally valid for rRNA structuresof other organisms since they are highly conserved althoughsome statistical fluctuations are expected due to structuraldifferences and RNArsquos finite size

These data show that RNA structures are modular constructsof relatively few (tree) building blocks Extrapolation of thecurves in Figure 8 suggests that 20 of V = 11 motifs (totalof 235 trees) and 10 of V = 12 motifs (total of 551 trees)probably exist in rRNAs Thus the total number of buildingblocks in rRNAs with V lt 13 would be 210 Although thepercentages of trees with higher V values are low they aremore numerous and will contribute to the overall number ofbuilding blocks We thus consider 210 as the minimum num-ber of building blocks The building block number may beexploited in the future design of novel functional RNAmolecules using modular assembly from 210 distinct treemotifs The tree motifs may be taken from substructures ofexisting RNAs allowing simultaneous sequence and structure

treeS =

16S 23SOccurrence

ID

6 4

)

0 1

(7 8)

6191

(7 11)

4509

0 1

(7 9)

5385

3 3

7

(7 10)

5137

5 5

(7 6)

5916

5

(7 5)

5921

(7 1)

0 0

7219

(7 3)

2 4

6446

4 4

(7 7)

5667

10 15

(7 4

6160

0 3

6692

(7 2)

Figure 7 Occurrence frequency of 11 distinct 7-vertex tree topologies as submotifs of Scerevisiae 16S and 23S rRNAs The tree graphs are labeled using topologicalinvariant Stree (second top row) and ID number (i j) (first row) which refers to motif identification number as used in RAG database (httpmonodbiomathnyuedurna) The two numbers at the bottom of each graph refer to the number of times that topology is found embedded in 16S and in 23S rRNAs respectively

Nucleic Acids Research 2005 Vol 33 No 4 1395

design Such a build-up approach appears feasible from our(26) and other works (35) and is akin to the lsquobuilduprsquo tech-nique for proteins pioneered by Vasquez and Scheraga (56)

The above estimate of building block number can be eithera consequence of the limited number of available structures orthe fact that RNA adopts only a subset of possible graphsCertainly future discoveries of novel RNA structures mightindicate a larger repertoire of building blocks Still our sur-veys of natural RNAs (26) suggest that RNAs possess specifictopological characteristics and favor a subset of the possiblegraph motifs The missing motifs may not exist in nature forRNA owing to physical or functional constraints

The building block number can also be defined in terms ofirreducible graphs ie trees that cannot be decomposed intosmaller trees However this approach has a major drawbackSince all RNAs are composed of connected stems their sub-structures can be reduced to two vertex graphs which is not aninformative description of complex RNA folds Still if wedefine 4-vertex trees as the smallest irreducible graphs thenumber of irreducible graphs can be calculated Based on oursubmotif data we estimate that for V up to 10 there are about42 irreducible tree modules in rRNAs

SUMMARY AND CONCLUSION

Our automated comparison of RNA secondary structuresbased on graphical representation allows the examination ofconserved 2D RNA topological features which in turn suggestfunctional relationships This approach has an advantage oversequence alignment because functionally related RNAs oftenlack sequence similarity as shown for frameshifting pseudo-knots with low global sequence identities Of course ourstructure alignment approach relies on available 2D RNA

structures and thus will increase in scope as more RNAstructures become available Our eight identified pairs ofpotentially related pseudoknots (Table 1) reveal three pairsthat are functionally related the pseudoknots of 30 end ofbrome mosaic virus genomic RNA (PKB134) and alternativehammerhead ribozyme pseudoknot (PKB173) are templatesfor replicase to initiate replication of viral genomes the pseu-doknots of internal entry site (IRES) of caspid-producingCrPVs (PKB223) and SSU of 18S rRNA (PKB205) are asso-ciated with translation initiation and the pseudoknots of 18SrRNA (PKB205) and viral tRNA-like (PKB134) most prob-ably interact via both structural mimicry and base comple-mentarity To the best of our knowledge these functionalrelationships have not been described previously

The developed RNA comparison algorithm also led us toquantify the modularity of 16S and 23S rRNAs using smallenumerated tree motifs Our results suggest that rRNAs areconstructed from at least 210 distinct tree subtopologies(or modules) having up to 180 nt Significantly we findthat 5S rRNA and two tree modules within 16S and 23SrRNAs with similar topologies have similar tertiary shapesaffirming the relationship between 2D graphical representa-tion and 3D structure The tree modules in rRNA may be usedas building blocks for the design of novel RNA folds A similaridea has been employed to design and improve functionalRNA molecules (510) Design by assembly of modulesexploits motifs present in natural RNAs unlike de novo designof sequences and structures (11) The issues of discoveringnovel topologies (6) building module library and designingfunctional folds are likely to increase in importance withinterest in novel RNAs

Our algorithm has several limitations Of course its utilitydepends on availability of reliable RNA secondary foldingalgorithms (5758) More fundamentally although simplifiedgraphical representations specify topological characteristicsdetailed sequence and motif information is missing An asso-ciated problem is that unusual RNA structurendashfunction rela-tionships due to deleterious mutations as in frameshiftingpseudoknots (28) and structural flexibility are not capturedMore advanced graph constructs such as weighted graphs mayhowever specify detailed motif features at the secondary andbase pair levels Another limitation of the present algorithmis the requirement to fix the probe structure or graph As inprotein structure alignment algorithms (15) future refinementsof our algorithm should allow some flexibility in motifdefinitions as the similarity search proceeds We inviteRNA scientists to visit our RAG web resource at httpmonodbiomathnyuedurna and suggest to us specific enhancementsthat will be useful in practice

ACKNOWLEDGEMENTS

We thank Dinshaw Patel for comments that motivated thiswork This research was supported by a Joint NSFNIGMSInitiative in Mathematical Biology (DMS-0201160) and theHuman Frontier Science Program (HFSP)

REFERENCES

1 TinocoI and BustamanteC (1999) How RNA folds J Mol Biol293 271ndash281

Figure 8 Distribution of all possible distinct tree modules with V vertices foundin Scerevisiae 16S and 23S rRNAs In graph theory each V is represented by aset of possible tree structures For example for V= 4 5 6 7 8 9 and 10 there are2 3 6 11 23 47 and 106 possible trees respectively The distribution of treesquantifies the diversity of tree modules within rRNAs This analysis suggeststhat rRNAs possess at least 210 tree modules

1396 Nucleic Acids Research 2005 Vol 33 No 4

2 GanHH PasqualiS and SchlickT (2003) Exploring the repertoire ofRNA secondary motifs using graph theory implications for RNA designNucleic Acids Res 31 2926ndash2943

3 SoukupGA and BreakerRR (1999) Engineering precision RNAmolecular switches Proc Natl Acad Sci USA 96 3584ndash3589

4 BreakerRR (2002) Engineered allosteric ribozymes as biosensorcomponents Curr Opin Biotechnol 13 31ndash39

5 IkawaY FukadaK WatanabeS ShiraishiH and InoueT (2002)Design construction and analysis of a novel class of self-foldingRNA Structure 10 527ndash534

6 KimN ShiffeldrimN GanHH and SchlickT (2004) Candidatesfor novel RNA topologies J Mol Biol 341 1129ndash1144

7 MitchellJR ChengJ and CollinsK (1999) A box HACAsmall nucleolar RNA-like domain at the human telomerase RNA 30

end Mol Cell Biol 19 567ndash5768 DoudnaJA and CechTR (2002) The chemical repertoire

of natural ribozymes Nature 418 222ndash2289 StorzG (2002) An expanding universe of noncoding RNAs Science

296 1260ndash126310 BreakerRR and JoyceGF (1994) Inventing and improving ribozyme

function rational design versus iterative selection methods TrendsBiotechnol 12 268ndash275

11 AndronescuM FejesAP HutterF HoosHH and CondonA (2004)A new algorithm for RNA secondary structure design J Mol Biol336 607ndash624

12 CarterRJ DubchakI and HolbrookSR (2001) A computationalapproach to identify genes for functional RNAs in genomic sequencesNucleic Acids Res 29 3928ndash3938

13 DoudnaJA (2000) Structural genomics of RNA Nature Struct Biol7 954ndash956

14 RivasE KleinRJ JonesTA and EddySR (2001) Computationalidentification of noncoding RNAs in Ecoli by comparativegenomics Curr Biol 11 1369ndash1373

15 HolmL and SanderC (1996) Mapping the protein universeScience 273 595ndash602

16 ShindyalovIN and BournePE (2001) A database and tools for 3-Dprotein structure comparison and alignment using the CombinatorialExtension (CE) algorithm Nucleic Acids Res 29 228ndash229

17 BenedettiG and MorosettiS (1996) A graph-topological approach torecognition of pattern and similarity in RNA secondary structuresBiophys Chem 59 179ndash184

18 MargalitH ShapiroBA OppenheimAB and MaizelJVJr (1989)Detection of common motifs in RNA secondary structures Nucleic AcidsRes 17 4829ndash4845

19 ShapiroBA and ZhangKZ (1990) Comparing multiple RNAsecondary structures using tree comparisons Comput Appl Biosci6 309ndash318

20 LeSY NussinovR and MaizelJV (1989) Tree graphs of RNAsecondary structures and their comparisons Comput Biomed Res22 461ndash473

21 DuarteCM WadleyLM and PyleAM (2003) RNA structurecomparison motif search and discovery using a reduced representation ofRNA conformational space Nucleic Acids Res 31 4755ndash4761

22 ChevaletC and MichotB (1992) An algorithm for comparing RNAsecondary structures and searching for similar substructures ComputAppl Biosci 8 215ndash225

23 WangL and ZhaoJ (2003) Parametric alignment of ordered treesBioinformatics 19 2237ndash2245

24 FontanaW KoningsDAM StadlerPF and SchusterP (1993)Statistics of RNA secondary structures Biopolymers 33 1389ndash1404

25 van BatenburgFHD GultyaevAP PleijCWA NgJ andOliehoekJ (2000) PseudoBase a database with RNA pseudoknotsNucleic Acids Res 28 201ndash204

26 GanHH FeraD ZornJ ShiffeldrimN TangM LasersonUKimN and SchlickT (2004) RAG RNA-As-Graphsdatabasemdashconcepts analysis and features Bioinformatics20 1285ndash1291

27 FeraD KimN ShiffeldrimN ZornJ LasersonU GanHH andSchlickT (2004) RAG RNA-As-Graphs web resource BMCBioinformatics 5 88

28 ChenXY KangHS ShenLX ChamorroM VarmusHE andTinocoI (1996) A characteristic bent conformation of RNA pseudoknotspromotes 1 frameshifting during translation of retroviral RNAJ Mol Biol 260 479ndash483

29 KangHS HinesJV and TinocoI (1996) Conformation of anon-frameshifting RNA pseudoknot from mouse mammary tumor virusJ Mol Biol 259 135ndash147

30 GanHH PerlowRA RoyS KoJ WuM HuangJ YanSNicolettaA VafaiJ SunD et al (2002) Analysis of protein sequencestructure similarity relationships Biophys J 83 2781ndash2791

31 FeldenB FlorentzC GiegeR and WesthofE (1994) Solutionstructure of the 30-end of brome mosaic-virus genomic RNAsmdashconformational mimicry with canonical transfer-RNAs J Mol Biol235 508ndash531

32 GrossJL and YellenJ (1999) Graph Theory and its Applications CRCPress Boca Raton FL

33 KoblerJ SchoningU and ToraanJ (1993) The Graph IsomorphismProblem Birkhauser Boston MA

34 RandicM and ZupanJ (2001) On interpretation of well-knowntopological indices J Chem Inform Comput Sci 41 550ndash560

35 ZornJ GanHH ShiffeldrimN and SchlickT (2004) Structuralmotifs in ribosomal RNAs implications for RNA design and genomicsBiopolymers 73 340ndash347

36 JacksT MadhaniHD MasiarzFR and VarmusHE (1988) Signalsfor ribosomal frameshifting in the Rous-Sarcoma virus GagndashPolregion Cell 55 447ndash458

37 GodenyEK ChenL KumarSN MethvenSL KooninEV andBrintonMA (1993) Complete genomic sequence and phylogeneticanalysis of the lactate dehydrogenase-elevating virus (Ldv) Virology194 585ndash596

38 MarczinkeB FisherR VidakovicM BloysAJ and BrierleyI (1998)Secondary structure and mutational analysis of the ribosomalframeshift signal of Rous sarcoma virus J Mol Biol 284 205ndash225

39 MyersEW and MillerW (1988) Optimal alignments in linear spaceComput Appl Biosci 4 11ndash17

40 HuangXQ and MillerW (1991) A time-efficient linear-space localsimilarity algorithm Adv Appl Math 12 337ndash357

41 SmithTF and WatermanMS (1981) Identification of commonmolecular subsequences J Mol Biol 147 195ndash197

42 SongSI SilverSL AulikMA RasochovaL MohanBR andMillerWA (1999) Satellite cereal yellow dwarf virus-RPV (satRPV)RNA requires a DouXble hammerhead for self-cleavage and analternative structure for replication J Mol Biol 293 781ndash793

43 DreherTW and HallTC (1988) Mutational analysis of thetransfer-RNA mimicry of brome mosaic-virus RNAmdashsequence andstructural requirements for aminoacylation and 30-adenylationJ Mol Biol 201 41ndash55

44 KanamoriY and NakashimaN (2001) A tertiary structure model of theinternal ribosome entry site (IRES) for methionine-independent initiationof translation RNA 7 266ndash274

45 WuytsJ De RijkP Van de PeerY PisonG RousseeuwP andDe WachterR (2000)Comparativeanalysisof more than 3000 sequencesreveals the existence of two pseudoknots in area V4 of eukaryotic smallsubunit ribosomal RNA Nucleic Acids Res 28 4698ndash4708

46 Van de PeerY De RijkP WuytsJ WinkelmansT and De WachterR(2000) The European small subunit ribosomal RNA databaseNucleic Acids Res 28 175ndash176

47 ChappellSA EdelmanGM and MauroVP (2000) A 9-nt segment ofa cellular mRNA can function as an internal ribosome entry site (IRES)and when present in linked multiple copies greatly enhances IRESactivity Proc Natl Acad Sci USA 97 1536ndash1541

48 ScheperGC VoormaHO and ThomasAA (1994) Basepairing with18S ribosomal RNA in internal initiation of translation FEBS Lett352 271ndash275

49 BarendsS Rudinger-ThirionJ FlorentzC GiegeR PleijCW andKraalB (2004) tRNA-like structure regulates translation of Bromemosaic virus RNA J Virol 78 4003ndash4010

50 CarterAP ClemonsWM BrodersenDE Morgan-WarrenRJWimberlyBT and RamakrishnanV (2000) Functional insights fromthe structure of the 30S ribosomal subunit and its interactions withantibiotics Nature 407 340ndash348

51 HarmsJ SchluenzenF ZarivachR BashanA GatS AgmonIBartelsH FranceschiF and YonathA (2001) High resolution structureof the large ribosomal subunit from a mesophilic Eubacterium Cell 107679ndash688

52 BanN NissenP HansenJ MoorePB and SteitzTA (2000) Thecomplete atomic structure of the large ribosomal subunit at 24 angstromresolution Science 289 905ndash920

Nucleic Acids Research 2005 Vol 33 No 4 1397

53 CannoneJJ SubramanianS SchnareMN CollettJRDrsquoSouzaLM DuYS FengB LinN MadabusiLV MullerKMet al (2002) The Comparative RNA Web (CRW) Site an online databaseof comparative sequence and structure information for ribosomal intronand other RNAs BMC Bioinformatics 3 2

54 HararyF and PrinsG (1959) The number of homeomorphicallyirreducible trees and other species Acta Math 101141ndash162

55 HararyF (1969) Graph Theory Addison-Wesley Reading Mass

56 VasquezM and ScheragaHA (1985) Use of buildup andenergy-minimization procedures to compute low-energy structures ofthe backbone of enkephalin Biopolymers 24 1437ndash1447

57 ZukerM MathewsDH and TurnerDH (1999) Algorithms andthermodynamics for RNA secondary structure prediction A practicalguide In BarciszewskiJ and ClarkBFC (eds) RNA Biochemistry andBiotechnology Kluwer Academic Publishers Dordrecht pp 11ndash43

58 HofackerIL (2003) Vienna RNA secondary structure serverNucleic Acids Res 31 3429ndash3431

1398 Nucleic Acids Research 2005 Vol 33 No 4

Page 11: doi:10.1093/nar/gki267 Modular RNA architecture revealed ... · Modular RNA architecture revealed by computational analysis of existing pseudoknots and ... Samuela Pasquali, ... RNAs

Figure 6 Comparison of the graph topologies and 2D and 3D structures of 5S rRNAs (A and B) and modules in 16S and 23S rRNAs (C and D) (C and D) Twostructural (backbone) overlaps of the 5S structure (1ffk gold color) with 16S module (1fjg blue) and 5S structure with 23S module (1nkw green) are shown Thesecondary diagrams are adapted from those in Gutellrsquos comparative RNA website (httpwwwrnaicmbutexasedu) and 3D structures are from the Protein DataBank (httpwwwrcsborgpdb) The 2D modules were identified computationally by our earlier work on finding120 nt 5S rRNA motifs in 16S and 23S rRNAs ofScerevisiae(2) The equivalent tertiary structures and modules for 5S 16S and 23S rRNAs from other organisms show that similar graph topologies can lead tosimilar secondary and tertiary structures All structures have similar graph topologies The Tthermophilus 16S module is a 5-vertex instead of 6-vertex graph becausea 2 bp stem at the junction has a non-canonical base pair AG for Scerevisiae this stem consists of only canonical base pairs making the 16S module an identicalmatch for the 5S topology

1394 Nucleic Acids Research 2005 Vol 33 No 4

Moreover the percentages of nucleotides in stems bulgesinternal loops hairpin loops and junctions do not vary for16S and 23S rRNAs even though the latter is twice thesize of the former Such motif characteristics help discriminateRNA-like from non-RNA-like molecules and guide the designof novel RNAs Here we analyze the distinct modular unitsmdashsmall tree submotifsmdashthat makeup the rRNAs to advance suchapplications Since all small tree graphs up to 10 vertices havebeen exhaustively enumerated (5455) we can use our sub-structure search algorithm to identify the occurrences of alldistinct tree submotifs in rRNAs

Specifically we use sets of tree graphs for V = 4 5 6 7 8 9and 10 comprising 2 3 6 11 23 47 and 106 graphs respect-ively for a total of 198 tree motifs The complete enumeratedmotifs can be found in the RAG web resource (httpmonodbiomathnyuedurna) Figure 7 shows as an example the11 possible 7-vertex tree graphs and lists the frequency ofoccurrence of each tree graph in rRNAs Remarkably thesame tree motifs are abundant in both 16S and 23S rRNAsThe most abundant (10 or 15 times) 7-vertex tree has two3-stem junctions (topological invariant Stree = 6160) 4- and5-stem junctions are the next most prevalent submotifs(Stree = 5137 5921 and 5667) Interestingly the unbranchedtree (Stree = 7219) and highly branched trees (Stree = 4509)are rare

Figure 8 plots the percentage of distinct tree graphs in eachset V (fraction of motifs identified relative to possible motifs)

found in rRNA For small tree graphs V = 4 5 and 6 allpossible topologies are present but not for V gt 6 In fact thepercentage of trees found declines with V In 23S rRNA90 of possible trees are found for V = 7 and only40 (representing 47 tree topologies) are found forV = 10 Larger trees (V gt 10) are expected to have muchsmaller percentages This percentage decline is expectedsince small tree modules are less specific than large treesIn total 111 submotifs out of 198 are identified The observedtrends in Figure 8 may be generally valid for rRNA structuresof other organisms since they are highly conserved althoughsome statistical fluctuations are expected due to structuraldifferences and RNArsquos finite size

These data show that RNA structures are modular constructsof relatively few (tree) building blocks Extrapolation of thecurves in Figure 8 suggests that 20 of V = 11 motifs (totalof 235 trees) and 10 of V = 12 motifs (total of 551 trees)probably exist in rRNAs Thus the total number of buildingblocks in rRNAs with V lt 13 would be 210 Although thepercentages of trees with higher V values are low they aremore numerous and will contribute to the overall number ofbuilding blocks We thus consider 210 as the minimum num-ber of building blocks The building block number may beexploited in the future design of novel functional RNAmolecules using modular assembly from 210 distinct treemotifs The tree motifs may be taken from substructures ofexisting RNAs allowing simultaneous sequence and structure

treeS =

16S 23SOccurrence

ID

6 4

)

0 1

(7 8)

6191

(7 11)

4509

0 1

(7 9)

5385

3 3

7

(7 10)

5137

5 5

(7 6)

5916

5

(7 5)

5921

(7 1)

0 0

7219

(7 3)

2 4

6446

4 4

(7 7)

5667

10 15

(7 4

6160

0 3

6692

(7 2)

Figure 7 Occurrence frequency of 11 distinct 7-vertex tree topologies as submotifs of Scerevisiae 16S and 23S rRNAs The tree graphs are labeled using topologicalinvariant Stree (second top row) and ID number (i j) (first row) which refers to motif identification number as used in RAG database (httpmonodbiomathnyuedurna) The two numbers at the bottom of each graph refer to the number of times that topology is found embedded in 16S and in 23S rRNAs respectively

Nucleic Acids Research 2005 Vol 33 No 4 1395

design Such a build-up approach appears feasible from our(26) and other works (35) and is akin to the lsquobuilduprsquo tech-nique for proteins pioneered by Vasquez and Scheraga (56)

The above estimate of building block number can be eithera consequence of the limited number of available structures orthe fact that RNA adopts only a subset of possible graphsCertainly future discoveries of novel RNA structures mightindicate a larger repertoire of building blocks Still our sur-veys of natural RNAs (26) suggest that RNAs possess specifictopological characteristics and favor a subset of the possiblegraph motifs The missing motifs may not exist in nature forRNA owing to physical or functional constraints

The building block number can also be defined in terms ofirreducible graphs ie trees that cannot be decomposed intosmaller trees However this approach has a major drawbackSince all RNAs are composed of connected stems their sub-structures can be reduced to two vertex graphs which is not aninformative description of complex RNA folds Still if wedefine 4-vertex trees as the smallest irreducible graphs thenumber of irreducible graphs can be calculated Based on oursubmotif data we estimate that for V up to 10 there are about42 irreducible tree modules in rRNAs

SUMMARY AND CONCLUSION

Our automated comparison of RNA secondary structuresbased on graphical representation allows the examination ofconserved 2D RNA topological features which in turn suggestfunctional relationships This approach has an advantage oversequence alignment because functionally related RNAs oftenlack sequence similarity as shown for frameshifting pseudo-knots with low global sequence identities Of course ourstructure alignment approach relies on available 2D RNA

structures and thus will increase in scope as more RNAstructures become available Our eight identified pairs ofpotentially related pseudoknots (Table 1) reveal three pairsthat are functionally related the pseudoknots of 30 end ofbrome mosaic virus genomic RNA (PKB134) and alternativehammerhead ribozyme pseudoknot (PKB173) are templatesfor replicase to initiate replication of viral genomes the pseu-doknots of internal entry site (IRES) of caspid-producingCrPVs (PKB223) and SSU of 18S rRNA (PKB205) are asso-ciated with translation initiation and the pseudoknots of 18SrRNA (PKB205) and viral tRNA-like (PKB134) most prob-ably interact via both structural mimicry and base comple-mentarity To the best of our knowledge these functionalrelationships have not been described previously

The developed RNA comparison algorithm also led us toquantify the modularity of 16S and 23S rRNAs using smallenumerated tree motifs Our results suggest that rRNAs areconstructed from at least 210 distinct tree subtopologies(or modules) having up to 180 nt Significantly we findthat 5S rRNA and two tree modules within 16S and 23SrRNAs with similar topologies have similar tertiary shapesaffirming the relationship between 2D graphical representa-tion and 3D structure The tree modules in rRNA may be usedas building blocks for the design of novel RNA folds A similaridea has been employed to design and improve functionalRNA molecules (510) Design by assembly of modulesexploits motifs present in natural RNAs unlike de novo designof sequences and structures (11) The issues of discoveringnovel topologies (6) building module library and designingfunctional folds are likely to increase in importance withinterest in novel RNAs

Our algorithm has several limitations Of course its utilitydepends on availability of reliable RNA secondary foldingalgorithms (5758) More fundamentally although simplifiedgraphical representations specify topological characteristicsdetailed sequence and motif information is missing An asso-ciated problem is that unusual RNA structurendashfunction rela-tionships due to deleterious mutations as in frameshiftingpseudoknots (28) and structural flexibility are not capturedMore advanced graph constructs such as weighted graphs mayhowever specify detailed motif features at the secondary andbase pair levels Another limitation of the present algorithmis the requirement to fix the probe structure or graph As inprotein structure alignment algorithms (15) future refinementsof our algorithm should allow some flexibility in motifdefinitions as the similarity search proceeds We inviteRNA scientists to visit our RAG web resource at httpmonodbiomathnyuedurna and suggest to us specific enhancementsthat will be useful in practice

ACKNOWLEDGEMENTS

We thank Dinshaw Patel for comments that motivated thiswork This research was supported by a Joint NSFNIGMSInitiative in Mathematical Biology (DMS-0201160) and theHuman Frontier Science Program (HFSP)

REFERENCES

1 TinocoI and BustamanteC (1999) How RNA folds J Mol Biol293 271ndash281

Figure 8 Distribution of all possible distinct tree modules with V vertices foundin Scerevisiae 16S and 23S rRNAs In graph theory each V is represented by aset of possible tree structures For example for V= 4 5 6 7 8 9 and 10 there are2 3 6 11 23 47 and 106 possible trees respectively The distribution of treesquantifies the diversity of tree modules within rRNAs This analysis suggeststhat rRNAs possess at least 210 tree modules

1396 Nucleic Acids Research 2005 Vol 33 No 4

2 GanHH PasqualiS and SchlickT (2003) Exploring the repertoire ofRNA secondary motifs using graph theory implications for RNA designNucleic Acids Res 31 2926ndash2943

3 SoukupGA and BreakerRR (1999) Engineering precision RNAmolecular switches Proc Natl Acad Sci USA 96 3584ndash3589

4 BreakerRR (2002) Engineered allosteric ribozymes as biosensorcomponents Curr Opin Biotechnol 13 31ndash39

5 IkawaY FukadaK WatanabeS ShiraishiH and InoueT (2002)Design construction and analysis of a novel class of self-foldingRNA Structure 10 527ndash534

6 KimN ShiffeldrimN GanHH and SchlickT (2004) Candidatesfor novel RNA topologies J Mol Biol 341 1129ndash1144

7 MitchellJR ChengJ and CollinsK (1999) A box HACAsmall nucleolar RNA-like domain at the human telomerase RNA 30

end Mol Cell Biol 19 567ndash5768 DoudnaJA and CechTR (2002) The chemical repertoire

of natural ribozymes Nature 418 222ndash2289 StorzG (2002) An expanding universe of noncoding RNAs Science

296 1260ndash126310 BreakerRR and JoyceGF (1994) Inventing and improving ribozyme

function rational design versus iterative selection methods TrendsBiotechnol 12 268ndash275

11 AndronescuM FejesAP HutterF HoosHH and CondonA (2004)A new algorithm for RNA secondary structure design J Mol Biol336 607ndash624

12 CarterRJ DubchakI and HolbrookSR (2001) A computationalapproach to identify genes for functional RNAs in genomic sequencesNucleic Acids Res 29 3928ndash3938

13 DoudnaJA (2000) Structural genomics of RNA Nature Struct Biol7 954ndash956

14 RivasE KleinRJ JonesTA and EddySR (2001) Computationalidentification of noncoding RNAs in Ecoli by comparativegenomics Curr Biol 11 1369ndash1373

15 HolmL and SanderC (1996) Mapping the protein universeScience 273 595ndash602

16 ShindyalovIN and BournePE (2001) A database and tools for 3-Dprotein structure comparison and alignment using the CombinatorialExtension (CE) algorithm Nucleic Acids Res 29 228ndash229

17 BenedettiG and MorosettiS (1996) A graph-topological approach torecognition of pattern and similarity in RNA secondary structuresBiophys Chem 59 179ndash184

18 MargalitH ShapiroBA OppenheimAB and MaizelJVJr (1989)Detection of common motifs in RNA secondary structures Nucleic AcidsRes 17 4829ndash4845

19 ShapiroBA and ZhangKZ (1990) Comparing multiple RNAsecondary structures using tree comparisons Comput Appl Biosci6 309ndash318

20 LeSY NussinovR and MaizelJV (1989) Tree graphs of RNAsecondary structures and their comparisons Comput Biomed Res22 461ndash473

21 DuarteCM WadleyLM and PyleAM (2003) RNA structurecomparison motif search and discovery using a reduced representation ofRNA conformational space Nucleic Acids Res 31 4755ndash4761

22 ChevaletC and MichotB (1992) An algorithm for comparing RNAsecondary structures and searching for similar substructures ComputAppl Biosci 8 215ndash225

23 WangL and ZhaoJ (2003) Parametric alignment of ordered treesBioinformatics 19 2237ndash2245

24 FontanaW KoningsDAM StadlerPF and SchusterP (1993)Statistics of RNA secondary structures Biopolymers 33 1389ndash1404

25 van BatenburgFHD GultyaevAP PleijCWA NgJ andOliehoekJ (2000) PseudoBase a database with RNA pseudoknotsNucleic Acids Res 28 201ndash204

26 GanHH FeraD ZornJ ShiffeldrimN TangM LasersonUKimN and SchlickT (2004) RAG RNA-As-Graphsdatabasemdashconcepts analysis and features Bioinformatics20 1285ndash1291

27 FeraD KimN ShiffeldrimN ZornJ LasersonU GanHH andSchlickT (2004) RAG RNA-As-Graphs web resource BMCBioinformatics 5 88

28 ChenXY KangHS ShenLX ChamorroM VarmusHE andTinocoI (1996) A characteristic bent conformation of RNA pseudoknotspromotes 1 frameshifting during translation of retroviral RNAJ Mol Biol 260 479ndash483

29 KangHS HinesJV and TinocoI (1996) Conformation of anon-frameshifting RNA pseudoknot from mouse mammary tumor virusJ Mol Biol 259 135ndash147

30 GanHH PerlowRA RoyS KoJ WuM HuangJ YanSNicolettaA VafaiJ SunD et al (2002) Analysis of protein sequencestructure similarity relationships Biophys J 83 2781ndash2791

31 FeldenB FlorentzC GiegeR and WesthofE (1994) Solutionstructure of the 30-end of brome mosaic-virus genomic RNAsmdashconformational mimicry with canonical transfer-RNAs J Mol Biol235 508ndash531

32 GrossJL and YellenJ (1999) Graph Theory and its Applications CRCPress Boca Raton FL

33 KoblerJ SchoningU and ToraanJ (1993) The Graph IsomorphismProblem Birkhauser Boston MA

34 RandicM and ZupanJ (2001) On interpretation of well-knowntopological indices J Chem Inform Comput Sci 41 550ndash560

35 ZornJ GanHH ShiffeldrimN and SchlickT (2004) Structuralmotifs in ribosomal RNAs implications for RNA design and genomicsBiopolymers 73 340ndash347

36 JacksT MadhaniHD MasiarzFR and VarmusHE (1988) Signalsfor ribosomal frameshifting in the Rous-Sarcoma virus GagndashPolregion Cell 55 447ndash458

37 GodenyEK ChenL KumarSN MethvenSL KooninEV andBrintonMA (1993) Complete genomic sequence and phylogeneticanalysis of the lactate dehydrogenase-elevating virus (Ldv) Virology194 585ndash596

38 MarczinkeB FisherR VidakovicM BloysAJ and BrierleyI (1998)Secondary structure and mutational analysis of the ribosomalframeshift signal of Rous sarcoma virus J Mol Biol 284 205ndash225

39 MyersEW and MillerW (1988) Optimal alignments in linear spaceComput Appl Biosci 4 11ndash17

40 HuangXQ and MillerW (1991) A time-efficient linear-space localsimilarity algorithm Adv Appl Math 12 337ndash357

41 SmithTF and WatermanMS (1981) Identification of commonmolecular subsequences J Mol Biol 147 195ndash197

42 SongSI SilverSL AulikMA RasochovaL MohanBR andMillerWA (1999) Satellite cereal yellow dwarf virus-RPV (satRPV)RNA requires a DouXble hammerhead for self-cleavage and analternative structure for replication J Mol Biol 293 781ndash793

43 DreherTW and HallTC (1988) Mutational analysis of thetransfer-RNA mimicry of brome mosaic-virus RNAmdashsequence andstructural requirements for aminoacylation and 30-adenylationJ Mol Biol 201 41ndash55

44 KanamoriY and NakashimaN (2001) A tertiary structure model of theinternal ribosome entry site (IRES) for methionine-independent initiationof translation RNA 7 266ndash274

45 WuytsJ De RijkP Van de PeerY PisonG RousseeuwP andDe WachterR (2000)Comparativeanalysisof more than 3000 sequencesreveals the existence of two pseudoknots in area V4 of eukaryotic smallsubunit ribosomal RNA Nucleic Acids Res 28 4698ndash4708

46 Van de PeerY De RijkP WuytsJ WinkelmansT and De WachterR(2000) The European small subunit ribosomal RNA databaseNucleic Acids Res 28 175ndash176

47 ChappellSA EdelmanGM and MauroVP (2000) A 9-nt segment ofa cellular mRNA can function as an internal ribosome entry site (IRES)and when present in linked multiple copies greatly enhances IRESactivity Proc Natl Acad Sci USA 97 1536ndash1541

48 ScheperGC VoormaHO and ThomasAA (1994) Basepairing with18S ribosomal RNA in internal initiation of translation FEBS Lett352 271ndash275

49 BarendsS Rudinger-ThirionJ FlorentzC GiegeR PleijCW andKraalB (2004) tRNA-like structure regulates translation of Bromemosaic virus RNA J Virol 78 4003ndash4010

50 CarterAP ClemonsWM BrodersenDE Morgan-WarrenRJWimberlyBT and RamakrishnanV (2000) Functional insights fromthe structure of the 30S ribosomal subunit and its interactions withantibiotics Nature 407 340ndash348

51 HarmsJ SchluenzenF ZarivachR BashanA GatS AgmonIBartelsH FranceschiF and YonathA (2001) High resolution structureof the large ribosomal subunit from a mesophilic Eubacterium Cell 107679ndash688

52 BanN NissenP HansenJ MoorePB and SteitzTA (2000) Thecomplete atomic structure of the large ribosomal subunit at 24 angstromresolution Science 289 905ndash920

Nucleic Acids Research 2005 Vol 33 No 4 1397

53 CannoneJJ SubramanianS SchnareMN CollettJRDrsquoSouzaLM DuYS FengB LinN MadabusiLV MullerKMet al (2002) The Comparative RNA Web (CRW) Site an online databaseof comparative sequence and structure information for ribosomal intronand other RNAs BMC Bioinformatics 3 2

54 HararyF and PrinsG (1959) The number of homeomorphicallyirreducible trees and other species Acta Math 101141ndash162

55 HararyF (1969) Graph Theory Addison-Wesley Reading Mass

56 VasquezM and ScheragaHA (1985) Use of buildup andenergy-minimization procedures to compute low-energy structures ofthe backbone of enkephalin Biopolymers 24 1437ndash1447

57 ZukerM MathewsDH and TurnerDH (1999) Algorithms andthermodynamics for RNA secondary structure prediction A practicalguide In BarciszewskiJ and ClarkBFC (eds) RNA Biochemistry andBiotechnology Kluwer Academic Publishers Dordrecht pp 11ndash43

58 HofackerIL (2003) Vienna RNA secondary structure serverNucleic Acids Res 31 3429ndash3431

1398 Nucleic Acids Research 2005 Vol 33 No 4

Page 12: doi:10.1093/nar/gki267 Modular RNA architecture revealed ... · Modular RNA architecture revealed by computational analysis of existing pseudoknots and ... Samuela Pasquali, ... RNAs

Moreover the percentages of nucleotides in stems bulgesinternal loops hairpin loops and junctions do not vary for16S and 23S rRNAs even though the latter is twice thesize of the former Such motif characteristics help discriminateRNA-like from non-RNA-like molecules and guide the designof novel RNAs Here we analyze the distinct modular unitsmdashsmall tree submotifsmdashthat makeup the rRNAs to advance suchapplications Since all small tree graphs up to 10 vertices havebeen exhaustively enumerated (5455) we can use our sub-structure search algorithm to identify the occurrences of alldistinct tree submotifs in rRNAs

Specifically we use sets of tree graphs for V = 4 5 6 7 8 9and 10 comprising 2 3 6 11 23 47 and 106 graphs respect-ively for a total of 198 tree motifs The complete enumeratedmotifs can be found in the RAG web resource (httpmonodbiomathnyuedurna) Figure 7 shows as an example the11 possible 7-vertex tree graphs and lists the frequency ofoccurrence of each tree graph in rRNAs Remarkably thesame tree motifs are abundant in both 16S and 23S rRNAsThe most abundant (10 or 15 times) 7-vertex tree has two3-stem junctions (topological invariant Stree = 6160) 4- and5-stem junctions are the next most prevalent submotifs(Stree = 5137 5921 and 5667) Interestingly the unbranchedtree (Stree = 7219) and highly branched trees (Stree = 4509)are rare

Figure 8 plots the percentage of distinct tree graphs in eachset V (fraction of motifs identified relative to possible motifs)

found in rRNA For small tree graphs V = 4 5 and 6 allpossible topologies are present but not for V gt 6 In fact thepercentage of trees found declines with V In 23S rRNA90 of possible trees are found for V = 7 and only40 (representing 47 tree topologies) are found forV = 10 Larger trees (V gt 10) are expected to have muchsmaller percentages This percentage decline is expectedsince small tree modules are less specific than large treesIn total 111 submotifs out of 198 are identified The observedtrends in Figure 8 may be generally valid for rRNA structuresof other organisms since they are highly conserved althoughsome statistical fluctuations are expected due to structuraldifferences and RNArsquos finite size

These data show that RNA structures are modular constructsof relatively few (tree) building blocks Extrapolation of thecurves in Figure 8 suggests that 20 of V = 11 motifs (totalof 235 trees) and 10 of V = 12 motifs (total of 551 trees)probably exist in rRNAs Thus the total number of buildingblocks in rRNAs with V lt 13 would be 210 Although thepercentages of trees with higher V values are low they aremore numerous and will contribute to the overall number ofbuilding blocks We thus consider 210 as the minimum num-ber of building blocks The building block number may beexploited in the future design of novel functional RNAmolecules using modular assembly from 210 distinct treemotifs The tree motifs may be taken from substructures ofexisting RNAs allowing simultaneous sequence and structure

treeS =

16S 23SOccurrence

ID

6 4

)

0 1

(7 8)

6191

(7 11)

4509

0 1

(7 9)

5385

3 3

7

(7 10)

5137

5 5

(7 6)

5916

5

(7 5)

5921

(7 1)

0 0

7219

(7 3)

2 4

6446

4 4

(7 7)

5667

10 15

(7 4

6160

0 3

6692

(7 2)

Figure 7 Occurrence frequency of 11 distinct 7-vertex tree topologies as submotifs of Scerevisiae 16S and 23S rRNAs The tree graphs are labeled using topologicalinvariant Stree (second top row) and ID number (i j) (first row) which refers to motif identification number as used in RAG database (httpmonodbiomathnyuedurna) The two numbers at the bottom of each graph refer to the number of times that topology is found embedded in 16S and in 23S rRNAs respectively

Nucleic Acids Research 2005 Vol 33 No 4 1395

design Such a build-up approach appears feasible from our(26) and other works (35) and is akin to the lsquobuilduprsquo tech-nique for proteins pioneered by Vasquez and Scheraga (56)

The above estimate of building block number can be eithera consequence of the limited number of available structures orthe fact that RNA adopts only a subset of possible graphsCertainly future discoveries of novel RNA structures mightindicate a larger repertoire of building blocks Still our sur-veys of natural RNAs (26) suggest that RNAs possess specifictopological characteristics and favor a subset of the possiblegraph motifs The missing motifs may not exist in nature forRNA owing to physical or functional constraints

The building block number can also be defined in terms ofirreducible graphs ie trees that cannot be decomposed intosmaller trees However this approach has a major drawbackSince all RNAs are composed of connected stems their sub-structures can be reduced to two vertex graphs which is not aninformative description of complex RNA folds Still if wedefine 4-vertex trees as the smallest irreducible graphs thenumber of irreducible graphs can be calculated Based on oursubmotif data we estimate that for V up to 10 there are about42 irreducible tree modules in rRNAs

SUMMARY AND CONCLUSION

Our automated comparison of RNA secondary structuresbased on graphical representation allows the examination ofconserved 2D RNA topological features which in turn suggestfunctional relationships This approach has an advantage oversequence alignment because functionally related RNAs oftenlack sequence similarity as shown for frameshifting pseudo-knots with low global sequence identities Of course ourstructure alignment approach relies on available 2D RNA

structures and thus will increase in scope as more RNAstructures become available Our eight identified pairs ofpotentially related pseudoknots (Table 1) reveal three pairsthat are functionally related the pseudoknots of 30 end ofbrome mosaic virus genomic RNA (PKB134) and alternativehammerhead ribozyme pseudoknot (PKB173) are templatesfor replicase to initiate replication of viral genomes the pseu-doknots of internal entry site (IRES) of caspid-producingCrPVs (PKB223) and SSU of 18S rRNA (PKB205) are asso-ciated with translation initiation and the pseudoknots of 18SrRNA (PKB205) and viral tRNA-like (PKB134) most prob-ably interact via both structural mimicry and base comple-mentarity To the best of our knowledge these functionalrelationships have not been described previously

The developed RNA comparison algorithm also led us toquantify the modularity of 16S and 23S rRNAs using smallenumerated tree motifs Our results suggest that rRNAs areconstructed from at least 210 distinct tree subtopologies(or modules) having up to 180 nt Significantly we findthat 5S rRNA and two tree modules within 16S and 23SrRNAs with similar topologies have similar tertiary shapesaffirming the relationship between 2D graphical representa-tion and 3D structure The tree modules in rRNA may be usedas building blocks for the design of novel RNA folds A similaridea has been employed to design and improve functionalRNA molecules (510) Design by assembly of modulesexploits motifs present in natural RNAs unlike de novo designof sequences and structures (11) The issues of discoveringnovel topologies (6) building module library and designingfunctional folds are likely to increase in importance withinterest in novel RNAs

Our algorithm has several limitations Of course its utilitydepends on availability of reliable RNA secondary foldingalgorithms (5758) More fundamentally although simplifiedgraphical representations specify topological characteristicsdetailed sequence and motif information is missing An asso-ciated problem is that unusual RNA structurendashfunction rela-tionships due to deleterious mutations as in frameshiftingpseudoknots (28) and structural flexibility are not capturedMore advanced graph constructs such as weighted graphs mayhowever specify detailed motif features at the secondary andbase pair levels Another limitation of the present algorithmis the requirement to fix the probe structure or graph As inprotein structure alignment algorithms (15) future refinementsof our algorithm should allow some flexibility in motifdefinitions as the similarity search proceeds We inviteRNA scientists to visit our RAG web resource at httpmonodbiomathnyuedurna and suggest to us specific enhancementsthat will be useful in practice

ACKNOWLEDGEMENTS

We thank Dinshaw Patel for comments that motivated thiswork This research was supported by a Joint NSFNIGMSInitiative in Mathematical Biology (DMS-0201160) and theHuman Frontier Science Program (HFSP)

REFERENCES

1 TinocoI and BustamanteC (1999) How RNA folds J Mol Biol293 271ndash281

Figure 8 Distribution of all possible distinct tree modules with V vertices foundin Scerevisiae 16S and 23S rRNAs In graph theory each V is represented by aset of possible tree structures For example for V= 4 5 6 7 8 9 and 10 there are2 3 6 11 23 47 and 106 possible trees respectively The distribution of treesquantifies the diversity of tree modules within rRNAs This analysis suggeststhat rRNAs possess at least 210 tree modules

1396 Nucleic Acids Research 2005 Vol 33 No 4

2 GanHH PasqualiS and SchlickT (2003) Exploring the repertoire ofRNA secondary motifs using graph theory implications for RNA designNucleic Acids Res 31 2926ndash2943

3 SoukupGA and BreakerRR (1999) Engineering precision RNAmolecular switches Proc Natl Acad Sci USA 96 3584ndash3589

4 BreakerRR (2002) Engineered allosteric ribozymes as biosensorcomponents Curr Opin Biotechnol 13 31ndash39

5 IkawaY FukadaK WatanabeS ShiraishiH and InoueT (2002)Design construction and analysis of a novel class of self-foldingRNA Structure 10 527ndash534

6 KimN ShiffeldrimN GanHH and SchlickT (2004) Candidatesfor novel RNA topologies J Mol Biol 341 1129ndash1144

7 MitchellJR ChengJ and CollinsK (1999) A box HACAsmall nucleolar RNA-like domain at the human telomerase RNA 30

end Mol Cell Biol 19 567ndash5768 DoudnaJA and CechTR (2002) The chemical repertoire

of natural ribozymes Nature 418 222ndash2289 StorzG (2002) An expanding universe of noncoding RNAs Science

296 1260ndash126310 BreakerRR and JoyceGF (1994) Inventing and improving ribozyme

function rational design versus iterative selection methods TrendsBiotechnol 12 268ndash275

11 AndronescuM FejesAP HutterF HoosHH and CondonA (2004)A new algorithm for RNA secondary structure design J Mol Biol336 607ndash624

12 CarterRJ DubchakI and HolbrookSR (2001) A computationalapproach to identify genes for functional RNAs in genomic sequencesNucleic Acids Res 29 3928ndash3938

13 DoudnaJA (2000) Structural genomics of RNA Nature Struct Biol7 954ndash956

14 RivasE KleinRJ JonesTA and EddySR (2001) Computationalidentification of noncoding RNAs in Ecoli by comparativegenomics Curr Biol 11 1369ndash1373

15 HolmL and SanderC (1996) Mapping the protein universeScience 273 595ndash602

16 ShindyalovIN and BournePE (2001) A database and tools for 3-Dprotein structure comparison and alignment using the CombinatorialExtension (CE) algorithm Nucleic Acids Res 29 228ndash229

17 BenedettiG and MorosettiS (1996) A graph-topological approach torecognition of pattern and similarity in RNA secondary structuresBiophys Chem 59 179ndash184

18 MargalitH ShapiroBA OppenheimAB and MaizelJVJr (1989)Detection of common motifs in RNA secondary structures Nucleic AcidsRes 17 4829ndash4845

19 ShapiroBA and ZhangKZ (1990) Comparing multiple RNAsecondary structures using tree comparisons Comput Appl Biosci6 309ndash318

20 LeSY NussinovR and MaizelJV (1989) Tree graphs of RNAsecondary structures and their comparisons Comput Biomed Res22 461ndash473

21 DuarteCM WadleyLM and PyleAM (2003) RNA structurecomparison motif search and discovery using a reduced representation ofRNA conformational space Nucleic Acids Res 31 4755ndash4761

22 ChevaletC and MichotB (1992) An algorithm for comparing RNAsecondary structures and searching for similar substructures ComputAppl Biosci 8 215ndash225

23 WangL and ZhaoJ (2003) Parametric alignment of ordered treesBioinformatics 19 2237ndash2245

24 FontanaW KoningsDAM StadlerPF and SchusterP (1993)Statistics of RNA secondary structures Biopolymers 33 1389ndash1404

25 van BatenburgFHD GultyaevAP PleijCWA NgJ andOliehoekJ (2000) PseudoBase a database with RNA pseudoknotsNucleic Acids Res 28 201ndash204

26 GanHH FeraD ZornJ ShiffeldrimN TangM LasersonUKimN and SchlickT (2004) RAG RNA-As-Graphsdatabasemdashconcepts analysis and features Bioinformatics20 1285ndash1291

27 FeraD KimN ShiffeldrimN ZornJ LasersonU GanHH andSchlickT (2004) RAG RNA-As-Graphs web resource BMCBioinformatics 5 88

28 ChenXY KangHS ShenLX ChamorroM VarmusHE andTinocoI (1996) A characteristic bent conformation of RNA pseudoknotspromotes 1 frameshifting during translation of retroviral RNAJ Mol Biol 260 479ndash483

29 KangHS HinesJV and TinocoI (1996) Conformation of anon-frameshifting RNA pseudoknot from mouse mammary tumor virusJ Mol Biol 259 135ndash147

30 GanHH PerlowRA RoyS KoJ WuM HuangJ YanSNicolettaA VafaiJ SunD et al (2002) Analysis of protein sequencestructure similarity relationships Biophys J 83 2781ndash2791

31 FeldenB FlorentzC GiegeR and WesthofE (1994) Solutionstructure of the 30-end of brome mosaic-virus genomic RNAsmdashconformational mimicry with canonical transfer-RNAs J Mol Biol235 508ndash531

32 GrossJL and YellenJ (1999) Graph Theory and its Applications CRCPress Boca Raton FL

33 KoblerJ SchoningU and ToraanJ (1993) The Graph IsomorphismProblem Birkhauser Boston MA

34 RandicM and ZupanJ (2001) On interpretation of well-knowntopological indices J Chem Inform Comput Sci 41 550ndash560

35 ZornJ GanHH ShiffeldrimN and SchlickT (2004) Structuralmotifs in ribosomal RNAs implications for RNA design and genomicsBiopolymers 73 340ndash347

36 JacksT MadhaniHD MasiarzFR and VarmusHE (1988) Signalsfor ribosomal frameshifting in the Rous-Sarcoma virus GagndashPolregion Cell 55 447ndash458

37 GodenyEK ChenL KumarSN MethvenSL KooninEV andBrintonMA (1993) Complete genomic sequence and phylogeneticanalysis of the lactate dehydrogenase-elevating virus (Ldv) Virology194 585ndash596

38 MarczinkeB FisherR VidakovicM BloysAJ and BrierleyI (1998)Secondary structure and mutational analysis of the ribosomalframeshift signal of Rous sarcoma virus J Mol Biol 284 205ndash225

39 MyersEW and MillerW (1988) Optimal alignments in linear spaceComput Appl Biosci 4 11ndash17

40 HuangXQ and MillerW (1991) A time-efficient linear-space localsimilarity algorithm Adv Appl Math 12 337ndash357

41 SmithTF and WatermanMS (1981) Identification of commonmolecular subsequences J Mol Biol 147 195ndash197

42 SongSI SilverSL AulikMA RasochovaL MohanBR andMillerWA (1999) Satellite cereal yellow dwarf virus-RPV (satRPV)RNA requires a DouXble hammerhead for self-cleavage and analternative structure for replication J Mol Biol 293 781ndash793

43 DreherTW and HallTC (1988) Mutational analysis of thetransfer-RNA mimicry of brome mosaic-virus RNAmdashsequence andstructural requirements for aminoacylation and 30-adenylationJ Mol Biol 201 41ndash55

44 KanamoriY and NakashimaN (2001) A tertiary structure model of theinternal ribosome entry site (IRES) for methionine-independent initiationof translation RNA 7 266ndash274

45 WuytsJ De RijkP Van de PeerY PisonG RousseeuwP andDe WachterR (2000)Comparativeanalysisof more than 3000 sequencesreveals the existence of two pseudoknots in area V4 of eukaryotic smallsubunit ribosomal RNA Nucleic Acids Res 28 4698ndash4708

46 Van de PeerY De RijkP WuytsJ WinkelmansT and De WachterR(2000) The European small subunit ribosomal RNA databaseNucleic Acids Res 28 175ndash176

47 ChappellSA EdelmanGM and MauroVP (2000) A 9-nt segment ofa cellular mRNA can function as an internal ribosome entry site (IRES)and when present in linked multiple copies greatly enhances IRESactivity Proc Natl Acad Sci USA 97 1536ndash1541

48 ScheperGC VoormaHO and ThomasAA (1994) Basepairing with18S ribosomal RNA in internal initiation of translation FEBS Lett352 271ndash275

49 BarendsS Rudinger-ThirionJ FlorentzC GiegeR PleijCW andKraalB (2004) tRNA-like structure regulates translation of Bromemosaic virus RNA J Virol 78 4003ndash4010

50 CarterAP ClemonsWM BrodersenDE Morgan-WarrenRJWimberlyBT and RamakrishnanV (2000) Functional insights fromthe structure of the 30S ribosomal subunit and its interactions withantibiotics Nature 407 340ndash348

51 HarmsJ SchluenzenF ZarivachR BashanA GatS AgmonIBartelsH FranceschiF and YonathA (2001) High resolution structureof the large ribosomal subunit from a mesophilic Eubacterium Cell 107679ndash688

52 BanN NissenP HansenJ MoorePB and SteitzTA (2000) Thecomplete atomic structure of the large ribosomal subunit at 24 angstromresolution Science 289 905ndash920

Nucleic Acids Research 2005 Vol 33 No 4 1397

53 CannoneJJ SubramanianS SchnareMN CollettJRDrsquoSouzaLM DuYS FengB LinN MadabusiLV MullerKMet al (2002) The Comparative RNA Web (CRW) Site an online databaseof comparative sequence and structure information for ribosomal intronand other RNAs BMC Bioinformatics 3 2

54 HararyF and PrinsG (1959) The number of homeomorphicallyirreducible trees and other species Acta Math 101141ndash162

55 HararyF (1969) Graph Theory Addison-Wesley Reading Mass

56 VasquezM and ScheragaHA (1985) Use of buildup andenergy-minimization procedures to compute low-energy structures ofthe backbone of enkephalin Biopolymers 24 1437ndash1447

57 ZukerM MathewsDH and TurnerDH (1999) Algorithms andthermodynamics for RNA secondary structure prediction A practicalguide In BarciszewskiJ and ClarkBFC (eds) RNA Biochemistry andBiotechnology Kluwer Academic Publishers Dordrecht pp 11ndash43

58 HofackerIL (2003) Vienna RNA secondary structure serverNucleic Acids Res 31 3429ndash3431

1398 Nucleic Acids Research 2005 Vol 33 No 4

Page 13: doi:10.1093/nar/gki267 Modular RNA architecture revealed ... · Modular RNA architecture revealed by computational analysis of existing pseudoknots and ... Samuela Pasquali, ... RNAs

design Such a build-up approach appears feasible from our(26) and other works (35) and is akin to the lsquobuilduprsquo tech-nique for proteins pioneered by Vasquez and Scheraga (56)

The above estimate of building block number can be eithera consequence of the limited number of available structures orthe fact that RNA adopts only a subset of possible graphsCertainly future discoveries of novel RNA structures mightindicate a larger repertoire of building blocks Still our sur-veys of natural RNAs (26) suggest that RNAs possess specifictopological characteristics and favor a subset of the possiblegraph motifs The missing motifs may not exist in nature forRNA owing to physical or functional constraints

The building block number can also be defined in terms ofirreducible graphs ie trees that cannot be decomposed intosmaller trees However this approach has a major drawbackSince all RNAs are composed of connected stems their sub-structures can be reduced to two vertex graphs which is not aninformative description of complex RNA folds Still if wedefine 4-vertex trees as the smallest irreducible graphs thenumber of irreducible graphs can be calculated Based on oursubmotif data we estimate that for V up to 10 there are about42 irreducible tree modules in rRNAs

SUMMARY AND CONCLUSION

Our automated comparison of RNA secondary structuresbased on graphical representation allows the examination ofconserved 2D RNA topological features which in turn suggestfunctional relationships This approach has an advantage oversequence alignment because functionally related RNAs oftenlack sequence similarity as shown for frameshifting pseudo-knots with low global sequence identities Of course ourstructure alignment approach relies on available 2D RNA

structures and thus will increase in scope as more RNAstructures become available Our eight identified pairs ofpotentially related pseudoknots (Table 1) reveal three pairsthat are functionally related the pseudoknots of 30 end ofbrome mosaic virus genomic RNA (PKB134) and alternativehammerhead ribozyme pseudoknot (PKB173) are templatesfor replicase to initiate replication of viral genomes the pseu-doknots of internal entry site (IRES) of caspid-producingCrPVs (PKB223) and SSU of 18S rRNA (PKB205) are asso-ciated with translation initiation and the pseudoknots of 18SrRNA (PKB205) and viral tRNA-like (PKB134) most prob-ably interact via both structural mimicry and base comple-mentarity To the best of our knowledge these functionalrelationships have not been described previously

The developed RNA comparison algorithm also led us toquantify the modularity of 16S and 23S rRNAs using smallenumerated tree motifs Our results suggest that rRNAs areconstructed from at least 210 distinct tree subtopologies(or modules) having up to 180 nt Significantly we findthat 5S rRNA and two tree modules within 16S and 23SrRNAs with similar topologies have similar tertiary shapesaffirming the relationship between 2D graphical representa-tion and 3D structure The tree modules in rRNA may be usedas building blocks for the design of novel RNA folds A similaridea has been employed to design and improve functionalRNA molecules (510) Design by assembly of modulesexploits motifs present in natural RNAs unlike de novo designof sequences and structures (11) The issues of discoveringnovel topologies (6) building module library and designingfunctional folds are likely to increase in importance withinterest in novel RNAs

Our algorithm has several limitations Of course its utilitydepends on availability of reliable RNA secondary foldingalgorithms (5758) More fundamentally although simplifiedgraphical representations specify topological characteristicsdetailed sequence and motif information is missing An asso-ciated problem is that unusual RNA structurendashfunction rela-tionships due to deleterious mutations as in frameshiftingpseudoknots (28) and structural flexibility are not capturedMore advanced graph constructs such as weighted graphs mayhowever specify detailed motif features at the secondary andbase pair levels Another limitation of the present algorithmis the requirement to fix the probe structure or graph As inprotein structure alignment algorithms (15) future refinementsof our algorithm should allow some flexibility in motifdefinitions as the similarity search proceeds We inviteRNA scientists to visit our RAG web resource at httpmonodbiomathnyuedurna and suggest to us specific enhancementsthat will be useful in practice

ACKNOWLEDGEMENTS

We thank Dinshaw Patel for comments that motivated thiswork This research was supported by a Joint NSFNIGMSInitiative in Mathematical Biology (DMS-0201160) and theHuman Frontier Science Program (HFSP)

REFERENCES

1 TinocoI and BustamanteC (1999) How RNA folds J Mol Biol293 271ndash281

Figure 8 Distribution of all possible distinct tree modules with V vertices foundin Scerevisiae 16S and 23S rRNAs In graph theory each V is represented by aset of possible tree structures For example for V= 4 5 6 7 8 9 and 10 there are2 3 6 11 23 47 and 106 possible trees respectively The distribution of treesquantifies the diversity of tree modules within rRNAs This analysis suggeststhat rRNAs possess at least 210 tree modules

1396 Nucleic Acids Research 2005 Vol 33 No 4

2 GanHH PasqualiS and SchlickT (2003) Exploring the repertoire ofRNA secondary motifs using graph theory implications for RNA designNucleic Acids Res 31 2926ndash2943

3 SoukupGA and BreakerRR (1999) Engineering precision RNAmolecular switches Proc Natl Acad Sci USA 96 3584ndash3589

4 BreakerRR (2002) Engineered allosteric ribozymes as biosensorcomponents Curr Opin Biotechnol 13 31ndash39

5 IkawaY FukadaK WatanabeS ShiraishiH and InoueT (2002)Design construction and analysis of a novel class of self-foldingRNA Structure 10 527ndash534

6 KimN ShiffeldrimN GanHH and SchlickT (2004) Candidatesfor novel RNA topologies J Mol Biol 341 1129ndash1144

7 MitchellJR ChengJ and CollinsK (1999) A box HACAsmall nucleolar RNA-like domain at the human telomerase RNA 30

end Mol Cell Biol 19 567ndash5768 DoudnaJA and CechTR (2002) The chemical repertoire

of natural ribozymes Nature 418 222ndash2289 StorzG (2002) An expanding universe of noncoding RNAs Science

296 1260ndash126310 BreakerRR and JoyceGF (1994) Inventing and improving ribozyme

function rational design versus iterative selection methods TrendsBiotechnol 12 268ndash275

11 AndronescuM FejesAP HutterF HoosHH and CondonA (2004)A new algorithm for RNA secondary structure design J Mol Biol336 607ndash624

12 CarterRJ DubchakI and HolbrookSR (2001) A computationalapproach to identify genes for functional RNAs in genomic sequencesNucleic Acids Res 29 3928ndash3938

13 DoudnaJA (2000) Structural genomics of RNA Nature Struct Biol7 954ndash956

14 RivasE KleinRJ JonesTA and EddySR (2001) Computationalidentification of noncoding RNAs in Ecoli by comparativegenomics Curr Biol 11 1369ndash1373

15 HolmL and SanderC (1996) Mapping the protein universeScience 273 595ndash602

16 ShindyalovIN and BournePE (2001) A database and tools for 3-Dprotein structure comparison and alignment using the CombinatorialExtension (CE) algorithm Nucleic Acids Res 29 228ndash229

17 BenedettiG and MorosettiS (1996) A graph-topological approach torecognition of pattern and similarity in RNA secondary structuresBiophys Chem 59 179ndash184

18 MargalitH ShapiroBA OppenheimAB and MaizelJVJr (1989)Detection of common motifs in RNA secondary structures Nucleic AcidsRes 17 4829ndash4845

19 ShapiroBA and ZhangKZ (1990) Comparing multiple RNAsecondary structures using tree comparisons Comput Appl Biosci6 309ndash318

20 LeSY NussinovR and MaizelJV (1989) Tree graphs of RNAsecondary structures and their comparisons Comput Biomed Res22 461ndash473

21 DuarteCM WadleyLM and PyleAM (2003) RNA structurecomparison motif search and discovery using a reduced representation ofRNA conformational space Nucleic Acids Res 31 4755ndash4761

22 ChevaletC and MichotB (1992) An algorithm for comparing RNAsecondary structures and searching for similar substructures ComputAppl Biosci 8 215ndash225

23 WangL and ZhaoJ (2003) Parametric alignment of ordered treesBioinformatics 19 2237ndash2245

24 FontanaW KoningsDAM StadlerPF and SchusterP (1993)Statistics of RNA secondary structures Biopolymers 33 1389ndash1404

25 van BatenburgFHD GultyaevAP PleijCWA NgJ andOliehoekJ (2000) PseudoBase a database with RNA pseudoknotsNucleic Acids Res 28 201ndash204

26 GanHH FeraD ZornJ ShiffeldrimN TangM LasersonUKimN and SchlickT (2004) RAG RNA-As-Graphsdatabasemdashconcepts analysis and features Bioinformatics20 1285ndash1291

27 FeraD KimN ShiffeldrimN ZornJ LasersonU GanHH andSchlickT (2004) RAG RNA-As-Graphs web resource BMCBioinformatics 5 88

28 ChenXY KangHS ShenLX ChamorroM VarmusHE andTinocoI (1996) A characteristic bent conformation of RNA pseudoknotspromotes 1 frameshifting during translation of retroviral RNAJ Mol Biol 260 479ndash483

29 KangHS HinesJV and TinocoI (1996) Conformation of anon-frameshifting RNA pseudoknot from mouse mammary tumor virusJ Mol Biol 259 135ndash147

30 GanHH PerlowRA RoyS KoJ WuM HuangJ YanSNicolettaA VafaiJ SunD et al (2002) Analysis of protein sequencestructure similarity relationships Biophys J 83 2781ndash2791

31 FeldenB FlorentzC GiegeR and WesthofE (1994) Solutionstructure of the 30-end of brome mosaic-virus genomic RNAsmdashconformational mimicry with canonical transfer-RNAs J Mol Biol235 508ndash531

32 GrossJL and YellenJ (1999) Graph Theory and its Applications CRCPress Boca Raton FL

33 KoblerJ SchoningU and ToraanJ (1993) The Graph IsomorphismProblem Birkhauser Boston MA

34 RandicM and ZupanJ (2001) On interpretation of well-knowntopological indices J Chem Inform Comput Sci 41 550ndash560

35 ZornJ GanHH ShiffeldrimN and SchlickT (2004) Structuralmotifs in ribosomal RNAs implications for RNA design and genomicsBiopolymers 73 340ndash347

36 JacksT MadhaniHD MasiarzFR and VarmusHE (1988) Signalsfor ribosomal frameshifting in the Rous-Sarcoma virus GagndashPolregion Cell 55 447ndash458

37 GodenyEK ChenL KumarSN MethvenSL KooninEV andBrintonMA (1993) Complete genomic sequence and phylogeneticanalysis of the lactate dehydrogenase-elevating virus (Ldv) Virology194 585ndash596

38 MarczinkeB FisherR VidakovicM BloysAJ and BrierleyI (1998)Secondary structure and mutational analysis of the ribosomalframeshift signal of Rous sarcoma virus J Mol Biol 284 205ndash225

39 MyersEW and MillerW (1988) Optimal alignments in linear spaceComput Appl Biosci 4 11ndash17

40 HuangXQ and MillerW (1991) A time-efficient linear-space localsimilarity algorithm Adv Appl Math 12 337ndash357

41 SmithTF and WatermanMS (1981) Identification of commonmolecular subsequences J Mol Biol 147 195ndash197

42 SongSI SilverSL AulikMA RasochovaL MohanBR andMillerWA (1999) Satellite cereal yellow dwarf virus-RPV (satRPV)RNA requires a DouXble hammerhead for self-cleavage and analternative structure for replication J Mol Biol 293 781ndash793

43 DreherTW and HallTC (1988) Mutational analysis of thetransfer-RNA mimicry of brome mosaic-virus RNAmdashsequence andstructural requirements for aminoacylation and 30-adenylationJ Mol Biol 201 41ndash55

44 KanamoriY and NakashimaN (2001) A tertiary structure model of theinternal ribosome entry site (IRES) for methionine-independent initiationof translation RNA 7 266ndash274

45 WuytsJ De RijkP Van de PeerY PisonG RousseeuwP andDe WachterR (2000)Comparativeanalysisof more than 3000 sequencesreveals the existence of two pseudoknots in area V4 of eukaryotic smallsubunit ribosomal RNA Nucleic Acids Res 28 4698ndash4708

46 Van de PeerY De RijkP WuytsJ WinkelmansT and De WachterR(2000) The European small subunit ribosomal RNA databaseNucleic Acids Res 28 175ndash176

47 ChappellSA EdelmanGM and MauroVP (2000) A 9-nt segment ofa cellular mRNA can function as an internal ribosome entry site (IRES)and when present in linked multiple copies greatly enhances IRESactivity Proc Natl Acad Sci USA 97 1536ndash1541

48 ScheperGC VoormaHO and ThomasAA (1994) Basepairing with18S ribosomal RNA in internal initiation of translation FEBS Lett352 271ndash275

49 BarendsS Rudinger-ThirionJ FlorentzC GiegeR PleijCW andKraalB (2004) tRNA-like structure regulates translation of Bromemosaic virus RNA J Virol 78 4003ndash4010

50 CarterAP ClemonsWM BrodersenDE Morgan-WarrenRJWimberlyBT and RamakrishnanV (2000) Functional insights fromthe structure of the 30S ribosomal subunit and its interactions withantibiotics Nature 407 340ndash348

51 HarmsJ SchluenzenF ZarivachR BashanA GatS AgmonIBartelsH FranceschiF and YonathA (2001) High resolution structureof the large ribosomal subunit from a mesophilic Eubacterium Cell 107679ndash688

52 BanN NissenP HansenJ MoorePB and SteitzTA (2000) Thecomplete atomic structure of the large ribosomal subunit at 24 angstromresolution Science 289 905ndash920

Nucleic Acids Research 2005 Vol 33 No 4 1397

53 CannoneJJ SubramanianS SchnareMN CollettJRDrsquoSouzaLM DuYS FengB LinN MadabusiLV MullerKMet al (2002) The Comparative RNA Web (CRW) Site an online databaseof comparative sequence and structure information for ribosomal intronand other RNAs BMC Bioinformatics 3 2

54 HararyF and PrinsG (1959) The number of homeomorphicallyirreducible trees and other species Acta Math 101141ndash162

55 HararyF (1969) Graph Theory Addison-Wesley Reading Mass

56 VasquezM and ScheragaHA (1985) Use of buildup andenergy-minimization procedures to compute low-energy structures ofthe backbone of enkephalin Biopolymers 24 1437ndash1447

57 ZukerM MathewsDH and TurnerDH (1999) Algorithms andthermodynamics for RNA secondary structure prediction A practicalguide In BarciszewskiJ and ClarkBFC (eds) RNA Biochemistry andBiotechnology Kluwer Academic Publishers Dordrecht pp 11ndash43

58 HofackerIL (2003) Vienna RNA secondary structure serverNucleic Acids Res 31 3429ndash3431

1398 Nucleic Acids Research 2005 Vol 33 No 4

Page 14: doi:10.1093/nar/gki267 Modular RNA architecture revealed ... · Modular RNA architecture revealed by computational analysis of existing pseudoknots and ... Samuela Pasquali, ... RNAs

2 GanHH PasqualiS and SchlickT (2003) Exploring the repertoire ofRNA secondary motifs using graph theory implications for RNA designNucleic Acids Res 31 2926ndash2943

3 SoukupGA and BreakerRR (1999) Engineering precision RNAmolecular switches Proc Natl Acad Sci USA 96 3584ndash3589

4 BreakerRR (2002) Engineered allosteric ribozymes as biosensorcomponents Curr Opin Biotechnol 13 31ndash39

5 IkawaY FukadaK WatanabeS ShiraishiH and InoueT (2002)Design construction and analysis of a novel class of self-foldingRNA Structure 10 527ndash534

6 KimN ShiffeldrimN GanHH and SchlickT (2004) Candidatesfor novel RNA topologies J Mol Biol 341 1129ndash1144

7 MitchellJR ChengJ and CollinsK (1999) A box HACAsmall nucleolar RNA-like domain at the human telomerase RNA 30

end Mol Cell Biol 19 567ndash5768 DoudnaJA and CechTR (2002) The chemical repertoire

of natural ribozymes Nature 418 222ndash2289 StorzG (2002) An expanding universe of noncoding RNAs Science

296 1260ndash126310 BreakerRR and JoyceGF (1994) Inventing and improving ribozyme

function rational design versus iterative selection methods TrendsBiotechnol 12 268ndash275

11 AndronescuM FejesAP HutterF HoosHH and CondonA (2004)A new algorithm for RNA secondary structure design J Mol Biol336 607ndash624

12 CarterRJ DubchakI and HolbrookSR (2001) A computationalapproach to identify genes for functional RNAs in genomic sequencesNucleic Acids Res 29 3928ndash3938

13 DoudnaJA (2000) Structural genomics of RNA Nature Struct Biol7 954ndash956

14 RivasE KleinRJ JonesTA and EddySR (2001) Computationalidentification of noncoding RNAs in Ecoli by comparativegenomics Curr Biol 11 1369ndash1373

15 HolmL and SanderC (1996) Mapping the protein universeScience 273 595ndash602

16 ShindyalovIN and BournePE (2001) A database and tools for 3-Dprotein structure comparison and alignment using the CombinatorialExtension (CE) algorithm Nucleic Acids Res 29 228ndash229

17 BenedettiG and MorosettiS (1996) A graph-topological approach torecognition of pattern and similarity in RNA secondary structuresBiophys Chem 59 179ndash184

18 MargalitH ShapiroBA OppenheimAB and MaizelJVJr (1989)Detection of common motifs in RNA secondary structures Nucleic AcidsRes 17 4829ndash4845

19 ShapiroBA and ZhangKZ (1990) Comparing multiple RNAsecondary structures using tree comparisons Comput Appl Biosci6 309ndash318

20 LeSY NussinovR and MaizelJV (1989) Tree graphs of RNAsecondary structures and their comparisons Comput Biomed Res22 461ndash473

21 DuarteCM WadleyLM and PyleAM (2003) RNA structurecomparison motif search and discovery using a reduced representation ofRNA conformational space Nucleic Acids Res 31 4755ndash4761

22 ChevaletC and MichotB (1992) An algorithm for comparing RNAsecondary structures and searching for similar substructures ComputAppl Biosci 8 215ndash225

23 WangL and ZhaoJ (2003) Parametric alignment of ordered treesBioinformatics 19 2237ndash2245

24 FontanaW KoningsDAM StadlerPF and SchusterP (1993)Statistics of RNA secondary structures Biopolymers 33 1389ndash1404

25 van BatenburgFHD GultyaevAP PleijCWA NgJ andOliehoekJ (2000) PseudoBase a database with RNA pseudoknotsNucleic Acids Res 28 201ndash204

26 GanHH FeraD ZornJ ShiffeldrimN TangM LasersonUKimN and SchlickT (2004) RAG RNA-As-Graphsdatabasemdashconcepts analysis and features Bioinformatics20 1285ndash1291

27 FeraD KimN ShiffeldrimN ZornJ LasersonU GanHH andSchlickT (2004) RAG RNA-As-Graphs web resource BMCBioinformatics 5 88

28 ChenXY KangHS ShenLX ChamorroM VarmusHE andTinocoI (1996) A characteristic bent conformation of RNA pseudoknotspromotes 1 frameshifting during translation of retroviral RNAJ Mol Biol 260 479ndash483

29 KangHS HinesJV and TinocoI (1996) Conformation of anon-frameshifting RNA pseudoknot from mouse mammary tumor virusJ Mol Biol 259 135ndash147

30 GanHH PerlowRA RoyS KoJ WuM HuangJ YanSNicolettaA VafaiJ SunD et al (2002) Analysis of protein sequencestructure similarity relationships Biophys J 83 2781ndash2791

31 FeldenB FlorentzC GiegeR and WesthofE (1994) Solutionstructure of the 30-end of brome mosaic-virus genomic RNAsmdashconformational mimicry with canonical transfer-RNAs J Mol Biol235 508ndash531

32 GrossJL and YellenJ (1999) Graph Theory and its Applications CRCPress Boca Raton FL

33 KoblerJ SchoningU and ToraanJ (1993) The Graph IsomorphismProblem Birkhauser Boston MA

34 RandicM and ZupanJ (2001) On interpretation of well-knowntopological indices J Chem Inform Comput Sci 41 550ndash560

35 ZornJ GanHH ShiffeldrimN and SchlickT (2004) Structuralmotifs in ribosomal RNAs implications for RNA design and genomicsBiopolymers 73 340ndash347

36 JacksT MadhaniHD MasiarzFR and VarmusHE (1988) Signalsfor ribosomal frameshifting in the Rous-Sarcoma virus GagndashPolregion Cell 55 447ndash458

37 GodenyEK ChenL KumarSN MethvenSL KooninEV andBrintonMA (1993) Complete genomic sequence and phylogeneticanalysis of the lactate dehydrogenase-elevating virus (Ldv) Virology194 585ndash596

38 MarczinkeB FisherR VidakovicM BloysAJ and BrierleyI (1998)Secondary structure and mutational analysis of the ribosomalframeshift signal of Rous sarcoma virus J Mol Biol 284 205ndash225

39 MyersEW and MillerW (1988) Optimal alignments in linear spaceComput Appl Biosci 4 11ndash17

40 HuangXQ and MillerW (1991) A time-efficient linear-space localsimilarity algorithm Adv Appl Math 12 337ndash357

41 SmithTF and WatermanMS (1981) Identification of commonmolecular subsequences J Mol Biol 147 195ndash197

42 SongSI SilverSL AulikMA RasochovaL MohanBR andMillerWA (1999) Satellite cereal yellow dwarf virus-RPV (satRPV)RNA requires a DouXble hammerhead for self-cleavage and analternative structure for replication J Mol Biol 293 781ndash793

43 DreherTW and HallTC (1988) Mutational analysis of thetransfer-RNA mimicry of brome mosaic-virus RNAmdashsequence andstructural requirements for aminoacylation and 30-adenylationJ Mol Biol 201 41ndash55

44 KanamoriY and NakashimaN (2001) A tertiary structure model of theinternal ribosome entry site (IRES) for methionine-independent initiationof translation RNA 7 266ndash274

45 WuytsJ De RijkP Van de PeerY PisonG RousseeuwP andDe WachterR (2000)Comparativeanalysisof more than 3000 sequencesreveals the existence of two pseudoknots in area V4 of eukaryotic smallsubunit ribosomal RNA Nucleic Acids Res 28 4698ndash4708

46 Van de PeerY De RijkP WuytsJ WinkelmansT and De WachterR(2000) The European small subunit ribosomal RNA databaseNucleic Acids Res 28 175ndash176

47 ChappellSA EdelmanGM and MauroVP (2000) A 9-nt segment ofa cellular mRNA can function as an internal ribosome entry site (IRES)and when present in linked multiple copies greatly enhances IRESactivity Proc Natl Acad Sci USA 97 1536ndash1541

48 ScheperGC VoormaHO and ThomasAA (1994) Basepairing with18S ribosomal RNA in internal initiation of translation FEBS Lett352 271ndash275

49 BarendsS Rudinger-ThirionJ FlorentzC GiegeR PleijCW andKraalB (2004) tRNA-like structure regulates translation of Bromemosaic virus RNA J Virol 78 4003ndash4010

50 CarterAP ClemonsWM BrodersenDE Morgan-WarrenRJWimberlyBT and RamakrishnanV (2000) Functional insights fromthe structure of the 30S ribosomal subunit and its interactions withantibiotics Nature 407 340ndash348

51 HarmsJ SchluenzenF ZarivachR BashanA GatS AgmonIBartelsH FranceschiF and YonathA (2001) High resolution structureof the large ribosomal subunit from a mesophilic Eubacterium Cell 107679ndash688

52 BanN NissenP HansenJ MoorePB and SteitzTA (2000) Thecomplete atomic structure of the large ribosomal subunit at 24 angstromresolution Science 289 905ndash920

Nucleic Acids Research 2005 Vol 33 No 4 1397

53 CannoneJJ SubramanianS SchnareMN CollettJRDrsquoSouzaLM DuYS FengB LinN MadabusiLV MullerKMet al (2002) The Comparative RNA Web (CRW) Site an online databaseof comparative sequence and structure information for ribosomal intronand other RNAs BMC Bioinformatics 3 2

54 HararyF and PrinsG (1959) The number of homeomorphicallyirreducible trees and other species Acta Math 101141ndash162

55 HararyF (1969) Graph Theory Addison-Wesley Reading Mass

56 VasquezM and ScheragaHA (1985) Use of buildup andenergy-minimization procedures to compute low-energy structures ofthe backbone of enkephalin Biopolymers 24 1437ndash1447

57 ZukerM MathewsDH and TurnerDH (1999) Algorithms andthermodynamics for RNA secondary structure prediction A practicalguide In BarciszewskiJ and ClarkBFC (eds) RNA Biochemistry andBiotechnology Kluwer Academic Publishers Dordrecht pp 11ndash43

58 HofackerIL (2003) Vienna RNA secondary structure serverNucleic Acids Res 31 3429ndash3431

1398 Nucleic Acids Research 2005 Vol 33 No 4

Page 15: doi:10.1093/nar/gki267 Modular RNA architecture revealed ... · Modular RNA architecture revealed by computational analysis of existing pseudoknots and ... Samuela Pasquali, ... RNAs

53 CannoneJJ SubramanianS SchnareMN CollettJRDrsquoSouzaLM DuYS FengB LinN MadabusiLV MullerKMet al (2002) The Comparative RNA Web (CRW) Site an online databaseof comparative sequence and structure information for ribosomal intronand other RNAs BMC Bioinformatics 3 2

54 HararyF and PrinsG (1959) The number of homeomorphicallyirreducible trees and other species Acta Math 101141ndash162

55 HararyF (1969) Graph Theory Addison-Wesley Reading Mass

56 VasquezM and ScheragaHA (1985) Use of buildup andenergy-minimization procedures to compute low-energy structures ofthe backbone of enkephalin Biopolymers 24 1437ndash1447

57 ZukerM MathewsDH and TurnerDH (1999) Algorithms andthermodynamics for RNA secondary structure prediction A practicalguide In BarciszewskiJ and ClarkBFC (eds) RNA Biochemistry andBiotechnology Kluwer Academic Publishers Dordrecht pp 11ndash43

58 HofackerIL (2003) Vienna RNA secondary structure serverNucleic Acids Res 31 3429ndash3431

1398 Nucleic Acids Research 2005 Vol 33 No 4