conformational analysis of invariant peptide sequences in bacterial genomes

19
Conformational Analysis of Invariant Peptide Sequences in Bacterial Genomes Tulika Prakash, C. Ramakrishnan, Debasis Dash and Samir K. Brahmachari* G.N.R. Knowledge Centre for Genome Informatics, Institute of Genomics and Integrative Biology, (Formerly Centre for Biochemical Technology) CSIR, Mall Road, Delhi 110007 India The functional significance of evolutionarily conserved motifs/patterns of short regions in proteins is well documented. Although a large number of sequences are conserved, only a small fraction of these are invariant across several organisms. Here, we have examined the structural features of the functionally important peptide sequences, which have been found invariant across diverse bacterial genera. Ramachandran angles (f,j) have been used to analyze the conformation, folding patterns and geometrical location (buried/exposed) of these invariant peptides in different crystal structures harboring these sequences. The analysis indicates that the peptides preferred a single conformation in different protein structures, with the exception of only a few longer peptides that exhibited some conformational variability. In addition, it is noticed that the variability of conformation occurs mainly due to flipping of peptide units about the virtual C a /C a bond. However, for a given invariant peptide, the folding patterns are found to be similar in almost all the cases. Over and above, such peptides are found to be buried in the protein core. Thus, we can safely conclude that these invariant peptides are structurally important for the proteins, since they acquire unique structures across different proteins and can act as structural determinants (SD) of the proteins. The location of these SD peptides on the protein chain indicated that most of them are clustered towards the N-terminal and middle region of the protein with the C-terminal region exhibiting low preference. Another feature that emerges out of this study is that some of these SD peptides can also play the roles of “fold boundaries” or “hinge nucleus” in the protein structure. The study indicates that these SD peptides may act as chain- reversal signatures, guiding the proteins to adopt appropriate folds. In some cases the invariant signature peptides may also act as folding nuclei (FN) of the proteins. q 2004 Elsevier Ltd. All rights reserved. Keywords: invariant peptides; peptide flip; signatures: functional and chain reversal; structural determinants *Corresponding author Introduction The advances in high-throughput techniques for sequencing genomes, compounded with the development of a large number of sophisti- cated approaches for extracting functional infor- mation about these sequences, is an important breakthrough in functional genomics. Most of the methods for function elucidation of proteins are based on the criterion of homology of the newly obtained sequence with the sequences in the public domain databases. 1 In fact, every protein sequence can be divided into two types of regions, (i) those that are conserved and share homology with other sequences, (ii) those that are unique to that protein sequence and are not homologous to other proteins. In the light of the above fact, it will be interesting to find regions in the proteins that are allowed to vary without loss of function, those that remain strictly conserved through the evolution and the roles, if any, these invariant regions play in proteins. 0022-2836/$ - see front matter q 2004 Elsevier Ltd. All rights reserved. Abbreviations used: SD, structural determinants; FS, functional signatures; IS, invariant signatures; PDB, Protein Data Bank; RMSD, root-mean-square deviation; FB, fold boundary; HN, hinge nuclei; FN, folding nucleus. E-mail address of the corresponding author: [email protected] doi:10.1016/j.jmb.2004.11.008 J. Mol. Biol. (2005) 345, 937–955

Upload: tulika-prakash

Post on 19-Oct-2016

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Conformational Analysis of Invariant Peptide Sequences in Bacterial Genomes

doi:10.1016/j.jmb.2004.11.008 J. Mol. Biol. (2005) 345, 937–955

Conformational Analysis of Invariant PeptideSequences in Bacterial Genomes

Tulika Prakash, C. Ramakrishnan, Debasis Dash andSamir K. Brahmachari*

G.N.R. Knowledge Centre forGenome Informatics, Institute ofGenomics and IntegrativeBiology, (Formerly Centre forBiochemical Technology)CSIR, Mall Road, Delhi 110007India

0022-2836/$ - see front matter q 2004 E

Abbreviations used: SD, structurafunctional signatures; IS, invariantProtein Data Bank; RMSD, root-meFB, fold boundary; HN, hinge nucleiE-mail address of the correspond

[email protected]

The functional significance of evolutionarily conserved motifs/patterns ofshort regions in proteins is well documented. Although a large number ofsequences are conserved, only a small fraction of these are invariant acrossseveral organisms. Here, we have examined the structural features of thefunctionally important peptide sequences, which have been foundinvariant across diverse bacterial genera. Ramachandran angles (f,j)have been used to analyze the conformation, folding patterns andgeometrical location (buried/exposed) of these invariant peptides indifferent crystal structures harboring these sequences. The analysisindicates that the peptides preferred a single conformation in differentprotein structures, with the exception of only a few longer peptides thatexhibited some conformational variability. In addition, it is noticed that thevariability of conformation occurs mainly due to flipping of peptide unitsabout the virtual Ca/Ca bond. However, for a given invariant peptide, thefolding patterns are found to be similar in almost all the cases. Over andabove, such peptides are found to be buried in the protein core. Thus, wecan safely conclude that these invariant peptides are structurally importantfor the proteins, since they acquire unique structures across differentproteins and can act as structural determinants (SD) of the proteins. Thelocation of these SD peptides on the protein chain indicated that most ofthem are clustered towards the N-terminal and middle region of theprotein with the C-terminal region exhibiting low preference. Anotherfeature that emerges out of this study is that some of these SD peptides canalso play the roles of “fold boundaries” or “hinge nucleus” in the proteinstructure. The study indicates that these SD peptides may act as chain-reversal signatures, guiding the proteins to adopt appropriate folds.In some cases the invariant signature peptides may also act as foldingnuclei (FN) of the proteins.

q 2004 Elsevier Ltd. All rights reserved.

Keywords: invariant peptides; peptide flip; signatures: functional and chainreversal; structural determinants

*Corresponding author

Introduction

The advances in high-throughput techniquesfor sequencing genomes, compounded withthe development of a large number of sophisti-cated approaches for extracting functional infor-mation about these sequences, is an important

lsevier Ltd. All rights reserve

l determinants; FS,signatures; PDB,an-square deviation;; FN, folding nucleus.ing author:

breakthrough in functional genomics. Most of themethods for function elucidation of proteins arebased on the criterion of homology of the newlyobtained sequence with the sequences in the publicdomain databases.1 In fact, every protein sequencecan be divided into two types of regions, (i) thosethat are conserved and share homology with othersequences, (ii) those that are unique to that proteinsequence and are not homologous to other proteins.In the light of the above fact, it will be interesting tofind regions in the proteins that are allowed to varywithout loss of function, those that remain strictlyconserved through the evolution and the roles, ifany, these invariant regions play in proteins.

d.

Page 2: Conformational Analysis of Invariant Peptide Sequences in Bacterial Genomes

938 Invariant Peptides as Structural Determinants

In many instances, the invariant regions arefound to constitute the active site of the protein.This implies that conserved sequences in homo-logous proteins have been generally found to beregions necessary for direct involvement in func-tion.2 In addition to this, some residues in otherbinding sites of several proteins are also found to beconserved. Such regions are known to indirectlyaffect the active site and hence the function of theprotein. These conserved regions are sometimesreferred to as the structural determinants (SD) ofthe proteins.3 It has also been demonstrated that theresidues involved in the folding nuclei of theproteins are evolutionarily conserved.4 However,significant controversy exists in this regard in theliterature.5

To analyze the role of such invariant regions inproteins, using in silico methods, is a challengingtask and was almost impossible before the avail-ability of a large repository of completed genomes.There have been several attempts to addressthis problem. To mention a few of many suchapproaches, the availability of the motif/patterndatabases like PROSITE,6 PRINTS,7 Pfam,8

BLOCKS,9 CoPS10 etc., have provided a compre-hensive resource of such conserved regions. Thesedatabases are used for suggesting the function ofproteins harboring the motifs/patterns definedtherein. However, why some unique sequences areinvariant in proteins across different organisms hasnot been addressed so far. Also, the structural rolesof such invariant regions, if any, have not beendiscussed.

In one of our earlier studies we have demon-strated the functional significance of the invariantpeptides identified from a set of completelysequenced genomes.10 In the course of this analysis,we observed that only a very small set of sequencesshares identity among several organisms chosenfrom diverse bacterial genera. In addition to this, wehave established the roles of these invariant pep-tides as functional signatures (FS) of proteins. Thisprompted us to undertake an in-depth analysis ofthe structural features and roles, other than theinvolvement in direct function, of these peptidesequences. Here, we present a systematic andextensive approach towards analyzing the struc-tural features and the roles of a limited set of suchinvariant peptides present in seven bacterial gen-omes, which are of significant functional relevance.

Results

In order to examine whether the 69 invariantsignature (IS) peptides (for the definition of ISpeptides, see Materials and Methods), which occurinvariantly across seven different organisms andhave lengths ranging from eight to 17 amino acidresidues, adopt any specialized structural features,these were searched in the Protein Data Bank11

(PDB; release April 2004). This showed thatexamples could be found for 46 of these peptides,

which mapped on 26 different proteins (Table 1).The codes for these IS peptides and their amino acidsequences are given in columns 2 and 3, respect-ively, of Table 1. These 46 peptides were found tooccur in 194 protein crystal structures. Six of the 46peptides showed only one hit; each of these peptidesequences, although conserved across the sevenorganisms, was found to occur only once in thePDB. The remaining 40 peptides exhibited multiplehits. There are different types of situations thatoccur in these examples. In some cases, multiplehits arise from crystallographically different chainswith identical sequence in the same structure. In afew other cases, the peptides occur in proteins thatare highly homologous. In several cases, however,the hits occur in proteins from different organismswith varying GC content, which have variablepercentages of sequence homology. The number ofhits for each of these peptides is given in column 4of Table 1 and it can be seen that it varies widelyfrom as low as one to as high as 123. It is to bementioned that the large number of hits (andconsequently proteins) does not mean that theyarise from functionally different proteins. Forexample, in the case of the largest number ofhits, viz. 123 for gap.1.9, which arises out of 35different protein entries, all the examples are fromglyceraldehyde-3-phosphate-dehydrogenase pro-tein from various sources. In fact, it is the samesituation for every peptide in Table 1. Though theexamples that have either a single hit or a smallnumber of hits are not statistically significant, theseare included in the Table for complete coverage ofthe information. As will be seen in the next section,if the cases with a small number of hits exhibitdifferent conformations, they are worth discussingin view of the growing speed of the Protein DataBank.

Conformational strings for the functionalsignature peptides

The main thrust of this work being conforma-tional analysis, the conformational strings forthese peptides as defined in Materials and Methodswere examined. The occurrence of conformationalstring identity was looked for in those cases havingmore than one hit. The number of such distinctconformations and the conformational string foreach of these peptides are given in columns 5 and 6,respectively, of Table 1. For a meaningful analysis, itwas necessary to eliminate the redundancy in themultiple hits of any IS peptide and arrive atrepresentative examples through a logical selectionprocess. For this purpose, in those cases wherethere were multiple hits with identical con-formational string, the one with the best resolutionwas picked up. If this still resulted in degeneracy,then the example corresponding to the structure ofthe protein in the native state was selected. If thedegeneracy is due to multiple chains, the choicefalls on the first chain reported. The outcome ofsuch an exercise is given as the representative PDB

Page 3: Conformational Analysis of Invariant Peptide Sequences in Bacterial Genomes

Table 1. Conformational details of IS peptides found as a part in protein crystal structures

Residue numbers ofsegment

S. No.IS peptide

codeaAmino acidsequence

Number ofhits

Number ofdistinct con-formations

Conformational string* andnumber of examples (within

parenthesis)

Representa-tive PDBcode andchain ID Start End Secondary structural stringc

ABC transporter, ATP binding proteinb

1 ABCtr.1.8 DEPTSALD 1 1 ELRRRREU (1) 1B0U-A 178 185 BUHHHHUU2 ABCtr.4.8 LSGGQQQR 14 1 EERRRRRR (14) 1B0U-A 154 161 UUHHHHHH3 ABCtrH.1.8 QRVAIARA 7 1 RRRRRRRR (7) 1L2T-A 152 159 HHHHHHHH4 ABCtrH.2.8 LGPSGCGK 2 1 EEEELRUR (2) 1G29-1 35 42 BBBBUUUH5 ABCtrH.3.8 SGSGKSTL 1 1 UURURRRR (1) 1MT0-A 504 511 UUUUHHHH6 ABCtrH.4.8 NGAGKSTL 2 1 EURURRRR (2) 1L7V-C 35 42 UUUUHHHHATP synthase beta chain (atpD)b

7 atpD.1.11 LFGGAGVGKTV 31 3 EEE*E*ELRL*RRR (27) 1E1R-D 154 164 BBUUUUUUHHHEEUUEUEURRR (3) 1BMF-E 154 164 BBUUUUUUHHHEEUUELULRRR (1) 1E1R-E 154 164 BBUUUUUUHHH

8 atpD.2.15 SVFAGVGER-TREGND

29 4 EEEEEELUERRRRRR (18) 1H8E-D 181 195 BBBBBBUUUHHHHHHEEEEEELEERRRRRR (9) 1H8E-E 181 195 BBBBBBUUUHHHHHHEEEEEEUUE*RRRRRR (1) 1MAB-B 181 195 BBBBBBUUUHHHHHHEEEEEELEURRRRRR (1) 1SKY-E 183 197 BBBBBBLUUHHHHHH

9 atpD.3.11 PSAVGYQPTLA 28 2 EUELLEERRRR (27) 1H8E-D 276 286 UUUUUUUHHHHE*UE*L*L*UEUURR (1) 1MAB-B 276 286 UUUUUUUUUHH

ATP-dependent Clp proteaseb

10 clp.1.9 GEPGVGKTA 4 1 EEELRURRR (4) 1JBK-A 206 214 BBBUUUHHH11 clp.2.11 LPDKAIDLIDE 3 1 URRRRRRRRRR (3) 1QVR-A 377 387 UHHHHHHHHHHTranslation elongation factor Gb

12 efG.1.10 IDTPGHVDFT 6 3 EEEURERRE*R (2) 2EFG-A 82 91 BBBUUUUUUHEEEUREEUER (1) 1DAR 82 91 BBBUUUUUUHUUERLUUUUR (3) 1ELO 82 91 UUUUUUUUUU

13 fusA.1.10 AHIDAGKTTT 6 2 EEE*LRURRRR (3) 1DAR 19 28 BBUUUUHHHHUERRRURRRR (3) 1ELO 19 28 UUUUUUHHHH

Cell division proteinb

14 ftsH.1.11 GPPGTGKTLLA 1 1 UEULRLRRRRR (1) 1LV7-A 195 205 UUULULHHHHHGlyceraldehyde-3-phosphate dehydrogenaseb

15 gap.1.9 INGFGRIGR 123 5 EEUE*URRRR (111) 1GD1-O 5 13 BBUUUHHHHEEUEEURRR (6) 1GGA-A 6 14 BBUUUUHHHEE*UE*URRRR (4) 4GPD-1 5 13 UUUUUHHHHEELEUURRU (1) 1GPD-G 5 13 BBUUUUUUUEEUEURRRR (1) 1GPD-R 5 13 BBUUUHHHH

Serine hydroxymethyltransferaseb

16 glyA.1.9 TNKYAEGYP 13 1 RREEE*E*UEU (13) 1KKJ-A 52 60 HHUUUUUUUGroEL, class I heat shock protein (60 kDa chaperonin)b

17 groel.1.9 AGDGTTTAT 113 2 RLELRRRRR (106) 1KP8-A 85 93 ULULHHHHHUURRRRRRR (7) 1IOK-A 85 93 UUHHHHHHH

GTP-binding proteinb

18 gtpH.1.11 GIVGLPNVGKS 2 1 EEEE*E*RE*R*L*RR (2) 1JAL-A 6 16 BBBUUUUUUHHDNA gyrase subunit Ab

19 gyrA.1.10 RDGLKPVHRR 1 1 RRLEERRRRR (1) 1AB4 38 47 UUUUUHHHHH

(continued on next page)DNA gyrase subunit Bb

Page 4: Conformational Analysis of Invariant Peptide Sequences in Bacterial Genomes

Table 1 (continued)

Residue numbers ofsegment

S. No.IS peptide

codeaAmino acidsequence

Number ofhits

Number ofdistinct con-formations

Conformational string* andnumber of examples (within

parenthesis)

Representa-tive PDBcode andchain ID Start End Secondary structural stringc

(continued on next page)20 gyrB.1.10 VRKRPGMYIG 4 2 RRRE*RRRRRU (2) 1AJ6 19 28 HHHUHHHHHU

RRRERRRRRU (2) 1EI1-A 19 28 HHHUHHHHHU21 gyrB.2.8 SGGLHGVG 5 4 EREURLUR (2) 1EI1-A 112 119 UUUUUUUH

URUEELRU (1) 1AJ6 112 119 UUUUUUUUELUEUURU (1) 1KIJ-A 111 118 UUUUUUUUEUUEUURU (1) 1KIJ-B 111 118 BUUUUUUU

22 gyrB.3.8 LHAGGKFD 2 1 ELEURURE (2) 1EI1-B 498 505 UUUUUUUUDnaK protein(heat shock protein 70)b

23 hsp.1.9 GIDLGTTNS 1 1 EEEEERREE (1) 1DKG-D 6 14 BBBBBUUBB24 hsp.2.8 DLGGGTFD 19 2 EEURUEEE (18) 1BA1 199 206 BBUUUBBB

EEELUEEE (1) 1HX1-A 199 206 BBBUUBBBGTP-binding proteinb

25 lepAH.1.8 ERERGITI 4 1 RRRRLRER (4) 1F60-A 66 73 HHHHUUUULon ATP dependent proteaseb

26 lonH.1.8 GPPGVGKT 10 1 UEELRURR (10) 1IXZ-A 196 203 UUULULHHS-Adenosylmethionine synthetaseb

27 metk.1.13 GLTGRKIIVDTYG 16 3 EEELRURRRRRUL (14) 1RG9-A 240 252 BBBLUUHHHHHULEEULRERRRRELU (1) 1FUG-A 240 252 UUUUUUHHHHUUUUUELUERRRRREU (1) 1FUG-B 240 252 UUUUUUHHHHHUU

Cytosolic aminopeptodaseb

28 pepA.1.8 NTDAEGRL 20 1 ERRE*RRRR (20) 1LAM 330 337 UUUUHHHHRibosomal proteinb

29 rpl14.1.8 GTRIFGPV 1 1 EREEEUEE (1) 1WHI 95 102 UUUUUUUURNA polymerase, beta prime subunitb

30 rpoC.1.8 NLLGKRVD 4 2 URUUEEUE (2) 1HQM-D 617 624 UUUUUUUURRRUEEER (2) 1IW7-D 617 624 HHHUUUUU

31 rpoC.2.9 LLNRAPTLH 4 2 EEEEREEEU (2) 1I6V-D 701 709 BBBBUUUUUEEEEREERE (2) 1IW7-D 701 709 BBBBUUUUU

32 rpoC.3.12 NADFDGDQ-MAVH

4 2 LEEERUEEEEUU (2) 1HQM-D 737 748 UUUUUUBBBBUUEEEURUEEEEEE (2) 1IW7-D 737 748 BBBUUUBBBBBB

33 rpoC.4.13 AAQSI-GEPGTQLT

2 1 RRRRRRRRRRREE* (2) 1IW7-D 1225 1237 HHHHHHHHHHHUU

30 S Ribosomal protein S12b

34 rps12.1.8 KPNSALRK 14 1 UR*E*EUEEE (14) 1N32-L 47 54 UUUUUUUU35 rps12.2.8 GHNLQEHS 14 3 EE*UEE*RUE (12) 1N32-L 74 81 UUUUUUUB

UEUEE*RUE* (1) 1I94-L 74 81 UUUUUUUUEE*UEE*RUR (1) 1FJG-L 74 81 UUUUUUUU

Pre-protein translocase SecA subunitb

36 secA.1.8 GFDYLRDN 6 1 RRRRRRRR* (6) 1NKT-A 182 189 HHHHHHHH37 secA.2.13 LIDEARTPLIISG 6 1 RRRR*R*REEEEEEE (6) 1NKT-A 214 226 HHHHHHBBBBBBU38 secA.3.12 ESRRIDNQLRGR 6 1 RERRRRRRRRRR (6) 1NKT-A 559 570 UUHHHHHHHHHHPseudouridine synthaseb

39 sfhB.1.10 TGRTHQIRVH 2 1 EE*EE*LRRRRU (2) 1PRZ-A 230 239 UUUULHHHHU

Page 5: Conformational Analysis of Invariant Peptide Sequences in Bacterial Genomes

Table 1 (continued)

Residue numbers ofsegment

S. No.IS peptide

codeaAmino acidsequence

Number ofhits

Number ofdistinct con-formations

Conformational string* andnumber of examples (within

parenthesis)

Representa-tive PDBcode andchain ID Start End Secondary structural stringc

RNA polymerase, sigma factorb

40 sigrp.1.10 KFSTYATWWI 5 1 ERRRRRRRRR (5) 1IW7-F 234 243 UHHHHHHHHHElongation factor (EF-Tu)b

41 tuf.2.17 KNMIT-GAAQMD-GAILVV

21 7 RRRRRLREEEREEEEEE (11) 1D8T-A 89 105 HHHHUUUUUUUUBBBBBRRRRRRRRREREEEEEE (5) 1EXM-A 90 106 HHHHHHHHHUUBBBBBBRRRRRURUUUREEEEEE (1) 1EFC-A 89 105 HHHHHUUUUUUBBBBBBRRRRRUREEEREEEEEE (1) 1EFC-B 89 105 HHHHHUUUUUUBBBBBBRRRRRURUEEREEEEEE (1) 1EFU-A 89 105 HHHHHUUUUUUBBBBBBRRRRRLEULEREEEEEE (1) 1EFU-C 89 105 HHHHHUUUUUUBBBBBBRRRRRUUUUEREEEEEE (1) 1ETU 89 105 HHHHHUUUUUUBBBBBB

42 tuf.3.9 IREGGRTVG 20 2 EEELLEERE (17) 1EXM-A 388 396 BBBUUUUUUEEEUR*EERE (3) 1HA3-A 388 396 BBBUUUUUB

Excision nuclease subunit B (uvrB)b

43 uvrB.1.10 GINLLREGLD 4 2 ERERUEELR*E (2) 1C4O-A 496 505 BUUUUUUUUUERERREE*L*RE (2) 1D9X-A 501 510 BUUUUUUUUU

Valyl tRNA synthetase (valS)b

44 valS.1.8 DHAGIATQ 4 1 ERLURRRR (4) 1IVS-A 81 88 UUUUHHHHClass I tRNA synthetase (xS)b

45 xS.1.8 KMSKSKGN 7 1 EEERRRLR* (7) 1JZS-A 591 598 BBBUUUUU46 xS.2.8 KMSKSLGN 5 2 E*ERURRLU (2) 1FFY-A 595 602 UUUUUUUU

E*EERRRLU (3) 1LI5-A 266 273 UUUUUUUU

a The digit(s) after the last dot (.) is the number of amino acid residues in the IS peptide.b Name of the proteins in which the IS peptides occur.c For definition of conformational and secondary structural strings see Materials and Methods.

Page 6: Conformational Analysis of Invariant Peptide Sequences in Bacterial Genomes

942 Invariant Peptides as Structural Determinants

code with chain identifier in column 7 of Table 1along with the residue numbers corresponding tothe beginning and end of the peptide segments(columns 8 and 9).

As is evident from Table 1, 21 IS peptides thathave more than one hit exhibited identical confor-mational string when occurring in different proteinstructures. Many peptides which have more thanone hit exhibit distinct conformations. Elevenpeptides exhibited only two distinct conformationalstates and four peptides showed three differentconformations. Only two peptides occur in fourdifferent conformations. There is only a singleexample for the peptides exhibiting five and sevendistinct conformations each. However, it is inter-esting to note that of the multiple conformationsexhibited by most of these peptides, only oneconformation was predominant. For example,gap.1.9 has 123 hits, of which 111 exhibit the sameconformation. Similarly, for peptides like groel.1.9,atpD.3.11 and tuf.3.9 etc., the majority of the hitsexhibit the same conformation (Table 1).

Longer peptides like tuf.2.17 and atpD.2.15 showconformational variability, with many distinctconformations. However, the majority of the hitsexhibit only one conformation (11 out of 21 intuf.2.17 and 18 out of 29 in atpD.2.15). In general,longer peptides tend to exhibit more conformation-al variability as compared to shorter ones. It is asignificant observation that this conformationalvariability does not seem to affect the total folding(discussed in the section Folding patterns of ISpeptides for tuf.2.17).

Secondary structural strings for the functionalsignature peptides

The two important secondary structural featuresof any protein being a helix and b strand,the occurrence of these can be looked for in thesecondary structural string (represented by thesymbols described in Materials and Methods),which is given in column ten of Table 1. It can beseen that all the three symbols H, B and U occur indifferent IS peptide segments. Quantitatively, thepercentage distribution among H, B and U is 32, 19and 49, respectively, showing a preponderance ofUs that are essential for bringing about folding inany protein structure.

In many cases, the peptide region overlaps withthe C-terminal end of the secondary structure

Table 2. Distribution of examples among the different types oends of the IS peptide

C-te

H

N-terminal side ofpeptide

H 2B 16U 18

Total 36

preceding it and /or overlaps with the N-terminalend of the structure that follows it. The distributionof examples is shown in Table 2 and the followingpoints are observed:

(1)

f sec

rmin

There are two cases (ABCtrH.1.8 and secA.1.8)wherein the entire segment is fully helical. Inaddition, in the second example of groel.1.9 andin ABCtr.4.8, clp.2.11, gyrB.1.10, rpoC.4.13 andsecA.3.12, there are perceptible helical segments.However, due to the small number of examples(eight out of 81), it can be safely assumed thatthe helix is not a preferred structure for the ISpeptides.

(2)

A total of 16 (20%) examples are found to occurat the C-terminal end of a helix with the residueat the C-terminal end forming part of thepeptide. There are 36 examples (44%) in whichthe IS peptide is followed by a helix. Thus, the ISpeptide seems to possess both helix initiatingand terminating properties.

(3)

There are only five examples (6.1%) where the ISpeptide acts as a linker between the two strands.It is also interesting to note that there are noexamples wherein the entire peptide forms afull b strand (near extended chain). However,31 (38%) examples are found to occur at theC-terminal end of a strand, with the residue atthe C-terminal end forming part of the peptide.There are 13 examples (16%) in which the ISpeptide is followed by a strand. Thus, the ISpeptide seems to possess both strand initiatingand terminating properties.

(4)

In 15 examples the string contains only U,indicating the occurrence of a non-standardstructural region.

(5)

In metk.1.13, short helical fragments (four andfive residues long) occur in the middle of thepeptide. In another case (rpoC.3.12), a shortfour-residue extended strand segment occurs inthe middle.

Thus, the IS peptides exhibit significant confor-mational variability.

Folding patterns of IS peptides

As a first step, graphical representation hasbeen used to find out and group the differentfolding patterns that occur in the peptides. A goodpointer to the folding pattern of any chain can be its

ondary structures occurring at the N and C- terminal

al side of peptide

B U Total

7 7 165 10 311 15 3413 32 81

Page 7: Conformational Analysis of Invariant Peptide Sequences in Bacterial Genomes

Figure 1 (legend next page)Figure 1 (legend p.948)

Invariant Peptides as Structural Determinants 943

end-to-end distance. This varies from about3.7 A to 27 A, indicating a wide variety. Anexercise using this parameter shows a variety offolding patterns. These are pictorially shown inFigure 1(a)–(u) (with structures having nearlysimilar folding being grouped together andenclosed in boxes). Starting with a simple U-turn (or hairpin) up to large snake-like(elongated) and helical patterns are seen. Nodirect correlation could be deduced between thelength of the peptide and the folding pattern.

In order to get a concrete quantitative picture ofthe similarity in those peptides that yielded morethan one conformation, the backbone atoms in therepresentative examples are superposed in the bestfitting position and root-mean-square deviation(RMSD) was evaluated. It was found that the RMS

value was less than 2.0 A in most of the cases. Forexample:

(1)

hsp.2.8: There are two different conformationsfor this peptide, although the majority of thehits (18 out of 19) prefer only one conformation.The sequence similarity between the sequencesis as low as 63%; however, the two confor-mations are still very similar with an RMS valueof only 0.9 A.

(2)

xS.2.8: This example has two distinct confor-mations. Once again the sequence identity of therepresentative examples is significantly low,being only 23% between class I tRNA synthe-tase proteins 1LI5_A and 1FFY_A containingthe IS peptide xs.2.8. Also, a visual examinationof the structural alignment of these two proteins
Page 8: Conformational Analysis of Invariant Peptide Sequences in Bacterial Genomes

Figure 1 (legend next page)Figure 1 (legend p.948)

944 Invariant Peptides as Structural Determinants

indicates that their structures differ significantlyas shown in Figure 5(b). This is substantiatedby superposing the Ca atoms of these proteinsand calculating the RMS deviation over384 equivalent Ca atoms, which is as high as28.30 A. However, it is interesting to note thatthe structure of the invariant peptide is highlysimilar in two proteins with the RMS value ofonly 0.8 A (Figure 1).

Two exceptions are found (gyrB.2.8 with fourconformations, RMS 3.0 A, and tuf.2.17 with sevenconformations, RMS 3.9 A).

(1) gyrB.2.8: Of the four conformations, two

(1KIJ_A, 1KIJ_B) are very similar and the RMSdeviation is quite low (0.65 A). This is to beexpected, since these are two different chains inthe same protein. However, the RMS values of1KIJ_A with 1EI1_A and 1AJ6 are w2.8 A andw3.0 A, respectively. This itself indicates that1AJ6 and 1EI1_A should belong to a differentfolding pattern group and this is substantiatedthrough the Figure (see Figure 1(m) and (u)).In addition, the RMS value between 1EI1_Aand 1AJ6 is high (2.38 A) and this is visible inFigure 1(u).

(2) tuf.2.17: For this example there are sevendistinct conformations (Table 1). Among these,some are very similar in folding and differ from
Page 9: Conformational Analysis of Invariant Peptide Sequences in Bacterial Genomes

Figure 1 (legend next page)Figure 1 (legend p.948)

Invariant Peptides as Structural Determinants 945

the rest. For example, the two chains A and C of1EFU have very similar folding (Figure 1(l)) withan RMS deviation of 0.65 A. Of the remaining five(Figure 1(h)), A and B chains of 1EFC againare very similar with RMS deviation of 0.77 A.Examples belonging to 1D8T_A and 1ETU arealso close to these. However, 1EXM_A standsout, in that there is a much longer helix portioncomprising nine residues at the N terminus ofthe peptide sequence (column 10 of Table 1) ascompared to four to five residues in others. Thisaspect is also clearly perceptible in Figure 1(h).

Flip of peptide unit

In order to find out the exact difference betweenthe distinct conformations, the arrangements of

various atoms of the IS peptides were carefullyexamined. In an earlier study detailing the confor-mational analysis of Walker motif A,12,13 it wasnoticed that a flip of a peptide unit can lead toconformational differences without affecting thefolding. In addition, it was also hypothesized thatsuch a flip can be distantly related to F1-ATPaseaction. The present data were examined from thisangle and it was found that such flips do occur in 25cases. The examples in which this occurs are givenin Table 3, from which it can be seen that 11 out ofthese 25 cases occur in the set of proteins having thesame functional representation, tuf.2.17, which hasthe maximum number of distinct conformations(see Table 1). The flip occurs almost in the middle ofthe peptide segment and all the five PDB examplesviz., 1D8T, 1EXM, 1EFC, 1EFU and 1ETU are

Page 10: Conformational Analysis of Invariant Peptide Sequences in Bacterial Genomes

Figure 1 (legend next page)Figure 1 (legend p.948)

946 Invariant Peptides as Structural Determinants

involved in the flip. In the case of 1EFU, the flip, infact, occurs among the two chains A and C. Similarobservation has beenmade for F1-ATPase (1BMF).12

Two representative examples are shown in Figure 2,wherein the flipped peptide unit is enclosed in abox (the H atom at N of the flipped peptide unit isgeometrically fixed at 1.0 A from N along theexternal bisector of CNCa).

Location of peptide in the protein

In order to examine the location of the residues inthe peptides as they occur in the proteins from thestructural data, the surface properties were deter-mined using a geometrical method described in

Materials and Methods. The result is shown as ahistogram in Figure 3, from which it is clear that avery significant number of residues in the signaturepeptides occur inside the protein core. In somecases, the segment occurs in a cleft and in someothers, at the junction of the two domains. Twoexceptions occur in ribosomal proteins, where thepercentage of buried residues is small (15% forrps12.1.8 and 53% for rps12.2.8).

Discussion

With the aim of better understanding the struc-tural roles of IS peptides in the context of therespective protein sequence, we carried out a

Page 11: Conformational Analysis of Invariant Peptide Sequences in Bacterial Genomes

Figure 1 (legend next page)Figure 1 (legend p.948)

Invariant Peptides as Structural Determinants 947

systematic analysis of the structural features ofthese peptides. The search in the PDB revealed thatthe proteins in which these peptides occur do notnecessarily belong just to the organisms used for theidentification of these IS peptides in our earlierstudy.10 Interestingly, among the hits obtained,some of the peptides were found conserved in allthe three kingdoms, viz. archaea, bacteria andeukaryotes. Since there are subtle, or in manycases very large, differences in the proteins of theorganisms of every kingdom and across kingdoms,a significant amount of structural variability is to beexpected. In order to determine the extent of thisvariability, we analyzed the structural confor-mations of the IS peptides in different proteins.This analysis revealed that many peptides havingmultiple hits exist in more than one distinctconformation. However, most of them exhibited

only one predominant conformation. On the con-trary, longer peptides show a tendency to exhibitdifferent conformations, in general. But the fact thatthis conformational variability has a negligibleeffect on the total folding of the peptides providesa clue towards the possibility of crucial roles thesepeptides play, in the structure, activity and/orstability of the proteins. This aspect is elaboratedfurther in this section.Since the IS peptide is expected to be responsible

for bringing about the required total folding, it isnot surprising to find a preponderance of U in thesecondary structural string. Also, it is evident fromour analysis that the IS peptides do not prefer well-defined secondary structures, i.e. complete helix orfull b strand. Instead, in most of the cases, thesepeptides possess both helix or strand initiatingand/or terminating properties. In some examples,

Page 12: Conformational Analysis of Invariant Peptide Sequences in Bacterial Genomes

Figure 1. (a)–(u) Diagrams displaying different folding patterns exhibited by the IS peptides. The groups of peptideshaving roughly similar folding pattern (as decided visually) are enclosed in the same box.

948 Invariant Peptides as Structural Determinants

Page 13: Conformational Analysis of Invariant Peptide Sequences in Bacterial Genomes

Table 3. List of functional signatures exhibiting peptide unit flips

PDB codes_chain identifier Residue number at the ends of the peptides

Functional representative 1 2 1 2

atpD.1.11 1E1R_D 1E1R_E 156–157 156–1571BMF_E 1E1R_D 160–161 160–161

fusA.1.10 1DAR 1ELO 21–22 21–22groel.1.9 1KP8_A 1IOK_A 87–88 87–88gyrB.2.8 1AJ6 1KIJ_A 112–113 111–112

1KIJ_A 1KIJ_B 112–113 112–1131AJ6 1KIJ_B 112–113 111–1121AJ6 1KIJ_A 113–114 112–113

hsp.2.8 1BA1 1HX1_A 201–202 201–202metk.1.13 1FUG_A 1RG9_A 250–251 250–251

1FUG_A 1FUG_B 250–251 250–251rpoC.1.8 1HQM_D 1IW7_D 619–620 619–620tuf.2.17 1EFC_B 1ETU 93–94 93–94

1EFU_A 1EXM_A 93–94 94–951EFU_C 1EXM_A 93–94 94–951D8T_A 1EXM_A 93–94 94–951EFC_B 1ETU 95–96 95–961D8T_A 1ETU 95–96 95–961EFC_A 1ETU 95–96 95–961ETU 1EXM_A 96–97 97–98

1EFU_A 1ETU 96–97 96–971EFU_C 1EXM_A 96–97 97–981EFU_A 1EFU_C 96–97 96–97

tuf.3.9 1EXM_A 1HA3_A 391–392 391–392xS.2.8 1FFY_A 1LI5_A 597–598 268–269

Invariant Peptides as Structural Determinants 949

the helical fragment or extended strand segmentoccurs in the middle of the IS peptide. Thus, it canbe conveniently concluded that the middle part ofthe peptide might act as a linker between twosecondary structures.

It is a moot question to address, whether themultiple conformations obtained from differentproteins, for an IS peptide with a given sequence,fold differently or not. For most of the examples, theoverall folding patterns of the peptides remain thesame even though the peptide exhibits multipleconformations. This is also evident fromFigure 1(a)–(u). However, there are two exceptionsto this general trend, viz. gyrB.2.8 and tuf.2.17,wherein a large number of distinct conformationsexist for the peptides and the RMSD values amongthose is also large. Thus, it can be inferred that inthese cases, the folding pattern, though not entirely

Figure 2. Figure displaying peptide flips in two representaand 1FUG_B (250–251) and (b) gyrB.2.8, between 1KIJ_A (112–in opposite directions in every pair of both the examples. Tbetween various pairs.

different between the examples, has variationsindicating flexibility in conformation. In Figure 1,the IS peptides have been grouped based on theirfolding patterns. A careful examination of Figure 1indicates that these peptides exhibit a wide varietyof unique folding patterns.It is apparent from the results of our analysis that

in those examples that had more than one distinctconformation, the folding was not very different.This raises the question as to what could be theexact difference between the conformations. Oncareful examination of the arrangement of atoms, itwas observed that flipped peptide units occurwithin the IS peptides. From Figure 2, it is clearthat in these cases the total folding is unalteredwhile the directions of the CaO and N–H, whichare essential groups for H-bonding, are pointing indifferent directions. The occurrence of a peptide flip

tive examples. (a) metk.1.13, between 1FUG_A (250–251)113) and 1KIJ_B (112–113). Note that H and O atoms pointhere are 25 examples that exhibit flipped peptide units

Page 14: Conformational Analysis of Invariant Peptide Sequences in Bacterial Genomes

Figure 3. Histogram displaying the distribution of percentage of average number of residues buried in the protein.

950 Invariant Peptides as Structural Determinants

between b-turns of type I and II is well known.14,15

The flips observed in various examples wereexamined for b-turn flipping using the Ramachan-dran angles. It comes as a surprise to note that onlyin two examples (fusA.1.10 and tuf.3.9) does the flipcorrespond to a b-turn flip. In the former, the flip is

Figure 4. (a) The Figure displays a protein chain. The firsN-terminal (TN) region, followed by the middle (M) (nexttowards the C-terminal (TC) region. The IS peptides occurrinHowever, the IS peptides occurring in the M region also act aact as FS. (b) Histogram displaying the distribution of the IS pfor (a). Large numbers of the peptides are clustered in the TN

between types I and II (I4II) and in the latter,between I 0 and II 0 (I 04II 0). The fact that theflip occurs in peptides that are occurring in differentfunctional proteins suggests that this mayhave some role, direct or indirect, in their function-ing. This, though very interesting, has not

t 25% of the protein chain is considered as towards the50% of the protein chain) and the last 25% is termed asg in the TN and M regions act as either FS or SD or both.s FB or HN. The IS peptides present in the TC region onlyeptides based on their location in the protein as describedregion (which is only 25% of the entire protein).

Page 15: Conformational Analysis of Invariant Peptide Sequences in Bacterial Genomes

Invariant Peptides as Structural Determinants 951

been addressed here as it falls beyond the scope ofthis conformational study and thus has to remain ahypothesis till proved or disproved otherwise.

The IS peptides being invariant among variousorganisms, it is logical to assume that thesesequences are highly sensitive to mutation andhave withstood evolutionary selection pressure.This is indeed the case, as is evident from Figure 3,which shows that, in a significant number ofpeptides, a large percentage of residues are buried.Thus, these peptides are well insulated from thesurrounding medium and must be indifferent toother secondary interactions in the immediateenvironment. It is known that buried residues,which are expected to contribute more to thethermodynamic stability, are usually more con-served than surface residues.16,17 This has alsobeen experimentally demonstrated, for example, inthe case of cholecystokinin receptor,18 ARMand HEAT protein repeats,19 bacterial multidrugtransporters,20 etc.

Based on our entire analysis of the structuralfeatures, it is apparent that the IS peptides, whichare highly conserved across various differentorganisms, retain their folding patterns and adoptsimilar conformations in different proteins irrespec-tive of the neighboring sequences. Thus, thepossibility of the roles of these peptides in main-taining the appropriate structures for the activity ofthe protein cannot be ruled out. As a result, we cansafely conclude that these IS peptides can also act asstructural determinants of the proteins, in additionto their role as functional signatures.

Location of IS peptides in the protein chain andtheir respective structural roles

We have analyzed the positional distribution ofthese IS peptides in 81 representative exampleslisted in Table 1. For this purpose, any protein chainis divided into three regions: (i) the first 25% of theprotein chain from the N-terminal end is designatedas TN region (towards N-terminal); (ii) the next 50%of the protein chain as M (middle region); and(iii) the last 25% as TC (towards C-terminal)(Figure 4(a)). The locations of the IS peptides areexamined in terms of these three regions and theresult is shown in Figure 4(b) as a histogram. It isclear from the Figure that a large number of ISpeptides (40 out of 46 w90%) lie in the TN and Mregions. This suggests that the IS peptides mightfavor occuring in the N-terminal and middleregions andwould perhaps exhibit somewhat lesserpreference for the near C-terminal region. Ouranalysis shows that the IS peptides adopt uniquefolds. In vivo protein synthesis and the subsequentfolding initiation takes place from the N to C-terminal end. Thus, it is not surprising that the ISpeptides, which adopt unique folds, are locatedpredominantly in the N-terminal regions of theproteins. Presence of these peptides in the N-terminal regions could enable the proteins toadopt defined tertiary structures.

The IS peptides that lie in the M region can actas linkers connecting independently foldeddomains. It is well documented that folds aremainly made up of a number of simple local unitsof super-secondary structural motifs. These motifsare formed by a few secondary structures connectedby boundaries, which are not expected to formwell-defined regular secondary structures (like a helices,b strands etc.).21,22 Such boundaries are termed asfold boundaries (FB). It is evident from our resultsthat the IS peptides do not prefer to adopt well-defined secondary structures, but have a tendencyto act as linkers between them with a well-definedunique fold. Thus, these linker peptides might actas FB, which distinctly separate domains of theproteins. The examples of FB are shown in the caseof RNA polymerase b 0 subunit protein in Figure5(a). This protein has four signatures, of which theresidues of rpoC.1.8 are known to interact with thesigma-70 subunit23 and the role of rpoC.2.9 isunknown. Interestingly, since these two peptidesact as boundaries between the two independentlyfolded domains, they might be the potential FB. Theamino acid residues of the third peptide i.e.rpoC.2.9, comprise the active site of the proteinand the role of the fourth peptide is unclear. It isknown that this peptide lies adjacent to the G-loopof rpoC protein.23 Thus, this may be a specialized ISpeptide, which might help the G-loop to adopt theproper orientation for its function.The FB that divide the domains and act as

hinges between them are termed hinge nuclei(HN). The HN peptides are expected to lie at thesurface of the protein. This is displayed inFigure 5(b), which clearly indicates that in the caseof xS.2.8, the IS peptide KMSKSLGN might act as ahinge that holds the two halves of the protein as itlies in the middle of the protein and also is exposedon the protein surface. Also, the second lysineresidues of this peptide has been implicated instabilizing the transition state for the first step ofaminoacylation.24

IS peptides lying in the TN region may act as SDor FS or both. Figure 5(c) clearly shows one suchpeptide in the case of gap.1.9, which lies in the TNregion and is known to participate in binding to anNAD molecule, in addition to forming a part of themultiple salt-bridge arrangement that links differ-ent structural elements.25 It has not escaped ournotice that any structural distortion of these SDpeptide sequences could result in the proteinbecoming non-functional. Thus, some of these SDsequences could be Achilles heels for pathogenicorganisms and could act as a potential drug target.

Are IS peptides chain-reversal signatures (CRS)in the protein folding process?

The fact that these IS peptides are conservedthrough 400 million years of evolution, frombacteria to human in some cases, clearly showstheir functional and/or structural significance inthe proteins. Here, we have identified certain

Page 16: Conformational Analysis of Invariant Peptide Sequences in Bacterial Genomes

Figure 5. Figure displaying the functional signatures (FS), structural determinants (SD), fold boundaries (FB) andhinge nucleus (HN) in the proteins. (a) The four FB peptides of RNA polymerase b prime subunit protein (rpoC):rpoC.1.8 (orange), rpoC.2.9 (red), rpoC.3.12 (magenta) and rpoC.4.13 (yellow) in chain D of 1IW7. The different foldeddomains are shown in cyan, green, brown, gray and blue. N and C-terminal ends are shown in blue and light green,respectively. The different domains are clearly folded independent of each other and the four FB peptides act asboundaries between those. (b) The HN peptide of class I tRNA synthetase protein (xS.2.8) (yellow) in chain A of the twoproteins 1FFY (upper) and 1LI5 (lower). The two folded domains are shown in different colors (1FFY: N-terminal half inmagenta, C-terminal half in purple and HN in yellow. 1LI5: N-terminal half in blue, C-terminal half in cyan and HN inyellow). The HNpeptide clearly acts as a hinge between the two folded domains in both the structures. N and C-terminalends are shown in blue and light green, respectively, in both structures. Also, the structures of 1FFY and 1LI5 arenoticeably different (with average RMS value of 28.30 A over 384 equivalent Ca atoms). The structure of the HN peptideis highly similar in these two proteins with RMS value of only 0.8 A. (c) The SD peptide of glyceraldehyde-3-phosphatedehydrogenase protein in 1GD1 (gap.1.9) (yellow). The N-terminal region is shown in cyan and the C-terminal region inblue. N and C-terminal ends are shown in blue and light green, respectively. The SD peptide lies adjacent to theN-terminal end of the protein.

952 Invariant Peptides as Structural Determinants

invariant peptide segments that can adopt uniquefolding patterns irrespective of their immediateenvironment, and can act as linkers between regularsecondary structures. Thus, the structures of

these sequences might be determined throughshort-range interactions and be indifferent to thelong-range interactions in the proteins. In additionto this, the molecular dynamics simulation

Page 17: Conformational Analysis of Invariant Peptide Sequences in Bacterial Genomes

Invariant Peptides as Structural Determinants 953

experiments in aqueous solutions and structuredetermination in non-aqueous solutions throughcircular dichroism technique of one of the peptides,tuf.3.9, suggests a b hairpin bend structure ofthis peptide (D.D. et al., unpublished results). Thepeptide forms a b hairpin in all the 20 proteincrystal structures, in which it is found invariant.This provides evidence for the structures of theseshort IS peptides in solution, in the absence of therest of the protein. Thus, the possibility that thesepeptides may form structures early in the foldingcannot be ruled out. As a result, these sequencesmight enable reversal of polypeptide chain to makethe protein fold and act as major structuraldeterminants. Thus, the possible role for these ISpeptides could be to act as potential chain-reversalsignatures (CRS), guiding the protein chains toadopt the appropriate fold.

Interestingly, there are a few coincidental simi-larities of these IS peptides with the folding nucleiof the proteins also. The conformation adoptedby the protein chain in the folding transitionstate is called folding nucleus (FN), and isknown to extensively accelerate the process offolding by playing a key role in the protein foldingpathway.26–30 The residues involved in foldingnuclei are distributed across the entire proteinchain and are also known to form a large numberof interactions during the early stages of proteinfolding. Our preliminary analysis of the availablescientific literature of the FN of 17 proteins showedthat some of them occur in the invariant regions ofthe protein also; however, this may not be alwaystrue (T.P. et al. unpublished results). Also, accordingto Mirny & Shakhnovich,4 the residues present inthe FN are evolutionarily conserved and are largelyhydrophobic too. In addition to this, the residues ofFN are known to occupy topohydrophobic posi-tions31 and are significantly more buried than thehydrophobic residues in the non-topohydrophobicpositions.32 It has been observed that these residuesmight not have any functional role.33–35 It is clearfrom our analysis that the IS peptide sequences arenot only highly conserved, but the residues occur-ring in them are well buried in the protein core.These 46 peptides have 72% hydrophobic residuesand 17% glycine residues. These aspects of ISpeptides, which could be potential CRS, underscoresome similarity with that of FN residues and thus,these IS peptides can play a dual role of chainreversal and folding, the former being responsiblefor the latter. However, more evidence is to besought for to prove this hypothesis.

† Brahmachari, S. K., Dash, D. (2001). A computerbased method for identifying peptides useful as drugtargets. PCT international patent publication (WO01/74130 A2, 11th October 2001).

Conclusion

It is seen in this study that the IS peptide retainsits folding in spite of some conformational differ-ences. This leads to a debatable issue, namely,whether the folding invariance is a sub-feature of aclose similarity of the overall structure (with thesignature peptide forming a miniscule part). We

believe that these peptides, which are conserved inmany organisms and withstood evolutionary selec-tion, are well buried and retain the same foldamong proteins from various organisms, are the SDof the proteins. In many cases, these SD might formboundaries of the independently folded domains(FB), which could sometimes act as hinges betweenthose domains (HN). However, it is interesting tonote that all the IS peptides have been implicated inthe function of the protein and act as FS. In additionto this, all the IS peptides are CRS for proteinfolding and in a few cases they include residuesinvolved in the nucleation of the protein foldingprocess. Though this concept is only in its nascentstate, it is not without reasoning and more evidenceis to be sought, which is reserved for our futurestudies. However, it has not escaped our notice thatthese IS peptides with defined folds could be usedas linkers between secondary structures in de novoprotein design.

Materials and Methods

Comparison of the sequences of seven completelysequenced bacterial proteomes viz. Bacillus subtilis,Escherichia coli, Haemophilus influenzae, Helicobacter pylori,Mycoplasma genitalium,Mycoplasma pneumoniae andMyco-bacterium tuberculosis to identify the invariant peptidesequences was carried out using in-house developedsoftware PLHOST (peptide library-based homologysearch tool†). PLHOST first fragments all the proteinsequences of the selected genomes into their respectiveoverlapping octa-peptide libraries. These octa-peptidelibraries were then compared amongst each other and thepeptides occurring in all the seven genomes werereported. Of the five million overlapping octa-peptidesfrom seven bacterial genomes selected for this study, only168 unique octa-peptides were found to be present acrossall of them. Several of these peptide sequences hadoverlaps. The redundancy was removed by mergingoverlapping invariant octa-peptides to report the longerconserved peptides. This resulted in 69 IS peptidesranging in length from eight to 17 amino acid residues,present invariantly across all the seven genomes. These ISpeptides occur in functionally similar proteins and henceact as FS peptides (data not shown here). All the 69 ISpeptides are mapped to 28 different proteins, with asmany as five IS peptides in one protein.In order to analyze the structural features of these IS

peptides, the structures of the proteins invariantlyharboring these sequences were explored. For thispurpose, the occurrence of the 69 IS peptides wassearched for in the sequences of the proteins in the PDB(April 2004 Release), using in-house developed Perlscripts. Of the hits that were obtained, only thosestructures determined through X-ray diffraction havebeen used in the present analysis. Those for whichatomic positions have been obtained by modeling andNMR techniques have been excluded. In addition, thestructures with poor resolution (O4 A) were eliminated.

Page 18: Conformational Analysis of Invariant Peptide Sequences in Bacterial Genomes

954 Invariant Peptides as Structural Determinants

The main analysis has been done using Ramachandranangles.36–38 In order to facilitate representation, theconcept of conformational string has been introduced,which is based on the position of the conformationalpoints in the Ramachandran map. Four symbols are usedto denote the conformations, viz. R, E, L and U. Theregions representing these symbols are shown in Figure 6.These correspond to the right-handed a-helical (R),extended strand (E), left-handed a-helical (L) anduncharacterized (U) regions of conformational space.These were first introduced and used for conformationalanalysis in our earlier studies.39 The conformationalsymbols are, in turn, used to define the secondarystructures. Thus occurrence of four consecutive Rs leadsto an a-helical segment represented by H and fourconsecutive Es leads to an extended strand (many ofwhich eventually associate through hydrogen bonding toform b-sheets, both parallel and antiparallel) representedby B. The rest are denoted by U, representing random-coilregions. Thus, the secondary structural symbols areformed for the entire protein. The string correspondingto the peptide segment is denoted as secondary structuralstring.In order to analyze the conformations adopted by the IS

peptides, the conformational string identity was looked atamong the hits of each IS peptide having more than onehit in the PDB. Since strict rectangular boundaries areused for defining R, E and L, it is likely that some of thepoints might have occurred just outside the boundaryand hence would have been designated as U in a verystrict sense. For a meaningful analysis, these must begrouped with the regular structures (R, E or L). Hence, asa second step, visual examination of the points was madeto find out such instances and wherever needed, thesymbol U is replaced with either R* or E* or L* as the casemay be (the symbol * denoting that the conformationoccurs as a borderline case).Pairwise alignment of the sequences of the protein

crystal structures harboring the IS peptides was carriedout using the Gap module of the Wisconsin PackageVersion 10.0, Genetics Computer Group Inc. (GCG)software. The residues of these peptides were classified

Figure 6. The (f, j) region corresponding to right-handed a-helical (R), extended strand (E) and left-handeda-helical (L) conformations are shown as boxes (RZ(f:K1208 toK108; j:K1208 to 208), EZ (f:K1808 toK608; j:608 to 1808), L Z (f: 08 to 908, j: K108 to 908)).

as occurring on the surface of the protein or in the core,using a geometrical procedure involving three orthogonalprojections40 (C.R. & N. Srinivasan, unpublished results).The correctness of this approach was verified using bothvisual and solvent-accessibility methods. INSIGHT IIversion 2000 software was used to draw the Figures.

Acknowledgements

We acknowledge Dr Y. Singh and Dr S.Ramachandran for invaluable discussions andRohit Ghai, Pankaj Bhatnagar and DipayanDasgupta for technical support. T.P. is the recipientof a Senior Research Fellowship from CSIR. Fund-ing support from Council of Scientific and Indus-trial Research (CSIR) through “In Silico BiologyTask Force” project is duly acknowledged.

Supplementary Data

Supplementary data associated with this articlecan be found, in the online version, at doi:10.1016/j.jmb.2004.11.008

References

1. Hannenhalli, S. S. & Russell, R. B. (2000). Analysis andprediction of functional sub-types from proteinsequence alignments. J. Mol. Biol. 303, 61–76.

2. Chothia, C. & Gerstein, M. (1997). Protein evolutionHow far can sequences diverge?.Nature, 385, 579–580.

3. Gobin, S., Thuillier, L., Jogl, G., Faye, A., Tong, L., Chi,M. et al. (2003). Functional and structural basis ofcarnitine palmitoyltransferase 1A deficiency. J. Biol.Chem. 278, 50428–50434.

4. Mirny, L. & Shakhnovich, E. (2001). Evolutionaryconservation of the folding nucleus. J. Mol. Biol. 308,123–129.

5. Larson, S. M., Ruczinski, I., Davidson, A. R., Baker, D.& Plaxco, K. W. (2002). Residues participating in theprotein folding nucleus do not exhibit preferentialevolutionary conservation. J. Mol. Biol. 316, 225–233.

6. Hulo, N., Sigrist, C. J., le Saux, V., Langendijk-Genevaux, P. S., Bordoli, L., Gattiker, A. et al. (2004).Recent improvements to the PROSITE database. Nucl.Acids Res. 32, D134–D137.

7. Attwood, T. K., Bradley, P., Flower, D. R., Gaulton, A.,Maudling, N., Mitchell, A. L. et al. (2003). PRINTS andits automatic supplement, prePRINTS. Nucl. AcidsRes. 31, 400–402.

8. Bateman, A., Coin, L., Durbin, R., Finn, R. D., Hollich,V., Griffiths-Jones, S. et al. (2004). The Pfam proteinfamilies database. Nucl. Acids Res. 32, D138–D141.

9. Henikoff, J. G., Greene, E. A., Pietrokovski, S. &Henikoff, S. (2000). Increased coverage of proteinfamilies with the Blocks Database servers. Nucl. AcidsRes. 28, 228–230.

10. Prakash, T., Khandelwal, M., DasGupta, D., Dash, D.& Brahmachari, S. K. (2004). CoPS: ComprehensivePeptide Signature Database. Bioinformatics, 20,2886–2888.

Page 19: Conformational Analysis of Invariant Peptide Sequences in Bacterial Genomes

Invariant Peptides as Structural Determinants 955

11. Berman, H. M., Westbrook, J., Feng, Z., Gilliland, G.,Bhat, T. N., Weissig, H. et al. (2000). The Protein DataBank. Nucl. Acids Res. 28, 235–242.

12. Ramakrishnan, C., Dani, V. S. & Ramasarma, T. (2002).A conformational analysis of Walker motif A[GXXXXGKT (S)] in nucleotide-binding and otherproteins. Protein Eng. 15, 783–798.

13. Ramasarma, T. & Ramakrishnan, C. (2002). On theimportance of backbone loop and peptide flip in theWAlker sequence in F1–ATPase action. IndianJ. Biochem. Biophys. 39, 5–15.

14. Venkatachalam, C. M. (1968). Stereochemical criteriafor polypeptides and proteins. V. Conformation of asystem of three linked peptide units. Biopolymers, 6,1425–1436.

15. Gunasekaran, K., Gomathi, L., Ramakrishnan, C.,Chandrasekhar, J. & Balaram, P. (1998). Confor-mational interconversions in peptide beta-turns:analysis of turns in proteins and computationalestimates of barriers. J. Mol. Biol. 284, 1505–1516.

16. Mirny, L. A. & Shakhnovich, E. I. (1999). Universallyconserved positions in protein folds: reading evol-utionary signals about stability, folding kinetics andfunction. J. Mol. Biol. 291, 177–196.

17. Pintar, A., Carugo, O. & Pongor, S. (2003). Atom depthas a descriptor of the protein interior. Biophys. J. 84,2553–2561.

18. Fourmy, D., Escrieut, C., Archer, E., Gales, C., Gigoux,V., Maigret, B. et al. (2002). Structure of cholecysto-kinin receptor binding sites and mechanism ofactivation/inactivation by agonists/antagonists.Pharmacol. Toxicol. 91, 313–320.

19. Andrade, M. A., Petosa, C., O’Donoghue, S. I., Muller,C. W. & Bork, P. (2001). Comparison of ARM andHEAT protein repeats. J. Mol. Biol. 309, 1–18.

20. Putman, M., Van Veen, H.W. & Konings, W. N. (2000).Molecular properties of bacterial multidrug transpor-ters. Microbiol. Mol. Biol. Rev. 64, 672–693.

21. Salem, G. M., Hutchinson, E. G., Orengo, C. A. &Thornton, J. M. (1999). Correlation of observed foldfrequency with the occurrence of local structuralmotifs. J. Mol. Biol. 287, 969–981.

22. Wood, T. C. & Pearson, W. R. (1999). Evolution ofprotein sequences and structures. J. Mol. Biol. 291,977–995.

23. Vassylyev, D. G., Sekine, S., Laptenko, O., Lee, J.,Vassylyeva, M. N., Borukhov, S. & Yokoyama, S.(2002). Crystal structure of a bacterial RNA polymer-ase holoenzyme at 2.6 A resolution. Nature, 417,712–719.

24. Newberry, K. J., Hou, Y. M. & Perona, J. J. (2002).Structural origins of amino acid selection withoutediting by cysteinyl-tRNA synthetase. EMBO J. 21,2778–2787.

25. Duee, E., Olivier-Deyris, L., Fanchon, E., Corbier, C.,

Branlant, G. & Dideberg, O. (1996). Comparison ofthe structures of wild-type and a N313T mutant ofEscherichia coli glyceraldehydes-3-phosphate dehy-drogenases: implication for NAD binding and co-operativity. J. Mol. Biol. 257, 814–838.

26. Galzitskaya, O. V., Ivankov, D. N. & Finkelstein, A. V.(2001). Folding nuclei in proteins. FEBS Letters, 489,113–118.

27. Fersht, A. R. (1995). Characterizing transition states inprotein folding: an essential step in the puzzle. Curr.Opin. Struct. Biol. 5, 79–84.

28. Fersht, A. R. (1997). Nucleation mechanisms inprotein folding. Curr. Opin. Struct. Biol. 7, 3–9.

29. Jackson, S. E. (1998). How do small single-domainproteins fold? Fold. Des. 3, R81–R91.

30. Dobson, C. M. & Karplus, M. (1999). The funda-mentals of protein folding: bringing together theoryand experiment. Curr. Opin. Struct. Biol. 9, 92–101.

31. Poupon, A. & Mornon, J. P. (1999). Predicting theprotein folding nucleus from sequences. FEBS Letters,452, 283–289.

32. Poupon, A. & Mornon, J. P. (1998). Populations ofhydrophobic amino acids within protein globulardomains: identification of conserved “topohydro-phobic” positions. Proteins: Struct. Funct. Genet. 33,329–342.

33. Shakhnovich, E., Abkevich, V. & Ptitsyn, O. (1996).Conserved residues and the mechanism of proteinfolding. Nature, 379, 96–98.

34. Michnick, S. W. & Shakhnovich, E. (1998). A strategyfor detecting the conservation of folding-nucleusresidues in protein superfamilies. Fold. Des. 3,239–251.

35. Mirny, L. A., Abkevich, V. I. & Shakhnovich, E. I.(1998). How evolution makes proteins fold quickly.Proc. Natl Acad. Sci. USA, 95, 4976–4981.

36. Ramachandran, G. N., Ramakrishnan, C. &Sasisekharan, V. (1963). Stereochemistry of polypep-tide chain configurations. J. Mol. Biol. 7, 95–99.

37. Ramakrishnan, C. & Ramachandran, G. N. (1965).Stereochemical criteria for polypeptide and proteinchain conformations. II. Allowed conformations for apair of peptide units. Biophys. J. 5, 909–933.

38. Ramachandran, G. N. & Sasisekharan, V. (1968).Conformation of polypeptides and proteins. Advan.Protein Chem. 23, 283–438.

39. Ramakrishnan, C. & Srinivasan, N. (1990). Glycylresidues in proteins and peptides: an analysis. Curr.Sci. 59, 851–861.

40. Nandi, T., Dash, D., Ghai, R., B-Rao, C., Kannan, K.,Brahmachari, S. K. et al. (2003). A novel complexitymeasure for comparative analysis of proteinsequences from complete genomes. J. Biomol. Struct.Dynam. 20, 657–668.

Edited by J. Thornton

(Received 9 June 2004; received in revised form 26 October 2004; accepted 5 November 2004)