universally conserved positions in protein folds: reading...

19
Universally Conserved Positions in Protein Folds: Reading Evolutionary Signals about Stability, Folding Kinetics and Function Leonid A. Mirny and Eugene I. Shakhnovich* Department of Chemistry and Chemical Biology, Harvard University, 12 Oxford Street Cambridge, MA 02138, USA Here, we provide an analysis of molecular evolution of five of the most populated protein folds: immunoglobulin fold, oligonucleotide-binding fold, Rossman fold, alpha/beta plait, and TIM barrels. In order to dis- tinguish between ‘‘historic’’, functional and structural reasons for amino acid conservations, we consider proteins that acquire the same fold and have no evident sequence homology. For each fold we identify positions that are conserved within each individual family and coincide when non- homologous proteins are structurally superimposed. As a baseline for statistical assessment we use the conservatism expected based on the sol- vent accessibility. The analysis is based on a new concept of ‘‘conserva- tism-of-conservatism’’. This approach allows us to identify the structural features that are stabilized in all proteins having a given fold, despite the fact that actual interactions that provide such stabilization may vary from protein to protein. Comparison with experimental data on thermo- dynamics, folding kinetics and function of the proteins reveals that such universally conserved clusters correspond to either: (i) super-sites (com- mon location of active site in proteins having common tertiary structures but not function) or (ii) folding nuclei whose stability is an important determinant of folding rate, or both (in the case of Rossman fold). The analysis also helps to clarify the relation between folding and function that is apparent for some folds. # 1999 Academic Press Keywords: protein evolution; protein folding; conservatism; super-site kinetics; protein stability *Corresponding author Introduction The amount of data on protein structure, folding and kinetics are exploding. Progress in genomics (gene sequences) and proteomics (structure, func- tion and expression) studies created a new realm for bioinformatics in which a qualitatively different amount of biological information needs to be prop- erly rationalized and used. Success in achieving this goal depends entirely on our understanding of the principles that govern protein stability, folding and function. Such understanding progressed over last several years to the point that basic principles of folding begin to emerge from theoretical and experimental studies. Of particular importance is the foldability principle in thermodynamics and the discovery of nucleation in folding kinetics. The foldability principle states that protein-like sequences should have their native conformation as pronounced energy minimum (separated by a large energy gap from the bulk of structurally unrelated misfolded conformations) (Goldstein et al., 1992; Shakhnovich & Gutin, 1993a; Sali et al., 1994; Govindarajan & Goldstein, 1995; Hao & Scheraga, 1994b). Sequences that satisfy this requirement are able to fold fast and have cooperative folding transition (Shakhnovich & Gutin, 1993a; Shakhnovich, 1994; Hao & Scheraga, 1994a,b). Their native structures are stable against mutations (Tiana et al., 1998) as well as against variation in solvent conditions and temperature (Pande et al., 1995). The modern concept of nucleation in protein folding emerged from several theoretical and experimental studies (Bryngelson & Wolynes, 1990; E-mail address of the corresponding author: [email protected] Abbreviations used: CoC, conservatism-of- conservatism; PDB, Protein Data Bank; OB, oligonucleotide-binding; EBN, endo-beta-N- acetylglucosaminidase; CLT, central limit theorem. Article No. jmbi.1999.2911 available online at http://www.idealibrary.com on J. Mol. Biol. (1999) 00, 1–19 0022-2836/99/0000001–19 $30.00/0 # 1999 Academic Press

Upload: others

Post on 20-Aug-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Universally Conserved Positions in Protein Folds: Reading …leonid/publications/Mirny_Shakhnovich... · 2005. 6. 19. · of protein folding, especially for proteins that fold via

Article No. jmbi.1999.2911 available online at http://www.idealibrary.com on J. Mol. Biol. (1999) 00, 1±19

Universally Conserved Positions in Protein Folds:Reading Evolutionary Signals about Stability, FoldingKinetics and Function

Leonid A. Mirny and Eugene I. Shakhnovich*

Department of Chemistry andChemical Biology, HarvardUniversity, 12 Oxford StreetCambridge, MA 02138, USA

E-mail address of the [email protected]

Abbreviations used: CoC, conservconservatism; PDB, Protein Data Baoligonucleotide-binding; EBN, endoacetylglucosaminidase; CLT, centra

0022-2836/99/0000001±19 $30.00/0

Here, we provide an analysis of molecular evolution of ®ve of the mostpopulated protein folds: immunoglobulin fold, oligonucleotide-bindingfold, Rossman fold, alpha/beta plait, and TIM barrels. In order to dis-tinguish between ``historic'', functional and structural reasons for aminoacid conservations, we consider proteins that acquire the same fold andhave no evident sequence homology. For each fold we identify positionsthat are conserved within each individual family and coincide when non-homologous proteins are structurally superimposed. As a baseline forstatistical assessment we use the conservatism expected based on the sol-vent accessibility. The analysis is based on a new concept of ``conserva-tism-of-conservatism''. This approach allows us to identify the structuralfeatures that are stabilized in all proteins having a given fold, despite thefact that actual interactions that provide such stabilization may varyfrom protein to protein. Comparison with experimental data on thermo-dynamics, folding kinetics and function of the proteins reveals that suchuniversally conserved clusters correspond to either: (i) super-sites (com-mon location of active site in proteins having common tertiary structuresbut not function) or (ii) folding nuclei whose stability is an importantdeterminant of folding rate, or both (in the case of Rossman fold). Theanalysis also helps to clarify the relation between folding and functionthat is apparent for some folds.

# 1999 Academic Press

Keywords: protein evolution; protein folding; conservatism; super-sitekinetics; protein stability

*Corresponding author

Introduction

The amount of data on protein structure, foldingand kinetics are exploding. Progress in genomics(gene sequences) and proteomics (structure, func-tion and expression) studies created a new realmfor bioinformatics in which a qualitatively differentamount of biological information needs to be prop-erly rationalized and used. Success in achievingthis goal depends entirely on our understanding ofthe principles that govern protein stability, foldingand function.

Such understanding progressed over last severalyears to the point that basic principles of folding

ing author:

atism-of-nk; OB,-beta-N-l limit theorem.

begin to emerge from theoretical and experimentalstudies. Of particular importance is the foldabilityprinciple in thermodynamics and the discoveryof nucleation in folding kinetics. The foldabilityprinciple states that protein-like sequences shouldhave their native conformation as pronouncedenergy minimum (separated by a large energy gapfrom the bulk of structurally unrelated misfoldedconformations) (Goldstein et al., 1992; Shakhnovich& Gutin, 1993a; Sali et al., 1994; Govindarajan &Goldstein, 1995; Hao & Scheraga, 1994b).Sequences that satisfy this requirement are able tofold fast and have cooperative folding transition(Shakhnovich & Gutin, 1993a; Shakhnovich, 1994;Hao & Scheraga, 1994a,b). Their native structuresare stable against mutations (Tiana et al., 1998) aswell as against variation in solvent conditions andtemperature (Pande et al., 1995).

The modern concept of nucleation in proteinfolding emerged from several theoretical andexperimental studies (Bryngelson & Wolynes, 1990;

# 1999 Academic Press

Page 2: Universally Conserved Positions in Protein Folds: Reading …leonid/publications/Mirny_Shakhnovich... · 2005. 6. 19. · of protein folding, especially for proteins that fold via

2 Conservatism-of-Conservatism

Abkevich et al., 1994; Shakhnovich et al., 1996;Mirny et al., 1998; Guo & Thirumalai, 1995; Pandeet al., 1998; Itzhaki et al., 1995; Martinez et al., 1998)as a paradigm to describe transition state ensembleof protein folding, especially for proteins that foldvia simple two-state kinetics (Jackson, 1998). Ofparticular importance is the discovery of speci®cfolding nucleus in some proteins. The speci®cnucleus scenario of folding suggests that a numberof obligatory contacts (speci®c nucleus) should beformed in order for a protein chain to reach thetransition state. The speci®c nucleus constitutes aspatially contiguous cluster in structure, but notnecessarily in sequence: non-local contacts arealways present in speci®c nuclei. After the speci®cnucleus is formed, subsequent transition occursdownhill in free energy and is fast (Abkevich et al.,1994; Shakhnovich, 1998a). Further, it was noted(Abkevich et al., 1994; Shakhnovich et al., 1996;Mirny et al., 1998; Shakhnovich, 1998a; Martinezet al., 1998) that location of a speci®c nucleusdepends on the structure to a greater extent than itdoes on sequence. The major implication of this®nding is prediction that different (even non-homologous) sequences that fold into the samestructure may have similar folding nuclei. In otherwords, the location of a folding nucleus in a struc-ture may serve as a ®ngerprint of a protein fold.The speci®c nucleus model of folding kinetics hasdirect implication for experimental results, predict-ing a substantial variance of kinetic effects ofmutations at various locations in protein structure.Another important prediction is robustness ofspeci®c nucleus with respect to variation in solventconditions, temperature and other mutations.These predictions are consistent with experiment(Itzhaki et al., 1995; Viguera et al., 1997).

Molecular evolution represents an invaluablenatural laboratory to test and further develop ourunderstanding of protein folding. Conversely, ourunderstanding of protein folding and function is akey to rational analysis of signals sent by proteinevolution. The fusion of theoretical understandingof protein folding with analysis of evolutionaryinformation is the main aim of this study.

Molecular evolution sends us signals in the formof conservation patterns in multiple sequencealignments. However, those signals are hard todecipher because there may be many reasons forconservation: function, stability or maybe ``histori-cal'' reasons (insuf®cient evolutionary time todiverge). Finally, there may be some evolutionarypressure towards fast folding (perhaps to exceedsome rate threshold beyond which aggregationand/or proteolysis of party folded species maypresent a problem). The kinetic factor may giverise to additional conservation in the kineticallyimportant locations related to folding nucleus.

How can one distinguish between differentreasons for amino acid conservation? A possibleapproach is to use as much evolutionary infor-mation as possible. In particular, it is known thatbesides protein homologs, i.e. proteins that have a

clear evolutionary connection and are often (butnot always) functionally related, there exist analo-gus, i.e. structurally similar proteins that have non-homologous sequences, unrelated functions and noevident evolutionary relation (Branden & Tooze,1998). Since in most cases analogs share a commonfold but not function, a proper sequence compari-son between them may emphasize positions whereconservatism is related to structural stability andfolding kinetics rather than function (except in thecases when folds contain functional super-sites(Russell et al., 1998a), see below).

However, comparison of sequences of proteinanalogs should be made with care: a simplesequence alignment between analogs may notalways work due to the possibility of multi-aminoacid correlated mutations. The easiest way tounderstand this is to consider a basic examplewhere a certain element of structure needs to bestabilized. However, there are several ways toform strong attractive interactions (i.e. by forminghydrophobic contacts or disul®de bridges or insome cases salt bridges). Therefore, if the sameelement of structure is stabilized in analogs bydifferent forces, the amino acid residues thatdeliver such stabilization may be of quite differenttypes. This suggests that a simple sequence align-ment between analogs may in some cases yield noindication of conservatism. In other words, ener-getics may be more conserved than amino acidtypes that deliver it. On the other hand, withinfamilies of homologous proteins one can expectconserved amino acid residues to form stable sub-structures: the change of amino acid residues inthese positions requires compensating mutations inseveral related positions. Such multi-amino acidcorrelated mutations are very rare. They can befound only in highly diverged or unrelated pro-teins rather than within protein families.

This analysis suggests that a factor that maypoint to a common structure-related property in allanalogs may be the intrafamily conservation itselfrather than actual amino acid residues at the pos-itions in question. This leads to an important newconcept of ``conservatism-of-conservatism'' (CoC)to analyze evolutionary signals that are speci®c toa given fold (Mirny et al., 1998). The principle ofCoC calls for alignment of intrafamily conserva-tism pro®les between analogs as a method to ®ndand analyze evolutionary signals that re¯ect fea-tures that are characteristic of a particular fold:structural stability, folding kinetics or in somecases common function or common location ofactive sites between analogs (super-site).

In the following report, we ®rst explain howCoC is computed and what controls and statisticaltests we perform. Next, we consider the case of theimmunoglobulin fold in detail, and show how CoCanalysis helps to identify the evolutionary pressuretowards fast folding and distinguish it from evol-utionary pressure aimed at protein stabilizationand functional pressure. This will help to identify apossible location of folding nucleus for the immu-

Page 3: Universally Conserved Positions in Protein Folds: Reading …leonid/publications/Mirny_Shakhnovich... · 2005. 6. 19. · of protein folding, especially for proteins that fold via

Conservatism-of-Conservatism 3

noglobulin fold which allows direct comparisonwith protein engineering experiments.

Next we carry out similar analysis for all otherfolds for which suf®cient structural and evolution-ary data are available oligonucleotide-binding (OB)fold, Rossman fold, alpha/beta plait and TIMbarrel). Similar to the case of immunoglobulin fold,the analysis of the observed CoC signal will allowus to identify (in some cases) common nucleationsites characteristic of a given fold and (in somecases) super-sites, i.e. a common location of theactive site in proteins with similar structure butpossibly different function.

The results of our analysis will be comparedwith experimental information about the function,thermodynamics and kinetics of studied proteinsin cases when such information is available.

Results

Conservatism-of-conservatism (CoC)

As was stated earlier, the analysis of CoC aims toidentify positions in a protein structure which areconserved within each family of homologous pro-teins that acquire this structure. To pursue this goalwe need: (i) a large set of analogs - non-homologousproteins sharing the same fold (representativeproteins); and (ii) for each representative protein anumber of proteins homologous to it (a family).

When these data are available, the evaluation ofCoC proceeds as follows: (i) make multiplesequence alignments of proteins homologous toeach representative protein; (ii) identify positionswhich are conserved within each multiple align-ment; (iii) structurally align families to each other;and (iv) identify sites where conserved positionscoincide between the families.

Figure 1 outlines the major steps of thisprocedure.

We use the FSSP database (Holm & Sander,1993) as a source of structural alignments of repre-sentative proteins and the HSSP database (Dodgeet al., 1998) as a source of sequence alignmentsamong homologous proteins. Some FSSP structuralalignments were corrected using our Monte Carlostructural alignment algorithm (see Methods andMirny & Shakhnovich, 1998).

The degree of evolutionary conservation withina family of homologous sequences is measured bysequence entropy:

s�l� � ÿX6

i�1

pi�l� log pi�l�

where pi(l) is the frequency of each of the sixclasses i of residues at position l in the multiplesequence alignment. The six classes of residues are:aliphatic {AVLIMC}, aromatic {FWYH}, polar{STNQ}, positive {KR}, negative {DE}, and special(re¯ecting their special conformational properties){GP}. A low value of the intrafamily conservatisms(l) indicates that this position was under an evol-

utionary pressure to keep a particular type ofresidue.

After representative proteins and their respectivefamilies are structurally superimposed we computeconservatism-of-conservatism (CoC):

S�l� �XMm�1

sm�l�=M �1�

where l is now position in the structural alignment,and sm(l) is intrafamily conservatism in family m. Alow value of S(l) indicates that position l was con-served in most of the protein families acquiringthis fold. Note that identities of these residuescould be different in different families, what reallymatters is their conservatism within each family(see Figure 2).

CoC versus solvent accessibility: astability factor

As with any observed quantity, the statisticalsigni®cance of the obtained S(l) value has to beevaluated. Residues that are less exposed to sol-vent are known to be more evolutionary conserved(Koshi & Goldstein, 1997; Branden & Tooze, 1998).The main reason for that appears to be due to theselection of thermodynamically stable sequences(see Discussion for more details). ``Buriedness'' inthis context is an indicator of the degree to whichan amino acid participated in intraprotein inter-actions. Obviously, amino acids that are moreinvolved in such interactions are more importantfor stability and should be conserved underpressure towards some (not necessarily highestpossible) thermodynamic stability.

Hence, when structures of the same fold aresuperimposed, buried residues of one proteinmatch buried residues of the other. These buriedresidues tend to be more conserved in each family,giving rise to low values of S(l) for buriedpositions, i.e. a straightforward apparent CoCsignal. Thus an important control has to be made:Can higher conservatism of buried positionsexplain the observed values of S(l)?

In order to address this question we formulatethe following statistical hypothesis: H0: Sequenceentropy sm(l) at a position l in a family m dependssolely on the solvent accessibility al of this position,and observed CoC S(l) is fully accounted for by thedependence of amino acid conservation on solventaccessibility.

In a formal mathematical language, the H0means that the intrafamily conservatisms at eachposition can be treated as independent randomvalues with probability density f(sja) that dependssolely on solvent accessibility a of a position. Sincethe CoC is given by equation (1) as an averageover conservatism within each family, its prob-ability distribution expected under H0 can bederived as convolution of individual probabilitydensities f(sja) over all representative proteins. Fur-

Page 4: Universally Conserved Positions in Protein Folds: Reading …leonid/publications/Mirny_Shakhnovich... · 2005. 6. 19. · of protein folding, especially for proteins that fold via

Figure 1. Schematic represen-tation of procedure used to com-pute conservatism-of-conservatismS(l).

4 Conservatism-of-Conservatism

thermore, if the number of analogus is large, theprobability density for CoC under H0 will beGaussian according to Central Limit Theorem.Thus the statistical signi®cance of non-trivial CoCmay be estimated by comparing the observed CoCwith the predicted value according to the H0, andestimating the deviation between the two in termsof the number of standard deviations in the prob-ability density for CoC obtained according to H0.This can be directly translated into the probabilitythat observed CoC is a trivial consequence of resi-due accessibility to solvent, as suggested by thezero hypothesis H0.

To carry out this program we ®rst compute theprobability f(sja) of having sequence entropy s at aposition with solvent accessibility a. These statisticsare taken over all representative structures fromthe Protein Data Bank (PDB). Next, using f (sja) andthe central limit theorem we compute Sexp(l), i.e.the value of CoC expected according to H0. And

®nally, we compute the probability P(S) of observ-ing S(l) according to hypothesis H0 (see Methodsfor details). This probability P(l) � P(S(l)) togetherwith S(l) are used to assess each position l instudied folds.

Figure 3 presents expected Sexp(l) and observed(from actual alignments according to the schemeoutlined above) Sobs(l) for the immunoglobulinfold. Strikingly, Sexp(l) and Sobs(l) are in a very goodagreement. This fact suggests that most of the vari-ation in S(l) is indeed explained by solvent accessi-bility (by H0 hypothesis). Correlation betweenSobs(l) and Sexp(l) in this example is r � 0.89. How-ever, there are few positions that exhibit CoC S(l)well below Sexp(l). This gives rise to low values ofP(l) that are indicative that at those positionsfactors other than just solvent accessibility havecontributed signi®cantly to conservatism.

From the physical point of view, higher conser-vatism of buried residues re¯ects their role in the

Page 5: Universally Conserved Positions in Protein Folds: Reading …leonid/publications/Mirny_Shakhnovich... · 2005. 6. 19. · of protein folding, especially for proteins that fold via

Conservatism-of-Conservatism 5

stabilization of the fold (buried residues areusually more hydrophobic and form more residue-residue interactions in a protein structure) (Koshi& Goldstein, 1997; Bahar & Jernigan, 1997; Gilis &Rooman, 1997). Remarkable values of correlationbetween Sexp(l) and Sobs(l), which vary around 0.9for studied folds, demonstrate that the dominantcontribution to Sobs(l) arises from the requirementfor thermodynamics stability of a protein structure.The signal which we are interested in, i.e. the devi-ation of Sobs(l) from Sexp(l), cannot be explained bythe stability requirement alone. Hence, a low valueof P(l) indicates some additional evolutionarypressure on a position l in the fold.

Conservatism across the families

Conservatism across the families (Sacross(l))addresses the following question: are there anypositions in the proteins of the same fold that arefrequently occupied by the same type of residuesin different families. Similar to the CoC compu-tations, we ®rst align sequences within each familyusing sequence alignment, and then align familiesagainst each other using structural alignment. Oneshould b careful to weigh large and small familiesequally. First, we compute pm

i (l), the frequence ofresidue type i at position l within each family m.Next we compute the across-family frequency:

Pi�l� � 1

M

XMm�1

pmi �l�

and the across-family entropy:

Sacross�l� � ÿX6

i�1

Pi�l� log Pi�l�

A related quantity was analyzed by Ptitsyn(1998) for the cytochrome C family. Note thatalways S(l) < Sacross(l). To understand the differ-ence between S(l) and Sacross(l) consider the fol-lowing example. If position l is conserved ineach family, then S(l) is low. If residues of thesame type are conserved (e.g. this position ishydrophobic aliphatic in each family), thenSacross(l) is also low. However, if a position isconserved within each family, but differentfamilies have different types of residues at thisposition, then S(l) is low, but Sacross(l) is high.Such a situation occurs when one family has aconserved hydrophobic group and another has aconserved cysteine residue forming a disul®debond or a conserved charged residue participat-ing in a salt bridge. Another class of positionswhich has low S(l) and high Sacross(l) correspondto functional super-sites in protein folds. In thiscase each family has a conserved active siteesidue at position l, but since function is differ-ent, the types of these residues are different inthe different families. In this case a pronounced

difference between S(l) and Sacross(l) would indi-cate a presence of a super-site. We should notehowever, that unlike CoC which is a well-de®ned statistical quantity, Sacross is less well-de®ned statistically since it is more dependenton the evolutionary history within individualfamilies. However, we may consider a qualitativedifference between Sacross and S(l) at someposition as an heuristic indicator that amino acidresidues at such positions have been under an``unusual'' evolutionary pressure (often related toa super-site or in some cases to folding nucleus,see below). Since S(l) and Sacross(l) correspondto different patterns of evolutionary pressurewe consider signals re¯ected in both thesequantities.

Importantly, both low S(l) and low P(l) areused as indicator of strong conservation in ouranalysis. In fact, low S(l) and high P(l) meanthat although position l is conserved in severalfamilies, the degree of conservation does notexceed that expected from consideration of sol-vent accessibility factor. Conversely, when S(l) ismoderate and P(l) is low, position l may not bevery conserved in the families, although eventhis weak conservatism is unusual for positionsof such (usually, high) solvent accessibility. Thiskind of weak but signi®cant CoC could be a sig-nature of a solvent exposed common binding/active site. Examples of such situations will bediscussed in more detail (see oligonucleotide-binding fold below).

Immunoglobulin fold

Immunoglobulin fold is the most populated oneamong known beta-proteins. Tenascin (1ten) isused as a representative protein for this fold.Figure 4 presents S(l), Sacross(l) and P(l). Positionswith low S(l) and low P(l) are the ones that exhibithigh CoC. Positions with P 4 103 in the immuno-globulin fold are marked with stars on Figure 4.There are six positions with P(l) 4 Pc � 103 andS(l) 4 0.2 in tenascin: Ala17(819), Ile19(821),Trp21(823), Leu33(835), Val69(871), andLeu71(873). (Here and below residues are countedfrom the beginning of the PDB ®le, residue num-bers used in the PDB ®le are shown in parenth-eses). Figure 5 shows those six residues in thestructure of tenascin. Importantly, residues withhigh CoC form a dense cluster in the core of theprotein, connecting strands B, C and F (strandnotations are according to Branden & Tooze(1998)).

What is the origin of the high CoC at some pos-itions of the immunoglobulin fold? As stated in theIntroduction, three factors can account for highCoC: (i) a super-site; (ii) key positions responsiblefor stabilization of the fold; and (iii) a foldingnucleus, whose stability is required for fast folding.We consider all three possibilities for the immuno-globulin fold.

Page 6: Universally Conserved Positions in Protein Folds: Reading …leonid/publications/Mirny_Shakhnovich... · 2005. 6. 19. · of protein folding, especially for proteins that fold via

Figure 2 (Legend shown opposite)

6 Conservatism-of-Conservatism

Page 7: Universally Conserved Positions in Protein Folds: Reading …leonid/publications/Mirny_Shakhnovich... · 2005. 6. 19. · of protein folding, especially for proteins that fold via

Figure 3. Observed Sobs(l) (red) and expected Sexp(l) (blue) CoC in the immunoglobulin fold. Sexp(l) is calculatedbased on the solvent accessibility. The remarkable correlation of 0.9 shows that solvent accessibility explains most ofthe conservatism in protein families Error bars on Sexp(l) shows one standard deviation s(S).

Conservatism-of-Conservatism 7

Function

Most of the proteins of immunoglobulin foldare extracellular domains responsible for speci®cbinding and/or recognition of small ligands(Fab), peptides (MHC), DNA (p53 DNA-bindingdomain) or other proteins (hormones, proteins ofextracellular matrix, other receptors). Differentproteins use very different parts of the fold forspeci®c binding and the binding site is oftenlocated on the junction between the two immu-noglobulin domains. For example, growth hor-mone receptor (1cfb) has two domains with theimmunoglobulin fold which bind the hormoneby their loops, but one domain uses loops atone end of the fold and the other uses the loopslocated on the opposite side of the fold. In gen-eral, there is no speci®c part of the immunoglo-bulin fold which is used for binding/active siteplacement. Hence, high CoC in this fold cannotbe explained by conservation of functionalresidues.

Stability

The low values of P(l) shows that high CoC can-not be explained by solvent accessibility. Hence,high CoC is hardly a result of conservation drivenby a requirement for thermodynamic stability.These positions are under some additional evol-utionary pressure. Some proteins of the immuno-globulin fold have disul®de bridges at the highCoC positions (see below). This observation pointsto some special role of the high CoC positions inthe fold stabilization and/or initiation.

Figure 2. Structural alignment of families in immunoglobrepresentative proteins for each family are indicated. Thedark, conserved; light, variable. For each family we present ttein. Note dark vertical strips corresponding to positions wit

Kinetics

Importantly, tenascin and some other proteins ofthe immunoglobulin fold (twitchin, FNIII-9, FNIII-10) are known to fold fast and by a two-statemechanism (Plaxco et al., 1996; Hamill et al., 1998).They are expected to have a stable folding nucleus(see above), i.e. a set of residues that interact witheach other in the transition state. Stability of thenucleus provides rapid folding to the native stateand, hence, residues contributing to the nucleusare conserved (Mirny et al., 1998). If location of thefolding nucleus in the protein structure dependsprimarily on the fold and not on the sequence,then positions belonging to the nucleus shouldexhibit high CoC.

Strong evidences in support of this view is pro-vided in a recent experimental study by Lorch et al.(1999). These authors studied the N-terminaldomain of rat CD2 which has no disul®de bonds.Making nine mutations in the core of this proteinand measuring stability and folding kinetics associ-ated with each mutation, they identi®ed residuesbelonging to the folding nucleus as Ile18, Val30,Trp32 and Val78. When the structure of the CD2 issuperimposed with tenascin, the folding nucleus ofCD2 maps onto positions Trp21(823), Ile31(833),Leu33(835) and Leu71(873) in tenascin, all of whichexhibit statistically signi®cant CoC (see Figure 4).

Different mechanisms of stabilization

As one can see from Figure 4(b), the positionsthat exhibit signi®cant CoC have low values ofacross-conservatism Sacross(l), indicating that thesepositions carry residues of the same type in mostof the families. A substantial difference between

ulin fold. Each line corresponds to a single family andgrayscale level shows conservatism within each family:he PDB code and the sequence of the representative pro-h high CoC (positions 17, 19, 21, 33, 69 and 71).

Page 8: Universally Conserved Positions in Protein Folds: Reading …leonid/publications/Mirny_Shakhnovich... · 2005. 6. 19. · of protein folding, especially for proteins that fold via

Figure 4. Conservatism in immunoglobulin fold. (a)Probability P(S < Sobs) of observing S(l) by chance. (b)Observed S(l) (circles) and Sacross(l) (squares). Positionswith P < 10ÿ3 are shown by ®lled circles.

Figure 5. Structure of tenascin (the immunoglobulinfold). Residues with high and signi®cant CoC areshown by space-®lling models. All protein structure car-toons are produced using Molscript (Kraulis, 1991) andRaster3D (Merritt & Bacon, 1997).

8 Conservatism-of-Conservatism

Sacross(l) and S(l) also shows that not all familieshave the same types of residues in the outlinedpositions. Figure 2 presents structural alignmentsof the proteins of the immunoglobulin fold. Grays-cale level indicates the degree of conservationwithin each family (sm(l)). Positions with high CoCcan be seen as dark vertical strips on the diagram.

From the diagram (Figure 2) one can see thatdifferent families may have different residues atthe high CoC positions. Importantly, pairs of inter-acting residues from the high CoC set of tenascincorrespond to disul®de bridges in some immuno-globulins. In particular, Trp21 and Leu71 in tenas-cin correspond to a disul®de bridge Cys23-Cus92in the beta chain of the 14.3.D T-cell antigen recep-tor (1bec) and to Cys23-Cys94 in CD8 (1cd8). Thepair Leu33 and Leu71 corresponds to a disulphidebridge in the second domain of CD4 (3cdy) and toone of the bridges in the ®rst domain of growthhormone receptor (1axi). The presence of the disul-®de bridges indicates evolutionary pressure tostabilize interactions between these positions.

Interestingly, in many families of the immuno-globulin fold a strong conserved tryptophan resi-due is found at one of the six outlined positions.For example, tenascin has Trp21, whereas CD8,(1cd8), Kb5-c20 t-cell antigen receptor (1kb5B),Igg2a intact antibody (1igtB) and myelin p0 proteinfragment (1neu) have a tryptophan residue at pos-ition corresponding to Leu33 in tenascin. CD4(1cdy) has a tryptophan residue corresponding toVal69 in tenascin. This ``circular permutation'' oftryptophan illustrates a possibility of correlatedmutations in the putative folding nucleus of veryfar diverged proteins.

Both examples demonstrate the distinctionbetween S(l) and Sacross(l). Proteins of the immuno-globulin fold use different types of interaction (e.g.disul®de bonds, hydrophobic and aromatic stack-ing) to stabilize the same positions in the structure.Hence, different types of residues are conserved indifferent families. Such a diversity of conservedresidues makes Sacross(l) high, while keeping S(l)low. However, in many positions of immunoglobu-lin fold dominant factor in stabilization of the fold-ing nucleus are hydrophobic aliphatic interactionsand hence Sacross(l) has rather low values at pos-itions where S(l) is low.

A previous study of sequence and structure con-servation in the immunoglobulins (Bork et al.,1994) did not identify those key positions in thefold. Instead, the authors came to the conclusionthat ``no single interaction (or localized set of inter-actions) can be uniquely identi®ed as a principaldeterminant of the Ig-like fold''. Although con-served residues in each family were identi®ed, thefact that different types of conserved residues werefound at corresponding positions in differentproteins obscured the analysis made by Bork et al.(1994).

Page 9: Universally Conserved Positions in Protein Folds: Reading …leonid/publications/Mirny_Shakhnovich... · 2005. 6. 19. · of protein folding, especially for proteins that fold via

Figure 7. Structure of major cold shock protein CspB(OB fold). Residues with high and signi®cant CoC areshown by space-®lling models.

Conservatism-of-Conservatism 9

Oligonucleotide-binding fold

The oligonucleotide-binding (OB) fold is a b-foldwith a barrel topology. Proteins belonging to thisfold have diverse sequences and functions. About20 non-homologous protein families share thisfold. Figure 6 presents S(l) and P(l) for the majorcold shock (CspA) protein (1mjc in PDB), a typicalOB fold protein. Positions with high CoC(S(l) 4 0.2 and P(l) < 1 %) in CspA are: Val8(9),Ile20(21), Val50(51) and Val66(67). These residuesform a dense cluster located inside the b-barrel,closer to one open end of the barrel (see Figure 7).In contrast to immunoglobulin fold, where thehigh CoC cluster is located in the center of the pro-tein, residues with high CoC in the OB fold areasymmetrically grouped near ``the bottom'' of thebarrel. Strands 1, 2, 4 and 5 are involved in thecluster. Hence, when the four outlined residuescome together, the overall topology of the chainbecomes well de®ned.

The across-conservatism Sacross has deep minimaat the outlined positions demonstrating that mostof the analogs carry the same type of residuesthere.

What is the origin of the CoC in the OB fold:function, stability or kinetics? Proteins of the OBfold are (i) nucleotide binding (transcription sig-nals, RNA binding, tRNA synthetase, staphylococ-cal nucleases); (ii) inorganic pyrophosphates; (iii)tissue inhibitor of metalloproteinases; or (vi) toxins.The binding site of single-stranded DNA and RNAis localized mostly on the face of the molecule instrands 2, 3 and on the surface loops (Newkirket al., 1994). Residues contributing to this site aredifferent from the four residues belonging to thehigh CoC cluster shown in Figure 7. Therefore,functional conservation cannot account for theobserved CoC.

Figure 6. Conservatism in OB fold. (a) ProbabilityP(S < Sobs) of observing S(l) by chance. (b) Observed S(l)(circles) and Sacross(l) (squares). Positions with P < 10ÿ2

are shown by ®lled circles.

Thermodynamic properties of the cold shockproteins are very well studied (Perl et al., 1998;Reid et al., 1998; Schindler et al., 1998, 1999). Thisfamily of proteins serves as a clear example wherethermodynamics are separated from the kinetics offolding: different proteins in the family have verydifferent stability, but all fold very fast and by atwo-state mechanism (Perl et al., 1998; Reid et al.,1998; Schindler et al., 1999). A great variation instability with almost no changes in folding ratesdemonstrates that amino acid residues responsiblefor stability and fast folding are located in differentregions of the structure of these proteins. Unfortu-nately, there is no study where stability and fold-ing kinetics of a variety of CspA/CspB mutantsare measured. Schindler et al. (1998) mutatedsurface-exposed phenylalanine residues in CspBand showed a substantial destabilization uponmutation of Phe15, Phe17 and Phe27 (correspond-ing to Ph17(18), Phe19(20) and Phe30(31) in CspA).Stabilization of the CspB structure by exposedphenylalanine residues is another evidence thatdifferent regions of the protein structure areresponsible for stability and for kinetics.

Low values of P(l) for the four outlined positionsin the OB fold indicate that CoC in those positionscan hardly be explained by solvent accessibilityalone. Thus we predict that Val8(9), Ile20(21),Val50(51) and Val66(67) constitute a foldingnucleus in the OB fold proteins, and their conser-vation gives rise to the conservation of rapidfolding.

Note that several positions in the OB fold(Phe11(12), Phe30(31), His32(33) and Gly47(48))

Page 10: Universally Conserved Positions in Protein Folds: Reading …leonid/publications/Mirny_Shakhnovich... · 2005. 6. 19. · of protein folding, especially for proteins that fold via

Figure 8. Conservatism in Rossman Fold. (a) Prob-ability P(S < 10obs) of observing S(l) by chance. (b)Observed S(l) (circles) and Sacross(l) (squares). Positionswith P < 10ÿ5 are shown by ®lled circles.

Figure 9. Structure of CheY protein (Rossman fold).Residues with high and signi®cant CoC are shown byspace-®lling models.

10 Conservatism-of-Conservatism

exhibit CoC that is moderate in absolute value(0.2 < S(l) 4 0.5) but is of high statistical signi®-cance (P < 1 %). Interestingly, these positions con-stitute a nucleotide/phosphate binding super-site.We examined several proteins having the OB foldand found that in all nucleotide binding proteins(nucleases, DNA and RNa binding, etc.) and inor-ganic pyrophosphatases the active/site is localizedat the same face of the barrel and involves theseexposed aromatic residues (Newkirk et al., 1994;Schindelin et al., 1994). For example, Arg35 iscentral to the active site of Staphylococcal nuclease(1snc). This position corresponds to His32(33) inCspB; ferredoxin-NADP� reductase (1fnc) placesits FAD binding site in the same location. On theother hand, toxins that have the OB fold do notuse this face of the barrel for speci®c binding.Instead, they form large complexes where a helixlocated on the top of the barrel is involved in ATPbinding (e.g. pertussis toxin S2/S3 subunits).Hence, these residues are conserved in nucleotide/phosphate binding proteins but not in the toxinsforming a ``weak super-site''. As a result, S(l) is notvery low, but still very signi®cant (low P(l)) forsuch exposed positions. It is an important featureof the CoC analysis that allows us to identifya consensus between functionally conservedpositions in non-homologous proteins of the samefold. Later we discuss possible biological impli-cations of this peculiar interplay of function andstability in the cold shock proteins.

Rossman fold

The Rossman fold is the most populated foldamong a/b-folds. We use chemotactic proteinCheY (3chy) as a representative of this fold.Figure 8 presents CoC and Sacross(l) for CheY. Resi-dues with S(l) < 0.3 and P(l) < 10ÿ5 are Phe7(8),Val9(10), Val10(11), Asp11(12), Asp12(13),Met16(17), Val53(54), Asp56(57), Trp57(58) andAla87(88). The numbers are from the PDB ®le, andthe numbers in parenthesis are as reported byLopez-Hernandez & Serrano (1996). In the proteinstructure (see Figure 9) residues Asp11(12),Asp12(13), Met16(17), Asp56(57), Trp57(58) andAla87(88) form a dense, solvent-exposed cluster atthe C termini of strands 1, 3 and 4 and N termin ofhelix 1. Residues Phe7(8), Val9(10), Val10(11) andVal53(54) are lined on strands 2 and 3. Theexposed cluster of Asp11(12), Asp12(13) andAsp56(57) is stabilized by a bound Mg2�.

The folding nucleus of CheY was identi®ed byLopez-Hernandez & Serrano (1996). They mutatedseveral positions scattered through the whole pro-tein and measured changes in stability and the(un)folding rate of the mutant proteins. Impor-tantly, for most of the mutated positions f-valuesare either close to 0 or above 0.5. Comparison ofthe CoC data with the measured f-values give evi-dence in support of kinetic origin of the CoC inRossman fold. Positions where mutations had beenmade are shown on the bottom of Figure 8, with

triangles marking those with f > 0.5. Agreementbetween high CoC positions and those with f > 0.5is very good (both high CoC and f > 0.5: Val9(10),Val10(11), Asp11(12), Asp12(13), Val53(54), andAsp56(57); there are no positions with high CoCand f4 0.5; residues Phe7(8), Met16(17),Ala87(88), no measurements; low CoC and f > 0.5Val32(33), Ala35(36), Ala41(42)). No mutationswere made for Phe7(8). Residue Ala87(88), whichhas S(l) � 2.29 and P < 10ÿ10, may also be import-ant for kinetics as an A87G mutation makes the

Page 11: Universally Conserved Positions in Protein Folds: Reading …leonid/publications/Mirny_Shakhnovich... · 2005. 6. 19. · of protein folding, especially for proteins that fold via

Figure 11. Structure of ADA2 h protein (alpha/betaplait). Residues with high and signi®cant CoC areshown by space-®lling models.

Figure 10. Conservatism in alpha/beta plait. (a) Prob-ability P(S < Sobs) of observing S(l) by chance. (b)Observed S(l) (circles) and Sacross(l) (squares). Positionswith P < 2 % are shown by ®lled circles.

Conservatism-of-Conservatism 11

protein fold much slower without affecting its stab-ility (L. Serrano, personal communication). Beingat the end of strand 5, Ala87 can be responsible forterminating this strand. Another explanation is itsfunctional role in the proteins of Rossman fold (seebelow). Positions Val32(33), Ala35(36), Ala41(42),which have f > 0.5 but no CoC, probably belongto ``an extended folding nucleus'', a part of thenucleus which vary from family and hence exhibitno CoC signal. This scope of data from proteinengineering experiments strongly supports a linkbetween CoC and folding kinetics.

The Rossman fold is known to have a super-sitewith functional residues located at the C termini ofthe b-strands (Branden & Tooze, 1998; Russell et al.,1998b). In CheY these positions are 11-12, 16, 56and 87. Remarkably, the same positions in the foldare used to provide fast folding (as shown byprotein engineering experiments) and to buildthe active/binding site. Residues Asp11(12),Asp12(13), Met16(17) and Asp56(57) form Mg2�

binding sites (Lopez-Hernandez & Serrano, 1996);residue Ala87(88) is in contact with Lys109 and isprobably involved in the mechanism of allosterictransition in CheY (Welch et al., 1994; Bellsolellet al., 1996). There is an important relation betweenfunction of CheY and its stability: cation bindingsubstantially stabilizes the structure of CheY andcoordinates the charged side-chains of Asp11(12),Asp12(13), and Asp56(57) (Wilcock et al., 1998). Inthe Discussion we examine possible biologicalimplications for linking protein function with kin-etics and stability. Importance of the high CoCpositions in the Rossman fold for both functionand folding kinetics imposed a strong evolutionarypressure at those positions.

The alpha/beta plait

The alpha/beta plait has an antiparallel a � b-topology and consists of a few helixes and four-stranded b-sheet. This fold is the third most popu-lated after TIM barrels (see below) and Rossmanfolds. Proteins of this fold have very diversefunctions, thermodynamic and kinetic properties(Villegas et al., 1989; van Nuland et al., 1998a,b;T. Ternstro et al., unpublished results).

Results for this fold are presented in Figure 10.We chose acylphosphatase (pdb:2acy) as a repre-sentative protein. In acylphosphatase; positionswith high and signi®cant CoC (S(l) < 0.25 andP(l) < 2.5 %) are: Tyr11, Thr26, Gly30, Gly49 andLeu65. Note that statistical signi®cance of theseresults is lower than for all other folds describedabove. This lower statistical signi®cance comesfrom fewer non-homologous families known tohave this fold. Only 29 families were used in ouranalysis in contrast to 51 families for immunoglo-bulin and 166 families for Rossman folds.

Similar to other cases, residues with high andsigni®cant CoC are all located close to each otherin space (see Figure 11). This CoC cannot be attrib-uted to functional conservatism, since there is no

super-site for this fold (Branden & Tooze, 1998;Russell et al., 1998b). Functional/binding residuesare typically located at either a solvent-exposedface of the beta-sheet (active site of the glutaminesynthetase (Liaw et al., 1994); catalytic site of BIRAprotein (Wilson et al., 1992); active site of the

Page 12: Universally Conserved Positions in Protein Folds: Reading …leonid/publications/Mirny_Shakhnovich... · 2005. 6. 19. · of protein folding, especially for proteins that fold via

Figure 12. Conservatism in TIM barrels. (a) Prob-ability P(S < Sobs) of observing S(l) by chance. (b)Observed S(l) (circles) and Sacross(l) (squares). Positionswith P < 10ÿ4 are shown by ®lled circles.

12 Conservatism-of-Conservatism

human DNA polymerase beta (Pelletier et al.,1996)) or at the loops (ligand binding site of the D-3-phosphoglycerate dehydrogenase (Schuller et al.,1995)).

Comparison of our results with those from pro-tein engineering experiments (Villegas et al., 1998)indicate that CoC at some positions is related tothe folding nucleus. For the four proteins of thealpha/beta plait: human procarboxypeptidase A2(ADA2h), spliceosomal protein U1A, acylphospha-tase (AcP), and histidine-containing phosphocarrierprotein (HPr), folding kinetics and thermodyn-amics have been studied (van Nuland et al.,1998a,b; T. Ternstro et al., unpublished results;Villegas et al., 1998). For two of these proteins(ADA2h and U1A), the transition state (andfolding nucleus) has been characterized. Import-antly, all four proteins exhibit two-state foldingtransition. However, AcP and HPr fold slowly(kH2O

f � 0.23 sÿ1 and kH2Of � 14.9 sÿ1, respectively),

while ADA2 h and U1A fold very fast(kH2O

f � 897 sÿ1 and kH2Of � 316 sÿ1).

When the structures of ADA2h, U1A and AcPare superimposed, some nucleation residuescoincide with each other and with the high CoCresidues. Particularly, there is a clear consensusbetween two experimentally identi®ed and oneputative nucleus residue in positions Tyr11 andThr26 in AcP (Ile14 and Leu30 in U1A; Ile15 andLeu26 in ADA2 h). Another position that can alsobelong to the folding nucleus in Leu65 in AcP(Phe65 in ADA2 h; Phe34 in U1A, or perhapsMet72 or Met82, which were not studied in exper-iment). Other nucleation residues in ADA2 h andU1A do not coincide with each other with pos-itions of the high CoC. These residues either consti-tute ``an extended folding nucleus'', which variesfrom family to family, or are under some other sortof evolutionary pressure. The difference betweenfolding nuclei in U1A and ADA2h may be due to avery different twist of the U1A structure and sub-stantial angle between the ®rst helixes (M.Oliveberg, personal communication).

TIM barrel

TIM barrel is the third most populated a/b-fold.Sequences and functions of the proteins sharing thisfold are very diverse. Very little is known aboutstability of the TIM barrel proteins and no data areavailable regarding their folding kinetics. On theother hand, functions of the majority of TIM barrelproteins are well known. This fold has a distinctivesuper-site at the loops on the top of the barrel(Russell & Ponting, 1998; Branden & Tooze, 1998).

Figure 12 presents results of our analysis ofthe TIM barrel fold mapped onto the structureof endo-beta-N-acetylglucosaminidase (EBN,pdb:2ebn). Remarkably, solvent accessibility is avery good predictor of the CoC signal yielding acorrelation of 0.9 between Sobs(l) and Sexp(l) overthe whole structure of about 300 residues. Theonly positions in EBN with high signi®cant

CoC (P(l) < 10ÿ3 and S(l) < 0.3) are: Thr12(16),Ser42(46), Ser86(90), Leu88(92), Asp126(130),Asp127(131), Glu128(132), Tyr167(171), Asp194(198),Ser217(221), Gln218(222), and Phe245(249). Allthese positions belong to the super-site.

In 19 TIM barrel representative structures, theactive site is reported in the PDB ®le (record``SITE''). A total of 67 positions are reported. Thedistribution of these positions along the fold is thefollowing. The vicinity (�three residues) ofThr12(16) contain four active site positions,Ser42(46)-4; Ser86(90)/Leu88(92)-6; Asp126(130)/Asp127(131)/Glu128(132)-16; Tyr167(171)-16;Asp194(198)-6; Ser217(221)/Gln218(222)-4. In total,58 out of 67 reported active site residues are loca-lized at the high CoC positions. All of them arelocated at the top loops connecting the b-barrelwith helixes (see Figure 13).

Importantly, the ``super-site'' induces no signalon Sacross, since different amino acid residues areused in the active sites of different families. TheCoC, in contrast, very clearly identi®es the super-site, since amino acid residues in the active site arevery conserved within each family.

Discussion

Here, we report a detailed study of molecularevolution of ®ve of the most populated proteinfolds. Out of �2200 domains in known structureswithout evident sequence homology, 564 belong to®ve dominant folds that were analyzed in thisstudy. High data-demanding nature of the methodof analysis limits it to the folds that contain at least20-30 non-homologous families. As the number ofsolved protein structures increases, this analysiscan be extended to other folds.

Page 13: Universally Conserved Positions in Protein Folds: Reading …leonid/publications/Mirny_Shakhnovich... · 2005. 6. 19. · of protein folding, especially for proteins that fold via

Figure 13. Structure of endo-beta-N-acetylglucosaminidase (TIMbarrel). Residues with high and sig-ni®cant CoC are shown by space-®lling models. All those residuescorrespond to the active site pos-itions in TIM barrel proteins.

Conservatism-of-Conservatism 13

For each of folds shaded, we identi®ed pos-itions that are conserved in most of the proteinfamilies sharing this fold. In all studied foldsresidues that show high CoC form a densecluster in the native structure. Location of thiscluster and the nature of interactions stabilizingit are, however, different in different folds andeven in different families of the same fold. Forexample, in immunoglobulin domain residueswith high CoC form a cluster deeply buried intothe fold. Some families of this fold stabilize thiscluster by hydrophobic interactions, some by thedisul®de bonds. On the contrary, proteins ofthe Rossman fold have high CoC residuesmostly on the solvent-exposed helix-sheet loops.In different families having Rossman fold thiscluster is stabilized by either Coulomb inter-actions between aspartic acids and a boundmetal ion, or by hydrophobic interactions,or, perhaps, by interactions with the ligand in¯avodoxins. Similarly in the OB fold the positionof the CoC cluster is shifted to the bottom ofthe b-barrel

Correlated mutations

Different interactions between high CoC residuesin different families lead to the emergence of ``cor-related mutations'' (Altschuh et al., 1988; Thomaset al., 1996). In fact, substitution of the two hydro-phobic residues by two cysteine residues forming adisul®de bridge can be considered as a clearexample of a correlated mutation. Several substi-tutions of these kind can be observed whenfamilies of a particular fold are aligned with eachother (by a sequence alignment within a family

and by a structural alignment between thefamilies). Deeper analysis of these cases leadsto a very different picture for correlatedmutations.

In the immunoglobulin folds, substitution of ahydrophobic pair in high CoC positions by a pairof cysteine residues is typical (see above). Howeverdifferent pairs of residues are substituted bycysteine residues, e.g. position 71 in tenascin canbe occupied by a cysteine forming a disul®debridge with either position 21 or position 33.Hence, an analysis of any of these pairs (71-21 or71-33) reveal very little or no correlation. A morestriking example is the ``circular permutation'' ofthe tryptophan residue among the high CoC pos-itions of the same fold (see above). This cluster ofresidues usually contains a single tryptophan resi-due which can be at either position 19, 21, 33, 69 or71. Clearly, no pair of these positions exhibit corre-lated mutations. It is the whole cluster, not anysingle pair, that exhibits correlated mutations. Thisanalysis explains why observed correlatedmutations are so rare: there are several ways tostabilize even a small cluster of residues and nopair of positions in this cluster is superior to theothers. Another lesson is that correlated mutationsdo exist, but they involve more residues than two.Identi®cation of such cases requires analysis of fardiverged proteins and hence can only be done overalignments of several families sharing the samefold. These correlated mutations in distantlydiverged homologous or analogous proteinsis a manifestation of the fact that interactionsare more conserved than residues in proteinevolution.

Page 14: Universally Conserved Positions in Protein Folds: Reading …leonid/publications/Mirny_Shakhnovich... · 2005. 6. 19. · of protein folding, especially for proteins that fold via

14 Conservatism-of-Conservatism

Origins of CoC

The main goal of our study is to ®nd, in eachspeci®c case the physical and evolutionary reasonfor the observed conservatism. Apparently themost common selection pressure is thermodynamicstabilization of sequences which should serve as a``noise baseline'' for our analysis. One can expect(see below) that pressure towards thermodynamicstabilization will be stronger on amino acid resi-dues that participate in a larger number of intra-protein interactions, i.e. the ones that are more bur-ied in structure. To this end one would predict astrong correlation between conservation and anymeasure of buriedness of an amino acid residue,such as solvent accessibility. Our analysis indeedreveals a (surprisingly) strong correlation betweenCoC and solvent accessibility.

A more detailed quantitative explanation of thecorrelation between CoC and solvent accessibilitycomes from statistical-mechanical theory (L.A.M. &E.I.S., unpublished results). The formal analysis isbased on the detailed analogy between proteinsequence selection and certain statistical-mechan-ical spin models (Shakhnovich & Gutin, 1993b;Shaknovich, 1998b). Within this analogy the degreeof evolutionary pressure towards stabilization isanalogous with temperature in statistical mech-anics; it can be shown that more buried amino acidresidues are at ``lower'' effective temperature, i.e.they are indeed under stronger evolutionary press-ure towards stabilization. We would like to stressthat solvent accessibility in our analysis serves ameasure of amino acid involvement in a proteinstructure. The statistical-mechanical explanation ofthe correlation between CoC and solvent accessibil-ity does not imply or assume that interaction withsolvent is an only or a dominant force in proteinstabilization. Our analysis rather points out to theintegral effect of all interactions that lead to proteinstabilization as a major reason for correlationbetween solvent accessibility and CoC. Thus weconclude that the ``baseline noise'' level of ouranalysis, i.e. the correlation between CoC andsolvent accessibility, actually accounts for theevolutionary selection of stable sequences (notnecessarily the most stable ones but just at someacceptable level).

However, we found a number of universal pos-itions in each fold which CoC is much strongerthan that expected from the stability pressurealone. These positions are obviously at someadditional selective pressure. Since individual evol-utionary histories and functions of analogs arevery different, the only common features that theyshare is their native structure (fold). Hence theorigin of the stronger than expected CoC can beattributed primarily to the evolutionary pressure topreserve some structural features. Among thesefeatures two factors dominate in determining theCoC: the function super-site and the foldingnucleus. We also cannot exclude the possibilitythat some of the high CoC positions may play

a somewhat special role in stabilizing the fold(serving as ``anchor'' positions), and hence, areunder stronger evolutionary pressure thanexpected from the solvent accessibility only (L.A.M& E.I.S., unpublished results). A possible exampleof this kind may be helix or b-initiation and ter-mination signals.

High CoC in most of identi®ed positions corre-sponds to either super-site or to the foldingnucleus. Experimental evidence exists that for threeout of ®ve studied folds (immunoglobulin, Ross-man fold and alpha/beta plaits) the high CoC isindeed related to folding nucleus. However, anal-ysis of the high CoC positions in different proteinsbrought us to a surprising conclusion that somepositions in proteins contribute to both the activesite and the folding nucleus, or to the binding/active site and are essential for stabilizing thenative structure. Consider this interplay betweenfunction, stability and of folding kinetics in moredetail.

In the immunoglobulin domain functional andstructural load is clearly separated: loops areresponsible for binding and recognition whileinteractions between several residues of the buriedcore provide stability and fast folding. Importantly,stability and kinetics seam to use the same set ofinteractions. As we noted above the high CoCpositions are located at strands B, C and F; thoseamino acid residues form the folding nucleus.NMR and HD labeling experiments, however, indi-cate that these strands have the highest protectionindex in different proteins of the immunoglobulinfold (Parker et al., 1998; Meekhof & Freund, 1999).Therefore, for proteins having an immunoglobulinfold one can expect a noticeable correlationbetween folding rate and stability.

In the major cold shock protein different struc-tural elements are responsible for stability andfunction and folding rate. Proteins from the familyof cold shock proteins have various stability (�Granges from 2.7 to 6.3 kcal/m) exhibiting, how-ever, in all cases very fast folding (kH2O

f � 1000 sÿ1)(Perl et al., 1998; Reid et al., 1998; Schindler et al.,1999). This indicates that thermodynamic stabilityand folding kinetics for this fold are providedby different interactions (and different residues).Stability and binding (most of these proteins bindDNA or RNA) are, in turn, provided by the sameset of solvent exposed aromatic residues (Newkirket al., 1994; Schindelin et al., 1994). This linkbetween stability and binding has an importantbiological implications. Marginally stable coldshock proteins get stabilized when bound to theDNA. The excess of these proteins which are notbound to the DNA are rapidly eliminated byproteolysis (Schinlder et al., 1999). Clearly this kindof regulation favors selection of fast folding andmarginally stable proteins. (Slow folding proteinswill be eliminated by proteolysis before they bindthe DNA; too stable proteins will not be removedby proteolysis and hence destroy the regulation). Itis possible that such evolutionary pressure may

Page 15: Universally Conserved Positions in Protein Folds: Reading …leonid/publications/Mirny_Shakhnovich... · 2005. 6. 19. · of protein folding, especially for proteins that fold via

Conservatism-of-Conservatism 15

have been applied to other regulatory proteins.Very interestingly, the observed signi®cant CoC inthe putative folding nucleus positions of the OBfold provides a direct evidence, for this fold, ofspecial evolutionary pressure towards fast folding,the biological reason for such pressure has beenexplained before.

This is in contrast to Rossman fold proteinswhere the super-site and nucleus are close to eachother. The reason for this is that most proteins hav-ing this fold are enzymes. Chemical catalysis per-formed by enzymes requires precise (up to fractionof A) localization and spatial coordination of elec-trophylic and nucleophilic groups. Therefore, themost rigid part of a structure may be most suitableas a location of an enzymatic active site. The mostconformational rigidity is in the nucleus: This isthe part of the structure that forms ®rst in theassembly of the native conformation and is leastdistorted by local unfolding ¯uctuations (Abkevichet al., 1994). This is due to the fact that nucleus con-tacts are formed at the folding transition state bar-rier. Therefore all local unfolding ¯uctuations thatdo not reach the top of the folding-unfoldingbarrier preserve the nucleus intact making thenucleus the most protected from thermal ¯uctu-ation part of structure.

To summarize this part of the discussion wenote that relation between nucleus, stabilizationand function depends very much on the dominantfunction that in several cases can be associatedwith a fold. Proteins having immunoglobulin foldparticipate in speci®c macromolecular recognitionas receptors, cell adhesion proteins, immunoglobu-lins, DNA-binding domains etc. A multitude ofrelatively weak non-bonded interactions results instrong and speci®c interactions between proteins.It is impossible to simultaneously orient and con-straint a large number of interacting groups. Hencean induced ®t principle may be operational in pro-tein-protein recognition in immunoglobulin foldproteins. Their recognition sites should be locatedin ¯exible parts of the molecule (loops) far fromthe nucleus. Further, since different analogs havedifferent speci®city in protein-protein recognition,functional sites in such proteins may vary fromprotein to protein; hence proteins with such rangeof functions (having Immunoglobulin or OB fold)may not have a clearly detectable super-sites. Incontrast, protein folds that are heavily used pri-marily by enzymes (Rossman fold and TIM barrel)participate functionally in a very small number ofstrong chemical, covalent interactions and therequired spatial precision of the active site is muchhigher for such function. In this case it becomesadvantageous to place active sites near foldingnucleus; this ensures suf®cient stability of theactive site against thermal ¯uctuations.

Comparison with protein engineering analysis

The experimental approach to determine foldingnucleus has been pioneered by Fersht and

co-workers (Itzhaki et al., 1995). It is based on theidea to use site-speci®c mutagenesis to determinewhich positions are most important for folding kin-etics. The results are usually expressed in terms off � ��Gkin/��Geq where ��Gkin is the changeof activation free energy upon mutation (derivedfrom transition-state assumption) and ��Geq ischange on stability upon mutation. The parameterf re¯ects the degree of participation of a mutatedresidue in the transition state: when a residue par-ticipates in the transition state f � 1 and f � 0otherwise.

The protein engineering analysis of the transitionstate for folding has been made for cd2 (immuno-globulin fold) (Lorch et al., 1999), CheY (Rossmanfold) (Lopez-Hernandez & Serrano, 1996), andADA2H (alpha/beta plait) (Villegas et al., 1998). Inall cases the results are consistent with the CoCanalysis, pointing out that residues that exhibitstrongest CoC belong to the folding nucleus asjudged by high f-values. In some cases (speci®-cally ADA2H) a slight discrepancy is observed: ashift of one b-register (two residues) of highest-fposition from the highest CoC (for the third b-strand) position. This discrepancy may be due topossible uncertainties in structural alignment withrespect to register shifts. Another, perhaps moreimportant reason, for such small discrepancies isthat nucleus always represents a cluster of resi-dues. In other words, if a pair of residues forms acontact if often induces also a contact betweentheir neighbors along the chain. This makes thenucleus boundaries somewhat fuzzy so that in anyparticular protein family the evolutionary pressuretowards fast folding could have been applied toslightly different amino acids from the nucleatingcluster (e.g. for ``historical'' reasons) that may giverise to a slight variation of the exact nucleuslocation (amino acids with highest f-values)between analogs. In this sense it may be moremeaningful to compare the location of the nucleuspositions between analogs (and with CoC signal)with respect to their location on different elementsof secondary structure. To this end the CoC anal-ysis give very accurate predictions in all caseswhere experimental information is available.

The f-value analysis often returns fractionalvalues (Itzhaki et al., 1995; Martinez et al., 1998).This may mean that contacts in the transition stateare weaker than in the native state or that eachcontact that an amino acid makes in the transitionstate is as strong as in the native state but only afraction of contacts that an amino acid forms in thenative state are actually formed in the transitionstate. The latter explanation is more plausible inview of recent results of Serrano and co-workerswho showed that f-values are robust with respectto change in solvent conditions (e.g. pH; Martinezet al., 1998). In this case even for nucleus residuesthat show strong CoC one can expect only frac-tional f-values. We expect this to be especially thecase for deeply buried nucleus clusters like in Igfold. In this case each high CoC amino acid makes

Page 16: Universally Conserved Positions in Protein Folds: Reading …leonid/publications/Mirny_Shakhnovich... · 2005. 6. 19. · of protein folding, especially for proteins that fold via

16 Conservatism-of-Conservatism

many contacts and not all of them are nucleusones. An experimental way to address this issue isthrough multiple mutant cycles that addressspeci®c interactions. A potential dif®culty in carry-ing out this analysis is that some mutations wouldhave low effect on stability ��Geq resulting in biguncertainty in f-values (Gutin et al., 1998).

Evaluating the statistical errors

Finally we would like to comment on some math-ematical details of our analysis. The decision ofwhat to consider residues with high CoC dependson a choice of cutoff probability Pc so that whenP(l) < Pc, the position l is attributed statistically sig-ni®cant CoC. Clearly some freedom in choice of Pc

can potentially introduce some arbitrariness inidenti®cation of high CoC positions. A possibleway to quantitatively estimate possible errors orig-inating from the choice of Pc is to evaluate the prob-ability of ``false positives''. We consider as a falsepositive the positions that do not have any specialnucleation-related or other signi®cant CoC (i.e.whose level of conservatism can be entirelyexplained by their solvent assessibility) but appar-ently showing some ``signal'' P(l) < Pc, due to a stat-istical ¯uctuation. Since Pc5 1 the probability offalse positives follows the rare event statistics that isdescribed by Poisson distribution (Feller, 1970):

pn�x� � eÿxxn

n!

where pn is the probability that n false positives arereported, and x � PcL where L is the length of asequence. Speci®cally, probability that no false posi-tives are reported is p0 � eÿx. Pc cannot also be cho-sen too low, since in that case no position will beidenti®ed as having signi®cant CoC. The choice ofPc is outlined by shaded lines in Figure 4, 6, 8, 10and 12. It can be seen that probability of a false posi-tive 1 ÿ p0 is very low for immunoglobulin, Ross-man and TIM barrel folds and is substantial(0.50 ÿ 0.7) for OB fold and alpha/beta plaits. Inthe latter two cases two or three positions identi®edas having signi®cant CoC in OB fold proteins andalpha/beta plaits are likely to be false positive, i.e.they may not belong to folding nucleus or be func-tionally relevant.

Conclusion

Here, we provided a detailed statistical analysisof molecular evolution of most common proteinfolds. Our results clearly point out that physicalfactors related to protein folding such as stabilityand folding rate have undergone considerableevolutionary optimization. In particular we pre-sented a direct evidence for evolutionary pressuretowards fast (but not necessarily the fastest) fold-ing for several proteins.

One of the most striking discoveries that emergefrom growing data on protein folding kinetics is

that proteins that have similar structures and com-parable stabilities may fold via a two-state mechan-ism with rates that differ as much as four orders ofmagnitude (Jackson, 1998). The only model of tran-sition state that is consistent with this observationis of speci®c nucleus (Abkevich et al., 1994; Itzhakiet al., 1995; Shakhnovich, 1997; Pande et al., 1998;Martinez et al., 1998) that points out that there existparticular nucleus positions in the structure thatserve as ``accelerator pedals'' for folding. Strongeror weaker evolutionary pressure on those ``accel-erator pedals'' (i.e. variation in the nucleus stab-ility) in different proteins gives rise to substantialvariation in folding rates. While a ®rst glance com-parison of sequences of slow and fast folders doesnot reveal any striking differences between them, adeeper analysis that compares interactions betweenamino acid residues at nucleus positions in differ-ent analogs provides a possible physical and evol-utionary rationale for the surprisingly broad rangein which folding rate of analogs may vary. Thisalso suggest an exciting experimental way to con-trol folding rate by ``transplanting'' nucleus of afast folding protein into its slow folding analogs.Alpha/beta plait and OB fold proteins seem to bethe best candidates for such protein surgery.

Finally, we identi®ed two cases where conservedproperties of a fold are linked to functionallyimportant locations-super-sites (Rossman fold andTIM barrel). For proteins having these folds predic-tion of function from structure (and ultimatelyfrom sequence) may be a feasible goal.

Methods

Control for solvent accessibility

If f (sja) is the probability density function (pdf)of entropy s given accessibility a, (normalizedR

log(6)0 f (sja)ds � 1 for 8a) we can compute the pdf of S(l)

based on the H0. Assuming families are independent(see a note below) we can apply central limit theorem(CLT) to compute the pdf of S(l). Since S(l) is a sum oflarge number of independent random variables sm(l).Hence, according to the CLT S(l) has Gaussian distri-bution with the mean and the variance:

�S�l� � 1

M

XMm�1

�s�am�l��;

s2S�l� �

1

M2

XMm�1

s2s �am�l��

where s(a) � R f (sja)sds and s2s �a � s2 ÿ s2� are the mean

and the variance of the in-family entropy as a function ofaccessibility.

The probability to observe Sobs(l) < S by chance:

P�Sobs�l� < S� �Z Sobs�l�ÿ �S�l�=sS�l�

ÿ1exp ÿ x2

2

� �dx

Positions which exhibit P(S < S(l)) < Pc are said tohave signi®cant CoC. The threshold value Pc depends on

Page 17: Universally Conserved Positions in Protein Folds: Reading …leonid/publications/Mirny_Shakhnovich... · 2005. 6. 19. · of protein folding, especially for proteins that fold via

Conservatism-of-Conservatism 17

the amount of data available for a given fold. In ouranalysis it varies between 10ÿ2 (OB fold) to 10ÿ5 (Ross-man fold).

Probability P(S) can also be computed using a convo-lution. CLT however makes computations easier and fas-ter. Importantly we assumed that families areindependent, i.e. conservatism in one family does notchange the probability to observe the same position con-served in the other family. This assumption is motivatedby the choice of families that are distant enough insequences (ID < 25 %).

Solvent accessibility was taken from HSSP ®les(Dodge et al., 1998) where it is computed as the solvatedresidues surface area in AÊ 2 (number of contacting watermolecules �10). To compute P(sja) we quanti®ed accessi-bility into intervals of 1 AÊ 2. Then smoothed P(sja) with awindow of width 31 for all intervals where less thentown counts where observed in the PDB dataset.

Selection of representative proteins

All our results were obtained using sequence align-ments from HSSP (Dodge et al., 1998) and structuralalignments (see below) from FSSP (Holm & Sander,1993). Representative set of proteins from FSSP, Sep98release was used as a basic set. From this set weremoved all proteins which have sequence identityID > 25 % with any other protein in the set. This allowedus to eliminate some obvious homologs. Next weexcluded from our analysis all the families where all pos-itions are conserved, i.e., where 1/L�L

l � 1 sm(l) > 0.4.In this study we used sequence entropy s(l) as a

measure of evolutionary conservation in a proteinfamily. However, multiple sequence alignments used toderive the sequence entropy can be biased in variousways and may not represent the divergent evolution ofthe family. For example, closely homologous sequencescan be over-represented in a multiple sequence align-ment and dominate over a few distant homologs withhigh variability. This effect was partially accounted forin our calculations by exclusion of families where allamino acid residues are conserved as such familiesclearly represent insuf®ciently divergent sequences. Byweighting sequences in a multiple alignment one cancompensate for this bias (for a review, see Henikoff &Henikoff, 1994). Lack of weighting, however, does notaffect our results. The control by solvent accessibilityshows that a simple sequence entropy can be predictedfrom solvent accessibility with a great accuracy whenboth are averaged over several families of the same fold.Hence, biases from individual families are ``averagedout'' over very large number of families that we used.We plan to introduce sequence weighting in the future,which can make our method more sensitive and lessdata demanding.

Structural alignments

Some structural alignments in the FSSP were correctedusing Monte Carlo alignment algorithm (Mirny &Shakhnovich, 1998). In brief, each of the two alignedstructures are represented by a distance matrix. We usethe same similarity measure as Holm & Sander (1993) intheir Dali program. Structural alignment is obtained byoptimization of this function. In contrast to Holm andSander, we optimized this function using Monte Carloalignment which gives systematically higher scores thanoptimization protocol implemented in Dali (Mirny &

Shakhnovich, 1998). Final alignments, however, are notvery different from those obtained by Dali. This re®ne-ment of structural alignments was applied primarily toalpha/beta plait where low degree of structural simi-larity made structural alignments a complicated pro-blem. However, one should bare in mind thatambiguities in structural alignments are inevitable sincethe results depend mostly on the choice of the similarityscore. Two structures were considered similar if theyhave FSSP Z-score ZFSSP > 2.5 and DRMSCb < 6 AÊ .

Treating gaps in alignments

Positions with gaps in structural alignments were neg-lected in computations of S(l), Sacross(l) and P(S). There-fore summation in equation (1) is over those families mwhich do not have a gap in structural alignment inposition l. Taking this into account equation (1) turnsinto:

S�l� � �Mm�1sm�l�dm

l

�Mm�1d

ml

where dml � 0 if position l in family m has a gap in struc-

tural alignment.This treatment of gaps however leads to a problem

when all except a few families have gaps at position l.Then S(l) and P(l) ¯uctuate and are unreliable. To avoidthis problem we deleted from our analysis fragments ofa protein fold where more than 50 % of structural align-ments have gaps.

Acknowledgments

This work is supported by NIH grant RO1 GM52126.We are grateful to Fabrizio Chiti, Jane Clarke, ChrisDobson, Alan Fersht, Stephen Hamil, Mikael Olivebergand Luis Serrano for illuminating discussions of exper-imental results and making many of them available to usprior to publication.

After this paper had been completed we heard sadnews that Oleg Ptitsyn passed away. Oleg had been veryexcited about emerging understanding of deep relationbetween protein folding and evolution, which is themain topic of the present paper. In fact Oleg's last paperis entirely devoted to this subject (Ptitsyn, 1998). E.I.S.enormously bene®ted from his insights over many yearsof our close collaboration and friendship.

References

Abkevich, V., Gutin, A. & Shakhnovich, E. (1994).Speci®c nucleus as the transition state for proteinfolding: evidence from the lattice model. Biochemis-try, 33, 10026-10036.

Altschuh, D., Vernet, T., Berti, P., Moras, D. & Nagai, K.(1988). Coordinated amino acid changes in homolo-gous protein families. Protein Eng. 2, 193-199.

Bahar, I. & Jernigan, R. (1997). Inter-residue potentials inglobular proteins and the dominance of highlyspeci®c hydrophilic interactions at close separation.J. Mol. Biol. 266, 195-214.

Bellsolell, L., Cronet, P., Majolero, M., Serrano, L. &Coll, M. (1996). The three-dimensional structure oftwo mutants of the signal transduction protein chey

Page 18: Universally Conserved Positions in Protein Folds: Reading …leonid/publications/Mirny_Shakhnovich... · 2005. 6. 19. · of protein folding, especially for proteins that fold via

18 Conservatism-of-Conservatism

suggest its molecular activation mechanism. J. Mol.Biol. 257, 116-128.

Bork, P., Holm, L. & Sander, C. (1994). The immunoglo-bulin fold. Structural classi®cation, sequence pat-terns and common core. J. Mol. Biol. 242, 309-320.

Branden, C. & Tooze, J. (1998). Introduction to ProteinStructure, Garland Publishing, Inc., New York.

Bryngelson, J. & Wolynes, P. (1990). A simple statistical®eld theory of heteropolymer collapse with appli-cation to protein folding. Biopolymers, 30, 177-188.

Dodge, C., Schneider, R. & Sander, C. (1998). The hsspdatabase of protein structure-sequence alignmentsand family pro®les. Nucl. Acids Res. 26, 313-315.

Feller, W. (1970). An Introduction to Probability Theory andits Applications, Wiley, New York.

Gilis, D. & Rooman, M. (1997). Predicting protein stab-ility changes upon mutation using database-derivedpotentials: solvent accessibility determines theimportance of local versus non-local interactionsalong the sequence. J. Mol. Biol. 272, 276-290.

Goldstein, R., Luthey-Schulten, Z. & Wolynes, P. (1992).Optimal protein-folding codes from spin-glasstheory. Proc. Natl Acad. Sci. USA, 89, 4918-4922.

Govindarajan, S. & Goldstein, R. (1995). Why are someprotein structures so common?. Proc. Natl Acad. Sci.USA, 93, 3341-3345.

Guo, Z. & Thirumalai, D. (1995). Nucleation mechanismfor protein folding and theoretical predictions forhydrogen-exchange labelling experiments. Biopoly-mers, 35, 137-139.

Gutin, A., Abkevich, V. & Shakhnovich, E. (1998). Aprotein engineering analysis of the transition statefor protein folding: simulation in the lattice model.Fold. Design, 3, 183-194.

Hamill, S., Meekhof, A. & Clarke, J. (1998). The effect ofboundary selection on the stability and folding ofthe third ®bronectin type iii domain from humantenascin. Biochemistry, 37, 8071-8079.

Hao, M.-H. & Scheraga, H. (1994a). Monte-Carlo simu-lation of a ®rst order transition for protein folding.J. Phys. Chem. 98, 4940-4945.

Hao, M.-H. & Scheraga, H. (1994b). Statistical thermo-dynamics of protein folding: sequence dependence.J. Phys. Chem. 98, 9882-9886.

Henikoff, S. & Henikoff, J. (1994). Position-basedsequence weights. J. Mol. Biol. 243, 574-578.

Holm, L. & Sander, C. (1993). Protein structure compari-son by alignment of distance matrices. J. Mol. Biol.233, 123-138.

Itzhaki, L., Otzen, D. & Fersht, A. (1995). The structureof the transition state for folding of chymotrypsininhibitor 2 analyzed by protein engineeringmethods: evidence for a nucleation-condensationmechanism for protein folding. J. Mol. Biol. 254,260-288.

Jackson, S. (1998). How do small single-domain proteinsfold? Fold. Design, 3, R81-R91.

Koshi, J. & Goldstein, R. (1997). Mutation matrices andphysical-chemical properties: correlations and impli-cations. Proteins: Struct. Funct. Genet. 27, 336-344.

Kraulis, P. (1991). Molscript: a program to produce bothdetailed and schematic plots of protein structures.J. Appl. Crystallog. 24, 946-950.

Liaw, S., Jun, G. & Eisenberg, D. (1994). Interactions ofnucleotides with fully unadenylylated glutaminesynthetase from salmonella typhimurium. Biochemis-try, 33, 11184-11188.

Lopez-Hernandez, E. & Serrano, L. (1996). Structure ofthe transition state for folding of the 129 aa protein

chey resembles that of a smaller protein, ci2. Fold.Design, 1, 43-55.

Lorch, M., Mason, J., Clarke, A. & Parker, M. (1999).Effects of core mutations on the folding of a beta-sheet protein: implications for backbone organiz-ation in the i-state. Biochemistry, 38, 1377-1385.

Martinez, J., Pissabarro, T. & Serrano, L. (1998). Obliga-tory steps in protein folding and the conformationaldiversity of the transition state. Nature Struct. Biol.5, 721-729.

Meekhof, A. & Freund, S. (1999). Probing residual struc-ture and backbone dynamics on the milli- to picose-cond timescale in a urea-denatured ®bronectin typeIII domain. J. Mol. Biol. 286, 579-592.

Merritt, E. & Bacon, D. (1997). Raster3D: phosorealisticmolecular graphics. Methods Enzymol. 277, 505-524.

Mirny, L. & Shakhnovich, E. (1998). Protein structureprediction by threading. Why it works and why itdoes not. J. Mol. Biol. 283, 507-526.

Mirny, L., Abkevich, V. & Shakhnovich, E. (1998). Howevolution makes proteins fold quickly. Proc. NatlAcad. Sci. USA, 95, 4976-4981.

Newkirk, K., Feng, W., Jiang, W., Tejero, R., Emerson,S., Inouye, M. & Montelione, G. (1994). Solutionnmr structure of the major cold shock protein (cspa)from Escherichia coli: identi®cation of a bindingepitope for DNA. Proc. Natl Acad. Sci. USA, 91,5114-5118.

Pande, V., Grosberg, A. & Tanaka, T. (1995). How accu-rate must potentials be for successful modeling ofprotein folding?. J. Chem. Phys. 103, 1-10.

Pande, V., Grosberg, A., Rokshar, D. & Tanaka, T.(1998). Pathways for protein folding: is a ``newview'' needed? Curr. Opin. Struct. Biol. 8, 68-79.

Parker, M., Dempsey, C., Hosszu, L., Waltho, J. &Clarke, A. (1998). Topology, sequence evolutionand folding dynamics of an immunoglobulindomain. Nature Struct. Biol. 5, 194-198.

Pelletier, H., Sawaya, M., Wol¯e, W., Wilson, S. &Kraut, J. (1996). Crystal structures of human DNApolymerase beta complexed with DNA: implicationsfor catalytic mechanism, processivity, and ®delity.Biochemistry, 35, 12742-12761.

Perl, D., Welker, C., Schindler, T., Schroder, K.,Marahiel, M., Jaenicke, R. & Schmid, F. (1998). Con-servation of rapid two-state folding in mesophilic,thermophilic and hyperthermophilic cold shockproteins. Nature Struct. Biol. 5, 229-235.

Plaxco, K., Soitzfaden, C., Campbell, I. & Dobson, C.(1996). Rapid refolding of a proline-rich all b-sheet®bronectin type III module. Proc. Natl Acad. Sci.USA, 93, 10703-10706.

Ptitsyn, O. (1998). Protein folding and protein evolution:common folding nucleus in different subfamilies ofc-type cytochromes?. J. Mol. Biol. 278, 655-666.

Reid, K., Rodriguez, H., Hillier, B. & Gregoret, L. (1998).Stability and folding properties of a model beta-sheet protein, Escherichia coli CSPA [published erra-tum appears in protein Sci 1998]. Protein Sci. 7, 470-479.

Russell, R. & Ponting, C. (1998). Protein fold irregulari-ties that hinder sequence analysis. Curr. Opin.Struct. Biol. 8, 364-371.

Russell, R., Sasieni, P. & Sternberg, M. (1998a). Super-sites within superfolds. Binding site similarity in theabsence of homology. J. Mol. Biol. 282, 903-918.

Russell, R., Sasieni, P. & Sternberg, M. (1998b). Super-sites within superfolds. Binding site similarity in theabsence of homology. J. Mol. Biol. 282, 903-918.

Page 19: Universally Conserved Positions in Protein Folds: Reading …leonid/publications/Mirny_Shakhnovich... · 2005. 6. 19. · of protein folding, especially for proteins that fold via

Conservatism-of-Conservatism 19

Sali, A., Shakhnovich, E. & Karplus, M. (1994). Kineticsof protein folding. A lattice model study for therequirements for folding to the native state. J. Mol.Biol. 235, 1614-1636.

Schindelin, H., Jiang, W., Inouye, M. & Heinemann, U.(1994). Crystal structure of cspa, the major coldshock protein of Escherichia coli. Proc. Natl Acad.Sci. USA, 91, 5119-5123.

Schindler, T., Perl, D., Graumann, P., Sieber, V.,Marahiel, M. & Schmid, F. (1998). Surface-exposedphenylalanines in the rnp1/rnp2 motif stabilize thecold-shock protein cspb from Bacillus subtilis. Pro-teins: Struct. Funct. Genet. 30, 401-406.

Schindler, T., Graumann, P., Perl, D., Ma, S., Schmid, F.& Marahiel, M. (1999). The family of cold shockproteins of Bacillus subtilis. Stability and dynamicsin vitro and in vivo. J. Biol. Chem. 274, 3407-3413.

Schuller, D., Grant, G. & Banaszak, L. (1995). The allo-steric ligand site in the vmax-type cooperativeenzyme phosphoglycerate dehydrogenase. NatureStruct. Biol. 2, 69-76.

Shakhnovich, E. (1994). Proteins with selected sequencesfold to their unique native conformation. Phys. Rev.Letters, 72, 3907-3910.

Shakhnovich, E. (1997). Theoretical studies of protein-folding thermodynamics and kinetics. Curr. Opin.Struct. Biol. 7, 29-40.

Shakhnovich, E. (1998a). Folding nucleus: speci®c ofmultiple? Insights from simulations and comparisonwith experiment. Fold. Design, 3, R108-R111.

Shakhnovich, E. (1998b). Protein design: a perspectivefrom simple tractable models. Fold. Design, 3, R45-R58.

Shakhnovich, E. & Gutin, A. (1993a). Engineering ofstable and fast-folding sequences of model proteins.Proc. Natl Acad. Sci. USA, 90, 7195-7199.

Shakhnovich, E. & Gutin, A. (1993b). A novel approachto design of stable proteins. Protein Eng. 6, 793-800.

Shakhnovich, E., Abkevich, V. & Ptitsyn, O. (1996).Conserved residues and the mechanism of proteinfolding. Nature, 379, 96-98.

Thomas, D., Casari, G. & Sander, C. (1996). The predic-tion of protein contacts from multiple sequencealignments. Protein Eng. 9, 941-948.

Tiana, G., Broglia, R., Roman, H., Vigezzi, E. &Shakhnovich, E. (1998). Folding and misfolding ofdesigned protein like chains with mutations. J. Chem.Phys. 108, 757-761.

van Nuland, H., Chiti, F., Taddei, N., Raugei, G.,Ramponi, G. & Dobson, C. (1998a). Slow folding ofmuscle acylphosphatase in the absence of inter-mediates. J. Mol. Biol. 283, 883-891.

van Nuland, H., Meijberg, W., Warner, J., Forge, V.,Scheek, R., Robillard, G. & Dobson, C. (1998b).Slow cooperative folding of a small globular proteinhpr. Biochemistry, 37, 622-637.

Viguera, A., Serrano, L. & Wilmanns, M. (1997). Differ-ent folding transition states may result in the samenative structure. Nature Struct. Biol. 4, 939-946.

Villegas, V., Martinez, J., Aviles, F. & Serrano, L. (1998).Structure of the transition state in the folding pro-cess of human procarboxypeptidase a2 activationdomain. J. Mol. Biol. 283, 1027-1036.

Welch, M., Oosawa, K., Aizawa, S. & Eisenbach, M.(1994). Effects of phosphorylation, Mg2�, and con-formation of the chemotaxis protein chey on itsbinding to the ¯agellar switch protein ¯im. Biochem-istry, 33, 10470-10476.

Wilcock, D., Pisabarro, M., Lopez-Hernandez, E.,Serrano, L. & Coll, M. (1998). Structure analysis oftwo chey mutants: importance of the hydrogen-bond contribution to protein stability. Acta Crystal-log. sect. D, 54, 378-385.

Wilson, K., Shewchuk, L., Brennan, R., Otsuka, A. &Matthews, B. (1992). Escherichia coli biotin holoen-zyme synthetase/bio repressor crystal structuredelineates the biotin- and DNA-binding domains.Proc. Natl Acad. Sci. USA, 89, 9257-9261.

Edited by A. R. Fersht

(Received 26 March 1999; received in revised form 21 May 1999; accepted 27 May 1999)