statistical potentials extracted from protein structures: how accurate are they?

13
J. Mol. Biol. (1996) 257, 457–469 Statistical Potentials Extracted From Protein Structures: How Accurate Are They? Paul D. Thomas 1 and Ken A. Dill 2 * ‘‘Statistical potentials’’ are energies widely used in computer algorithms to 1 Graduate Group in fold, dock, or recognize protein structures. They are derived from: (1) Biophysics, University of observed pairing frequencies of the 20 amino acids in databases of known California, San Francisco protein structures, and (2) approximations and assumptions about the CA 94143-0448, USA physical process that these quantities measure. Using exact lattice models, 2 Department of we construct a rigorous test of those assumptions and approximations. We Pharmaceutical Chemistry find that statistical potentials often correctly rank-order the relative University of California strengths of interresidue interactions, but they do not reflect the true San Francisco underlying energies because of systematic errors arising from the neglect CA 94143-0446, USA of excluded volume in proteins. We find that complex residue–residue distance dependences observed in statistical potentials, even those among charged groups, can be largely explained as an indirect consequence of the burial of non-polar groups. Our results suggest that current statistical potentials may have limited value in protein folding algorithms and wherever they are used to provide energy-like quantities. 7 1996 Academic Press Limited Keywords: protein folding; knowledge-based potential; Boltzmann *Corresponding author ensemble; residue partitioning; protein structure recognition Introduction Our purpose here is to evaluate ‘‘statistical potentials’’ and the premises that underlie them. Statistical potentials are widely used as empirical energy functions to judge the quality of proposed protein structure models (Lu ¨ thy et al ., 1992; Wilmanns & Eisenberg, 1993), to identify the native fold or correct folding motif of an amino acid sequence among many incorrect alternatives (Hendlich et al ., 1990; Jones et al ., 1992; Bryant & Lawrence, 1993), to identify possible folds for a sequence of unknown structure (Bowie et al ., 1991; Sippl & Weitckus, 1992), to predict docking of protein structures (Pellegrini & Doniach, 1993), to find amino acid sequences compatible with a desired structure (Godzik & Skolnick, 1992), and to simulate protein folding (Wilson & Doniach, 1989; Skolnick & Kolinski, 1990; Sun, 1993; Kolinski & Skolnick, 1994). Statistical potentials are putative energies that are derived from amino acid pairing frequencies observed in known protein structures. The idea was first proposed by Tanaka & Scheraga (1976). Miyazawa & Jernigan (1985) took a major step forward in including terms to explicitly consider solvent effects. Sippl (1990) and others (Hendlich et al ., 1990; Jones et al ., 1992) extended these methods to include dependence on pairwise separation of residues in space and along the sequence. Bryant & Lawrence (1993) developed a log-linear statistical model to analyze protein structures separately, rather than using simple sums over distributions of residues in all proteins. More recently, statistical potentials have been refined by adding other statistical terms involving residue triplets (Godzik & Skolnick, 1992), dihedral angles (Nishikawa & Matsuo, 1993; Kocher et al ., 1994), solvent accessibility and hydrogen-bonding (Nishikawa & Matsuo, 1993). The basic idea behind statistical potentials is simple. We illustrate the idea using an idealized example. Suppose large numbers of the 20 amino acids were somehow to distribute themselves in a gas phase at temperature T. If the interactions are purely pairwise, the distributions can be described by the equilibrium pairwise density r ij (r ) between any two amino acid types i , j = 1, 2, . . . , 20 at Abbreviations used: PDB, Protein Data Bank; 2D, two-dimensional; HP, hydrophobic-polar; PP, polar-polar; HH, hydrophobic-hydrophobic. 0022–2836/96/120457–13 $18.00/0 7 1996 Academic Press Limited

Upload: paul-d-thomas

Post on 17-Oct-2016

220 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Statistical Potentials Extracted From Protein Structures: How Accurate Are They?

J. Mol. Biol. (1996) 257, 457–469

Statistical Potentials Extracted From ProteinStructures: How Accurate Are They?

Paul D. Thomas 1 and Ken A. Dill 2*

‘‘Statistical potentials’’ are energies widely used in computer algorithms to1Graduate Group infold, dock, or recognize protein structures. They are derived from: (1)Biophysics, University ofobserved pairing frequencies of the 20 amino acids in databases of knownCalifornia, San Franciscoprotein structures, and (2) approximations and assumptions about theCA 94143-0448, USAphysical process that these quantities measure. Using exact lattice models,2Department of we construct a rigorous test of those assumptions and approximations. We

Pharmaceutical Chemistry find that statistical potentials often correctly rank-order the relativeUniversity of California strengths of interresidue interactions, but they do not reflect the trueSan Francisco underlying energies because of systematic errors arising from the neglectCA 94143-0446, USA of excluded volume in proteins. We find that complex residue–residue

distance dependences observed in statistical potentials, even those amongcharged groups, can be largely explained as an indirect consequence of theburial of non-polar groups. Our results suggest that current statisticalpotentials may have limited value in protein folding algorithms andwherever they are used to provide energy-like quantities.

7 1996 Academic Press Limited

Keywords: protein folding; knowledge-based potential; Boltzmann*Corresponding author ensemble; residue partitioning; protein structure recognition

Introduction

Our purpose here is to evaluate ‘‘statisticalpotentials’’ and the premises that underlie them.Statistical potentials are widely used as empiricalenergy functions to judge the quality of proposedprotein structure models (Luthy et al., 1992;Wilmanns & Eisenberg, 1993), to identify the nativefold or correct folding motif of an amino acidsequence among many incorrect alternatives(Hendlich et al., 1990; Jones et al., 1992; Bryant &Lawrence, 1993), to identify possible folds for asequence of unknown structure (Bowie et al., 1991;Sippl & Weitckus, 1992), to predict docking ofprotein structures (Pellegrini & Doniach, 1993), tofind amino acid sequences compatible with adesired structure (Godzik & Skolnick, 1992), and tosimulate protein folding (Wilson & Doniach, 1989;Skolnick & Kolinski, 1990; Sun, 1993; Kolinski &Skolnick, 1994).

Statistical potentials are putative energies that arederived from amino acid pairing frequencies

observed in known protein structures. The idea wasfirst proposed by Tanaka & Scheraga (1976).Miyazawa & Jernigan (1985) took a major stepforward in including terms to explicitly considersolvent effects. Sippl (1990) and others (Hendlichet al., 1990; Jones et al., 1992) extended thesemethods to include dependence on pairwiseseparation of residues in space and along thesequence. Bryant & Lawrence (1993) developed alog-linear statistical model to analyze proteinstructures separately, rather than using simple sumsover distributions of residues in all proteins.More recently, statistical potentials have beenrefined by adding other statistical terms involvingresidue triplets (Godzik & Skolnick, 1992), dihedralangles (Nishikawa & Matsuo, 1993; Kocher et al.,1994), solvent accessibility and hydrogen-bonding(Nishikawa & Matsuo, 1993).

The basic idea behind statistical potentials issimple. We illustrate the idea using an idealizedexample. Suppose large numbers of the 20 aminoacids were somehow to distribute themselves in agas phase at temperature T. If the interactions arepurely pairwise, the distributions can be describedby the equilibrium pairwise density rij (r) betweenany two amino acid types i, j = 1, 2, . . . , 20 at

Abbreviations used: PDB, Protein Data Bank; 2D,two-dimensional; HP, hydrophobic-polar; PP,polar-polar; HH, hydrophobic-hydrophobic.

0022–2836/96/120457–13 $18.00/0 7 1996 Academic Press Limited

Page 2: Statistical Potentials Extracted From Protein Structures: How Accurate Are They?

Accuracy of Statistical Potentials458

Figure 1. Hypothetical process for extracting contactenergies between residues of type i and j from contactdistributions in proteins. (a) ‘‘Desolvation’’ of twoi-solvent contacts to form i-i and solvent-solventcontacts: extracted contact energy = wii + w00 − 2wi0, where0 denotes solvent and wxy are defined by equation (2)using a random mixture of solvent and residues as thereference state. (b) ‘‘Mixing’’ of i-i and j-j contacts to formtwo i-j contacts: extracted contact energy = 2wij − wii − wjj ,where wxy are defined by equation (2) using a randommixture of residues, weighted according to averagedegree of burial in protein structures, as the referencestate.

out the available volume. In a random mixture thenumber of contacts between different monomersdepends only on the relative concentrations of thosemonomer types. For example, because alanineresidues are more common in proteins thanmethionine residues, a random mixture will havemore Ala-Ala contacts than Met-Met. Finally, usingthe Boltzmann equation supposes an equilibriumbetween the observed pairing state and thereference state. Each amino acid pair is assumed tobe independent of all the other pairs in themolecule. For weakly interacting particles in thegas phase, this is a reasonable approximation.However, one of the most remarkable features aboutproteins is the extremely close packing of residues(Richards, 1977). In addition, amino acids arecovalently linked in specific sequences. These arethe premises we test below. We do not test theassumption that interactions are pairwise additive,nor do we treat local interactions (e.g. dihedral anglepotentials).

There are two main approaches to calcu-lating amino acid pair potentials. In one ap-proach, the interactions between amino acids areassumed to be short-ranged, and are approxi-mated using a ‘‘contact potential’’ (Tanaka &Scheraga, 1976; Miyazawa & Jernigan, 1985). Inthe Miyazawa & Jernigan (1985) formulation, acontact energy wij is an average of amino acidpairings over distances shorter than some cutoffdistance rc :

wij = −kT lng

rc

0

rij (r) dr

grc

0

r*ij (r) dr

(2)GG

G

G

G

F

f

GG

G

G

G

J

j

Miyazawa & Jernigan (1985) recognized that proteinfolding involves desolvating two monomers i and jbefore forming a contact between them. To accountfor this approximately, Miyazawa and Jerniganinvented a hypothetical two-step process of contactformation (Figure 1). In the first step, monomers iand j, which are regarded as being solvated in thedenatured state, go through a self-pairing (i with i,j with j, solvent with solvent). In the second step, thei-i and j-j pair ‘‘bonds’’ are broken and i-j bonds aremade. These steps involve applying equation (2)four times, for breaking two contacts and formingtwo contacts.

In the Miyazawa and Jernigan approach, each ofthe two steps, desolvation and mixing, is based ona different random mixture reference state. Fordesolvation, the reference state is a uniformmixture of solvent molecules and amino acidresidues. For mixing, which involves moving aminoacids within a compact globule, the reference stateweights residue positions in terms of their degree ofburial.

The second class of statistical pair potentialsallows for distance-dependence of the interactions.

distance r. In this case, the interaction free energy,wij (r), can be calculated from the observed densitiesby the Boltzmann relation:

wij (r) = − kT ln0rij (r)r* 1 (1)

where k = Boltzmann’s constant and r* is thereference state pair density at infinite separationwhere the particle interaction is zero. This exampleshows how to infer pairwise energies from theaverage spatial distributions of amino acids in thisidealized gas phase example.

But protein crystal structures (and NMR struc-tures) are not gas phases of amino acids in dynamicequilibrium. Certain assumptions and approxi-mations are usually made to obtain energy-likequantities from protein structures. First, amino acidpair density functions rij (r) are constructed bysumming the static densities observed in differentproteins from the Brookhaven Protein Data Bank(PDB, Bernstein et al., 1977) rather than averagingdifferent states of the same protein. Second, it isnecessary to choose a reference state (the pairdensity corresponding to zero-energy). Miyazawa &Jernigan (1985) introduced the use of the ‘‘random-mixing approximation’’, which assumes that in theabsence of interactions, the amino acids and solventmolecules would be uniformly distributed through-

Page 3: Statistical Potentials Extracted From Protein Structures: How Accurate Are They?

Accuracy of Statistical Potentials 459

For such potentials, equation (1) has been applied toindividual small distance intervals. In this case,another normalization is also needed. Sippl (1990)solved the problem of how to calculate the expected‘‘uniform density’’ reference state for distance-de-pendent potentials. The reference density of pairdistances at each distance interval depends not onlyon the frequencies of the residue pairs in question,but also on the total number of pair distancesobserved at that distance. For example, more pairsof amino acids are separated by 10 A than by 80 A,because of the small sizes of proteins. Becausedividing up frequencies both by pair type anddistance interval results in many parameters withfew proteins to define them, Sippl (1990) developeda ‘‘sparse data correction’’ which corrects for theenergies calculated using equation (1) by anuncertainty factor.

Testing the premises of statistical potentials

Statistical potentials are arguably intended tomimic the natural energies that drive amino acids toform contacts. How well do statistical potentials,‘‘extracted’’ from native structure databases, reflectthe ‘‘true’’ underlying energies? It is not known,because there is no independent knowledge ofnature’s true underlying energies. Here we devise atest that circumvents that problem. We generatedifferent ‘‘model PDBs’’ using an exact lattice modelfor which the underlying energies can be specifiedexactly. We extract putative energies from observingthe monomer pairing frequencies in each modelPDB, and compare the extracted energies to the trueenergies. We define the term ‘‘true energy’’ to meanthe actual contact free energy that causes the proteinto fold into its given native state, and the term‘‘extracted energy’’ to mean the energy-like quan-tity that is obtained from observing the monomerpairing frequencies in the database of nativestructures and using the assumptions describedabove. It is not important that the lattice model is nota perfect mimic of real proteins. We are simplyperforming a consistency check of the methods thatgenerate statistical potentials. We aim to learn howmuch error is introduced by their neglect of chainconnectivity, amino acid sequence and excludedvolume.

The ‘‘AB-model’’ consists of chains of twomonomer types, A and B, having lengths L = 11 to18 on two-dimensional (2D) square lattices. Wespecify a true contact potential, involving threeinteraction energies (for AA, AB, and BB contacts)which defines the total energy for any conformationof any AB sequence on the 2D lattice. Monomers arein contact if they are non-bonded nearest neighborson the lattice. Since there are only three energies, weare able to explore all possible sets of unique contactpotentials, and all the unique native structures foreach given energy function.

Our consistency check runs as follows. (1) Weselect a set of true contact free energies, which wedenote with capital letters EAA , EAB and EBB . (2) For

each chain length, we perform an exhaustive searchof conformational space, and find the native states(lowest true energy) of all 2L sequences. (3) We makea ‘‘database’’ of the unique native structures. (4) Weuse two representative statistical potential extrac-tion methods, one contact-based and one distance-dependent, to extract statistical energies from thislattice model database. We denote the extractedenergies with lowercase eAA , eAB and eBB . The contactpotential is extracted by the method of Miyazawa &Jernigan (1985) and the distance-dependent poten-tial by a simplified version of the method of Sippl(1990) that considers all residues in the short chainsto be of the same ‘‘topological level’’ (i.e. there is nodependence on sequence separation). We selectedthese two methods as representative of themethodology of statistical potentials because theyare the most widely referenced in the currentliterature.

Our approach involves no sampling problemssince we exhaustively enumerate the completedatabase of all sequences having a unique fold. Thedata are not sparse, as they are for real proteins. Wehave approximately 1200 to 60,000 contacts to definethree parameters (400 to 20,000 observations perparameter). For comparison, Miyazawa and Jerni-gan used about 27,000 contacts to define 210parameters, or about 130 observations per par-ameter. For the distance-dependent potentials, weinclude the sparse-data correction of Sippl (1990),though for most of the lattice model databases weconsider there are enough pair distances toapproach the uncorrected values. We do not addressquestions of database size or sampling problems.Rather, our purposes here are (1) to investigate theprinciples of statistical potentials, and (2) to assesshow accurately current statistical potentials mightreflect the real amino acid contact energies inproteins.

Results

First we explore the simplest version of the ABmodel, where the true potential involves only asingle energy, namely the HP (Hydrophobic-Polar)model: EHH :EHP :EPP = −1:0:0. In this model, contactsbetween H monomers are favorable relative tosolvent contacts, while HP and PP contacts areenergetically equivalent to solvent contacts. Thefolding properties of this model are known in somedetail (Lau & Dill, 1989, 1990; Chan & Dill, 1991a,b;Lipman & Wilbur, 1991; Shortle et al., 1992; Milleret al., 1992; Unger & Moult, 1993; O’Toole &Panagiotopoulos, 1993; Camacho & Thirumalai,1993a; Camacho & Thirumalai, 1993b; Chan & Dill,1994; Chan et al., 1995; reviewed by Dill et al., 1995).Figure 2 shows that the Miyazawa and Jerniganprocedure correctly determines that the HH contactinteraction is dominant and attractive. However, theextracted energies are not equal to the true energies.Two of the main errors introduced by the extractionprocess are: (1) the extracted energies eHP and ePP arefound to be non-zero, whereas the true energies EHP

Page 4: Statistical Potentials Extracted From Protein Structures: How Accurate Are They?

Accuracy of Statistical Potentials460

Figure 2. Extracted statistical potentials for the HPmodel versus chain length: eHH (circles, true potential was−e, where e > 0), eHP (squares, compare with the truepotential of 0), ePP (triangles, compare with the truepotential of 0).

contact breaking term (Figure 1). The extracted PPcontact energy, on the other hand, is dominated onlyby the unfavorable term for forming a PP contact; Presidues are about equally solvated in both thenative and reference states. The net result is that, asa result of the true HH interaction, the extracted HPcontact energy appears more favorable than theextracted PP interaction. The HP and PP interactionsare ‘‘coupled’’ to the HH interaction.

The coupling of interactions among differenttypes of residue pairs is also illustrated by con-sidering distance-dependent potentials (Figure 3).While the true potential in the HP model is justa first-neighbor HH contact interaction (i.e. afavorable ‘‘spike’’ at a distance of one lattice unit),the extracted potentials erroneously give a distancedependence. The extracted interactions are favor-able over some distance ranges and unfavorableover others. The reason for this incorrect andcomplex distance dependence is the assumeduniform distributions in the extraction procedurereference states, as described below.

The incorrect extracted potentials come fromtwo types of coupling. First, in both the distancedependent and contact potentials, the extractedenergy of a monomer pair at a given distance isinfluenced by other pairs at the same distance. Forthe HP lattice model, the high density of HH pairsat short distances causes a correspondingly lowdensity (relative to uniform) of HP and PP pairs atthose distances. As a result, the extracted HP andPP potentials are erroneously unfavorable at shortrange. Second, for distance dependent extractedpotentials, the energy of a monomer pair at a givendistance of spatial separation is influenced by thesame pair at different distances. For instance, thereis a high density of HH pairs at short distances, dueto the true HH attraction. The total density ofdistances between HH pairs is the same in both thedatabase and the uniform distribution referencestate, so the higher (than uniform) concentration ofHH pairs at short distances causes a compensatingdepletion (relative to uniform) at longer distances.The concentrations at different distances are treatedas independent by equation (1), but independence isa poor approximation.

In summary, the complex extracted distancedependence of HP and PP interactions follows fromthe HH interactions. H monomers cluster, givingmany short HH distances and few long HHdistances. The HH attraction drives P monomers tothe protein surface, resulting in many long PPdistances. The HH attraction similarly causes manyintermediate HP distances, between interior Hs andexterior Ps. Thus the HP and PP pairing frequenciesare not independent of the HH interaction.

This coupling of interactions is not an artifact ofour model. A comparison of Figures 4 and 5 showsthe same coupling in potentials extracted from theProtein Data Bank. We classified the residues inproteins into just two types, interior or exterior, andthen used the method of Sippl (1990) to extract apotential between these two geometrically defined

and EPP are zero, and (2) all the extractedinteractions depend on chain length, whereas thetrue energies do not. These errors arise from theapproximations made in the extraction proceduresfor statistical potentials, as described below.

The problem: interactions are not independent

For the HP model, the extraction process infersthat the HP interaction is more favorable than the PPinteraction even though the true energies are zerofor both HP and PP. The problem is not that HPcontacts are more common than PP in the database;the problem is the assumption that the pairwiseinteractions are independent. The HH interaction,which dominates in this model, indirectly affectsHP and PP pairing frequencies.

In the HP lattice model, favorable HH contacts aremade preferentially, which has two effects on theobserved frequencies of other contacts: (1) becauseeach structure makes only a limited number ofinterresidue contacts, HH contacts deplete the totalnumber of contacts available for any otherinterresidue contact (HP and PP contacts), and (2)HH contacts deplete the supply of H contactsurfaces that are available for any other contactswith H monomers (HP and H-solvent contacts). Butthe random mixture reference state supposes thatall contact types are uniformly available. Thus in themodel database, PP, HP and H-solvent pairs areunderrepresented relative to the reference state, soforming these contacts appears as an unfavorablecontribution to the extracted contact energies(breaking them will appear favorable). The ex-tracted HP contact energy therefore includes twolarge terms of opposite sign, an unfavorable HPcontact forming term and a favorable H-solvent

Page 5: Statistical Potentials Extracted From Protein Structures: How Accurate Are They?

Accuracy of Statistical Potentials 461

Figure 3. Distance-dependent statistical potentialsextracted from 2D HP structure databases using themethod of Sippl (1990), including sparse data correction(s = 1/50). Lines are polynomial fits to the data to guidethe eye. The extracted potentials for chain lengths of11 (open circles) and 18 (filled circles) are shown.(a) Extracted HH interaction (compare with the truepotential, a favorable ‘‘spike’’ at a distance of one latticeunit, and 0 for other distances). (b) Extracted HPinteraction (the true potential is 0 for all distances).(c) Extracted PP interaction (the true potential is 0 forall distances).

Figure 4. Distance-dependent potentials extracted fromreal protein structures using only two monomer types,interior or exterior: (a) Interior-interior (compare withFigure 5(a)), (b) interior-exterior, (c) exterior-exterior(compare with Figure 5(b) and (c)). Potentials areextracted for the same sequence separations in the sameset of proteins, and using the same equations (includingsparse data correction, scaling for three parameters ratherthan 210) as in Figure 5. Interior/exterior positions aredetermined using the program ACCESS (Lee & Richards,1971). The full backbone and side-chain center of mass isused for the calculation, with a probe radius of 2.0 A tocompensate for the single-atom side-chain representation.If the accessible surface area of a carbon atom positionedat the side-chain center of mass (Ca for glycine) exceeds30 A2, the residue is considered to be exposed.

‘‘residue types.’’ This test shows that the interior-in-terior pairing ‘‘energy’’ (Figure 4(a)) is nearlyidentical to that calculated by Hendlich et al. (1990)for the pair Ile-Val (Figure 5(a)), two hydrophobicresidues. Apparently the 30 different extractedenergy parameters for the different distanceintervals are mainly just reflecting that isoleucineand valine residues tend to be in protein interiors.Moreover, Figure 3 shows that a simple nearest-neighbor HH attraction in the lattice model issufficient to give the same functional form as that

for Ile-Val pairs in proteins. Hence the apparentlycomplex distance dependence among amino acidsin proteins may reflect little more than thathydrophobic residues attract each other.

More strikingly, Figure 3 shows that the HHcontact interaction is also sufficient to give anextracted PP potential (Figure 3(c)) that hasroughly the same functional form as for both Lys-Asp (unlike charges) and Asp-Asp (like charges)

Page 6: Statistical Potentials Extracted From Protein Structures: How Accurate Are They?

Accuracy of Statistical Potentials462

Figure 5. Distance-dependent potentials extracted byHendlich et al. (1990) for residue pairs separated inprimary sequence by 61 to 100: (a) valine-isoleucine, (b)lysine-aspartate, (c) aspartate-aspartate. Note the simi-larity between (b) and (c).

surrounded by polar residues. To test this possibleredundancy, we performed the same threading testas Hendlich et al. (1990), but using only a single‘‘energy’’ parameter that accounts for contactsbetween hydrophobic residues. Table 1 comparesour single-parameter results with the threadingpotential of Hendlich et al. (1990). Table 1 shows thatthis single hydrophobic contact energy parametercorrectly identifies the native conformation ashaving the lowest energy in nearly as many cases(37/65) as the much more complex potential ofHendlich et al. (41/65). Most of the failures inthe simple model are also failures in the com-plex model. We conclude that: (1) most of theinformation about protein energetics contained incomplex statistical potentials is simply hydrophobicclustering propensity, as has been noted before(Casari & Sippl, 1992; Bryant & Lawrence, 1993),and (2) the threading test (with no insertions anddeletions) is not a particularly challenging test ofenergy functions. Our results for the threading testare consistent with those of Bryant & Amzel (1987),who found that counting hydrophobic contacts candistinguish between correctly and grossly mis-folded proteins.

Interior-exterior partitioning: effects of proteinsize and composition

Whereas the true potentials between amino acidscannot depend on the chain length, the extractedpotentials do (Figures 2 and 3). For the distance-de-pendent potentials extracted from our 2D HP latticemodel (Figure 3), the functional form is similar forchain lengths of 11 or 18, but the scale of thedistance dependence is different. The scale is ageometric description of the average location ofmonomers relative to the core or surface. Forexample, H residues are overrepresented in thecores of our model structures, which are larger forthe 18-mers than for the 11-mers; accordingly, theextracted HH energy changes from attractive torepulsive between a distance of Z5 and Z8 latticeunits for the 18-mers, but only between Z2 and twolattice units for the 11-mers. Similarly, the PPinteraction reaches a minimum at three lattice unitsfor the 11-mers, but about five lattice units for the18-mers; this corresponds to the difference inaverage conformational diameters. Analogous geo-metric properties may account for the resultsreported by Hendlich et al. (1990) that a potentialextracted from a database of smaller proteinsperforms slightly better at recognizing the nativeconformations of other small proteins than apotential extracted from a database of proteins of allsizes.

In the case of the extracted contact potentials, thechain length dependence (Figure 2) is due primarilyto the desolvation terms (Figure 1), which take aform similar to that of a transfer from a solvated toa buried state. As Janin (1979) first noted, apparentinterior-exterior ‘‘partition energies’’ (extractedfrom frequencies of amino acids in buried and

(Figure 5(b) and (c)) as extracted from the PDB byHendlich et al. (1990). That is, the extracted chargedresidue interactions in proteins are not mainly dueto electrostatics; it appears to be mainly becausecharged residues are driven to the protein surfaceby the non-polar attractions of other amino acids.This explains the observation that more Coulomb-like charge-charge interactions are extracted fromthe PDB when statistics are compiled only onsurface residues (Kocher et al., 1994). Consideringonly surface residues is equivalent to using adifferent reference state for the potentials that takesinto account the average effect of the hydrophobicresidues on the observed charge-charge distri-butions.

Some statistical potentials have tens of thousandsof parameters for pairing frequencies as a functionof distance and sequence separation. But most ofthose parameters may be redundant, all reflectingmainly that non-polar amino acids form a core

Page 7: Statistical Potentials Extracted From Protein Structures: How Accurate Are They?

Accuracy of Statistical Potentials 463

Table 1. A count of hydrophobic contacts succeeds in the‘‘threading test’’ nearly as often as a more complexpotential for identifying native structures

ExtractedProtein potential Hydrophobic contact countPDB code Position Position NHH DNHH

4SBV A 1 1 227 153ADK 1 1 167 242STV 5 1 154 21HMG B 71 803 88 −341GCR 1 1 188 432ALP 1 1 202 303WGA A 1 1 147 292SGA 6 1 181 452LZM 1 1 183 384DFR A 1 1 178 141LH4 1 1 189 101MBD 1 1 161 202SOD O 1 1 136 382LHB 1 1 183 172HHB B 1 1 172 262PKA B 5 3 116 −92HHB A 1 1 155 43FXN 1 1 173 461LZ1 1 1 174 422AZA A 1 1 120 82CCY A 2 2 140 −11RN3 1 1 99 81BP2 1 1 115 21PP2 R 1 1 118 0155C 1 1 99 131PAZ 1 1 152 292PAB A 1 1 119 01HMQ A 1 115 81 −162C2C 1 16 92 −111CPV 1 1 121 21ACX 2 1 116 71REI A 1 1 95 92CDV 1 1548 41 −242SSI 1 7 119 64CYT R 1 53 59 −181WRP R 1 160 77 −161PCY 1 1 101 101HVP A 5 1 89 33FXC 2 1 102 132GN5 26 108 65 −131HIP 11 24 83 −92B5C 1 1 62 71CC5 235 7 86 −8351C 1 11 73 −52PKA A 1 42 62 −103ICB 1 1 66 192ABX A 65 12 56 −91HOE 125 4 69 −31CTF 71 1 101 171SN3 1 1 70 81CSE I 1 3 60 −42EBX 1 2 32 −12MT2 1 1 71 104PTI 1 9 52 −51OVO A 1 35 39 −111FDX 64 1 71 15RXN 2571 1 44 01CRN 24 8 45 −61BDS 8 4 42 −22RHV 4 74 9690 9 −281PPT 10 360 17 −101INS B 301 186 25 −131GCN 4563 666 11 −101MLT A 107 340 20 −111INS A 483 686 14 −14

Position of the correct native conformation in a list of threadedconformations sorted by the total distance-dependent statisticalenergy of Hendlich et al. (1990) (column 2), or just by countingnon-polar contacts (Ala, Cys, Ile, Leu, Met, Phe, Trp, Tyr, Val;column 3). A contact is defined as a Cb –Cb distance of less than8 A. NHH is the number of non-polar contacts in the nativeconformation, and DNHH is the difference between the numberof non-polar contacts in the native conformation and in the lowestenergy non-native threaded conformation.

exposed positions in proteins) will depend on thesurface-to-volume ratio of the protein because thegreater volume inside larger proteins more readilyaccommodates non-polar monomers. This explainsthe systematic errors noted by Miyazawa & Jernigan(1985) in trying to predict surface-to-volume ratiosusing their extracted contact energies. Miyazawaand Jernigan assumed their extracted energieswould not depend on the surface-to-volume ratiossince the true energies do not depend on thisquantity.

Extracted partition energies also depend onamino acid composition even though the truepotentials cannot. For the lattice model, extractedenergies become more positive with increasingcontent of H monomers (Figure 6). This is becausethere are more H monomers than needed to fill thecore (2D lattice 18-mers can have at most fivemonomers completely sequestered from solvent), soadditional H-monomers must be at least partiallyexposed to solvent. Because of coupling, theextracted HP and PP energies also increase withincreasing non-polar content; with more H residuesin the sequence, the chances increase that a givenconformation will make only HH contacts. Forexample, out of all 164 structures in the database of18-mers with 14 H monomers, 162 make only HHcontacts and no HP or PP contacts.

To account approximately for the effects ofsurface-to-volume ratio and composition on ex-tracted energies, we define the ‘‘partition propen-sity’’ (p) of a given protein or set of proteins as:

p = 2nc

qHnH(3)

where nc is the total number of contacts in a givenprotein, qH is the average coordination number of ahydrophobic (H) residue (taken from Miyazawa &Jernigan, 1985) and nH is the number of H residuesin the protein. Physically, 2nc is the number ofcoordination sites involved in all residue-residuecontacts (roughly proportional to the total buriedsurface of the molecule), and qHnH is the totalnumber of H coordination sites (roughly pro-portional to the total hydrophobic surface).

The partition propensity is therefore a crudemeasure of how effectively hydrophobic surfacecan be buried in a given structure. If a protein hasa low partition propensity, it has more hydrophobicresidues than are needed to fill a core, so some willnot be able to partition effectively into a core. A highpartition propensity means there is a large buriedcore, and few hydrophobic residues in the sequenceto fill it. Figure 7(b) shows how the extracted HHcontact energy depends on the partition propensityin the lattice model.

The same dependence of extracted energies onpartition propensity is found for real proteins(Figure 8). We sorted a group of 346 non-homolo-gous protein chains (Hobohm & Sander, 1994) fromthe PDB by the partition propensity of each protein,labeling each residue type as H (hydrophobic) or P(others). We made 326 sets of 20 proteins by sliding

Page 8: Statistical Potentials Extracted From Protein Structures: How Accurate Are They?

Accuracy of Statistical Potentials464

Figure 6. Extracted potentials depend on composition.The extracted HH (circles), HP (squares) and PP(triangles) energies for 2D lattice chains, L = 18, versus thenumber of H residues (nH ) in the sequences.

Figure 8. Extracted energies depend on partitionpropensity for proteins in the PDB. Hydrophobic residues(H) are Ala, Cys, Ile, Leu, Met, Phe, Tyr, Trp and Val;others are classed as P. The HH energy is shown by thebold line, HP by the continuous line and PP by the brokenline.

a window along the sorted list of proteins. That is,the ith set contains the ith through i + 19th proteinsin the sorted list. Figure 8 plots the extracted contactenergies of each set against the average partitionpropensity of the set. The extracted energy betweenhydrophobic groups becomes significantly morefavorable with increasing partition propensity, fromabout −1.3kT to −6.9kT, while the extracted PPenergy increases from about 0kT to −2.4kT. Thestatistical potentials depend systematically on thesize and amino acid composition of the proteins inthe given set.

Consider two different 69-protein databases: onecontaining proteins with an average partitionpropensity p = 1.8 and the other having p = 1.0.Figure 9 plots the 210 contact energies extractedfrom one database against the same ones from theother database. The correlation between the twoenergy sets is only modest (R = 0.62). Even the rankordering of contact pairings can be affected byprotein size and composition. For example, forp = 1.8 the extracted Phe-Lys contact energy is more

favorable than Ile-Val, whereas for p = 1.0 Phe-Lysis less favorable than Ile-Val. Coupling effects aredifferent in these two protein sets, because of thedifferent degree of burial of the hydrophobicresidues. Phe is usually completely buried in theproteins having higher partition propensity, so aPhe-solvent contact is very unfavorable by equation(2), and any contact with Phe appears very favorablebecause it breaks a Phe-solvent contact. However forproteins having more hydrophobic residues thantheir cores can accommodate, Phe is often sacrificedto the surface, and Phe-solvent contacts are lessunfavorable. If the relative strengths of the extractedcontact energies depend on average properties ofthe proteins in the database, such as protein sizeand composition, then how can we know which, if

Figure 7. Extracted energies forthe 2D HP model depend on (a)chain length, (b) average partitionpropensity of the structures in thedatabase.(a) (b)

Page 9: Statistical Potentials Extracted From Protein Structures: How Accurate Are They?

Accuracy of Statistical Potentials 465

Figure 9. Contact energies extracted from two differentsets of 69 proteins in the PDB. The energies calculated forthe set of highest partition propensity (p = 1.8) are plottedagainst the energies for the same pairs calculated for theset of lowest partition propensity (p = 1.0). The slope ofthe linear best fit is 2.2, with an intercept of −1.6kT.

frequencies of different ‘‘states’’ (e.g. cis- andtrans-peptide conformations) observed in the PDBcorrelate with the frequencies expected fromthermodynamic behavior. Observed frequencies inproteins are converted to energy differences usingthe Boltzmann relation, and then plotted against theenergy differences (expected from either theory orexperiment) between different states accordingto thermodynamic behavior at T = 300 K. If thecorrelation is linear, the slope can be adjusted bychoosing the best-fit value for T in the Boltzmannrelation (i.e. so the slope of the plot is 1). This yieldsan effective ‘‘temperature’’ for that substructure inproteins relative to the temperature of a trueBoltzmann ensemble. This has been done for thefollowing protein properties. Extracted partitioningenergies correlate (Miyazawa & Jernigan, 1985; Roseet al., 1985; Lawrence et al., 1987; Miller et al., 1987)with oil/water transfer experiments (e.g. Nozaki &Tanford, 1971; Fauchare & Pliska, 1983). Also wellpredicted are distributions of f/c dihedral angles(Pohl, 1971; Kolaskar & Prashanth, 1979), chargedresidues (Bryant & Lawrence, 1991), cis- andtrans-proline residues (MacArthur & Thornton,1991), the sizes of empty cavities (Rashin et al., 1986)and residue stabilization of secondary structureelements (Serrano et al., 1992). Remarkably, for all ofthese different protein substructures, the apparenttemperature is of the same order of magnitude,between about 150 K and 600 K.

Finkelstein et al. (1993) have proposed a theory toexplain why substructures have about the samefrequencies in proteins as they would have inthermodynamic equilibrium. They argue, based ona random heteropolymer model of proteins, that thenumber of random sequences having a nativestructure that contains any given substructuredepends exponentially on the energy of thatsubstructure. This model predicts that the tempera-ture in the Boltzmann relation should be the samefor all types of substructures, and roughly equal toroom temperature.

We test here the Boltzmann distribution assump-tion for the interior-exterior partitioning of residues.We find that the apparent temperature for theextracted partition energies depends systematicallyon the average partition propensity (equation (3))for the set of proteins. We divided the 346 PDBprotein structures, sorted by partition propensity,into five sets. For each set, we define the extractedexterior-interior partition energy for each aminoacid i:

DGi = Ginside − Goutside = − kT ln0nir

ni01 (4)

where nir is the number of contacts between residuetype i and other residues (these are ‘‘interior’’ sites),and ni0 is the number of contacts with solvent(residue type 0). Both nir and ni0 are estimated as byMiyazawa & Jernigan (1985). Physically, nir corre-sponds roughly to the average fractional surfacearea of residues of type i that is buried in protein

any, set of proteins gives extracted energies that aresimilar to nature’s underlying energies? Clearlysimply increasing the number of proteins in adatabase does not ‘‘average out’’ the coupling of theextracted energies. However, there may be ways toconstruct contact pair density functions or referencestates that can take into account properties such asthe partition propensity.

The Boltzmann distribution: does it apply?

Among the deepest questions underlying statisti-cal potentials is whether the Boltzmann distributionlaw, equations (l) or (2), is appropriate for convertingpair frequencies in proteins to energies. What isthe meaning of temperature in these expressions?The Boltzmann distribution law applies to a singleclosed system held at fixed temperature that canpopulate different energy levels. In the gas-phaseamino acid pairing example above, increasing thetemperature would cause Ala-Trp contacts, forexample, to resemble a random distribution. But thePDB is many proteins, not a single system. Thedatabase is fixed; each protein has no degrees offreedom that are affected by temperature changes.The amino acid pairings in lysozyme, for example,are the same at T = 300 K as at T = 0 K. Conse-quently, the extracted potentials contain no infor-mation about protein stability; extracted energieswill be the same whether the native conformationsof the proteins are stable by 10−5 or 105 kcal/mol.Should we use T = 0 K, T = 300 K, or some othertemperature?

Nevertheless, some evidence supports the use ofa Boltzmann distribution. In particular, for someprotein ‘‘substructures’’ (e.g. proline residues),

Page 10: Statistical Potentials Extracted From Protein Structures: How Accurate Are They?

Accuracy of Statistical Potentials466

Figure 10. Extracted exterior-interior partition energiesof the 20 amino acids, for protein sets having differentpropensities, versus experimental water-octanol transferenergies (Fauchere & Pliska, 1983). The lines areregression fits to (from top to bottom) p = 1.0 (filleddiamonds, R = 0.90, m = 0.28), p = 1.3 (open squares,R = 0.92, m = 0.42), p = 1.5 (crosses, R = 0.92, m = 0.50),p = 1.6 (filled triangles, R = 0.92, m = 0.58), and p = 1.8(open circles, R = 0.88, m = 0.79).

In a Boltzmann distribution, the amount of surfacein the excited (exposed) state will increase withincreasing temperature. But for a large proteinhaving few enough hydrophobic monomers (highpartition propensity) that it buries all its hydro-phobic residues in the core (the ground state), theBoltzmann analogy gives an apparent temperatureof 0 Kelvin with respect to hydrophobic residuepartitioning. On the other hand, small proteins withmany non-polar residues will be ‘‘hotter’’ becausethose proteins are ‘‘forced,’’ by surface-to-volumeand composition constraints, to expose hydrophobicmonomers. Hence proteins and databases can differin their ‘‘temperatures’’ of interior-exterior parti-tioning.

Can statistical potentials correctly recognizenative structures?

The results above indicate that statistical poten-tials may not quantitatively reflect the true energiesthat cause amino acid pairings in proteins. Here weask if they succeed in a more modest test: do theycorrectly identify a native structure among a set ofdecoys? In this case, the value of the temperature isunimportant. The temperature just scales allinteractions to the same degree, so while it affectsthe absolute stability, it does not affect the rankorderings of different structures. Therefore, for astatistical potential to succeed in correctly rank-or-dering the energies of different structures, it onlyneeds to be approximately correct to within anarbitrary scaling constant C > 0. In the AB latticemodel we require only that: eAA 3 CEAA , eAB 3 CEAB

and eBB 3 CEBB .To test the accuracy of the extracted energies in

structure prediction, we now turn to the three-en-ergy AB model. For chains of length L = 14 of twomonomer types A and B, we create databases ofminimum energy structures for different ‘‘true’’contact potential functions, and extract statisticalcontact energies from each database. We thencompare the extracted energies eAA , eAB and eBB to thetrue energies EAA , EAB and EBB . The most importanttest is whether, for all sequences in a givendatabase, the true native conformation is identifiedas having the lowest value of the extracted energyover all possible conformations of that chainsequence. Unlike structure recognition tests thathave been performed for real proteins, here we canexhaustively explore the conformational space foreach AB sequence.

Table 2 shows the contact energies extracted fromdatabases constructed using different true poten-tials. To facilitate comparison, both the true andextracted potentials are scaled relative to kT suchthat the strongest attractive contact interaction hasan energy of −5 units. Table 2 shows that in all caseswhere the three true contact energies are different,the extracted contact energies have the same rankordering as the true energies. For example, for thetrue potential EAA :EAB :EBB = −5:−4:−1, the extractedpotential eAA :eAB :eBB = −5:−3:+0.8 correctly predicts

interiors, while ni0 corresponds roughly to theaverage fractional exposed surface area. Figure 10plots the partition energies extracted from eachprotein set against the experimental energies fortransferring each amino acid from water to octanol.

We find that (1) extracted energies correlate withexperimental partition energies, consistent with theuse of the Boltzmann expression, but (2) there is nosingle temperature that is relevant. The first point issupported by the high correlation (R = 0.88 to 0.92)for all five protein sets. The relevant temperatureis determined by the slopes of these plots. In Fig-ure 10 the temperature relevant to the set p = 1.8 isT = 300 K/[(0.59)(0.79)] = 640 K, and to the setp = 1.0 is T = 300 K/[(0.59)(0.28)] = 1800 K, basedon slope factors of 0.79 and 0.28 relative to oil/waterpartitioning at 300 K. This result suggests that withrespect to interior-exterior residue partitioning theproteins in the PDB may not be well-modeled by therandom heteropolymer assumption of Finkelsteinet al. (1993). Figure 10 shows that the effectivetemperature of interior-exterior partitioning de-pends on the length, composition and compactnessof the proteins in the database, while the randomheteropolymer model results are independent ofprotein length (due to cancellation of length-depen-dent terms) and composition (due to sequenceaveraging).

What is the physical meaning of the apparent‘‘temperature’’ of a single protein structure or adatabase of structures? Here is an analogy, based onthe buried/exposed partitioning of amino acids.Non-polar side-chain surface is buried in its‘‘ground state’’ and exposed in its ‘‘excited state.’’

Page 11: Statistical Potentials Extracted From Protein Structures: How Accurate Are They?

Accuracy of Statistical Potentials 467

Table 2. AB model testTrue Extracted Number of Prediction

EHH : EHP : EPP eHH : eHP : ePP sequences success (%)

−5 : −5 : −1 −5 : −3.7 : +1.4 173 84−5 : −4 : −1 −5 : −3.0 : +0.8 1388 74−5 : −3 : −1 −5 : −2.4 : 0.0 1374 64−5 : −2 : −1 −5 : −2.1 : −0.5 1726 99−5 : −1 : −1 −5 : −1.5 : −1.0 913 97−4 : −5 : −1 −1.1 : −5 : +1.8 1059 93−3 : −5 : −1 −0.8 : −5 : +1.8 1046 95−2 : −5 : −1 −0.5 : −5 : +1.4 1060 95−1 : −5 : −1 +1.3 : −5 : +1.3 918 90−5 : −1 : −5 −5 : +1.2 : −5 915 90−5 : −1 : −4 −5 : −0.2 : −3.6 2264 99−5 : −1 : −3 −5 : −0.6 : −2.5 1922 92−5 : −1 : −2 −5 : −1.1 : −2.1 2040 100−5 : −3 : +1 −5 : −2.6 : +2.5 1514 96−5 : −3 : +2 −5 : −2.6 : +4.7 1467 99−5 : −3 : +3 −5 : −2.6 : +6.8 1465 99

Left column is the scaled set of true energies used to generatethe lattice model database for L=14. Second column is the scaledset of energies found by the statistical potential extractionprocedures. Column 3 is the number of unique folding sequencesin the database for each true potential. Column 4 shows howoften the extracted potential correctly identifies the nativestructure within the database (i.e. the structure having the lowestextracted energy is also the structure with the lowest trueenergy).

Figure 11. Percent correct structure prediction by theHP model statistical contact potential versus chain lengthfor 2D model. The structure of a sequence is predictedcorrectly if the true native conformation is also lower inextracted energy than any other conformation. Each setof connected points represents a different potential inTable 2. For the 12-mer and 14-mer chains, sequence spacesearching is exhaustive; for the 16-mer chains, random ABsequences are sampled until 1000 sequences havingunique native structures are found for each true potential.

that the AA interaction is the most favorable and theBB interaction is least favorable (although itincorrectly predicts a repulsive BB contact energy).

For the three-interaction model, the extractedpotentials qualitatively reflect the true potential. Butthis is not always sufficient for structure predictionsince the extracted contact energies are usuallydifferent from the true energies, and sometimeseven opposite in sign. As a result, the total extractedcontact energy does not always correctly predict thenative conformations of AB sequences. For only oneof the model databases are the native structurescorrectly predicted for 100% of the sequences.Sometimes statistical potentials are calculated usinga compact reference state rather than the unfoldedstate of Miyazawa and Jernigan. But this referencestate is still a random mixture, and statisticalenergies calculated using the compact reference areessentially just shifted by a constant positiveamount from the values in Table 2. For our latticemodel, we find that such a shift generally decreasesthe success of the potential in identifying correctnative conformations.

The extraction procedures fail to reproduce thecorrect rank ordering of interactions for degeneratepotentials, in which two interactions are equal(Figure 2, Table 2). In these cases, the interactionsthat are equal in the true potential are not equal inthe extracted potential (except when the true AAand BB interactions are equal, since the three-inter-action potential is symmetric and sequence space issymmetric). This incorrect ‘‘splitting’’ of degenerateenergies results from the coupling between differ-ent interactions, as we noted for the HP model. Forthe true potential EAA :EAB :EBB = −5:−5:−1, theextracted energies are eAA :eAB :eBB = −5:−2:+1. TheAA interaction appears stronger than the AB

interaction because few A residues are exposed tosolvent (so any A contact is very favorable). Formingan AA contact breaks two A-solvent contacts whileforming an AB contact breaks only one, so theextracted AA energy is more favorable. In addition,the sign of the extracted BB interaction is wrongbecause BB contacts are less common than in arandom mixture since Bs prefer to form AB contacts.

For the 14-monomer AB model, the correct nativeconformations are predicted for 64 to 100% of thesequences in a given database. But Figure 11 showsthat the success in structure prediction generallydecreases with chain length, even over very shortchain lengths from 12 to 16 monomers. Since realproteins are much longer than our lattice chains(having hundreds to thousands of interresiduecontacts), Figure 11 suggests that extracted statisti-cal energies may have limited success in identifyingthe native states of protein sequences among a setof reasonable alternative structures.

Conclusions

We test some of the principles that underliestatistical potentials. Statistical potentials are en-ergy-like quantities that are extracted from proteinstructure databases, based on certain assumptions.They have been used to model the true energies thatcause proteins to fold, dock with ligands, and

Page 12: Statistical Potentials Extracted From Protein Structures: How Accurate Are They?

Accuracy of Statistical Potentials468

recognize other proteins. We test the premisesbehind statistical potentials using exact latticemodels, and we verify our conclusions, wherepossible, with tests on the PDB.

We conclude that the principal weakness in all thecurrent statistical potentials is their assumption thatthe frequencies of each type of amino acid pair,such as Ala-Leu, are independent of other types ofpairs, such as Phe-Gly. In a relatively small, compactobject such as a protein, the space taken by otheramino acids is a strong constraint on the possiblepositions of each given pair. The clustering ofhydrophobic amino acids is probably a strongerdeterminant of the statistical potentials amongcharged groups than electrostatics are. We find that,whereas true potentials cannot depend on chainlength or composition, extracted potentials do. Thistoo appears to be mainly a consequence of theburial of hydrophobic surface in small compactobjects of differing sizes and compositions.

There are a few caveats in interpreting our latticemodel results. First, excluded volume is a morestringent constraint in two dimensions than inthree, simply because of the reduced number ofpossible spatial neighbors of each residue. Further-more, sequence effects may be more pronouncedin our short-chain model. As a control, we haveconstructed a database of low-energy configurationsof long (60-monomer) HP chains on a 3D lattice, andfound results that are similar to those from the 2DHP model. Second, there are only two monomertypes, so that the dominant interaction (e.g. the HHinteraction) will tend to be the primary determinantof all the observed pair distributions. The realenergetics of protein folding are undoubtedly morecomplex.

For real protein structures, we demonstrate thatthe use of the Boltzmann distribution law to convertinterior-exterior residue partitioning frequencies toenergies, which defines a database temperaturerelative to octanol-water partition energies, is notfirmly grounded. The choice of a relevant tempera-ture is strongly dependent on the choice of proteinsin the database. We define a quantity we call thepartition propensity of a given protein, whichdetermines the relevant temperature in a Boltzmannequation. It may be possible to use this quantity toweight the burial frequencies observed in eachprotein to obtain database-independent partitionenergies.

AcknowledgementsP.D.T. is a Howard Hughes Predoctoral Fellow. We

thank Kai Yue for suggesting this approach.

ReferencesBernstein, F. C., Koetzle, T. F., Williams, G. J. B., Meyer,

E. F., Brice, M. D., Rodgers, J. R., Kennard, O.,Shimanouchi, T. & Tasumi, M. (1977). The protein

data bank: a computer-based archival file formacromolecular structures. J. Mol. Biol. 112, 535–542.

Bowie, J. U., Luthy, R. & Eisenberg, D. (1991). A methodto identify protein sequences that fold into a knownthree-dimensional structure. Science, 253, 164–170.

Bryant, S. H. & Amzel, L. M. (1987). Correctly foldedproteins make twice as many hydrophobic contacts.Int. J. Peptide Protein Res. 29, 46–52.

Bryant, S. H. & Lawrence, C. E. (1991). The frequency ofion-pair substructures in proteins is quantitativelyrelated to electrostatic potential: a statistical modelfor nonbonded interactions. Proteins: Struct. Funct.Genet. 9, 108–119.

Bryant, S. H. & Lawrence, C. E. (1993). An empiricalenergy function for threading protein sequencethrough the folding motif. Proteins: Struct. Funct.Genet. 16, 92–112.

Camacho, C. J. & Thirumalai, D. (1993a). Kinetics andthermodynamics of folding in model proteins. Proc.Natl Acad. Sci. USA, 90, 6369–6372.

Camacho, C. J. & Thirumalai, D. (1993b). Minimumenergy compact structures of random sequences ofheteropolymers. Phys. Rev. Letters, 71, 2505–2508.

Casari, G. & Sippl, M. J. (1992). Structure-derivedhydrophobic potential. Hydrophobic potential de-rived from X-ray structures of globular proteins isable to identify native folds. J. Mol. Biol. 224, 725–732.

Chan, H. S. & Dill, K. A. (1991a). Sequence-space soup ofproteins and copolymers. J. Chem. Phys. 95, 3775–3787.

Chan, H. S. & Dill, K. A. (1991b). Polymer principles inprotein structure and stability. Annu. Rev. Biophys.Biophys. Chem. 20, 447–449.

Chan, H. S. & Dill, K. A. (1994). Transition states andfolding dynamics of proteins and heteropolymers.J. Chem. Phys. 100, 9238–9257.

Chan, H. S., Bromberg, S. & Dill, K. A. (1995). Models ofcooperativity in protein folding. Phil. Trans. Roy. Soc.Lond. ser. B, 348, 61–70.

Dill, K. A., Bromberg, S., Yue, K., Fiebig, K. M., Yee, D. P.,Thomas, P. D. & Chan, H. S. (1995). Principles ofprotein folding—a perspective from simple exactmodels. Protein Sci. 4, 561–602.

Fauchere, J.-L. & Pliska, V. (1983). Hydrophobicparameters II of amino-acid side-chains from thepartitioning of N-acetyl-amino acid amides. Eur. J.Med. Chem. Chim. Ther. 18, 369–375.

Finkelstein, A. V., Gutun, A. M. & Badretdinov, A. Ya.(1993). Why are the same protein folds usedto perform different functions? FEBS Letters, 325,23–28.

Finkelstein, A. V., Gutin, A. M. & Badretdinov, A. Ya.(1995). Boltzmann-like statistics of protein architec-tures. Origins and consequences. Sub-Cell. Biochem.24, 1–26.

Godzik, A. & Skolnick, J. (1992). Sequence-structurematching in globular proteins: application to super-secondary and tertiary structure determination. Proc.Natl Acad. Sci. USA, 89, 12098–12102.

Hendlich, M., Lackner, P., Weitckus, S., Floeckner, H.,Froschauer, R., Gottsbacher, K., Casari, G. & Sippl,M. J. (1990). Identification of native protein foldsamongst a large number of incorrect models. Thecalculation of low energy conformations frompotentials of mean force. J. Mol. Biol. 216, 167–180.

Hobohm, U. & Sander, C. (1994). Enlarged representativeset of protein structures. Protein Sci. 3, 522–524.

Janin, J. (1979). Surface and inside volumes in globularproteins. Nature, 277, 491–492.

Page 13: Statistical Potentials Extracted From Protein Structures: How Accurate Are They?

Accuracy of Statistical Potentials 469

Jones, D. T., Taylor, W. R. & Thornton, J. M. (1992). A newapproach to protein fold recognition. Nature, 358,86–89.

Kocher, J. P., Rooman, M. J. & Wodak, S. J. (1994). Factorsinfluencing the ability of knowledge-based potentialsto identify native sequence-structure matches. J. Mol.Biol. 235, 1598–1613.

Kolaskar, A. S. & Prashanth, D. (1979). Empirical torsionalpotential functions from protein structure data. Int. J.Peptide Protein Res. 14, 88–98.

Kolinski, A. & Skolnick, J. (1994). Monte Carlo simul-ations of protein folding. I. Lattice model andinteraction scheme. Proteins: Struct. Funct. Genet. 18,338–352.

Lau, K. F. & Dill, K. A. (1989). A lattice statisticalmechanics model of the onformational and sequencespaces of proteins. Macromolecules, 22, 3986–3997.

Lau, K. F. & Dill, K. A. (1990). Theory for proteinmutability and biogenesis. Proc. Natl Acad. Sci. USA,87, 638–642.

Lawrence, C., Auger, I. & Mannella, C. (1987).Distribution of accessible surfaces of amino acids inglobular proteins. Proteins: Struct. Funct. Genet. 2,153–161.

Lee, B. K. & Richards, F. M. (1971). The interpretation ofprotein structures: estimation of static accessibility.J. Mol. Biol. 55, 379–400.

Lipman, D. J. & Wilbur, W. J. (1991). Modelling neutraland selective evolution of potein folding. Proc. Roy.Soc. London, ser. B, 245, 7–11.

Luthy, R., Bowie, J. U. & Eisenberg, D. (1992). Assessmentof protein models with three-dimensional profiles.Nature, 356, 83–85.

MacArthur, M. W. & Thornton, J. M. (1991). Influence ofproline residues on protein formation. J. Mol. Biol.218, 397–412.

Miller, R., Danko, C. A., Fasolka, M. J., Balazs, A. C.,Chan, H. S. & Dill, K. A. (1992). Folding kinetics ofproteins and copolymers. J. Chem. Phys. 96, 768–790.

Miller, S., Janin, J., Lesk, A. M. & Chothia, C. (1987).Interior and surface of monomeric proteins. J. Mol.Biol. 196, 641–656.

Miyazawa, S. & Jernigan, R. L. (1985). Estimation ofeffective interresidue contact energies from proteincrystal structures: quasi-chemical approximation.Macromolecules, 18, 534–552.

Nishikawa, K. & Matsuo, Y. (1993). Development ofpseudoenergy potentials for assessing protein 3-D-1-D compatibility and detecting weak homologies.Protein Eng. 6, 811–820.

Nozaki, Y. & Tanford, C. (1971). The solubility of aminoacids and two glycine polypeptides in aqueousethanol and dioxane solutions. J. Biol. Chem. 246,2211–2217.

O’Toole, E. M. & Panagiotopoulos, A. Z. (1993). Effectof sequence and intermolecular interactions on

the number and nature of low-energy statesfor simple model proteins. J. Chem. Phys. 90,3185–3190.

Pellegrini, M., & Doniach, S. (1993). Computer simulationof antibody binding specificity. Proteins: Struct. Funct.Genet. 15, 436–444.

Pohl, F. M. (1971). Empirical protein energy maps. NatureNew Biol. 234, 277–279.

Rashin, A. A., Ionif, M. & Honig, B. (1986). Internalcavities and buried waters in globular proteins.Biochemistry, 25, 3619–3625.

Richards, F. M. (1977). Areas, volumes, packing andprotein structure. Annu. Rev. Biophys. Bioeng. 6,151–176.

Rose, G. D., Geselowitz, A. R., Lesser, G. J., Lee, R. H.& Zehfus, M. H. (1985). Hydrophobicity of aminoacid residues in globular proteins. Science, 229,834–838.

Serrano, L., Sancho, J., Hirshberg, M. & Fersht, A. R.(1992). Alpha-helix stability in proteins. I. Empiricalcorrelations concerning substitution of side-chains atthe N- and C-caps and the replacement of alanine byglycine or serine at solvent-exposed interfaces. J. Mol.Biol. 227, 544–559.

Shortle, D., Chan, H. S. & Dill, K. A. (1992). Modeling theeffects of mutations on the denatured states ofproteins. Protein Sci. 1, 201–215.

Sippl, M. J. (1990). Calculation of conformationalensembles from potentials of mean force. Anapproach to the knowledge-based prediction of localstructures in globular proteins. J. Mol. Biol. 213,859–883.

Sippl M. J. & Weitckus, S. (1992). Detection of native-likemodels for amino acid sequences of unknownthree-dimensional structure in a data base of knownprotein conformations. Proteins: Struct. Funct. Genet.13, 258–271.

Skolnick, J. & Kolinski, A. (1990). Simulations of thefolding of a globular protein. Science, 250, 1121–1125.

Sun, S. (1993). Reduced representation model of proteinstructure prediction: statistical potential and geneticalgorithms. Protein Sci. 2, 762–785.

Tanaka, S. & Scheraga, H. A. (1976). Medium- andlong-range interaction parameters between aminoacids for predicting three-dimensional structures ofproteins. Macromolecules, 9, 945–950.

Unger, R. & Moult, J. (1993). Genetic algorithmsfor protein folding simulations. J. Mol. Biol. 231,75–81.

Wilmanns, M. & Eisenberg, D. (1993). Three-dimensionalprofiles from residue-pair preferences: identificationof sequences with beta/alpha-barrel fold. Proc. NatlAcad. Sci. USA, 90, 1379–1383.

Wilson, C. & Doniach, S. (1989). A computer model todynamically simulate protein folding: studies withcrambin. Proteins: Struct. Funct. Genet. 6, 193–209.

Edited by F. Cohen

(Received 6 September 1995; accepted 30 November 1995)