A potential function for conformational analysis of proteins
Post on 02-Oct-2016
Embed Size (px)
Int. J. Peptide Protein Res 24, 1984,219-296
A potential function for conformational analysis of proteins
G.M. CRIPPEN a d Y.N. VISWANADHAN
Department of Chemistry, Texas A & M University, College Station, Texas, USA
Received 15 November 1983, accepted for publication 15 March 1984
We have devised a residue-residue potential function for low resolution protein conformational calculations. The interactions between residues near in sequence maintain correct secondary structure, while the long-range terms in the poten- tial govern the larger packing features and overall globularity. The short-range terms were calculated by comparing the observed distributions of distances between Ca atomsin 35 protein crystal structures to the expected distributions and assigning the discrepancies to a Boltzmann distribution due to an effective potential. Long-range terms were adjusted to ensure that the crystal structure of bovine pancreatic trypsin inhibitor has a lower total energy than perturbed conformations of the same molecule. Thus the empirical potential function implicity contains solvation and conformational entropy effects along with the usual Van der Waals and electrostatic energies. Extensive testing of the potential on trypsin inhibitor and other proteins establishes that it is generally applicable to small proteins, it does not attempt to compress or expand the conformations found by X-ray crystallography, standard secondary structural features are maintained under the potential, and there are so many local minima that local minimization can be trusted to return a perturbed structure to the native conformation only if they differ initially by less then 1 A. Key words: amino acid residue; conformation; energy embedding; potential energy; protein
One of the standard approaches for attempting to calculate the tertiary structure of a globular protein, given only its sequence, is to search for conformations of low calculated energy. Recent advances in the distance geometry approach, called energy embedding (1, 2), enable us to find very rare, low energy conformations possibly subject to additional distance con- straints. For general application of energy embedding to protein conformational calcu- lations, the energy function must have the following characteristics: (i) the molecule must be represented as a collection of points in space, where each point may represent a single atom
or some group of atoms; (ii) the energy is calculated as a sum of pairwise interaction terms between all pairs of points; (iii) each term depends isotropically on the distance between the points and on which points they are; and (iv) each terms functional dependence on distance may have only a single minimum (unimodality). The need for each of these requirements is clear from an examination of the energy embedding algorithm (l), and will not be discussed here. In order for the potential to be applied successfully to many different proteins, we must further require that (v) the identity of the interacting points in the
G.M. Crippen and V.N. Viswanadhan
third requirement can only be specified on the basis of chemical type (atom type, residue type, etc.) and/or relative position in the amino acid sequence.
The purpose of this paper is to present a generally useful methodology for finding a potential of the above form given crystal structures of a number of proteins, and to examine the suitability of the potential we have produced for protein conformational calcu- lations. A number of potential functions for protein energy calculations have been published, but none fulfill all the design criteria listed above. The one point per atom potentials, such as that of Momany et al. (3), are not strictly sums of pairwise interactions because of the torsional terms they employ. We are also interested in speeding up the calculations by adopting a considerably simplified model of the protein. The function of Pincus & Scheraga (4) takes a step in that direction by maintaining an all-atom representation only for residues near each other in space. Unfortunately, even the one point per residue polyvaline potential of McCammon ( 5 ) and its predecessor (6) have torsional terms about the virtual bonds. Our earlier effort along these lines (7) does not foster good secondary structure, and some of the terms are not unimodal functions of the distance between the interacting residues. Thus we have been forced to devise a new potential adhering strictly to the design criteria.
The first requirement concerns the precision of representation of the protein molecule. Since all pairwise interactions must be included in the calculation of the energy for a given confor- mation, it is obviously much faster to deal with fewer points. Also, the fewer the points, the fewer the local minima in the potential, as an empirical rule. Therefore we have chosen not an all-atom representation, or a sidechain point and a backbone point per residue, but simply one point per amino acid residue, centered at the C". The penalty is that fine details of struc- ture will not be represented, and there may be errors on a scale even larger than the size of the atom clusters used. Without some explicit sidechain points, the isotropic interpoint terms depending only on distance (according to the third requirement) cannot distinguish between a right-handed and a left-handed a helix, for
example. The potential could, however, show a preference for two right-handed helices packed together over the packing of a right-handed one and a left-handed one. Thus it would be possible to arrive at the native conformation except for an overall mirror inversion of the entire mol- ecule, but the relative chirality of the various parts of the molecule could be correctly deduced, in principle. Since our present interest lies in calculating terthry structure of proteins, we do not anticipate that lumping all the atoms of an entire residue together will be too great an approximation.
The fourth requirement forced us to abandon the Oobatake & Crippen (7) potential, because some terms had as many as three minima with respect to the distance between the interacting (2"'s. Energy embedding with this potential nevertheless gave such encouraging results (2) that we felt obliged to devise a more suitable function.
The second requirement turns out to cause a great deal of trouble. We conjecture that whenever the number of pairwise unimodal interaction terms in the potential exceeds the number of conformational degrees of freedom, there will be many local minima. In our experi- ence, the number of local minima rises very rapidly with the number of points (and hence energy terms). For example, consider points labeled 1 and 2 constrained to lie on the x-axis. If the potential F consists simply of quadratic terms, then
and there is clearly only a single minimum (at Ixl -x2 I = l ) , a single interaction term, and one degree of conformational freedom. If we now add a third point and change F to
+ (2 - d d 2
then there are three terms in F and two confor- mational degrees of freedom. In all there are three minima: F(0, 1 , 2) = 0, F(0, 513, 413) = 413, and F(0, - 113, 413) = 413. The situation quickly becomes more complicated in three dimensions, but we have observed a high density of local minima for systems of many
Protein conformational energy
sequentially adjacent quartets of residue points) for all i . Physically it is clear that there must be some sort of energy term for residues with greater sequence separation, although such effects could be built into possibly complicated functional forms for the short-range inter- action terms, in principle. We instead chose to keep all interaction terms in the energy in spite of there being insufficient conformational degrees of freedom to make a one-to-one assignment between terms and conformational parameters. That left us with considerable ambiguity as to how the total energy should be apportioned among the many terms, and we have used that latitude to keep the functional forms of the terms simple and unimodal. .
points with even the simplest of unimodal interaction terms.
A second difficulty with having more interaction terms than degrees of freedom arises when trying to deduce the energy function causing observed conformations. Suppose the set of conformational parameters, (@i}, i = 1, . . . , N (representing bond lengths, angles, dihedral angles, etc.), are a necessary and sufficient set of parameters to completely describe a molecules conformation. Further suppose that the 9s are mutually independent in the global sense that any particular & can assume its full range of values regardless of the values of the other $ii+k. This would not be true, for example, if the 9s represented inter- point distances. Then suppose we can make a large number of observations on the molecule in an equilibrium system. Estimating for all i the probability of q5i taking on its various possible values, p(&), by the observed normal- ized frequency, we can use standard equilibrium statistical mechanics to conclude
i = l p(conformation) = n p(&)
where Z is the partition function. Because of the simple correspondence between statistically independent p(+i)s and additive energy terms, e(@Js, it is easy to calculate the total probability of any particular conformation, p(confor- mation), and ascribe its frequency of occurrence to a Boltzmann distribution over the energy function. Given a sufficiently large, unbiased set of observations, each q can be deduced for all values of &, for all i. Unfortunately, we are obliged to work with interpoint (interresidue) distances as our conformational variables. Not only are there fundamental geometric inter- dependencies on the values the distances may assume, but there are now many more pairwise interaction terms in the energy sum than there are degrees of conformational freedom. For a polypeptide chain at the single point per residue level, a necessary and sufficient set of conformational parameters would be the di , i+l , d i , f+z , and di,i+j (plus the chirality of
We proceed from the assumptions that even lOA diameter regions in protein crystal struc- tures are in thermal equilibrium and that large separation implies weak interaction. Exami- nation of the X-ray diffraction data over the years has led many workers to notice a multi- tude of features in the structures of globular proteins. I t is, however, not immediately obvious which are energetically important, which reveal the fundamental energetics, and which are more complicated consequences of the fundamental interactions. Our general plan of attack is to look at groups of residues that lie very close together in the crystal structures, assume they have a Boltzmann distribution of the unknown energy, and then deduce what the energy must be in order to give such a distri- bution. Then we can consider slightly larger groups of residues, asking whether the known parts of the potential are sufficient to give the observed clusters a Boltzmann distribution, and if not, deduce whatever additional energy contributions are needed.
Using the same data set of 35 protein crystal structures employed by Oobatake & Crippen (7), we begin with the obvious. As before, we are assuming we can represent each amino acid residue by a single point located at the CQ. The strongest difference between a protein and a gas is that the residues have sequence number- ing, and ,re distance of closest approach is always found between all sequentially adjacent
G.M. Crippen and V.N. Viswanadhan
residues, i and i + 1. Given the resolution of the data, about all we can say is that di,i+l = 3.8 f 0.07 A. Some residue type pairs have slight statistical tendencies to deviate more than the given overall standard deviation, but the only justifiable conclusion is to take the i to i + 1 distance to be fued at 3.8A. As an illustration of our formal procedure, we could fit the observed distribution of distances to the gaussian
exp (-0.5 (M)(di,i+ 1 - 3 .8)2 1, (4) whch is the form for a Boltzmann distribution on a harmonic potential with 14 = k f / R T . That would amount to a force constant k f = 8.3 kcal/A2. In all that follows, we take the force constant t o be so large that di.i+l remains fued at 3.8 A.
The next largest cluster to consider is the three residues i, i + 1, and i + 2 for all residues i of all 35 proteins. We can already calculate the i to i + 1 and the i + 1 to i + 2 energies (and we presume that the distances diei+, and di+ , i+ do not deviate significantly from their energetically minimal value of 3.8A). That leaves only the i to i + 2 interactions to be determined. This energy contribution pre- sumably is connected with the secondary structure preferences of the residues in question, since short di4i+zs would be seen in helical or bend segments, and relatively long distances would correspond to extended structure. Because di,i+z is determined solely by the 4 and I) dihedral angles of residue i + 1, (assuming planar trans peptide bonds), we have grouped our observations according to the type of residue i + 1. Referring to Fig. 1, we see that the i t o i + 2 d i s t a n c e = d ( 6 ) = 3 . 8 [ 2 - 2 c o ~ B ] ~ ,
I 1+1 FIGURE 1 Relation of the virtual bond angle, 0, to the i to i + 2 interresidue distance, d , for fixed virtual bond lengths = 3.8 A.
TABLE 1 Frequency of occurrence of a < diOi,* 6 b calculated for polypeptides according to the freely jointed chain model, and observed in the crystal structures, averaged
over all residue types
a h b h P(a ,b;calc . ) P ( a , b ; obs.)
0.0 5.0 0.457 0.029 5.0 5.4 0.046 0.154 5.4 5.8 0.050 0.25 3 5.8 6.2 0.054 0.138 6.2 6.6 0.063 0.153 6.6 7.0 0.075 0.157 7.0 7.6 0.255 0.116
and conversely the virtual bond angle 0 (d) = arccos [ l - d 2 / 2 ? ] . Consequently P(a, b ) , the probability of observing a three residue segment with a < d Q b , is given by
The integration is over 0 rather than d for unbiased sampling. We take as the null hypo- thesis at this level of our analysis that E(d) = 0, which amounts to assuming that the observed distribution of distances is merely the geometric consequence of a freely jointed chain. Then P(a, b ; calc.) = (B(b) -B(u))/a. Table 1 shows clearly that there must be an effective energy of interaction between i and i + 2 in order to account for the observed distribution. It turns out that a simple quadratic function of d does not give a good fit to the observations; rather the potential well is much steeper at low separations than at high ones. The simplest polynomial having only one minimum and being asymmetric about the minimum is a function f(d) whose derivative is of the form
fW = Pdd -PI)@ -P2 + iP3) x x (d -P2 - iP3) . ( 6 )
The coefficients off andfare real, but there is only a single minimum off , located at p l . The condition that the slope for sm...