sidechain and backbone potential function for conformational analysis of proteins

23
hit. J. Peptidc Protein Res. 25, 1985,487-509 Sidechain and backbone potential function for conformational analysis of proteins G.M. CRIPPEN and V.N. VISWANADHAN Department of Chemistry, Texas A & M University, College Station, Texas, USA Received 17 April, accepted for publication 18 October 1984 An improved potential function has been devised for the calculation of protein conformations. Each amino acid residue is represented by two points. The mainchain is traced by the sequence of CQ atoms, and the details of sidechain structure and interactions are represented by a representative sidechain atom. This potential function has been developed from a data base of 22 high resol- ution protein crystal structures and includes the components of an earlier potential developed from a similar data base where each amino acid residue is represented by only its CQ atom. In virtually all aspects of testing, the present potential betters the previous single-point potential, and is shown to be useful in the simulation of protein folding. Key words: conformational analysis; conformational energy; protein folding; protein structure; secondary structure In our earlier work (1) we stated the require- ments of an energy function for protein con- formational calculations by energy embedding (2, 3): (i) the molecule must be represented as a collection of points in space, where each point may represent a single atom or groups of atoms; (ii) the energy is calculated as a sum of pairwise interaction terms between all pairs of points; (iii) each term depends iso- tropically on the distance between the points and the identity of points; (iv) each term’s functional dependence on distance may have only a single minimum (unimodality); and (v) the identity of the interacting points in the third requirement can be specified only by chemical type (atom type, residue type, etc.) and the relative sequence positions of the points. Many energy-like functions have been proposed for protein conformational calcu- lations which economize by not considering all atom-atom interactions. The very simple potential of Kuntz et al. (4) consists of only hydrophobic interactions and makes no attempt to enforce correct secondary structure. That of Oobatake & Crippen (5) also lacks secondary structure terms. The detailed poten- tials of Momany et al. (6) (and to a lesser extent, that of Pincus & Scheraga (7)) require so much computer time per energy evaluation that we feel a coarser representation of the polypeptide chain is worth exploring at length. The most promising of these (Levitt (8) and subsequent modifications by McCammon et al. (9)) unfortunately have an explicit virtual bond torsional term, thus violating our third criterion. Only our previous potential satisfies all our requirements, and it sacrificed physical realism for computational speed by represent- ing each amino acid residue by only a single point, centered at the Cn. In this paper we ask whether adding an extra sidechain point to each residue will improve the performance 487

Upload: gm-crippen

Post on 02-Oct-2016

213 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Sidechain and backbone potential function for conformational analysis of proteins

h i t . J . Peptidc Protein Res. 25 , 1985,487-509

Sidechain and backbone potential function for conformational analysis of proteins

G.M. CRIPPEN and V.N. VISWANADHAN

Department o f Chemistry, Texas A & M University, College Station, Texas, USA

Received 17 April, accepted for publication 18 October 1984

An improved potential function has been devised for the calculation of protein conformations. Each amino acid residue is represented by two points. The mainchain is traced by the sequence of C Q atoms, and the details of sidechain structure and interactions are represented by a representative sidechain atom. This potential function has been developed from a data base of 2 2 high resol- ution protein crystal structures and includes the components of an earlier potential developed from a similar data base where each amino acid residue is represented by only its C Q atom. In virtually all aspects of testing, the present potential betters the previous single-point potential, and is shown to be useful in the simulation of protein folding.

Key words: conformational analysis; conformational energy; protein folding; protein structure; secondary structure

In our earlier work (1) we stated the require- ments of an energy function for protein con- formational calculations by energy embedding ( 2 , 3): (i) the molecule must be represented as a collection of points in space, where each point may represent a single atom or groups of atoms; (ii) the energy is calculated as a sum of pairwise interaction terms between all pairs of points; (iii) each term depends iso- tropically on the distance between the points and the identity of points; (iv) each term’s functional dependence on distance may have only a single minimum (unimodality); and (v) the identity of the interacting points in the third requirement can be specified only by chemical type (atom type, residue type, etc.) and the relative sequence positions of the points. Many energy-like functions have been proposed for protein conformational calcu- lations which economize by not considering all atom-atom interactions. The very simple

potential of Kuntz et al. (4) consists of only hydrophobic interactions and makes no attempt to enforce correct secondary structure. That of Oobatake & Crippen ( 5 ) also lacks secondary structure terms. The detailed poten- tials of Momany et al. ( 6 ) (and to a lesser extent, that of Pincus & Scheraga (7)) require so much computer time per energy evaluation that we feel a coarser representation of the polypeptide chain is worth exploring at length. The most promising of these (Levitt (8) and subsequent modifications by McCammon et al. (9)) unfortunately have an explicit virtual bond torsional term, thus violating our third criterion. Only our previous potential satisfies all our requirements, and it sacrificed physical realism for computational speed by represent- ing each amino acid residue by only a single point, centered at the Cn. In this paper we ask whether adding an extra sidechain point t o each residue will improve the performance

487

Page 2: Sidechain and backbone potential function for conformational analysis of proteins

G.M. Crippen and V.N. Viswanadhan

of the potential enough to compensate for the increased computer time. In the Methods section, we describe how we constructed such a potential function meeting all of the above criteria. Then in the Results section, we show how well it performs on a variety of tests compared to our earlier potential.

METHODS

Representation of protein structure. Here we have used a two-point-per-residue rep- resentation where the mainchain is represented by the sequence of a-carbon atoms, and the sidechain positions are indicated by the coordi- nates of representative sidechain atoms for the respective amino acid residues. These representative atoms, listed in Table 1 , are approximately centrally located in the side- chain. However, for long sidechains the rep- resentative sidechain atom is taken to be close to the functional group that is likely to be involved in interactions. Also listed in Table 1 are the radii (average observed side- chain to mainchain distances in the present set of structures) for each residue type.

Protein data. We selected a set of 22 protein crystal structures from the more than 100 available coordinate sets (10) based on the following criteria: (i) the resolution of the electron density map be less than or equal to 2.8 A and (ii) not more than one representative structure be taken from a class of closely

related structures. The names of the proteins and the corresponding four character protein data bank identifier codes are given in Table 2.

Nomenclature of interactions. For the ith amino acid residue along the chain, the side- chain position is indicated by Si and the main- chain position by Mi. In our previous work (1) we considered all possible mainchain- mainchain interactions (MM). Now we retain these same terms and add extra terms to represent the interactions involving the additional point per residue: sidechain-side- chain interactions (SS) and sidechain-main- chain interactions (SM).

In particular, we consider the following interactions.

(a) Sidechain-mainchain interactions: Si to

Si to M i - , (SM4); and Si to M i + z (SM5). All these are taken to be functions of the sidechain type.

(b) Sidechain-sidechain interactions: Si t o

and Si to Sj (SS4). All these are taken to be functions of the two residue types involved. A similar nomenclature describes niainchain- mainchain (MM) interactions.

Mi ( S M l ) ; Si to M i - , (SM2) ; Si to Mi+ I (SM3):

(SSl); si to S*+z (SS2); si to si+J (SS3);

Estimation of empirical energy. Let a super- script M denote mainchain-mainchain inter- actions, while a superscript S denotes sidechain- sidechain interactions. Sidechain-mainchain interactions are indicated by the superscript

TABLE 1 List of the representative sidechain atoms taken for the 20 residue types along with their average Ca to sidechain

distance (radius)

Residue Side-chain atom Radius (A) Residue Side-chain atom Radius (A)

Gly none - His CY 2.55 Ala CP 1.55 TrP C E 2 4.64

Leu C Y 2.61 Thr CP 1.53 Ile C 7' 2.55 Lys C6 3.83 CY s SY 2.65 Arg N' 4.64 Met S6 4.02 Asp CY 2.55 Phe CY 2.55 Asn CY 2.55 PI 0 CY 2.45 Glu C6 4.21 TY r Cf 5 .13 Gln C 6 3.69

Val CP 1.55 Ser CP 1.55

488

Page 3: Sidechain and backbone potential function for conformational analysis of proteins

Protein conformational energy

TABLE 2 List of proteins used in the study

Code Protein Chain length Code Protein Chain length

1 APR 2ADK 4ADH 1 ABP 3CPV SCPA lCTX 2CNA 4CYT 2FDl 4FXN

Acid proteinase Adenylate kinase Alcohol dehydrogenase L-Arabinose-binding protein Ca-binding parvalburnin Carboxypeptidase A P Cobratoxin Concanavalin A Cytochrome C Bacterial ferredoxin Flavodoxin

3 24 194 3 74 306 108 307

71 237 103 106 138

lLYZ 3MBN 1 SN3 8PAP 3PGK lPcY 1 RHD 3RXN 2SNS 3TLN 4lT1

Lysozyme Myoglobin Scorpion neurotoxin Papain Phosphoglycerate kinase Plastocyanin Rhodanase Rubredoxin Staphylococcal nuclease Thermol ysin Trypsin inhibitor

129 153 65

212 415

99 293 52

141 316 5 8

SM. In our previous work with one-point-per- residue representation, the final form of the fitted potential consisted of the following terms:

E:, = E," + E r

where EEt is the total energy of the con- formation, E," is the contribution of short- range interactions, and Ef" is the contribution of longrange interactions. These respective contributions are given by

(2) E ~ M = E M M I + E M M ~ + E M M ~

N - 1 N - 2 N - 3

i = l i = l i=l = c E G + , + c E E + 2 + c E $ + ,

and

where N = number of residues in the protein. In the present work, we retain the same terms for mainchain-mainchain interactions. The functional forms for these interactions are given in our earlier work and are not reiterated here. What we now take to be the total energy is

(4)

SM interactions: SM1 interactions determine the distances between the sidechain atom and the corresponding mainchain Cay. Since we have given Gly no sidechain atom, it has no SMl interaction. Otherwise, depending on the residue type, the observed SM1 histograms in the protein crystal structures had either symmetric or asymmetric distributions. In the cases of residue types where the representa- tive sidechain atom is taken close to the corre- sponding mainchain atom, the distributions generally are symmetric and very narrow, while in the cases of residue types where the representative sidechain atom lies farther apart from the corresponding Ca atom, the distributions are asymmetric (compare for example the nature of SMl distributions for Phe and Tyr which are physico-chemically close). Obviously, a greater range of distances is possible between the sidechain atom and the mainchain atom when the number of inter- vening rotatable bonds is larger. We fitted the asymmetric distributions to polynomials and the symmetric ones to Gaussians. The general form of the simplest polynomial having only one minimum and asymmetric about the minimum is a given by

+ WzZ +P3' + 2P2Pl)d2

- 4 ~ 1 (PZ' + ~ 3 2 1 4 (5) In the following we explain how we obtained

EL, and EEY from the protein crystal struc- tures.

489

Page 4: Sidechain and backbone potential function for conformational analysis of proteins

G.M. Crippen and V.N. Viswanadhan

The empirical potential for SM 1 interactions whose distributions are asymmetric is given by the above expression, and the parameters p l . . . p s were determined by fitting the calculated SMI distributions for each residue type to the observed distributions. The devi- ation, 6, between the observed, normalized seven-cell histogram, h(d,), i = 1 , . . . 7 , and the calculated histogram is

(6)

taking T = 298 K. The lower limit of the first cell always corresponds to zero and the lower limit of the last cell corresponds to the upper limit of the sixth cell. The upper limit of the last cell has been chosen to accommodate all observations beyond the sixth cell. The lower limit of the second cell and the upper limit of the sixth cell are given in Table 3. The five cells starting from the second to the sixth cell correspond to five equal distance ranges. As can be seen from Table 3, the fits are acceptable. The functional form used for asymmetric SMl interactions is the same as that used to represent EMM2 interactions. The symmetric distributions (for Ala, Val, Ile, Cys, Phe, Pro, His, Ser, Asp, and Asn) are fitted to Gaussian potentials, so the energy contributions of such residue types is given by

K b E,@'(d) = -(di -d i ,0 )2 (7) 2

where di,o is the optimal separation distance between the representative sidechain atom and the corresponding CQ of residue i (see the "radius" column in Table I ) , and di is the observed distance between the two atoms of residue i . A uniform value of 40kcal/W2 was chosen for the force constant K , for all interactions represented by the Gaussian functional form. Though better fits would be obtained with higher values of force con- stant, the relatively small value of 40 kcal/A2 was chosen in order to prevent the formation of very narrow channels in the energy surface.

The observed SM2 and SM3 distributions for each residue type differ significantly at the 95% confidence level by the x 2 test. Sirni- larly, the SM4 and SMS distributions were found to differ even more than SM2 and SM3. This is the first evidence we have found in a survey of protein crystal structures for interactions which are not symmetric with respect to polypeptide chain direction. Since three distances would suffice to fix the side- chain position by virtue of SM interactions, we have taken the SMI, SM4, and SM5 inter- actions (neglecting SM2 and SM3 and all the higher order SM interactions). Including terms for more interactions would probably improve the accuracy and reliability of the potential, but at the cost of increased computer time for each energy evaluation. Furthermore,

TABLE 3 List of parameters for Si -Mi (SMI) distributions of the residue types fitted to polynomial curves and the lower ( 1 , ) and upper (u,) bounds of the second and sixth cells of the seven-cell histograms used in deriving

the parameters

Residue p , P2 P3 P4 PS deviationa 1, (A) u, (A)

2.6 1 4.86 - 3.41 4.02 7.41 - 0.80 5.13 1.87 - 3.73 4.64 7.48 - 0.99 1.53 6.22 - 1.31 3.83 7.50 - 0.98 4.64 3.00 - 2.22 4.21 7.84 - 1.16 3.69 7.19 - 0.68

1.35 1500 0.62 700 1.55 1200 0.75 1170 1.35 200 0.76 870 0.12 100 1.45 2170 0.17 170

0.0541 0.0235 0.007 19 0.0182 0.0214 0.0374 0.00539 0.0891 0.048

2.3 3.0 4.9 4.2 1.35 3.3 4.0 3.9 3.0

2.8 4.5 5.4 5.2 1.85 4.3 5.5 4.4 4.5

aDeviation between observed and calculated distributions (eqn. 6).

490

Page 5: Sidechain and backbone potential function for conformational analysis of proteins

Protein conformational energy

TABLE 4 Parameters for the interaction potential (eqn. 5) accounting for Si -Mi. I (SM4) interactions

Residue P1 P1 P3 P4 P s type

Ala 5.80 8.23 - 0.01 0.09 200.0 Val 6.24 7.41 - 0.93 0.10 300.0 Leu 6.87 6.86 - 0.29 0.17 300.0 Ile 6.36 7.44 - 0.99 0.05 100.0 CY s 5.69 7.00 0.00 0.43 700.0 Met 8.27 6.42 - 1.12 0.04 100.0 Phe 6.66 8.03 - 0.98 0.06 200.0 Pro 7.64 5.92 0.00 0.20 300.0 Tyr 7.92 5.81 - 1.42 0.01 100.0 His 7.26 5.75 - 1.43 0.06 100.0 TIP 9.22 6.12 - 1.73 0.01 100.0 Ser 5.80 7.98 - 0.27 0.07 200.0 Thr 5.87 8.06 - 0.32 0.08 200.0 LY s 7.64 7.64 - 1.18 0.04 100.0 Arg 9.49 7.46 0.00 0.02 100.0 ASP 7.06 7.49 - 1.21 0.03 100.0 Asn 7.63 6.15 - 1.34 0.03 100.0 Glu 7.28 9.63 - 1.27 0.03 200.0 Gln 7.1 1 9.38 - 0.72 0.05 300.0

devn.a

0.00276 0.001 7 1 0.00907 0.00315 0.00949 0.00435 0.00563 0.00694 0.02040 0.00826 0.02790 0.00233 0.00386 0.00743 0.00379 0.00370 0.01210 0.01070 0.00195

~~ ~

ahviation between calculated and observed distributions (eqn. 6).

TABLE 5 Parameters (eqn. 5) for the interaction potential accounting for Si -Mi+ (SM5) interactions

Residue PI PI P3 P4 P s devn.a tY Pe

Ala Va I Leu Ile CY s Met Phe pro TY r His TIP Ser Thr LY s h g ASP Asn Glu Gln

6.60 6.51 8.02 8.10 6.47 9.10 7.79 7.65 8.02 7.83 9.03 6.44 6.42 9.03 9.70 8.00 7.94 8.94 8.99

7.34 7.58 5.27 5.12 3.92 5.35 5.41 6.20 5.19 4.32 8.20 8.66 8.46 5.72 6.44 4.52 4.40 5.65 5.96

- 0.71 - 0.42 - 0.81 - 0.45 - 1.65

0.00 - 1.18 - 1.75 - 1.32

0.00 - 1.01 - 1.25 - 0.76 - 0.61 - 0.53

0.00 0.00

- 0.44 - 0.64

0.62 0.69 0.05 0.05 0.05 0.02 0.04 0.04 0.01 0.03 0.01 0.1 1 0.17 0.03 0.03 0.02 0.02 0.03 0.03

1600.0 1800.0 100.0 100.0 100.0

0.0 100.0 100.0 100.0 100.0 100.0 400.0 600.0

0.0 100.0 100.0 100.0 0.0 0.0

0.00409 0.00176 0.00443 0.00435 0.02720 0.01970 0.00739 0.00509 0.03490 0.00947 0.01 140 0.03420 0.01 160 0.00840 0.00735 0.01970 0.02030 0.00962 0.01630

ahviation between calculated and observed distributions (eqn. 6).

49 1

Page 6: Sidechain and backbone potential function for conformational analysis of proteins

G.M. Crippen and V.N. Viswanadhan

a fourth and a fifth interaction affecting the sidechain position would be redundant, making it difficult to apportion the energy among the several terms. The energy terms correspond- ing to the SM4 and SM5 were fitted to the polynomial form given in eqn. 5 . In Tables 4 and 5 , we list the parameters obtained for SM4 and SM5 interactions, along with the respective deviations of the calculated and observed distributions (eqn. 6). Clearly Gly is not involved. In Table 6, we present the lower bounds of the second cells and the upper bounds of the sixth cells. Much as for SMl distributions, the lower bound of the first cell was O A and the upper bound of the last cell was taken to accommodate most of the observations beyond the sixth cell. The expression for the contribution of SM inter- actions to the total energy is given by

N N N - 3

SS interactions: In considering SSl , S S 2 , and SS3 as functions of residue type pairs

i & i i- 1, i & i + 2, and i & i -k 3 respectively, we found the total number of observations for each residue type pair over the seven cell histograms is less than 50 in most cases and less than 10 in many cases. Hence, the para- metric fits obtained for such distributions would be subject to sample size limitations when considered as functions of individual residue type pairs. As a practical alternative, we classified residue types into four classes based on secondary structural preferences of the residue types: the first, helix favoring (Ala, Leu, Met, His, Gln, Glu, Lys, Cys): the second, sheet favoring (Val, Ile, Phe, Trp, Tyr, Thr); the third, turn favoring (Ser, Gly, Asp, Asn, Pro); and the fourth, the only amino acid that is neutral with respect to secondary structural preferences (Arg) (1 1). Furthermore, it was necessary to take account of the chain directionality (i.e. which of the two interacting residues has the lower sequence number). The reason for this is apparent from Table 7. In most cases of SSl , SS2 and SS3 distributions (with the obvious exclusion of residue class pairs which contain both residue types from the same residue class), the x2 probability that the histograms taken

TABLE 6 Lower bound of the second cells ( 1 , ) and upper bounds of the sixth cells (u,) for the seiwwell histograms

used in deriving the parameters for SM4 and SMS interactions

Ala Val Leu Ile CY s Met Phe pro Tyr His TrP Ser Thr LY s

ASP Asn Clu Gln

4.5 4.5 4.5 6.0 4.5 6.0 6.0 5.0 6.0 6.0 7.0 4.5 4.5 7.0 7.0 6.0 6.0 6.0 7.0

7.5 7.5 7.5 9.0 7.5 8.5 8.5 8.0 9.0 9.0 9.5 7.5 7.5

10.0 10.0

8.5 8.5 8.5

10.0

4.5 4.5 7.0 7.0 4.5 7.0 7.0 6.0 8.0 8.0 7.5 6.0 6.0 7.0 7.0 6.0 6.0 6.0 7.0

7.5 7.5 9.5 9.5 7.5 9.5 9.5 8.5

11.0 11.0 10.0

8.5 8.5 9.5

10.0 9.0 9.0 9.5

10.0

492

Page 7: Sidechain and backbone potential function for conformational analysis of proteins

Protein conformational energy

TABLE 7 x2 probabilities (p’s) that the observed S S l . SS2 and SS3 distance histograms differ from their counterparts

when the proteins’amino acid sequences are reversed

class P(SS1) No. obsd. P(SS2) No. obsd. P(SS3) No. obsd. pair

0.0000 0.7394 0.9829 0.9069 0.7138 0.0000 0.9887 0.1533 0.8627 0.9985 0.0000 0.8641 0.6906 0.1166 0.7655 0.0

610 4 26 466

54 426 321 397

35 464 394 392 51 51 42 41

3

0.0000 0.5042 1.0000 0.5616 0.7666 0.0000 0.5 134 0.1853 1.0000 0.9762 0.0000 1.0000 0.7345 0.1667 1 .oooo 0.0

601 463 434

51 447 312 364 49

443 366 449

37 57 36 44

4

0.0000 0.9857 1 .oooo 0.9970 0.9768 0.0000 1 .oooo 0.2643 1 .oooo 1 .oooo 0.0000 1 .oooo 0.9972 0.0736 1 .oooo 0.0

637 405 45 2

47 41 1 3 20 397

38 438 408 394 47 53 37 42

9

normally differ from the histograms taken (1, l), ( 2 , l) , . . . , (4,4) where each pair of with reversed protein sequencing, exceeds numbers within the parentheses refers to a 0.70. By using the same functional form pair of residue types each belonging to one defined in eqn. 5, we have fitted the observed of the four classes defined above. The par- distribution of distances for SS1, SS2, SS3 ameters obtained for SSl, S S 2 , and SS3 distri- interactions as functions of residue class pair butions are given in Tables 8, 9 and 10 respect-

TABLE 8 Parameters (eqn. 5) for Si - Si + (SSl ) interactions as a function of residue classes at i and i + 1

Class pair PI P2 P3 P4 P5 devn.*

(191) 6.75 7.61 - 0.020 0.030 100.0 0.00357 (2, 1) 5.97 8.37 - 0.95 0.02 100.0 0.00587 (3. 1) 6.68 5.88 - 1.25 0.020 0.0 0.00682 (4, 1) 8.65 6.58 - 1.65 0.01 100.0 0.01750 (1, 2) 6.17 9.28 - 1.55 0.01 100.0 0.0270 (2, 2) 5.94 10.10 - 0.43 0.01 100.0 0.0449 (3, 2) 5.41 8.6 - 1.82 0.01 100.0 0.0358 (4. 2) 0.0 0.0 0.0 0.0 0.0 0.0 (1, 3) 6.50 5.37 - 1.68 0.02 100.0 0.00962 (2, 3) 5.11 7.62 - 0.55 0.03 0.0 0.00939 (3, 3) 4.93 6.20 - 1.39 0.03 0.0 0.00337 (4, 3) 7.81 9.23 - 2.58 0.02 100.0 0.00833 (194) 8.64 8.20 - 1.87 0.01 100.0 0.0157 (2.4) 7.31 9.18 - 1.48 0.01 100.0 0.01 35 (3.4) 7.16 10.01 - 0.86 0.03 200.0 0.00767 (494) 0.0 0.0 0.0 0.0 0.0 0.0

ahviation between calculated and observed distributions (eqn. 6).

493

Page 8: Sidechain and backbone potential function for conformational analysis of proteins

G.M. Crippen and V.N. Viswanadhan

TABLE 9 Parameters (eqn. 5) for Si - Si + I (sS.2) interactions as a function of residue classes at i and i + 2

Class pair PI PI P3 P4 PS

(1, 1) 9.95 6.29 - 0.88 0.01 0.0 0.0252 (2,1) 9.5 3 6.66 - 0.85 0.01 0.0 0.02390 (3, 1) 8.30 7.49 - 0.04 0.01 0.0 0.00966 (4, 1) 11.23 7.29 0.0 0.01 0.0 0.0730 (1, 2) 9.55 6.80 - 0.78 0.01 0.0 0.0219 (2, 2) 9.01 7.34 0.0 0.01 0.0 0.0330 (3, 2) 7.43 8.29 - 1.48 0.01 100.0 0.0131 (4, 2) 11.19 7.71 - 0.72 0.01 0.0 0.0268 (1, 3) 8.99 6.59 - 1.48 0.01 0.0 0.0143 (2, 3) 7.36 8.19 - 1.38 0.01 0.0 0.0143 (3, 3) 6.60 7.44 - 1.76 0.01 0.0 0.01 78 (4, 3) 8.80 9.99 - 0.07 0.06 500.0 0.0164 (194) 11.57 9.05 - 0.46 0.01 100.0 0.01 16 (2,4) 10.6 1 7.04 0.0 0.01 100.0 0.0457 (3,4) 8.99 7.22 0.0 0.01 0.0 0.0233 (494) 0.0 0.0 0.0 0.0 0.0 0.0

aDeviation between calculated and observed distributions (eqn. 6).

ively, along with the deviations between limit of the sixth cell was 14A. The lower observed and calculated histograms for the limit of the first distance range was OA, and seven distance ranges. For the SS1 distribution, the upper limit of the last cell was chosen the lower limit of the second cell was 4A, to accommodate the maximum number of and the upper limit of the sixth distance was observations beyond the sixth range, much 9A. For S S 2 and SS3 distributions, the lower as for SM interactions. In the few cases where limit of the second cell was 4w, and the upper data is too sparse, a flat distribution has been

TABLE 10 Parameters for Si - Si +, (SS3) interactions as a function of residue classes at i and i + 3

Class pair PI P2 P3 P4 PS devn.a

(1, 1) 6.47 9.78 0.0 0.01 100.0 0.07760 (2, 1) 6.49 9.67 0.0 0.01 100.0 0.09560 (3.1) 11.54 9.04 - 0.45 0.0 0.0 0.09730 (4, 1) 11.66 9.35 - 0.02 0.01 200.0 0.0435 (1, 2) 11.59 8.58 0.0 0.01 100.0 0.0816 (2, 2) 10.96 7.91 0.0 0.01 100.0 0.0605 (3, 2) 10.30 6.79 0.0 0.01 100.0 0.0684 (4. 2) 11.63 8.94 0.0 0.01 100.0 0.07 12 (1, 3) 11.37 8.38 0.0 0.01 100.0 0.0434 (2, 3) 11.15 8.11 - 0.01 0.01 100.0 0.0425 (3, 3) 10.10 6.88 0.0 0.01 0.0 0.0419 (4. 3) 0.0 0.0 0.0 0.0 0.0 0.0 (194) 0.0 0.0 0.0 0.0 0.0 0.0 (2,4) 11.81 8.56 - 0.01 0.01 100.0 0.0632 (3,4) 0.0 0.0 0.0 0.0 0.0 0.0 (4,4) 0.0 0.0 0.0 0.0 0.0 0.0

~~~~ ~~~

aDeviation between calculated and observed distributions (eqn. 6).

494

Page 9: Sidechain and backbone potential function for conformational analysis of proteins

Protein conformational energy

TABLE 1 1 Type specificity o f longrange interactions (0,. eqn. 14) of the packing shells for MM4 and SS4 distri-

butions

D, (nats) D, (nats) Packing shell (A) for MM4 for SS4

0-6 0.077 0.063 6-8 0.032 0.05 1 8-10 0.009 0.035

10-12 0.010 0.029 12-14 0.009 0.014 14-16 0.010 0.020

> 16 0.022 0.021

assumed, which amounts to defining all five parameters to be zero. In three cases (SS3 of (4,3), (1,4), and (3,4) residue class pairs) the distributions were bimodal and could not be represented by asymmetric unimodal polynomial distributions. A flat distribution has been assumed in these cases also. With the above formulation for the shortrange part of inter-sidechain interactions, we may write the corresponding contribution to total energy as

N - l N - 2 N -3

Comparison of SS4 and MM4 distributions. In the case of S S l , SS2, and SS3 distributions, which reflect shortrange interactions, we have incorporated the observed preferences between residue type classes into the potential function by directly deriving the parameters from the distributions as functions of residue types at the termini of the segments considered. In contrast, MMl interactions are defined by a bond-stretching potential, MM2 inter- actions by the residue type at the middle location of the three residue segment and MM3 interactions by residue types at the two middle locations of the four residue segment. Thus, the basis for the incorporation of short- range SS interactions in the potential function differs from that of shortrange MM inter- actions. However, the longrange parameters for MM interactions define the spatial separ- ation of the &-carbons for a pair of interacting residues, and this limits the range of distances possible for SS4 interactions. Therefore one needs to compare the SS4 (longrange) inter-

Around each amino acid residue in each protein, a series of concentric nonoverlapping shells are constructed, each shell bounded by a specific distance range from the central residue. The distance ranges of these seven shells are given in the first column of Table 11. For a given packing shell around every occurrence of a central residue having type i, we count the number of residues in the shell of typej. In other words, we obtain the number of residue pairs for each pair type where the two residues are separated by a given range of distances. This is repeated over the different distance ranges. From this data, we obtain the conditional probabilities of pairwise associ- ation for each residue pair type. For gathering statistics, we include only those pairs with a sequence separation of four residues or more. We can evaluate is the conditional probability, p(jli), of occurrence of residue type j in a specific distance range from the residue type i b y

Z #of type j residues in shell around type i .. ~ - _ . protein8

Z #of central residue types i p(jli) =

proteins

actions with MM4 interactions in order to In other words, the conditional probability ascertain whether a separate treatment is p(jli) defines the extent of pairwise association required to incorporate the SS4 interactions of residue type j with residue type i in a spherical in the potential function. In the following, shell around residue type i, bounded by the we describe an information-theoretic procedure limits of the pre-defined distance range. The ( 1 2) to compare the relative strengths of independent (unconditional) residue type prob- SS4 and MM4 interactions. abilities are given by

4 9 5

Page 10: Sidechain and backbone potential function for conformational analysis of proteins

G.M. Crippen and V.N. Viswanadhan

Z # 0.f type i residues . . _ proteins

p ( i ) = Z total #o.f residues proteins

(1 1)

where the number of type i residues are those considered in eqn. 10. The Shannon entropy, H1 , of residue type composition corresponding to a distance range is

20

HI = - c P(i) h P ( i ) (1 2 ) i = l

and the Markov entropy, H,, of the residue pair type distribution corresponding to the same distance range is

? o m = - c 1 p ( i ) p ( j l i ) logp(jli)

i = l j = 1

(13) The deviation from randomness (or the

extent of interdependence between various residue types) of the residue pair type distri- bution is a function of Shannon entropy and Markov entropy given by

D2 = Hi -HM (14)

Table 11 presents the values of D2 for various distance ranges in the cases of SS4 interactions and MM4 interactions. Each value of D2 corresponds to the deviation from randomness of the entire residue pair type distribution for a packing shell, the bounds of which are defined by the distance range given in first column of Table 11. It may be seen from the table that SS4 interactions deviate from independence substantially more than MM4 interactions for all but very short and very long separations, suggesting that we need to consider SS4' interactions separately in the potential function by adding extra terms, but without duplicating the MM4 interactions we already include.

Geometrical model for SS4 interactions. We used the following geometrical model to examine S S 4 interactions. For a fixed inter-C" distance between two residue types, the prob- ability distribution of distances between the sidechain positions of the two residue types

can be calculated, assuming that the sidechain positions are isotropically distributed around the corresponding mainchain (C") positions (see appendix) at distances equal to the average mainchain to sidechain distance for the corre- sponding residue type. This probability distri- bution is obtained for different ranges of inter-C" distances and a cumulative distribution of inter-sidechain distances expected from the above hypothetical model of random sidechain association is obtained for each residue pair type. In the following we describe the procedure for obtaining the cumulative expected distribution of inter-sidechain dis- tances.

Let a and b denote the radii of two residues (Table 1) whose a-carbons are x A apart. The maximum possible distance between the sidechain atoms is (a + b + x). This distance is divided into 10 equal sub-ranges, the limits of which are given by

r , - A r < r < r , + A r (15)

where n varies from 1 to 10. The subrange half-width is

Ar = O.S[ (a+b +x)/10] (16)

and the subrange midpoints are

r, = 2 n A r - A r (17)

Let p(r , ) be the a priori probability of occur- rence of the two sidechain positions within a distance range r, + Ar. We can calculate P(rn) by

where F(r) is derived in the Appendix. To obtain the expected inter-sidechain distri- bution, we initially partition the observed inter-C" distribution into seven distance ranges, that fall within eight successive distance limits d , , d , , . . . , d,. The first distance limit do corresponds to O.OA, so that the first distance range corresponds to all obser- vations at distances less than d l . The last distance limit d7 corresponds to infinity, so that the last distance range corresponds to all observations beyond d 6 . Let n i l , n i2 , . . . , n i b , ni7 represent the number of

496

Page 11: Sidechain and backbone potential function for conformational analysis of proteins

Protein conformational energy

data. For Gly, the "sidechain" atom was taken to be the C". Most of the interactions between sidechains are unimodal and could be represented by Lennard-Jones type of potentials. The fitted function for SS4 inter- actions is

observed inter-C" distances in the seven dis- tance ranges: [do, dl I , [dl, 4 1 , . . . , [ 4 , d 7 1 , for a given residue pair type i . From eqns. 15 to 17 and the methods described in the appendix we obtain the probability distribution of inter-sidechain distances p(rk), over 10 equidistant intervals rk corresponding to a given number of inter-C" distances in a specific distance range [dj-l, dj] for a particular residue pair type i . The number nij of inter-C" distances of pair type i in each distance range j corresponds to a distribution of expected inter-sidechain distances Cijk according to the equation :

c i j k = nijP(rk) (19) The reason for this expression is that even for a particular inter-C" distance, j the inter- sidechain distance may fall in various ranges, k depending on the orientation of the side- chains. For an inter-sidechain distance, m, the expected (or calculated) distribution, Ni ,m,c , is given by

10 7

k = l I=1 Ni,m,c = c c 6mkCijk (20)

where

1 if rk belongs the range [dm - , d m ]

0 otherwise (21) &mk =

Empirical energies for SS4 interactions. We obtain a set of empirical energies for each residue pair type over different distance ranges from the following equation

where Ei(rj) is the energy difference between the actual distribution of SS atom pairs and that of the expected SS atom pairs for the residue type pair i . Ni, j ,s is the observed number of sidechain positions with inter- sidechain distance between rj and rj + Ar, and N i 3 j , c is the corresponding expected number of sidechain atoms in the same distance range. The incremental distance Ar was chosen to be 2.OA and the data were sampled every 0.25A to reduce the effect of noise in the

(23)

where el is the minimum energ of Esa, and rl is the distance at which Ela is mini- mum. We have taken the values of m = 6 and n = 4, just as for the MM4 interactions. The parameters el and rl derived for longrange interaction between sidechain types are presented in Table 12. In the last column of Table 12, we show the xz probability that each of the observed inter-sidechain distri- butions differs from the calculated one. The contribution of SS interactions to total energy is given by

EEt = E," +Ef (24)

TEST RESULTS

The present potential function has been exten- sively tested on BPTI by means similar to those described in our earlier work (1, 5): 1. ability of perturbed structures to return to the refer- ence upon minimization (perturbation-minimiz- ation test); 2. maintaining the native radius of gyration; 3. preservation of secondary structure and longrange contacts; 4. applicability to other proteins; 5. comparison of the present potential with the previous single point per residue potential. Most of these tests involve comparing different conformations of the same molecule, which we do by calculating the root mean square difference between the interpoint distances of the one structure and the corresponding distances of the other.

1. Perturbation-minimization test.: In this test, we started with an energy minimized near- native conformation as the reference, obtained by minimizing the x-ray structure with respect to total energy. The reference has an energy

497

Page 12: Sidechain and backbone potential function for conformational analysis of proteins

G.M. Crippen and V.N. Viswanadhan

TABLE 12 Parameters for the longrange part of the potential function accounting for longrange sidechain-sidechain (SS4)

interactions ~ ~~

Residue pair type No. obsd. e, (kcal/mol) 1, (A) .Q*

Gly Gly Gly Ala Gly Val Gly Leu Gly Ile Gly Cys Gly Met Gly Phe Gly Pro Gly Tyr Gly His Gly Trp Gly Ser Gly Thr Gly Lys

Gly Asp Gly Asn Gly Glu Gly Gln Ala Ala Ala Val Ala Leu Ala Ile Ala Cys Ala Met Ala Phe Ala Pro Ala Tyr Ala His Ala Trp Ala Ser Ala Thr Ala Lys Ala A r g Ala Asp Ala Asn Ala Glu Ala Gln Val Val Val Leu Val Ile Val Cys Val Met Val Phe Val Pro Val Tyr Val His Val Trp Val Ser

GIY A r g

498

5359 8858 8156 7745 5873 1513 1441 4432 4242 3961 1719 1340 7476 6278 6917 3375 6296 4254 5388 3394 3784 6937 6847 4919 1114 1226 3879 3651 3100 1665 1126 6289 5247 6381 2920 5489 3538 4850 2696 3227 6262 4592 1212 1255 3554 3548 2774 1474 1007 5808

0.0 - 0.024

0.150 0.425 0.075 0.000

- 0.644 - 0.018 - 0.220 - 0.802

0.153 - 0.419 - 0.212 - 0.209 - 0.088 - 0.537 - 0.073 - 0.173 - 0.252

0.078 - 1.914 - 0.41 1 - 0.621 - 0.509 - 0.660 - 1.010 - 0.830 - 1.028 - 0.896 - 0.370 -0.513 - 0.158 - 0.235 - 0.238 - 0.949 - 0.81 2 - 0.356 - 0.135 - 0.278 - 0.128 - 0.393 - 0.421 - 0.534 - 0.732 - 0.464 - 0.273 - 0.865 - 0.196 - 0.655 - 0.260

0.0 2.75 5.0 3.50 3.75 0.000 3.75 4.5 1 3.50 3.75 3.50 3.75 2.75 2.25 4.75 4.50 4.50 1.75 3.75 3.50 1.75 4.25 4.00 4.00 4.00 3.75 4.00 2.00 3.00 4.00 3.50 2.75 4.25 6.00 2.00 1.75 2.50 3.25 3.75 4.25 5.00 4.00 4.00 3.75 4.00 5.00 4.25 5.00 4.25 4.25

0.0 1 .oooo 1 .oooo 1 .oooo 1 .oooo 1 .oooo 1 .oooo 1 .oooo 1 .oooo 1 .oooo 0.9993 1 .oooo 1 .oooo 0.999 1 .oooo 1 .oooo 1 .oooo 1 .oooo 1 .oooo 1 .oooo 0.9998 0.9999 1 .oooo 1 .oooo 0.921 2 1 .oooo 1 .oooo 1 .oooo 1 .oooo 1 .oooo 0.9997 0.9994 1 .oooo 1 .oooo 1 .oooo 1 .oooo 1 .oooo 1 .oooo 1 .oooo 1 .oooo 1 .oooo 1 .oooo 1 .oooo 1 .oooo 1 .oooo 1 .oooo 1 .oooo 1 .oooo 1 .oooo 1 .oooo

Page 13: Sidechain and backbone potential function for conformational analysis of proteins

Protein conformational energy

TABLE 12 (Contd.)

Residue pair type No. obsd. e , (kcal/mol) r,(”Q a“

Val Thr Val Lys Val Arg Val Asp Val Asn Val Glu Val Gln Leu Leu Leu Ile Leu c y s Leu Met Leu Phe Leu Pro Leu Tyr Leu His Leu Trp Leu Ser Leu Thr Leu Lys Leu Arg Leu Asp Leu Asn Leu Glu Leu Gln Ile Ile Ile Cys Ile Met Ile Phe Ile Pro Ile Tyr Ile His Ile Trp Ile Ser Ile Thr Ile Lys Ile Arg Ile Asp Ile Asn Ile Glu Ile Gln c y s cys Cys Met Cys Phe Cys Pro Cys Tyr Cys His Cys Trp Cys Ser cys Thr c y s Lys CYS Arg Cys Asp

4797 5948 2819 4850 3108 4708 2430 2995 4397

990 1122 3473 3290 265 1 1531 1005 5643 4618 5388 2669 4689 3007 4559 2318 1630 834 847

2532 2367 2124 1052 747

4306 3600 4021 1850 3524 2407 3173 1820 276 237 636 761 484 212 225 95 1 878

1026 582 830

- 0.019 - 0.218 - 0.590 - 0.023 - 0.183 - 1.367

0.246 - 0.479 - 0.652 - 0.808

.- 0.585 - 0.599 - 0.093 - 0.606 - 0.387 - 0.946 - 0.273 - 0.320 - 0.204 - 0.774 - 0.206 - 0.101 - 0.555 - 0.872 - 0.549 - 0.972 - 1.120 - 0.495 - 1.448 - 0.570

0.328 - 0.633 - 0.401 - 0.162

0.478 - 0.757 - 0.024 - 0.235 - 0.256 - 0.328 - 2.726 - 1.240 - 0.686 - 0.736 - 0.311 - 1.339 - 0.471 - 0.501 - 0.140 - 0.665 - 1.937 - 0.426

4.25 6.00 4.25 3.75 5.00 1.50 3.50 4.75 6.00 3.75 3.00 6.00 4.75 4.50 5.25 3.25 5.00 5.00 3.00 4.25 5.25 3.75 2.00 4.25 3.75 3.75 4.25 5.50 2.00 3.50 3.50 3.00 1.75 4.00 3.50 3.25 3.75 3.75 2.00 4.00 1 .so 2.75 3.75 3.75 3.25 3.75 3.25 3.50 5.00 4.25 1.75 3.75

0.9997 1 .oooo 1 .oooo 1 .oooo 1 .oooo 1 .oooo 1 .oooo 1 .oooo 1.0000 0.9997 1 .oooo 1 .oooo 1 .oooo 1.0000 1 .oooo 0.9947 1 .oooo 1.0000 1 .oooo 1.0000 1 .oooo 1 .oooo 1.0000 1 .oooo 1 .oooo 0.9963 1.0000 1 .oooo 1.0000 1 .oooo 1 .oooo 1 .oooo 1 .oooo 1.0000 1 .oooo 1 .oooo 1.0000 1 .oooo 1 .oooo 1.0000 1.0000 0.6932 0.9835 0.9939 0.8704 0.9254 0.9617 0.9389 0.9994 1 .oooo 0.3046 0.9919

499

Page 14: Sidechain and backbone potential function for conformational analysis of proteins

G.M. Crippen and V.N. Viswanadhan

TABLE 12 (Conld.)

Residue pair type No. obsd. e I (kcal/mol) I I (A) Qa

Cys Asn Cys Glu Cys Gln Met Met Met Phe Met Pro Met Tyr Met His Met Trp Met Ser Met Thr Met Lys Met Arg Met Asp Met Asn Met Glu Met Gln Phe Phe Phe Pro Phe Tyr Phe His Phe Trp Phe Ser Phe Thr Phe Lys Phe Arg Phe Asp Phe Asn Phe Glu Phe Gln Pro Pro Pro Tyr Pro His Pro Trp Pro Ser Pro Thr Pro Lys Pro Arg Pro Asp Pro Asn Pro Glu Pro Gln Tyr Tyr Tyr His Tyr Trp Tyr Ser Tyr Thr Tyr Lys Tyr Arg Tyr Asp Tyr Asn Tyr Glu

5 00

599 84 2 428 132 635 669 432 269 193 988 884

1219 521 85 8 520 982 426 95 1

1901 1511 84 2 580

3321 2704 3114 1473 2746 1738 2479 1319 923

1472 794 5 84

3065 2536 3122 1541 2545 1631 2491 1276 966 728 583

2908 2455 2247 1386 2246 1816 1792

- 0.168 - 0.714 - 0.541 - 0.778 - 1.421 - 1.620 - 0.799 - 1.605 - 0.857 - 1.250 - 1.037 - 1.030 - 0.829 - 0.361 - 1.144 - 0.642 - 1.193 - 0.759 - 0.228 - 0.744 - 0.386 - 0.754

0.102 - 0.365 - 0.144 - 0.856 - 0.093 - 0.251 - 0.66 1 - 0.980 - 1.166 - 0.972 - 0.979 - 0.973 - 0.252 - 0.892 - 0.023 - 0.528 - 0.916 - 0.629 - 0.097 - 1.277 - 0.655 - 1.264 - 1.028 - 0.584 - 1.152 - 1.204 - 1.357 - 1.180 - 0.324 - 1.428

5.75 3.25 2.75 5.00 4.25 4.25 4.00 3.50 3.50 3.75 3.75 3.25 3.75 4.25 4.25 5.50 3.00 4.25 3.50 5.00 3.25 4.25 2.50 2.25 4.25 3.25 5.25 4.25 4.25 4.25 3.25 4.00 4.00 4.75 3.50 2.00 2.75 6.50 4.00 4.00 3.00 2.75 4.25 3.50 5.25 2.75 4.25 4.00 2.25 3.50 3.50 3.75

0.9914 0.9964 0.4799 0.9130 0.9970 0.9928 0.6319 0.9921 0.9932 1 .oooo 1 .oooo 0.9997 1 .oooo 1 .oooo 0.9989 0.9939 0.5806 1 .oooo 1 .oooo 1 .oooo 0.9989 1 .oooo 1 .oooo 1 .oooo 1 .oooo 0.9999 1 .oooo 1 .oooo 1 .oooo 1 .oooo 0.9479 0.9877 0.9854 0.9987 1 .oooo 1 .oooo 1 .oooo 1 .oooo 1 .oooo 1 .oooo 1 .oooo 1.0000 0.9949 0.9952 0.9955 1 .oooo 1.0000 0.9996 0.7943 1 .oooo 1 .oooo 0.9999

Page 15: Sidechain and backbone potential function for conformational analysis of proteins

Protein conformational energy

TABLE 12 (Conrd.) ~ ~

Residue pair type No. obsd. e , (kcaVmo1) r r ( 4 Qa

Tyr Gln His His His Trp His Ser His Thr His Lys His Arg His Asp His Asn His Glu His Gln Trp Trp Trp Ser Trp Thr Trp Lys Trp Arg Trp Asp Trp Asn Trp Glu Trp Gln Ser Ser Ser Thr Ser Lys Ser Arg Ser Asp Ser Asn Ser Glu Ser Gln Thr Thr Thr Lys Thr Arg Thr Asp Thr Asn Thr Gtu Thr Gln Lys Lys Lys Arg Lys Asp Lys Asn Lys Glu Lys Cln Arg Arg Arg ASP Arg Asn Arg Glu Arg Gln Asp Asp Asp Asn Asp Glu Asp Gln Asn Asn Asn Glu

1381 221 26 2

1417 1144 1454 692

1090 5 33

1130 553 103

1031 829 85 1 534 790 619 724 447

2840 4725 4853 2592 4553 3191 3812 2292 1919 4203 2090 3813 2592 3179 1976 2995 2476 4260 265 2 4634 2073 6 86

1957 1422 1996 1061 1943 2543 3349 1929 897

2050

- 0.858 - 0.531 - 1.044

0.027 - 0.536 - 0.602 - 0.803 - 0.031 - 0.484 - 0.849 - 0.208 - 0.581 - 0.644

0.125 - 0.857 - 1.072 - 1.022 - 0.713 - 1.137 - 1.152 - 0.770 - 0.421 - 0.372 - 0.641 - 0.347 - 0.323 - 1.043 - 0.457

0.024 - 0.199 - 0.777 - 0.306 - 0.658 - 0.874 - 0.586 - 0.316 - 0.759 - 0.832

0.188 - 1.992 - 0.297 - 0.528 - 1.185 - 0.873 - 2.538 - 0.827 - 0.618 - 0.463 - 1.123 - 1.062 - 0.316 - 0.712

4.00 5 .OO 3.25 2.75 4.00 4.25 3.00 3.50 4.00 4.25 5.75 5.25 4.25 3.50 5.00 4.00 4.25 3.25 4.00 3.25 2.75 4.25 3.75 2.75 3.00 3.50 4.00 3.75 4.25 4.25 4.25 4.00 4.00 3.00 4.50 4.50 3.75 4.25 2.75 1.75 3.00 3.50 3.00 3.00 2.25 5.50 3.75 3.75 3.00 4.00 3.75 3.00

0.9955 0.9867 0.5262 1 .oooo 0.9456 1 .oooo 0.9913 0.9996 0.9133 1 .oooo 1 .oooo 0.3792 0.9968 1 .oooo 0.9996 0.41 74 0.9964 0.6417 0.9589 0.8405 0.9974 0.9992 1 .oooo 1 .oooo 1 .oooo 1 .oooo 1 .oooo 1 .oooo 0.9918 1 .oooo 1 .oooo 1 .oooo 1 .oooo 1 .oooo 1 .oooo 1 .oooo 0.9992 1.0000 1 .oooo 1.0000 0.9982 0.9911 0.9962 1 .oooo 0.9656 0.9916 1.0000 1 .oooo 1.0000 1 .oooo 0.9715 1.0000

50 1

Page 16: Sidechain and backbone potential function for conformational analysis of proteins

G.M. Crippen and V.N. Viswanadhan

TABLE 12 (Contdj ~~ ~

Residue pair type No. obsd. e , (kcal/mol) I I (-4 QB

Asn Gln Glu Glu Glu Gln Gln Gln

1374 1831 1587 512

- 0.636 3.50 1 .oooo - 0.328 3.75 0.9998 - 1.658 3.00 0.9938 - 1.064 3.25 0.9778

aThe x2 probability that the observed distribution differs from the calculated one.

of 3095.52 kcal, compared to the x-ray struc- ture's energy of 3306.06 kcal. Next, for several trials we randomly perturbed it and minimized the total energy. The perturbed structures were derived by adding a uniformly distributed random number to each of the main chain coordinates. The perturbed sidechain positions were obtained by changing the direction of the sidechain vector (which is the difference vector of the mainchain position vector and the sidechain position vector) while preserving its magnitude and appending it to the correspond- ing mainchain position. The minimization of the total energy was done by demanding that the final energy gradient be small and that any alteration in any of the Cartesian coor- dinate axes by 0.1 A should produce no improvement in energy. The minimizations were carried out by using MINOP minimizer, an unconstrained variable metric optimization algorithm using function and gradient values (13), taking an initial step size of 0.01 A. In Table 13 we give the energy contributions of various components in the potential for a few of the minimized structures of BPTI. The large magnitudes of ESM and E," result from the relatively high positive values of the

asymmetric minima of some of the inter- actions. Appending a large negative constant would make these interaction potentials com- parable to other interaction components. In Fig. 1 we plot the root mean square devi- ations (for the mainchain positions) of the starting structures vs. the root mean square deviations of the final minimized structures. In all cases except two, the final minimized structures lie closer to the reference than the starting perturbed structures, showing the efficacy of the potential in guiding the per- turbed structures toward the native minimum. However, unlike the one-point-per-residue potential, we do not observe an effective radius of convergence, presumably due to the presence of many more interaction terms in the function. It is worth noting that minimiz- ation with the present potential takes less CPU time than with the single-point potential, indicating that the many more interaction terms in the function give rise to high energy local minima. Even when the magnitude of perturbation is large, the present potential improves the contacts, whereas with the pre- vious potential, minimization sometimes moved the structure away from the near-native refer-

TABLE 13 Energy contributions (kcal/mol) of components in the potential function for a few minimized structures of

BPTI

E m Ess EP ESM EP Etot

2145.5 1034.3 - 44.5 - 21.8 - 17.9 3095.5 2145.9 1041.5 - 43.2 - 20.9 - 18.2 3105.1 2146.7 1042.6 - 42.8 - 20.9 - 17.6 3108.0 2150.2 1050.7 - 41.7 - 18.8 - 16.7 3123.8 2147.1 1044.4 - 42.5 - 38.1 - 20.6 3110.9

502

Page 17: Sidechain and backbone potential function for conformational analysis of proteins

Protein conformational energy

2. Maintenance of native radius of gyration. In Table 14, we present the values of radii of gyration calculated for the mainchain and sidechain separately and for the entire mol- ecule. Like our previous potential, the present potential does not attempt to compress or expand the structure.

3. Preservation of seconday structure and longrange contacts. In Table 15 we show how the average deviations from the energy- minimized x-ray structure of BPTI change for the i to i + l , i to i + 2 , i to i + 3 , long- range ( l i - j l > 4), and all distances, as a perturbed structure is minimized. The main- chain deviations of short and longrange dis- tances are reduced upon minimization, except for the i to i + 1 distances, which increase only slightly. Thus, local structures, such as a-helix and @rand, are well preserved; and the decrements in various shortrange and longrange deviations upon minimization occur in a balanced fashion without undue emphasis on any component over another, much as with the single-point potential.

4. Applicability to other proteins. We applied our potential to two other proteins, and the results are summarized in Table 16. The mini- mized structures maintain the radius of gyration of their x-ray structures and are reasonably close to the x-ray structures in conformation. This shows that the potential has general applicability. As in the previous

a

~~ORRIINIGIORf4S O M R T 1 0 P 5-oo dm

FIGURE 1 Plot of rms deviations (in A’s) between the Ca”s of starting perturbed structures of trypsin inhibitor vs. that of total energy-minimized structures.

ence. Fig. 2 shows the total energy of the minimized structures plotted with respect to their rms deviations from the reference. In general, the greater the deviation, the larger the energy of the minimized structure.

m m

!!.. ?

FIGURE 2 Plot of the energy value (kcal) of BPTI final structures minimized from the perturbed structures vs. rms deviations (in A’s) of final minimized structures from the reference minimum.

TABLE 14 Radii of gyration for a few minimized structures of BPTI for the mainchain Rg(M), the sidechain positions

R,(S) and for the entire structure (RJ

a10.61 11.62 11.12 b10.53 11.39 10.98 10.52 11.37 10.96 10.51 11.26 10.89 10.61 11.17 10.90 10.53 11.31 10.93

aValues in this row correspond to the X-ray structure. bValues in this row correspond to the reference minimum.

503

Page 18: Sidechain and backbone potential function for conformational analysis of proteins

G.M. Crippen and V.N. Viswanadhan

TABLE 15 Average deviations (in A) of local and longrange distances between mainchain (M) and sidechain (S) positions

of BPTI from the near-native reference for a few perturbed (initial) and energy-minimized (final) structures

Trial i t o i + 1 i t o i f 2 i t o i f 3 Longrange rms No. deviation deviation deviation deviation deviation

1. hi initial

S initial

2. M initial

S initial

3. M initial

S initial

fmal

final

final

final

final

final

~~

0.027 0.035 0.484 0.580 0.027 0.028 0.326 0.402 0.018 0.028 1.763 1.474

0.686 0.441 0.704 0.593 0.474 0.344 0.482 0.402 1.564 0.560 1.793 1.322

test, changes are more pronounced for side- chain positions rather than mainchain points.

5. Comparison with the previous potential: One of the main advantages of using a two- point-per-residue potential over the one-point- per-residue potential lies in the ability of the former to realistically account for secondary structure formation through the additional information provided by preferred sidechain contacts. We have tested this by applying energy minimization with both potentials to the hairpin fragment of BPTI (residues 16 to 37). We have used the following criteria of Levitt & Greer (14) to verify the presence

0.86 1 0.628 0.778 0.702 0.593 0.489 0.532 0.473 1.681 1.338 1.869 1.881

0.848 0.723 0.833 0.833 0.594 0.545 0.589 0.577 2.671 2.238 2.562 2.499

0.831 0.699 0.822 0.832 0.581 0.528 0.579 0.573 2.630 2.138 2.591 2.419

of 8-sheet contacts. For the first two residues i and j which are in proper register on opposite strands of an anti-parallel 0-sheet containing n residues on each strand

(for k = 0 , 1 , 2 , . . .n), where x(C*) is the position vector of a Ca atom. It is also necess- ary for

IX(Ci">-X(Cia,1)I > lx(cP)-x(cp)I

IX(CiQ) -x(ci"-l)I > Ix(Ci"> -x(Ci">l (26)

(27)

TABLE 16 Average deviations (near-neighbor and longrange) in A between the x-ray structure and the energy-minimized structure for three different proteins. Deviations are tabulated separately for mainchain (M) and sidechain (S)

distances

Proteina i t o i + 1 i t o i + 2 i t o i + 3 Longrange rms - A R ~ deviation deviation deviation deviation deviation

4PTI M 0.044 S 0.960

lCTX M 0.199 S 0.972

lPCY M 0.130 S 0.834

aSee Table 3 for full protein name.

0.392 0.493 0.577 0.569 0.23 1.397 1.592 1.203 1.25 1 0.14 0.622 0.798 0.919 1.048 0.42 1.155 1.141 1.107 1.166 0.27 0.576 0.756 0.724 0.755 0.18 1.469 1.631 1.341 1.428 0.44

bChange in radius of gyration upon minimization.

5 04

Page 19: Sidechain and backbone potential function for conformational analysis of proteins

Protein conformational energy

TABLE 17 Average deviations (A) of local and longrange distances for three trials of the hairpin fragment of BPTI for the initial perturbed structure, the energy-minimized conformation with respect to the single-point potential, and the energy-minimized conformation with respect to the two-point potential. Np is the number of &sheet con-

tacts maintained in each structure

Longrange rms No Trial i t o i + 1 i t o i + 2 i t o i + 3 No. deviation deviation deviation deviation deviation

1. initial one point two point

one point two point

one point two point

2. initial

3. initial

0.011 0.017 0.007 0.020 0.01 1 0.021 0.01 1 0.039 0.021

0.059 0.067 0.085 0.087 0.042 0.053 0.159 0.197 0.299 0.361 0.205 0.226 0.640 0.830 0.416 0.820 0.205 0.226

0.069 0.096 0.064 0.304 0.357 0.368 0.898 0.802 0.368

0.066 0.091 0.059 0.276 0.337 0.332 0.837 0.745 0.332

In all cases the two-point-per-residue potential performs better than the one-point-per-residue potential in maintaining and reforming the 0-sheet contacts and in bringing the perturbed conformations closer to the native structure upon minimization, as seen from Table 17. However, when more than half the contacts were disrupted, even the present potential is not successful in regenerating the &sheet from the perturbed structure.

A similar analysis was performed on the terminal helix of BPTI (residues 47 to 55). However, since this helix has many non-helical residues, the improvement obtained by the present potential was not significantly better than by using the previous potential. To demonstrate the value of incorporating an extra point per residue, we generated a poly- leucine helix by minimizing the total energy of the BPTI terminal helix and replacing all sidechain types by leucine. The coordinates of this helix were perturbed at random, and the energy of the structure was minimized both with respect t o the present potential and the previous one-point-per-residue poten- tial. Residues i through j + 4 are considered helical, if the residues i to j satisfy the following distance criteria (1 4):

di,i+S < 6.0R (28) di, i+4 < 6.5A (29)

In addition, we also considered whether the

size of the helix is maintained to verify whether the potential has the effect of shrinking the helix. The results are summarized in Table 18. In both trials, the present potential brings the perturbed structure closer to the native structure upon minimization than the previous one-point-per-residue potential. When the mag- nitude of perturbation is small, as in the first trial, both potentials are successful in main- taining the helix, whereas with a larger pertur- bation in trial 2 , only the present potential maintains the helix. These tests demonstrate that with respect t o maintenance and formation of secondary structure, the present potential is better than the previous one-point-per- residue potential.

Lastly, we studied whether the potential is capable of distinction between right and left handed helices. A single point potential is not capable of this distinction, as the inter- residue distances remain invariant upon trans- formation of a right handed helix to a left handed one. Starting with the terminal helix of BPTI (residues 47 to 5 9 , we generated the left handed version by initially mirror- inverting all the coordinates and then deducing the coordinates of the sidechain point by an inversion of the mirror-inverted sidechain point i about the plane formed by Cr, Ccl, and CE,. The original right handed helix had an energy of 63.39kcal compared to its left handed counterpart with an energy of

505

Page 20: Sidechain and backbone potential function for conformational analysis of proteins

G.M. Crippen and V.N. Viswanadhan

TABLE 18 Average deviations of a polyleucine helix for the perturbed, minimized with respect to the single-point potential,

and minimized with respect t o the two-point potential structures

Trial i t o i + 1 i t o i + 2 i t o i + 3 Longrange rms No. deviation deviation deviation deviation deviation

1. initial 0.009 0.208 0.288 0.156 0.172 one point 0.009 0.116 0.310 0.314 0.564 two point 0.006 0.066 0.178 0.144 0.123

2. initial 0.009 1.161 1.348 1.024 1.007 one point 0.004 0.520 2.246 1.625 1.431 two point 0.009 0.093 0.493 0.474 0.421

65.23 kcal, the difference in the energy between the two structures coming from Si - Mi-, , Si - Mi+, and Si - Si+3 inter- actions. This confirms the ability of the poten- tial to account for chain chirality by preferring the right handed helix to the left handed one.

CONCLUSION

The present potential fulfills the requirements stated in the introduction, as did the one- point-per-residue potential. Moreover, the pres- ent potential is shown to be better than the previous one in virtually all aspects of testing. It is particularly more effective in the preser- vation and modelling of secondary and super- secondary structural elements such as a-helices and P-sheets, and consequently can guide incorrectly folded protein structures to the native minimum via preferred aggregates of secondary structural elements. Although the topography of the conformational hyperspace is far from smooth, the fact that energy- minimized perturbed structures have higher energy than the near-native minimum suggests that a strategy for the circumvention of the high energy local minima would enable the near-native minimum to be reached.

ACKNOWLEDGMENTS

This work has been supported by grants from the National Institutes of Health (5-R01-AM28140-03 and 5-R016M30561-02) and the Robert A. Welch Foundation. Acknowledgment is also made to the Donors of the Petroleum Research Fund, administered by the American Chemical Society, for the partial support of this research.

REFERENCES

1 . Crippen, G.M. & Viswanadhan, V.N. (1984) Int. J . Peptide Protein Rex 24, 279-296

2. Crippen, G.M. (1982) J. Comp. Chem. 3 , 471- 476

3 . Crippen, G.M. (1982) Biopolyrners 21, 1933- 1943

4. Kuntz, I.D., Crippen, G.M. & Kollman, P.A. (1979) Biopolymers 18,939-957

5 . Oobatake, M. & Crippen, G.M. (1981) J. Phys. Chem. 85,1187-1197

6. Momany, F.A., Carruthers, L.M., McGuire, R.F. & Scheraga, H.A. (1974) J. Phys. Chem. 78,1593-1620

7 . Pincus, M.R. & Scheraga, H.A. (1977) J. Phys. Chem. 81,1579-1583

8. Levitt, M. (1976)J. Mol. Biol. 104,59-107 9. McCammon, J.A., Northrup, S.H., Karplus,

M. & Levy, R.M. (1980) Biopolymers 19, 2033-2045

10. Bernstein, F.C., Koetzle, T.F., Williams, G.J.B., Meyer, E.F., Brice, M.D., Rodgers, J.R., Kennard, O., Shimanouchi, T. & Tasumi, M. (1977) J. Mol. Biol. 112, 535-542 [Brookhaven Protein Data Bank]

11. Levitt, M. (1978) Biochemistry 17,4277-4285 12. Gatlin, L.L. (1972) Information Theory and

Living System. Columbia University Press, New York

13. Dennis, J.E. & Mei, H.H.W. (1975) Technical Report TR(75-246), Department of Computer Science, Cornell University, Ithaca, New York

14. Levitt, M. & Greer, J. (1977) J. Mol. Biol. 114, 181

Address: G.M. Crippen Department of Chemistry Texas A & M University College Station, Texas 77843 USA

5 06

Page 21: Sidechain and backbone potential function for conformational analysis of proteins

Protein conformational energy

Pb given by r s i n a . The frequency of distances which lie between r and r + dr, namely f ( r , R ) d r is proportional t o the surface area of the spherical segment occurring within the circles whose radii are given by r sin and

X (r + d r ) sin a+. Furthermore, we see from Fig. 4

h = r c o s a - ( R - a ) ( 1 A )

( 2 A ) , h+ = (r + d r ) cos a+ - ( R - a ) OF3 and bv the law of cosines. FIGURE 3

0

Definitions of X, R , r, a , b , Pa, and Pb for calculating the distribution of distances between two spheres. (R' + r 2 - a 2 )

2 rR c o s a = ( 3 A )

( R z + r z - a z + 2rdr) 2 R (r i- d r )

c o s a + = APPENDIX

Distance distribution between points on two spherical surfaces. Let a be the radius of sphere A , b be the radius of sphere B, X be the dis- tance between the centers of the two spheres, Pb be a general point on sphere B, Pa be a general point on sphere A , r be the distance between Pa and Pb and R be the distance between Pb and the center of A . Without loss of generality, we take the sphere with greater radius as B (Fig. 3 ) . In order t o calculate the distribution of r for a uniform sampling of all Pa and Pb, we first consider the case of a fixed point Pb on sphere B. Then the distri- bution of distances r lies within the limits R - a t o R + a . Denote the distribution density function by f ( r , R ) .

Referring t o Fig. 4, the locus of all points o n sphere A with the same distance r from the fixed point is a circle, whose radius is

( 4 A )

neglecting second order smalls. Then f ( r , R ) d r is given by the difference in area of two spheri- cal segments:

f ( r , R ) d r = 2 n a ( h + - h ) ( 5 A )

Substituting the appropriate expressions from equations ( l A ) , ( 2 A ) , ( 3 A ) and ( 4 A ) on the right hand side of eqn. 5 A , we get

2nar R

f ( r , R ) d r = - dr ( 6 A )

To calculate the overall distribution density function, F(r)dr (when Pb can be any point on sphere B) from the function f ( r , R ) d r (when Pb is fixed), we set up a spherical polar coordinate system with the center of sphere B as origin, z-axis along the line joining the

FIGURE 4 Calculation of the spherical segment swept out when increasing r to r + dr (see eqns. 1A-5A). Case i l for the range of r (see eqns. 8A and 9A).

507

Page 22: Sidechain and backbone potential function for conformational analysis of proteins

C.M. Crippen and V.N. Viswanadhan

centers of spheres A and B, and the angle 8 (see Fig. 5) representing the polar angle.

Let Qmin and Q,,, represent the minimal and maximal values of the polar angle 8, defining a spherical section on sphere B that contains all possible locations of Pb such that the distance Papb lies within r and r + d r . The function F(r)dr is given by

f ( r , R ) b Z sin QdQd@dr F(r)dr = . $ = O . + = * [ zn C"""" . rnin

bZ + X z - (a -t r)' 2bX

cos Qmax =

( 1 1A)

(i3) Fo rX-a + b < r < X + a + b (Fig. 7)

bZ + X' --(r-a)' cos Qmin =

2bX ( 1 2 N

The limits of c o s 9 are conveniently broken into the following cases:

(i) The case of non-overlapping spheres ( i l) For X - a - b < r < X -I- a - b (Fig. 5 )

C O S Q m i n = 1 (8A)

b2 + X' - (a + r)' 2bX

cos Q,,, =

(9A)

( i 2 ) For X + a - b < r < X - a + b (Fig. 6 )

bZ + X 2 - ( r -a) ' 2 bX

cos Qmin =

( 1 O N

FIGURE 6 Case i2 for the range of r (see eqns. 10A and 11A).

COSQ,,, = - 1 ( 1 3 N

(ii) The case of overlapping spheres:

X z + ( b - r ) ' - a z

2 X ( b - r )

(iil) For 0 < r S u + b - X (Fig. 8)

cos Qmin =

( 1 4A)

(b2 + x z - a Z )

2bX

+ 2 sin-'(r/2b)

(ii2) For a + b - X < r < rint (Fig. 9)

FIGURE 7 Case i3 for the range of r (see eqns. 12A and 13A).

508

Page 23: Sidechain and backbone potential function for conformational analysis of proteins

Protein conformational energy

FIGURE 8 FIGURE 10 Case iil for the range of r (see eqns. 14A and 15A). Case ii3 for the range of r (see eqns. 18A and 19A).

1 cos-l((b2 + x2 - 2 ) / 2 b X ) 2

Tint = 2b sin

cos9,i, = 1

The expressions for the last two sets of limiting values of r remain the same as for the case of non-overlapping spheres.

FIGURE 9 Case ii2 for the range of r (see eqns. 16A and 17A).

509