average core structures and variability measures for protein families: application to the...

15
JMB—MS 640 Cust. Ref. No. PEW 032/95 [SGML] J. Mol. Biol. (1995) 251, 161–175 Average Core Structures and Variability Measures for Protein Families: Application to the Immunoglobulins Mark Gerstein 1 and Russ B. Altman 2 * 1 Department of Structural A variety of methods are currently available for creating multiple align- Biology, Fairchild D109 ments, and these can be used to define and characterize families of related proteins, such as the globins or the immunoglobulins. We have developed 2 Section on Medical a method for using a multiple alignment to identify an average structural Informatics, MSOB X215 ‘‘core’’, a subset of atoms with low structural variation. We show how the Stanford University, Stanford means and variances of core-atom positions summarize the commonalities CA 94305, USA and differences with a family, making them particularly useful in compiling libraries of protein folds. We show further how it is possible to describe the rotation and translation relating two core structures, as in two domains of a multi-domain protein, in a consistent fashion in terms of a ‘‘mean’’ transfor- mation and a deviation about this mean. Once determined, our average core structures (with their implicit measure of structural variation) allow us to define a measure of structural similarity more informative than the usual root-mean-square (RMS) deviation in atomic position, i.e. a ‘‘better RMS.’’ Our average structures also permit straightforward comparisons between variation in structure and sequence at each position in a family. We have applied our core-finding methodology in detail to the immuno- globulin family. We find that the structural variability we observe just within the VL and VH domains anticipates the variability that others have observed throughout the whole immunoglobulin superfamily; that a core definition based on sequence conservation, somewhat surprisingly, does not agree with one based on structural similarity; and that the cores of the VL and VH domains vary about 5° in relative orientation across the known structures. 7 1995 Academic Press Limited Keywords: alignment, structural; classification; protein structure, substructures; statistical analysis; structure motif *Corresponding author Introduction The number of protein structures in the Protein Data Bank is now quite large and is rapidly increas- ing. A recent estimate puts the total number of chains within the databank at 3000, increasing by about one a day (Orengo, 1994). One way to deal with this huge amount of structural information is by grouping related proteins into families, such as the globins or the immunoglobulins (Levitt & Chothia, 1976; Chothia & Finkelstein, 1990; Richardson, 1981). Members of protein families have similar overall folds but differences in their detailed structure. The classification of the entire databank using protein families has recently been attempted by a number of groups (Johnson et al ., 1990; Sander & Schneider, 1991; Murzin et al ., 1995; Holm et al ., 1993; Orengo et al ., 1993; Pascarella & Argos, 1992; Orengo et al ., 1994), and it has been found that some protein families are quite large. The immunoglobulin family is a case in point. In the databank, there are currently over 25 structures of different antibody molecules, and when the whole immunoglobulin superfamily is considered (Bork et al ., 1994), there are at least 20 additional immunoglobulin-like structures, which include such disparate proteins as the enzyme myosin light chain kinase and the cell-surface receptor CD2. Thus, because of both the great numbers of struc- tures and of families, it has become desirable (even necessary) to summarize the common features within a family, whilst separating out the variable ones. One of the most basic commonalities shared by each member of a family is a set of atoms that occupy the same relative positions in space. Our focus here is in identifying these atoms, and then in charac- terizing them statistically. We show how to construct an average core structure for a protein family in such a way that the average is unbiased and the resulting structure has acceptable stereochemistry. This core 0022–2836/95/310161–15 $08.00/0 7 1995 Academic Press Limited

Upload: mark-gerstein

Post on 18-Oct-2016

214 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Average Core Structures and Variability Measures for Protein Families: Application to the Immunoglobulins

JMB—MS 640 Cust. Ref. No. PEW 032/95 [SGML]

J. Mol. Biol. (1995) 251, 161–175

Average Core Structures and Variability Measures forProtein Families: Application to the Immunoglobulins

Mark Gerstein 1 and Russ B. Altman 2*

1Department of Structural A variety of methods are currently available for creating multiple align-Biology, Fairchild D109 ments, and these can be used to define and characterize families of related

proteins, such as the globins or the immunoglobulins. We have developed2Section on Medical a method for using a multiple alignment to identify an average structuralInformatics, MSOB X215 ‘‘core’’, a subset of atoms with low structural variation. We show how theStanford University, Stanford means and variances of core-atom positions summarize the commonalitiesCA 94305, USA and differences with a family, making them particularly useful in compiling

libraries of protein folds. We show further how it is possible to describe therotation and translation relating two core structures, as in two domains ofa multi-domain protein, in a consistent fashion in terms of a ‘‘mean’’ transfor-mation and a deviation about this mean. Once determined, our average corestructures (with their implicit measure of structural variation) allow us todefine a measure of structural similarity more informative than the usualroot-mean-square (RMS) deviation in atomic position, i.e. a ‘‘better RMS.’’Our average structures also permit straightforward comparisons betweenvariation in structure and sequence at each position in a family.

We have applied our core-finding methodology in detail to the immuno-globulin family. We find that the structural variability we observe just withinthe VL and VH domains anticipates the variability that others have observedthroughout the whole immunoglobulin superfamily; that a core definitionbased on sequence conservation, somewhat surprisingly, does not agree withone based on structural similarity; and that the cores of the VL and VHdomains vary about 5° in relative orientation across the known structures.

7 1995 Academic Press Limited

Keywords: alignment, structural; classification; protein structure,substructures; statistical analysis; structure motif*Corresponding author

Introduction

The number of protein structures in the ProteinData Bank is now quite large and is rapidly increas-ing. A recent estimate puts the total number of chainswithin the databank at 3000, increasing by about onea day (Orengo, 1994). One way to deal with this hugeamount of structural information is by groupingrelated proteins into families, such as the globinsor the immunoglobulins (Levitt & Chothia, 1976;Chothia & Finkelstein, 1990; Richardson, 1981).Members of protein families have similar overallfolds but differences in their detailed structure. Theclassification of the entire databank using proteinfamilies has recently been attempted by a number ofgroups (Johnson et al., 1990; Sander & Schneider,1991; Murzin et al., 1995; Holm et al., 1993; Orengoet al., 1993; Pascarella & Argos, 1992; Orengo et al.,1994), and it has been found that some proteinfamilies are quite large. The immunoglobulin family

is a case in point. In the databank, there are currentlyover 25 structures of different antibody molecules,and when the whole immunoglobulin superfamily isconsidered (Bork et al., 1994), there are at least 20additional immunoglobulin-like structures, whichinclude such disparate proteins as the enzymemyosin light chain kinase and the cell-surfacereceptor CD2.

Thus, because of both the great numbers of struc-tures and of families, it has become desirable (evennecessary) to summarize the common featureswithin a family, whilst separating out the variableones. One of the most basic commonalities shared byeach member of a family is a set of atoms that occupythe same relative positions in space. Our focus hereis in identifying these atoms, and then in charac-terizing them statistically. We show how to constructan average core structure for a protein family in sucha way that the average is unbiased and the resultingstructure has acceptable stereochemistry. This core

0022–2836/95/310161–15 $08.00/0 7 1995 Academic Press Limited

Page 2: Average Core Structures and Variability Measures for Protein Families: Application to the Immunoglobulins

JMB—MS 640

Average Core Structure162

structure can then be used to characterize the struc-tural variability within a family, to define the averagerelative orientation of domains in multi-domain com-plexes, and to develop new measures of similaritybetween members of the same structural family. Weillustrate our ideas here through application to thearchetypal protein family: the all b-sheet immuno-globulins. Previously, we had demonstrated somepreliminary aspects of the core calculation on the alla-helical globin family (Altman & Gerstein, 1994).Our method for defining regions of low structuralvariation is also useful for the analysis of structuressolved by NMR spectroscopy and generated bymolecular dynamics since both techniques producean ensemble of structures, in a sense, a family of verysimilar structures.

For the purposes of this discussion, we define‘‘core atoms’’ as those having the same local confor-mation (i.e. secondary structure) and the same globalconformation relative to their non-bonded neighborsacross a family. This is similar to core-structureconcept developed extensively by Chothia and Lesk,who have used it for such applications as analyzingprotein motions (Lesk & Chothia, 1984; Chothia &Lesk, 1986; Lesk, 1991). However, Chothia and Leskconfine their core-structure calculations to compari-sons between pairs of proteins, while we aim togeneralize the calculations so that they are applicableto the many proteins in a family. Many other investi-gators have also used the term ‘‘core structure’’ butin a different sense from that used here. For instance,Bryant & Lawrence (1993) define a core in terms ofconserved secondary structure elements, and othersdefine a core structure based on measures of se-quence conservation or hydrophobicity (Swindells,1995). As will be discussed later, our results indicatethat a core based purely on structural considerationsis not the same as one based on sequence consider-ations, so, clearly, these definitions of ‘‘core’’ do notalways coincide.

Practically, our method directly builds upon thelarge amount of recent work on superposition offamilies of structures. This work has either focusedon finding the optimum superposition of a seriesof structures in which the corresponding atomshave been defined (Diamond, 1992; Gerber & Muller,1987; Kearsley, 1990; Shapiro et al., 1992) or on findinga structural alignment between a pair of structuresfor which the corresponding atoms have not beendefined (Taylor & Orengo, 1989; Sali & Blundell, 1990;Holm & Sander, 1993; Subbiah et al., 1993; Yee & Dill,1993). Our core finding builds on both of these foun-dations: structural alignments are refined iterativelyto include only low variance ‘‘core’’ atoms using thetechniques of series superposition.

Because the average cores we define have reason-able stereochemistry, coupled with a consensus se-quence, profile, or hidden Markov model (Gribskovet al., 1990; Bowie et al., 1991; Overington et al., 1992;Krogh et al., 1994), they may also be useful as startingpoints for a variety of homology modeling tasks.Furthermore, a library of carefully constructed corestructures could eventually prove useful for sum-

marizing the roughly 1000 folds that are thought tooccur in nature (Chothia, 1992). This is particularlytrue with regard to the nine superfold structures(Orengo et al., 1994). A library of core structurescould also help speed up threading calculations,which match a sequence against a collection ofstructures (Ponder & Richards, 1987; Sippl, 1990;Finkelstein & Reva, 1992; Jones et al., 1992; Bryant &Lawrence, 1993; Madej & Mossing, 1994), and makethem less sensitive to the non-essential details ofindividual structures.

Once a core for a family of structures has beencalculated, it is possible to use it to assess the simi-larity of two structures in the family in a better waythan the usual root-mean-square (RMS) deviationin atom positions after doing a fit. The problem withthe usual RMS value is that it weights each atomposition equally and it gives equal weight to devi-ations in any direction. That is, a poorly fitting atomin the core of a structure is given equal weight to apoorly fitting atom in a highly variable surface loop.We use the variances calculated as part of the averagecore structure to weight the deviations and then usethese ‘‘calibrated’’ deviations to calculate a betterRMS.

For a multi-domain protein, such as the antibodymolecule, one could compute an average core struc-ture for all the domains taken together. However, thisaverage structure would have the effect of any rigid-body motion between the domains spread through-out it in a correlated and highly redundant fashion.Consequently, we describe the average core structureof a multi-part protein in terms of the individual corestructures of its component parts and then theaverage rigid-body positioning of one part relative toanother. We, furthermore, show how this averagepositioning can be described in terms of a mean anda variance in a manner that is completely consistentwith the formalism we use to construct the averagecore structures.

Methods

The methods developed and applied in this workfall into five categories: finding an average core,characterizing structural variation using this core,calculating sequence variation in a way that can berelated to structure variation, using our calculation ofstructural variation to define a better RMS, anddefining the average positioning of two cores.

Core-finding algorithm

The core-finding algorithm has been described indetail (Altman & Gerstein, 1994). It can besummarized as a five-step process: (1) We start withan ensemble of aligned structures (e.g. all theimmunoglobulin structures after they have beenaligned according to the Kabat numbering). Initially,we consider each atom position that occurs in everyaligned structure to be a member of the of the‘‘putative core.’’ (2) We construct an unbiased

Page 3: Average Core Structures and Variability Measures for Protein Families: Application to the Immunoglobulins

JMB—MS 640

Average Core Structure 163

average structure from all these putative corepositions, using one of the available methods forsuperimposing a series of corresponding structures(Altman & Gerstein, 1994; Diamond, 1992; Gerber &Muller, 1987; Kearsley, 1990; Shapiro et al., 1992). (3)We determine the spatial variation of each group ofaligned atoms about the average structure in terms ofthe volume of an ‘‘error ellipsoid’’ (as described inthe next section). (4) We remove from the putativecore the position with the largest structuralvariation. We then return to step 2 with a smallercore. We repeat this process until all alignedpositions have been removed from the core. (This isa generalization of the ‘‘sieve-fit’’ proceduredescribed by Gerstein & Chothia, 1991.) The result isan ordered list of atomic positions, ranked accordingto structural variability, which we call the ‘‘throw-out’’ order. (5) Based on a variety of different criteria,discussed in detail in the legend to Figure 3, we pickone cycle in this overall core-finding procedure asbest representing the separation between core andnon-core atoms. The average structure calculated inthis cycle is what we call the ‘‘core structure.’’

Calculating the structural variation of agiven position

When all the structures in the ensemble are fit tothe average core structure, it is possible to quantifythe structural variation at each aligned positionusing an ‘‘error ellipsoid’’: We summarize thevariability of each aligned position i over allstructures in the ensemble of structures by using thevariance/covariance matrix C. Each element in this3 × 3 matrix, represented by cov(m, n), is thecovariance between two coordinates, m and n, wherem and n can be 1, 2 or 3 (representing the x, y and zcoordinates), over the ensemble at position i. Thus,for instance, cov(1,3) represents the covariancebetween the x and z coordinates, and the diagonalelements of the matrix, cov(m, m), contain thevariance in coordinate m at position i over theensemble. The variance/covariance matrix can betranslated into an ellipsoid of errors centered at themean position of the atom xi . To find the orientationand axes lengths of the ellipsoid, the matrix isdiagonalized in standard fashion to give:

Ci = Ri Si R − 1i (1)

where Ri is a rotation matrix that specifies theorientation of the principal axes of the ellipsoid, andSi is a diagonal matrix of eigenvalues. Thesquare-roots of the eigenvalues (denoted by sx , sy

and sz ) give the standard deviation of thedistribution in its three principal directions and areconventionally taken as the lengths of the semi-axesof the ellipsoid. Consequently, if the distribution ofatomic positions is a three-dimensional normaldistribution (for which we provide evidence inResults), then an ellipsoid drawn at one standarddeviation should contain approximately two-thirdsof the atoms in the sample. In order to get a singlescalar estimate of the amount of variation in atomic

position, we calculate the volume of the ellipsoidfrom the eigenvalues:

V = 43 psxsysz = 4

3 pzdet C (2)

Calculating the sequence variation of agiven position

One of the advantages of our ‘‘volumetric’’measure of structural variability is that it can bedirectly compared with measures of sequencevariability at each aligned position, to see whetherthey are correlated. We calculate sequence variabilitythrough the computation of an information-theoreticentropy (Schneider & Stephens, 1991; Schneider et al.,1986; Shenkin et al., 1991). In particular, we measurevariability of a given position i in the alignment interms of its entropy relative to that if the sequenceswere aligned randomly:

Rseq(i ) = s20

t = 1

f� (t)log2 f� (t) − s20

t = 1

f(i, t)log2 f(i, t) (3)

where the first and second terms represent thestandard Shannon entropy H for the random andactual alignments, f(i, t) is frequency of amino acidt at position i, and f� (t) is the average frequency ofresidue type t over the whole alignment.

Using the core to calculate a ‘‘better RMS’’

Once a core for a family of structures has beencalculated, it can be used to assess the similarity oftwo structures in the family in a better way than theusual RMS value in atom positions. The basic idea isthat one uses the amount of structural variationobserved at each position in the core to scale theinteratomic distances between two structures. If anatom position has low structural variation in the corestructure and if the difference in the position of thecorresponding atoms in two structures is largerelative to this variation, then this difference shouldcontribute more to the overall difference between thetwo structures. Conversely, if the difference inposition of corresponding points is small relative tothe variation, then it should contribute less,regardless of the absolute value of the difference.

In the usual RMS measure of similarity, DRMS, ateach residue position i, one takes the vectordifference between the coordinate positions in thefirst structure and the second structure

di = x1i − x2i (4)

Then one averages the squares of these differences(i.e. the Euclidean distance) and takes the square-root to get the RMS value:

DRMS = z�d2i � (5)

where the brackets denote averaging over theresidues position index i. It is obviously harder to fitmore residues well, and the overall RMS value, fromfitting two arbitrarily selected segments of proteinstructure of length M, increases proportionately with

Page 4: Average Core Structures and Variability Measures for Protein Families: Application to the Immunoglobulins

JMB—MS 640

Average Core Structure164

the square-root of M (McLachlan, 1984; Remington& Matthews, 1980).

The usual RMS measure weights the difference di

at each residue position equally. However, if a givenposition is highly variable in all the structures usedin the core-finding procedure, it might be morereasonable to down-weight the coordinate differenceat this position. Furthermore, the structuralvariability in a family may be oriented prefer-entially along a certain direction, so one would wantto weight down the variation in this direction morethan in the other directions. Using the ellipsoid oferrors representation for structural variabilitydeveloped in the previous sections, it is possible tomake such a compensation. The weighted coordinatedifference at each position wi is the normal distanceexpressed in the units of standard devi-ation along the principal axes of the errors ellipsoid(discussed above):

wi = S − 1/2i R − 1

i di (6)

The above formula expresses the operation oftranslating and rotating the atoms into the coordinatesystem of the errors ellipsoid and then scaling theirseparation by an amount inversely proportional tothe lengths of its principal axes. The root-mean-square (i.e. RMS) of this ‘‘standard-deviation’’distance can then simply be computed to give, whatwe call an SD-RMS: DSD = z�w2

i �.Because wi and the SD-RMS are expressed in units

of standard deviation, they can not simply be relatedback to normal RMS values, which are usuallyexpressed in units of A. Consequently, it is helpful tointroduce a conversion of the standard deviation‘‘unit’’ back into A. For each position, we define a‘‘calibrated-A’’ distance ai to be

ai = DRMS

DSDwi (7)

where the ratio DRMS/DSD is the ‘‘conversion factor’’between the average standard deviation unit foratom i and A. Thus, the calibrated-A distanceexpresses the coordinate difference between twoatoms in angstrom units that have be inflated ordeflated according to the structural variabilityobserved at position i in the family of structures.Note that the RMS value of the calibrated-Adistances (which is taken over all atoms) isnecessarily the same as that of the standard RMSvalue. However, at particular positions the cali-brated-A distance will be larger than the normalCartesian distance if there is little variability in thefamily; i.e. these are expensive ‘‘A.’’ Conversely, ifthere is much variability in the family, the calibrateddistance will be less than the normal distance.

Finally, it is possible to do this whole analysis withone of the two structures as the average structure, i.e.x2i = xi . This allows one to decide which of anensemble of structures is most like the meanstructure overall or at a given residue position.

Finding the average orientation of two rigid cores

For a multi-domain protein, such as theimmunoglobulins, one could compute an averagecore structure for all the domains taken together.However, this core structure would have the effect ofany rigid-body motion between the domains spreadthroughout all of its error ellipsoids in a correlatedand highly redundant fashion. This would obviouslycreate problems in the core-finding procedure, as onewould tend to throw out all the atoms of a mobiledomain together, and it would make detecting thesubtle structural variations within a mobile domainmore difficult. Consequently, it is useful to describethe average core structure of a multi-part protein interms of individual core structures of its componentparts and then the average rigid-body positioning ofone part relative to the other. The variation about thismean positioning would summarize the range oforientations that are adopted throughout theensemble of structures.

Describing the average relative position of tworigid bodies involves calculating an averagetranslation and an average rotation. The translation isstraightforward. We treat all the translation vectorsexactly the same way as the atom coordinates at onealigned position in a family. That is, to get a meantranslation we just average the vectors. Likewise, todescribe the variation, we can form a variance/co-variance matrix C from the translation vectors andthen compute an errors ellipsoid volume from thedeterminant of this matrix. This allows us to describea rigid-body translation in exactly the same languageas we use to describe the individual atom positions.

Describing an average rotation is not so simple. Asrotations do not commute, one can not simplycompute a normal (commutative) average of all thecomponents in a list of rotation matrices. However,for small rotations, it is possible to express eachrotation as a vector and then average these in astraightforward fashion. By a small rotation we meanone with a small rotation angle u, such that sin u1u.This is usually about 10°. The vector representationwe use for rotations are the four quaternionparameters, qT = (l m n s) (Altmann, 1986). These aresimply related to the rotation angle u and thedirection cosines of the rotation axis l, m and n:

qT = (ls ms ns c) (8)

where:

s = sin u2 and c = cos u

2

As opposed to the direction cosines, for which therotation angle is treated distinctly differently fromthe rotation axis, quaternions have the advantage thatall four components enter on essentially equalfooting. Consequently, in averaging rotations it isreasonable to average each component of series ofquaternions in an equivalent and symmetric fashion.Since each quaternion has unit length (i.e.q2 = l2 + m2 + n2 + s2 = 1), we treat the averagequaternion as a vector constrained to be on the unit

Page 5: Average Core Structures and Variability Measures for Protein Families: Application to the Immunoglobulins

JMB—MS 640

Average Core Structure 165

Table 1. Structures and ensemblesA. Structures used

PDB Immunoglobulin Species Reference Chains

4FAB 4-4-20 Mouse Herron et al. (1989) H k1HIL 17/9 Mouse Schulze-Gahmen et al. (1988) H k1NCA NC41 Mouse Colman et al. (1987) H k2FB4 KOL Human Marquart et al. (1980) H l1DBA DB3 Mouse Arevalo et al. (1993) H k1IGF B1312 Mouse Stanfield et al. (1990) H k2FBJ J539 Mouse Suh et al. (1986) H k1MCP McPC603 Human Satow et al. (1987) H k6FAB 36-71 Mouse Strong et al. (1991) H k1FDL D1.3 Mouse Amit et al. (1986) H k2HFL HyHEL-5 Mouse Sheriff et al. (1987) H k3HFM HyHEL-10 Mouse Padlan et al. (1989) H k1REI REI (VL dimer) Human Epp et al. (1975) k

B. Ensembles used

Number of Average, Min, and Max RMSaligned Number of between structures in ensemble

Ensembles atoms structures (A per atom)

VL (k) 90 12 0.79 0.44 1.33VH 99 12 1.13 0.49 1.74Common to VL and VH 89 24 1.43 0.44 2.16

All immunoglobulin structures are of uncomplexed antibodies except for 2HFL, 3HFM and 1FDL. All thestructures were taken from the protein databank (Bernstein et al., 1977).

sphere (in four dimensions), and we average anumber of orientations by vector summing theirquaternion vectors and then normalizing the result.That is:

q =Sqj

>Sqj> (9)

where the qj are individual rotations and q is theaverage rotation†.

The beauty of the quaternion representation is thatsince it uses vectors to describe orientation, it ispossible to describe the variation in orientationsusing the same formalism we used above to describethe variation of atom positions in core structures andof the translations. Since both the four-componentquaternion and the direction cosines (l m n) arenormalized to have a length of 1, we find that, forsmall rotations, the variation in orientation can bedescribed completely in terms of RMS variation inthe rotation angle, z�u2�.

In averaging rotations, one additional point mustbe considered. Often the rotation that one wants toaverage is conventionally described as large. Forinstance, the average orientation of one immunoglob-ulin domain relative to the other (i.e. VL to VH) isusually described in terms of a pseudo 2-fold axis ora 180° rotation. This is hardly a small rotation and sowould not appear amenable to the averag-ing scheme discussed above. However, in comparinga series of Fv fragments, the difference in orientation

of one VL relative to another VL is small.Consequently (as shown in Figure 8), it possible todo the averaging calculation in ‘‘bootstrap’’ fashion.One finds the difference in orientation of all VLdomains as compared with an arbitrarily chosen firstone and then averages these small differences. Thenone combines the large rotation of the first VL relativeto VH with average of the difference rotations to getthe correct average rotation of VL relative to VH.Differences relative to this unbiased average can thenbe computed to get the variation in orientations.

Results

Immunoglobulin core

We started with the 12 VL (k) domains and 12 VHdomains listed in Table 1A. As shown in Figure 1, weused the Kabat numbering scheme (Kabat et al., 1983)to align the VL and VH domains individually, and weused the structural alignment given by Chothia &Lesk (1987) to align the VL and VH frame-works with each other. After removing the threedifficult-to-align hypervariable loops, we started ourcore-finding with 90 aligned positions for VL, 99 forVH and 89 for the combined VL-VH. We found coreswith 70 Ca atoms for VL, 87 for VH, and 52 for VL-VHcombined. The error ellipsoids for cores are shown inFigure 2.

Because of the way we chose the core versus non-core threshold, all the core structures had acceptableCa stereochemistry. As discussed for Figure 3, theCa-Ca bond lengths were all nearly 3.8 A and theCa bond and torsion angles were within normalranges.

† Note that for small u, 2qT1(ul um un 2), soaveraging quaternions is very similar to averaging thevector representation for infinitesimal rotations(Goldstein, 1980).

Page 6: Average Core Structures and Variability Measures for Protein Families: Application to the Immunoglobulins

JMB—MS 640

Average Core Structure166

Figure 1. Outline of immunoglobulin structure. A, Anoutline of the structure of an immunoglobulin Fv fragment,looking down the pseudo-2-fold axis at the hypervariableloops. The nine principal strands in each domain areindicated by boxes, along with the standard strand names(A to G). (Loops 1 to 3 are labeled, and strand A' is omittedfor clarity.) B, A table that gives a detailed overview of thestructural role of each immunoglobulin residue in con-juction with the degree of structural variability found forit by our core-finding procedure. The leftmost two columns(a) show the Kabat numbering (Kabat et al., 1983) for theantibodies and the alignment we used between VL andVH. Black bars mark loops 1, 2 and 3, the three hyper-variable loops. The first annotation column (b) showsa canonical labeling scheme for the residues within theb-sheets, which is derived from the work of Chothia andcolleagues (Lesk & Chothia, 1982; Chothia et al., 1985, 1988,1989; Chothia & Lesk, 1987; Harpaz & Chothia, 1994). Theinner sheet that forms the VL-VH interface consists ofstrands A, C, C', C", F and G, and the outer sheet consistsof strands A, B, D and E. The next annotation column (c)gives additional information about the structural role of theresidues: x, W, and w are b-strand residues that, respect-ively, participate in the VL-VH interface (x), pack betweenthe 2 sheets (W), or are not in the two preceding categories(w); b are residues in a b-bulge; C and W are the conservedCys and Trp; p are the residues (with the conserved C andW) that form the conserved pin identified by Lesk &Chothia (1982) that holds the two sheets together; v and gare residues in interstrand loops on the side of the moleculenear or not-near the hypervariable loops. The columnsmarked throw-out order (d) give the order that residueswere thrown out in the core-finding procedure when it wasapplied only to the VL structures, only to the VH struc-tures, and to the VL and VH structures taken together. Forthe latter calculation, the 12 VL and 12 VH structures werealigned based on this table and then the core-findingroutine was applied to this whole group of 24 structures togenerate an average structure that is representative of bothVL and VH. The throw-out numbers marked with filledboxes correspond to residues that were not included as partof the core and those simply boxed correspond to the resi-dues that were the last 15 residues to be thrown out. It isnotable that the two conserved Cys residues and one con-served Trp were among the residues to be thrown out last.

A

B

The throw out order was very similar for all threeruns: from the original positions we first threw outloop regions: in particular, the long loop connectingthe C" strand to the D strand; and then A' and C"strands and the ends of G and C' strands. The secondgroup of strands to be thrown out included A and D;the third group included E and the rest of G and C';and the last group included B, C and F.

The core structures included all of strands A, B, C,D, E and F and most of the strands C' and G. Onlyone of the strand residues sandwiched between thetwo sheets and none of the strand residues formingthe VL-VH interface were thrown out in the VL orVH cores. (The sandwich and interface residues wereidentified by Chothia et al. (1988) and indicated by‘‘,’’ and ‘‘×’’ in Figure 1). The last residues to bethrown out include the absolutely conserveddisulfide bridge, the buried Trp (C4), and theresidues immediately surrounding the disulfidebridge (B5, F3, C4, B3, B4, E2, E3, E4 and F2, indicated

Page 7: Average Core Structures and Variability Measures for Protein Families: Application to the Immunoglobulins

JMB—MS 640

Average Core Structure 167

A

B

Figure 2. Average core structures of the immunoglobulins. A, The mean positions and ellipsoids are shown for the 89atoms in the common alignment of the combined VL-VH domain (left). The non-core Ca atoms (with ellipsoids at twostandard deviations) are shown (center). The core atoms are shown (right). The central portions of the b-strands make upthe core. B, The cores of the VL and VH domains positioned by the average transformation (discussed in the legend toTable 2) to yield an average Fv fragment. The view is the same as in the schematic in Figure 1, down the pseudo-2-fold axis.

by a ‘‘p’’ Figure 1). These residues constitute the‘‘pin’’ holding together the two immunoglobulinsheets (Lesk & Chothia, 1982).

We performed core calculations similar tothose described above, starting with a larger subsetof aligned atoms: i.e. we started with all back-

bone atoms or with all possible backbone atomsplus all alignable side-chain atoms. These calcu-lations resulted in essentially the same core structureand throw-out order, indicating that Ca atoms alonewere enough to define the essential features of thecore.

Page 8: Average Core Structures and Variability Measures for Protein Families: Application to the Immunoglobulins

JMB—MS 640

Average Core Structure168

Figure 3. Determining a coreversus non-core threshold. Depend-ing on the specific application, avariety of measures can be used toposition the threshold within thethrow-out order to separate corefrom non-core. Most of thesemethods focus on means and vari-ances derived from the distributionof core and non-core ellipsoidvolumes. The distribution of thesecore and non-core volumes is shownhere at cycle 17 in the core-findingprocedure. This cycle was pickedbecause it maximizes the differencebetween the average core and non-core volume. This often is a usefulcore versus non-core threshold. How-ever, for consistency with previouswork (Altman & Gerstein. 1994), we

used a slightly different threshold. We placed the threshold at a point that maximizes the variance in the ellipsoid volumesof the non-core atoms (i.e. the width of the thick-line distribution). This threshold implies that core atoms have a uniformlysmall ellipsoid size and this, in turn, makes it possible for the core structure to have particularly good Ca stereochemistry.Three parameters characterize the stereochemistry of an a-carbon structure (Levitt, 1976): (1) the distance between twoconnected Ca atoms, which should be 3.8 A; (2) the angle t between three connected Ca atoms, which can range betweenapproximately 80° and 135°; and (3) the pseudo-torsion angle a between four connected Ca atoms, which can acceptablyrange from −180° to +180° and so does not form a meaningful constraint on Ca structure. We tabulate statistics on the Ca-Ca

virtual bond length and the t angle below.

Bond length Bond angle t (deg.)Number Number

Standard deviation within outsideAverage acceptable acceptable

Core structure (A) (A) (%) range (5) range

Immunoglobulin 3.77 0.051 1.4 35 0(combined VL and VH)

Immunoglobulin VL 3.78 0.061 1.6 53 3Immunoglobulin VH 3.76 0.039 1.0 68 5

No correlation between sequence andstructure variation

Figure 4 shows the relationship between sequencevariation, measured by Shannon entropy, and struc-tural variation, measured by ellipsoid volume. Wefind there is no significant correlation between them(discussed more fully in the legend to Figure 4).This is true whether we consider all the alignedpositions (89), the core positions (52), or just thenon-core positions (27).

SD-RMS calculations

After calculating the VL-VH combined core, we fiteach of the 24 immunoglobulin variable domains toit and then computed RMS and SD-RMS distancesbetween all pairs. We then used these distances tocluster the immunoglobulins into two different trees,which are shown in Figure 5. The trees from thenormal RMS and SD-RMS calculations are distinctlydifferent, having a rank correlation of only 00.8. Thismeans that the SD-RMS is not a redundant calcu-lation and does indeed provide new and potentiallyuseful information to use in comparing structures.Furthermore, the comparison of the trees illustrates

well how SD-RMS de-emphasizes the usual sourcesof variation between structures and accentuates theunique differences of each individual structure.

Figure 6 illustrates the use of the calibrated-Ameasure: for two representative variable domains,we compare at each aligned positions the standardCa-Ca distance with the calibrated-A distance. Inregions of little structural variability, such as in theB strand, the calibrated distance is greater than thenormal distance. This is because even smalldifferences between structures are very significantin this highly conserved region. The contrastingsituation is observed in highly variable regions, suchas the C"D turn, where the calibrated distance isusually less than the normal distance.

In order to evaluate the degree to which a three-dimensional normal distribution accurately de-scribes the distribution of actual atom positionsaround the average core structure, we plotted thedisplacement of each atom in each variable domainstructure from the average core structure (after fittingall structures to the calculated core). To put all thedisplacements on the same scale, we expressed themin SD units as weighted coordinate differences wi .Figure 7 shows the distribution of these weightedcoordinate differences for three arbitrarily selected

Page 9: Average Core Structures and Variability Measures for Protein Families: Application to the Immunoglobulins

JMB—MS 640

Average Core Structure 169

Figure 4. Sequence variation versus structure variation.Graph of sequence variation versus structural variation foreach immunoglobin position in the combined VL-VHalignment. At a particular position, structural variation ismeasured by the volume of the covariance matrix ellipsoidrelative to that of the smallest ellipsoid, expressed on alog-scale (so the variation in core and non-core volumes canbe shown together). Sequence variation is measured in bitsper residue as the information content of a given positionin the alignment relative to that if the sequences werealigned randomly (as described in Methods). There are 89positions represented in total here and the overall Pearsoncorrelation coefficient is 0.35. The 54 core positions areshown by open squares, non-core as filled squares. Thecorrelation between information content and ellipsoidvolume for just the core positions is 0.19; and for just thenon-core positions, 0.10. If structural variation werecorrelated with sequence variation, one would expect thepoints to lie on a line such that small ellipsoids would beassociated with a large difference in information contentrelative to the random sequence (and vice versa for largeellipsoids).

Figure 5. Structure clustering determined by normalRMS versus by SD-RMS. We fit each of the 24 immuno-globulin variable domains onto the VL-VH combined core.Then, for each of the 24 × (23/2) pairs of structures, wecomputed the normal RMS measure of similarity andthe SD-RMS. Trees are a way of visually displaying thesepairwise distances. We made trees clustering the immuno-globulin structures using normal RMS values (top) and ourSD-RMS (bottom). Since these trees were made only on thebasis of structural similarity, they are not expected to bedirectly comparable with phylogenetic trees made on thebasis of sequences. The trees were made with the PHYLIPpackage (Felsenstein, 1989, 1993). The Spearman rankorder correlation between the ordering of the structuresbased on normal RMS and on SD-RMS is 0.80 (with aprobability of less than 0.00001 of the null hypothesis thatthe two orderings are the same).

a-carbon atoms and the aggregate distribution ofweighted differences derived from all a-carbonatoms. The distributions are unimodal, peaked atzero and nearly symmetric. Thus, to a good degreethey can be considered normal.

Average orientation

The average positioning of the VH domains rela-tive to a VL domains is described in Table 2. Weexpress this positioning as the rotation and trans-lation necessary to superimpose a VL domain ontoa VH domain. As described in the legend to the Table,we chose our coordinate system so that the averagetransformation could simply be described as a 173°rotation around the z-axis followed by a 24.4 Atranslation along the x-axis.

Once we found the mean transformation we couldcompute deviations about this mean for each struc-ture. These deviations are the incremental rotationand translation that need to be applied after themean transformation to correctly position the VLand VH domains in each Fv fragment. We find that,on average, the incremental rotation is 5.4° and the

incremental translation is 0.94 A. The incrementalrotations appear to be roughly equally spread amongthe three axis directions (x, y and z). However, mostof the incremental translation is in the z direction(parallel with the pseudo-2-fold), while the least is inthe x direction.

The amount of rotation and the amount oftranslation required are fairly well correlated (witha correlation coefficient of 0.79) so that structures thatrequire significant additional translation also requirefurther rotation. Structures that are close to the meanorientation are HyHEL-10, which has the smallestincremental translation (0.26 A) and a small incre-mental rotation (4.5°), and B13I2, which has a smalltranslation (0.43 A) and the smallest rotation (3.6°).The structure farthest from the mean is NC41, which

Page 10: Average Core Structures and Variability Measures for Protein Families: Application to the Immunoglobulins

JMB—MS 640

Average Core Structure170

Figure 6. Measuring structuraldeviation with calibrated Ang-stroms. The relationship betweennormal distance in A andcalibrated-A distance, which isscaled according to the size of theerrors ellipsoid. The thick line in thetop part of the graph shows normaldistance deviations (in A) betweentwo representative immuoglobulinstructures (the VH of 4-4-20 and theVH of NC41) after they both havebeen fit onto the combined VL-VHcore structure. The thin line at thebottom of the graph shows the aver-age structural variation assigned toeach position in computing the corestructure. This variation is expressedas the volume of an error ellipsoid (incubic A3) at each position. The thickline shows the same distance devi-

ations as the thin line, but now calibrated according to the amount of variation at each position. The RMS value of thesecalibrated-A deviations and the normal distance deviations are necessarily the same and are represented here by thehorizontal broken line. Note that in the B strand, which runs from 18 to 24 and is a region of the immunoglobulin structurewhere there is little variation within the family, the normal distances between the two structures are small and beneaththe overall RMS value, but the calibrated distances are large and above the line. The converse is true for positions in thehighly variable C"D turn (from 59 to 67), where the calibrated distances are less than the normal distances.

has both the largest rotation and largest translation(7.6° and 1.8 A). Note that even this maximumrotation is rather small, so the assumptionsunderlying the rotational averaging procedure arefully satisfied.

Discussion

Methodology

Our core-finding procedure requires an initialalignment between the structures in a family. It theniteratively refines this alignment, throwing out atomswith high spatial variation. It is designed to work onfamilies of relatively similar structures, such as theglobins or immunoglobulins, where it is possible toget an initial structural alignment by eye or bysequence alignment. We, clearly, depend on a high-quality alignment as a starting point, and the avail-ability of automatic structural alignment procedures(Taylor & Orengo, 1989; Sali & Blundell, 1990; Holm& Sander, 1993; Subbiah et al., 1993; Yee & Dill, 1993)increases the applicability of our method. Consider-ing the analogy with methods for multiple sequencealignment (Subbiah & Harrison, 1989), we believeour method for finding and refining an unbiasedaverage could extend these pairwise structurealignment procedures to allow them to performmultiple structural alignment.

The basic core-finding procedure we present isapplicable only to monomeric proteins. However, weshow how it is possible to extend it for use onmulti-domain proteins such as the immunoglobulinsin an intelligent and consistent fashion. Since theorientation averaging calculations are not that com-putationally expensive, we see no limitations in

applying our calculations to even larger assembliesthan immunoglobulin Fv fragments. Furthermore,our formalism for describing average positioning oftwo rigid cores has obvious applications for describ-ing the rigid-body motion of domains (Gerstein et al.,1993, 1994) and the rigid-body docking of macro-molecular complexes, such as a protein binding toDNA (Suzuki et al., 1994, 1995).

The core structures generated by our procedureexhibit acceptable stereochemistry. This is a naturalconsequence of the way we discard the most variablepositions and only average a-carbon atoms (and thusnever have to worry about averaging over a flippedpeptide). Moreover, because of their acceptablestereochemistry and their variability measures ateach site, our core structures might provide goodstarting points for model-building and threading,and this, in turn, suggests that they may provide auseful representation to use in building up a compactlibrary of folds.

Another application for our core structures is thatthey allow one to compare structures using theSD-RMS and calibrated-A. These measures of struc-tural similarity emphasize differences betweenstructures in regions of low variability and discountthem in regions of high variability. Consequently,they are useful in highlighting structures withparticularly distorted core geometry and in assessingwhether a particular structure differs from the othermembers of the family in a typical or unique fashion.

Unlike other aspects of our procedure, the waywe represent the structural variation of a particularsite with an error ellipsoid is somewhat arbitrary.Underlying such a representation is a normal(i.e. symmetric and unimodal) model for the distri-bution of actual atom coordinates around the average

Page 11: Average Core Structures and Variability Measures for Protein Families: Application to the Immunoglobulins

JMB—MS 640

Average Core Structure 171

Figure 7. Distribution of atomic positions about theaverage structure. The distribution of displacements fromthe average coordinates in the 24 immunoglobulin variabledomains. We fit each structure to the core and thendetermined the weighted coordinate differences from themean for each atom. The thin lines show a histogram of thex component of these differences for Ca atoms at positionsC2, A2, C"2 and F5 (using the strand numbering shown inFigure 1). The thick line shows a histogram of all theweighted coordinate differences aggregated together.Thus, this histogram was constructed from 6408 specificcoordinate differences (=3 coordinates × 89 atoms × 24structures).

criterion. Alternatively, one could choose a corecutoff based on minimizing the overlap between theellipsoid volume distributions of core and non-coreatoms.

While the particular dividing line between coreand non-core is arbitrary, the throw-out order gener-ated by our procedure is not. This ordering isessentially a ranking of the atoms by their structuralvariability in the family. The throw-out order doesnot depend on whether one starts core-finding withall possible aligned atoms, just main-chain atoms, orjust Ca atoms. Residues appear to be thrown out asunits. This suggests that for finding an average corestructure for a family one gets most of the relevantinformation by just using a-carbon atoms.

Immunoglobulin core

Our analysis of the immunoglobulins shows thatthe throw-out order can be quite biologically illumin-ating. The immunoglobulin superfamily has recentlybeen divided into four groupings on the basis ofsequences and structures: the V-set, the C1-set,the C2-set and the I-set (Harpaz & Chothia, 1994;Williams & Barclay, 1988). The V-set includes theimmunoglobulin variable domains (i.e. VL and VH)as well as parts of the T-cell receptor (e.g. domain 1of CD2). Each molecule in the V-set contains the tenb-strands found in VL and VH (A, A', B, C, C', C", D",E, F and G) with the exception of CD2 and CD4,which are missing strand A. The C1-set includesthe constant domains of the immunoglobulins andvarious parts of major histocompatibility complex(MHC) molecules. In comparison with the V-set, itis missing all of strands A' and C", and the endsof strands G and C'. The C2-set is similar to theC1-set but is also missing strand D. The I-set, whichincludes cell-adhesion molecules and surface recep-tors, is intermediate between the V-set and the C1-setin that it contains A' and the end of G but doesnot contain C" or the end of C'. All immunoglobulinmolecules contain a highly conserved pin holdingtogether the two sheets (Lesk & Chothia, 1982; Borket al., 1994): this consists of two pairs of interlockingstrands (B-C and E-F), which contain the conserveddisulfide and Trp, and the residues contacting them.

The throw-out order we found during the im-munoglobulin core-finding is very consistent withthe division of the immunoglobulin superfamily. Thefirst strands thrown out (C", A' and the ends of C' andG) were the most variable strands in the superfamily,determining whether a molecule is in the V-set, I-set,and so forth. The next grouping of strands thrownout included strands A and D. The presence orabsence of these strands separates VL and VH fromother members of the immunoglobulin family in aless fundamental way than C or A. Finally, the laststrands to be thrown out were B, C and F, whichcontain the conserved disulfide and make up thebulk of the pin.

Thus, a ranking of structural variability within thevariable domains is consistent with structural vari-ability within the whole immunoglobulin super-

structure. Our results show that, at least for theimmunoglobulins, this assumption of normality isreasonable. However, there are clearly cases (e.g.particular surface side-chains) in which the distri-butions may be multimodal and asymmetric. Inthese cases, neither our core-finding algorithm norour SD-RMS calculation would lose its applicabilityor validity since it is still completely valid to performcalculations based on the first two moments, i.e.mean and variance, of an asymmetric, multi-modaldistribution. However, certain common interpret-ations of our average core structures would not bevalid (e.g. that the average structure is the mostprobable location for atoms in a family or that a onestandard deviation contour contains two-thirds ofthe atoms in a family).

Another aspect of procedure that is somewhatarbitrary is the way we choose a threshold betweencore and non-core. We used the variance of the non-core atoms as a criterion because it was straight-forward to calculate and produced consistent results(as discussed in the legend to Figure 3). For specificapplications one might want to choose a different wayof drawing the line between core and non-coreatoms. For instance, one could use a particular maxi-mum ellipsoid volume (e.g. 1 A3) as a cutoff for coreatoms. Such a cutoff would have the advantageof being even more directly related to the stereo-chemistry of the resulting core than is our present

Page 12: Average Core Structures and Variability Measures for Protein Families: Application to the Immunoglobulins

JMB—MS 640

Average Core Structure172

Table 2. The average positioning of immunoglobulin VL and VH domainsTranslation (A) Rotation (deg)d

Components in Components ineach direction Magnitude each direction Magnitude

T(x) T( y) T(z) =T = R(x) R(y) R(z) u

Meana 24.4 0 −0.17 24.4 0 0 173 173

Deviation from meanb

HyHEL-10 0.18 −0.18 0.00 0.26 3.9 0.7 −2.2 4.5B1312 0.21 −0.35 0.13 0.43 3.4 −1.0 0.3 3.6McPC603 0.18 −0.29 0.28 0.44 −3.2 −1.7 0.5 3.617/9 0.30 −0.14 −0.51 0.61 1.3 2.9 −1.8 3.736-71 −0.12 0.43 0.63 0.77 −0.2 −2.7 3.3 4.3KOL −0.36 −0.64 −0.46 0.86 −2.0 2.3 −3.3 4.5HyHEL-5 −0.11 0.62 −0.63 0.89 4.2 0.8 −1.4 4.5DB3 0.51 0.23 0.78 0.96 1.3 −4.9 5.2 7.3D1.3 −0.38 −0.90 0.30 1.02 −2.3 1.4 −5.4 6.1J539 0.28 0.23 −1.03 1.09 −5.8 3.5 −3.0 7.44-4-20 −0.06 0.18 −1.11 1.13 −2.6 4.0 3.0 5.6NC41 0.25 0.97 1.49 1.80 2.1 −5.4 4.9 7.6

RMSDc 0.28 0.51 0.74 0.94 3.1 3.0 3.3 5.4

The results of applying our orientation averaging procedure to the immunoglobulins: the mean rotationand translation necessary to superimpose a VL domain onto a VH domain and deviations about this mean.

a The mean transformation was derived from fitting the VH domain of a particular structure to the averageVL domain (using the VL-VH alignment in Figure 1) and then refitting on VH. The VL domain waspositioned in the coordinate system so that its centroid is at the origin; the rotation, which is applied beforethe translation, is around the z-axis; and as much as possible of the translation is along the x-axis. With theseconventions, one would roughly expect the z-axis to be parallel with the b-strands in the immunoglobulins;the x-axis to be perpendicular to the plane of the b-sandwich in each domain; and the y-axis to be parallelwith the plane of the sandwich but perpendicular to the strand direction.

b Once the average transformation is applied one can determine the incremental rotation and translationnecessary to perfectly superimpose particular VH domains. These deviations from the mean transformationare shown for each Fv fragment.

c The average of deviations from the mean transformation b is zero. However, one can measure the spreadof each component deviation by computing its RMS, which is shown. =T = is the magnitude of the translationalcomponent. Analogously, u is the magnitude of the rotational component.

d The rotation is expressed in terms of three components and an overall rotatation angle. The threecomponents are the direction cosines scaled by the angle of rotation. They are the usual way to describesmall rotations (Goldstein, 1980; Altmann, 1986). Because sin u1u for small rotations, the four numbers areeasily related to the four quaternion parameters:

qT = (l, m, n, s) = 0R(x)2 ,

R(y)2 ,

R(z)2 , cos

u21

As should be apparent, the way the translations and rotation are expressed in this Table is completelyequivalent. This is one of the strengths of the quaternion representation.

family. This is quite a striking finding since nowherein our immunoglobulin core-finding did we incor-porate any information about variability in othermembers of the superfamily beside VL and VH,yet our procedure identified as variable thoseparts of immunoglobulin structure that vary greatlythroughout the superfamily.

Structural variability is clearly correlated withsequence variability at the level of the overall fold,i.e. similar sequences have the same fold. However,with regard to the immunoglobulins, we find thatsequence variation is not correlated with structuralvariation in terms of the detailed positioning ofatoms (as measured by our error ellipsoids). This istrue whether we consider just the core atoms or bothcore and non-core atoms. Our results are consistentwith the idea that structural accommodation isglobal: in responding to mutations, helices andsheets shift slightly, more or less as rigid bodies(reflecting their fixed hydrogen-bonded geometry),and spread the effect of the mutation throughout thewhole core. Such global accommodation has been

found in the structures of phage T4 lysozymemutants (Eriksson et al., 1992; Baldwin et al., 1993).

Previously, we demonstrated similar results re-garding sequence and structure variability for theglobins (Altman & Gerstein, 1994). Our work withthe globins also manifests the importance of thethrow-out order. In particular, we found that purelyon the basis of throw-out order the globins could bepartitioned into a more variable region (the F helix)and a conserved core, which turned out to haveessentially the same structural elements as therepressor protein.

For the immunoglobulins, as well as the globins,it appears that the parts of the protein thrown outfirst had fewer tertiary interactions than thosethrown out later. For the immunoglobulins this isobvious in comparing, say, the first strands thrownout (A' and C", on the edge of the protein) with thosethrown out last (B-C and E-F, in the center of thedomain). Thus, the core-finding procedure appearsto start at the outside and successively peel awaylayers of protein until it gets to the center.

Page 13: Average Core Structures and Variability Measures for Protein Families: Application to the Immunoglobulins

JMB—MS 640

Average Core Structure 173

Figure 8. Schematic showing orientation averaging. Thedetails of the orientation averaging procedure. As shown inStep I (top), we start with a number of instances of atwo-domain protein with the domains having differentrelative orientations (called instance 1, instance 2, and soforth). We then form the core structures of each domain (i.e.domain A and domain B), which are represented in Step IIpositioned on top of each other with centroids at the origin.Then in Step III, we fit the first core structure (denoted A)to first instance of the A domain. This involves a transform-ation A(1). Then we refit to superimpose the B corestructure onto the B domain. This involves transformationB(1). We would like to do the same thing for all otherinstances of domain A and domain B and then average allthe B transformations to get an average transformation B.However, this leads to a problem since the B transform-ations often involve large rotations, for instance, in the caseof the immunoglobulins, an almost 180° rotation is neededand large rotations are not compatible with our averagingprocedure. What is small and compatible with ourprocedure are the differences amongst the various Btransformations. Consequently, as shown in Step IV, for allinstances after the first instance, after applying the Atransformation, we apply B(1). Using this positioning wethen fit the B core structure to the B domain. This involvesa third transformation, C(i ), which is the differencebetween the transformation required to superpose the Bdomain for the first instance and for instance i. Aftercomputing the C transformations for all instances exceptthe first, we average them with our orientational averagingprocedure. Since the rotation in each C(i ) is small we canaverage the corresponding quaternion just like a normalvector. After finding the average quaternion rotation andaverage transformation, we recombine them to produce theaverage transformation C. The average transformation

to superpose the B core on the B domain after applying therelating the core structures of domains A and B isB = B(1)C. To calculate the variation about the mean average transformation. (Note that unlike the proceduretransformation B, we repeat Step IV (above), but this time depicted in Step IV, where C(1) was the identityusing B in place of B(1). The differences we calculate, transformation, it is necessary and possible to find awhich we now denote D(i ), give the transformation needed non-trivial value for D(1).)

Conclusion

We have presented a method for finding anunbiased, average structural core and for determin-ing the average orientation between two rigid cores.An integral part of our method is assigning ameasure of variability (i.e. the error ellipsoid) to eachposition in a family and ranking all the positionsaccording to their structural variability (i.e. thethrow-out order). Once calculated this measure ofvariability can be used to calculate a moreinformative measure of structural similarity than thenormal RMS difference atom positions, i.e. a betterRMS, and can be compared easily with sequencevariability. Furthermore, when applied to specificprotein families, our average core structures andmeasures of variability yield a number of biologi-cally significant results. For instance, by looking atvariability just within the variable domains of theimmunoglobulin family, we are able to see patternsof variability that occur throughout the wholesuperfamily.

As we have defined them, the core atoms representthe structurally invariant components of proteinfamilies. Since they adopt the same position in allmembers of the family, the core atoms are probablynot responsible for functional differences withinthese families. Instead, the atoms that are classifiedas non-core are logically those to which functionaldifferences can be assigned.

Availability of results on the Internet

We make available C and lisp source code for per-forming the core-finding and calculating the SD-RMS; alignments of the immunoglobulins; the actualcoordinates of the immunoglobulin cores; ProteanD,a program for displaying error ellipsoids on a SiliconGraphics workstation; and further documentationin hypertext form. These items can be retrievedby sending e-mail to [email protected] [email protected] or through anonymousftp to the following URL: ftp://camis. stanford.edu/pub/AvgCore/.

AcknowledgementsR.B.A. is a Culpeper Medical Scholar and is supported

by LM-05652. Computing environment was provided bythe CAMIS resource under NIH grant LM-05305. M.G. issupported by a Damon-Runyon Walter-Winchell fellow-ship (DRG-1272). We thank M. Levitt and A. Lesk foruseful discussions and suggestions, and R. Diamond forproviding his structure superposition program.

Page 14: Average Core Structures and Variability Measures for Protein Families: Application to the Immunoglobulins

JMB—MS 640

Average Core Structure174

References

Altman, R. & Gerstein, M. (1994). Finding an average corestructure: application to the globins. Proc. Second Int.Conf. Intell. Sys. Mol. Biol., pp. 19–27, AAAI Press,Menlo Park, CA.

Altmann, S. L. (1986). Rotations, Quaternions and DoubleGroups, Oxford University Press, New York.

Amit, A. G., Mariuzza, R. A., Phillips, S. E. V. & Poljak, R. J.(1986). Three dimensional structure of an antigen-antibody complex at 2.8 A resolution. Science, 233,747–753.

Arevalo, J. H., Stura, E. A., Taussig, M. J. & Wilson, I. A.(1993). Three-dimensional structure of an anti-steroidFab and progesterone-Fab complex. J. Mol. Biol. 231,103–118.

Baldwin, E. P., Hajiseyedjavadi, O., Baase, W. A. &Matthews, B. W. (1993). The role of backbone flexi-bility in accomodation of variants that repack the coreof T4 lysozyme. Science, 262, 1715–1718.

Bernstein, F. C., Koetzle, T. F., Williams, G. J. B., Meyer.E. F., Jr, Brice, M. D., Rodgers, J. R., Kennard, O.,Shimanouchi, T. & Tasumi, M. (1977). The protein databank: a computer-based archival file for macro-molecular structures. J. Mol. Biol. 112, 535–542.

Bork, P., Holm, L. & Sander, C. (1994). The immunoglobulinfold: structural classification, sequence patterns andcommon core. J. Mol. Biol. 242, 309–320.

Bowie, J. U., Luthy, R. & Eisenberg, D. (1991). A methodto identify protein sequences that fold into a knownthree-dimensional structure. Science, 253, 164–170.

Bryant, S. H. & Lawrence, C. E. (1993). An empirical energyfunction for threading protein sequence through thefolding motif. Proteins: Struct. Funct. Genet. 16, 92–112.

Chothia, C. (1992). Proteins, 1000 families for the molecularbiologist. Nature, 357, 543–544.

Chothia, C. & Finkelstein, A, V. (1990). The classificationand origins of protein folding patterns. Annu. Rev.Biochem. 59, 1007–1039.

Chothia, C. & Lesk, A. M. (1986). The relation between thedivergence of sequence and structure in proteins.EMBO J. 5, 823–826.

Chothia, C. & Lesk, A. M. (1987). Canonical structures forthe hypervariable regions of the immunoglobulins.J. Mol. Biol. 196, 901–917.

Chothia, C., Novotny, J., Bruccoleri, R. & Karplus, M.(1985). Domain association in immunoglobulin mol-ecules: the packing of variable domains. J. Mol. Biol.186, 651–663.

Chothia, C., Boswell, D. R. & Lesk, A. M. (1988). Theoutline structure of the T-cell ab-receptor. EMBO J. 7,3745–3755.

Chothia, C., Lesk, A. M., Tramontano, A., Levitt, M.,Smith-Gill, S. J., Air, G., Sheriff, S., Padlan, E. A.,Davies, D., Tulip, W. R., Colman, P. M., Spinelli, S.,Alzari, P. M. & Poljak, R. J. (1989). Conformations ofthe immunoglobulin hypervariable regions. Nature,342, 877–883.

Colman, P. M., Tulip, W. R., Varghese, J. N., Baker, A. T. &Tuloch, P. A. (1987). NC41 Complex structure deter-mination. Nature, 326, 358–363.

Diamond, R. D. (1992). On the multiple simultaneoussuperposition of molecular structures by rigid-bodytransformations. Protein Sci. 1, 1279–1287.

Epp, O., Latham, E., Schiffer, M., Huber, R. & Palm, W.(1975). The molecular structure of a dimer composedof the variable portions of the Bence-Jones protein REIrefined at 2.0 A resolution. Biochemistry, 14, 4943–4952.

Eriksson, A. E., Baase, W. A., Zhang, X. J., Heinz, D. W.,Blaber, M., Baldwin, E. P. & Matthews, B. W. (1992).Response of a protein structure to cavity creatingmutations and its relation to the hydrophobic effect.Science, 255, 178–183.

Felsenstein, J. (1989). PHYLIP (phylogeny inference pack-age version 3.2). Cladistics, 5, 164–166.

Felsenstein, J. (1993). PHYLIP (phylogeny inference pack-age) version 3.5c. Department of Genetics, Universityof Washington, Seattle.

Finkelstein, A. V. & Reva, B. A. (1993). A search for the moststable folds of protein chains. Nature, 351, 497–500.

Gerber, P. R. & Muller, K. (1987). Superimposing severalsets of atomic coordinates. Acta Crystallog. sect. A, 43,426–428.

Gerstein, M. & Chothia, C. H. (1991). Analysis of proteinloop closure: two types of hinges produce one motionin lactate dehydrogenase. J. Mol. Biol. 220, 133–149.

Gerstein, M., Lesk, A. M., Baker, E. N., Anderson, B.,Norris, G. & Chothia, C. (1993). Domain closure inlactoferrin: two hinges produce a see-saw motionbetween alternative close-packed interfaces. J. Mol.Biol. 234, 357–372.

Gerstein, M., Lesk, A. M. & Chothia, C. (1994). Structuralmechanisms for domain movements. Biochemistry, 33,6739–6749.

Goldstein, H. (1980). Classical Mechanics, Addison-Wesley,New York.

Gribskov, M., Luthy, R. & Eisenberg, D. (1990). Profileanalysis. Methods Enzymol. 183, 146–159.

Harpaz, Y. & Chothia, C. (1994). Many of the immuno-globulin superfamily domains in cell adhesionmolecules and surface receptors belong to a newstructural set which is close to that containing variabledomains. J. Mol. Biol. 238, 528–539.

Herron, J. N., He, X., Mason, M. L., Voss, E. W. &Edmundson, A. B. (1989). Three-dimensional struc-ture of a fluorescein-Fab complex crystallized in2-methyl-2.4-pentandiol. Proteins: Struct. Funct. Genet.5, 271–280.

Holm, L. & Sander, C. (1993). Protein structure comparisonby alignment of distance matrices. J. Mol. Biol. 233,123–128.

Holm, L., Ouzounis, C., Sander, C., Tuparev, G. & Vriend,G. (1993). A database of protein structure familieswith common folding motifs. Protein Sci. 1, 1691–1698.

Johnson, M. S., Sali, A. & Blundell, T. L. (1990). Phylo-genetic relationships from three-dimensional proteinstructures. Methods Enzymol. 183, 670–691.

Jones, D. T., Taylor, W. R. & Thornton, J. M. (1992). A newapproach to protein fold recognition. Nature, 358,86–89.

Kabat, E. A., Wu, T. T., Bilofsky, H., Reid-Milner. M. & Perry,H. (1983). Sequences of Proteins of Immunological Interest.National Institutes of Health, Washington, DC.

Kearsley, S. K. (1990). An algorithm for the simultaneoussuperposition of a structural series. J. Comp. Chem. 11,1187–1192.

Krogh, A., Brown, M., Mian, I. S., Sjolander, K. & Haussler.D. (1994). Hidden Markov models in computationalbiology: applications to protein modelling. J. Mol. Biol.235, 1501–1531.

Lesk, A. M. (1991). Protein Architecture: A PracticalApproach. IRL Press, Oxford.

Lesk, A. M. & Chothia, C. H. (1980). How different aminoacid sequences determine similar protein structures:the structure and evolutionary dynamics of theglobins. J. Mol. Biol. 136, 225–270.

Lesk, A. M. and Chothia, C. (1982). Evolution of proteins

Page 15: Average Core Structures and Variability Measures for Protein Families: Application to the Immunoglobulins

JMB—MS 640

Average Core Structure 175

formed by b-sheets, II. The core of the immuno-globulin domains. J. Mol. Biol. 160, 325–342.

Lesk, A. M. & Chothia, C. (1984). Mechanisms of domainclosure in proteins. J. Mol. Biol. 174, 175–191.

Levitt, M. (1976). A simplified representation of proteinconformations for rapid simulation of protein folding.J. Mol. Biol. 104, 59–107.

Levitt, M. & Chothia, C. (1976). Structural patterns inglobular proteins. Nature, 261, 552–558.

Madej, T. & Mossing, M. C. (1994). Hamiltonians forprotein tertiary structure prediction based onthree-dimensional environment principles. J. Mol.Biol. 233, 480–487.

Marquart, M., Deisenhofer, J., Huber, R. & Palm, W. (1980).Crystallographic refinement and atomic models of theintact immunoglobulin molecule KOL and its antigen-binding fragment at 3.0 A and 1.9 A resolution. J. Mol.Biol. 141, 369–391.

McLachlan, A. D. (1984). How alike are the shapes of tworandom chains? Biopolymers, 23, 1325–1331.

Murzin, A., Brenner, S. E., Hubbard, T. & Chothia, C.(1995). SCOP: a structural classification of proteins forthe investigation of sequences and structures. J. Mol.Biol. 247, 536–540.

Orengo, C. A. (1994). Classification of protein folds. Curr.Opin. Struct. Biol. 4, 429–440.

Orengo, C. A., Flores, T. P., Taylor, W. R. & Thornton,J. M. (1993). Identifying and classifying protein foldfamilies. Protein Eng. 6, 485–500.

Orengo, C. A., Jones, D. T. & Thornton, J. M. (1994). Proteinsuperfamilies and domain superfolds. Nature, 372,631–634.

Overington, J., Donnelly, D., Johnson, M. S., Sali, A. &Blundell, T. L. (1992). Environment-specific aminoacid substitution tables: tertiary templates and pre-dition of protein folds. Protein Sci. 1, 216–226.

Padlan, E. A., Silverton, E. W., Sheriff, S., Cohen, G. H.,Smith-Gill, G. S. & Davies, D. R. (1989). Structure ofan antibody-antigen complex: crystal structure of theHyHEL-10 Fab-lysozyme complex. Proc. Natl Acad.Sci. USA, 86, 5938–5942.

Pascarella, S. & Argos, P. (1992). A databank mergingrelated protein structures and sequences. Protein Eng.5, 121–137.

Ponder, J. W. & Richards, F. M. (1987). Tertiary templatesfor proteins: use of packing criteria in the enumerationof allowed sequences for different structural classes.J. Mol. Biol. 193, 775–791.

Remington, S. J. & Matthews, B. W. (1980). A systematicapproach to the comparison of protein structures.J. Mol. Biol. 140, 77–99.

Richardson, J. S. (1981). The anatomy and taxonomy ofprotein structure. Advan. Protein Chem. 34, 167–334.

Sali, A. & Blundell, T. L. (1990). The definition of generaltopological equivalence in protein structures: Aprocedure involving comparison of properties andrelationships through simulated annealing anddynamic programming. J. Mol. Biol. 212, 403–428.

Sander, C. & Schneider, R. (1991). Database of homology-derived protein structures and the structural meaningof sequence alignment. Proteins: Struct. Funct. Genet. 9,56–68.

Satow, Y., Cohen, G. H., Padlan, E. A. & Davies, D. R.(1987). Phosphocholine binding immunoglobulin FabMcPC603: an X-ray diffraction study at 2.7 A. J. Mol.Biol. 190, 593–604.

Schneider, T. D. & Stephens, R. M. (1991). Sequence logos:a new way to display consensus sequences. Nucl. AcidsRes. 18, 6097–6100.

Schneider, T. D., Stormo, G. D., Gold, L. & Ehrenfeucht, A.(1986). Information content of binding sites on nucleo-tide sequences. J. Mol. Biol. 188, 415–431.

Schulze-Gahmen, U., Rini, J. M., Arevalo, J., Stura, E. A.,Kenten, J. H. & Wilson, I. A. (1988). Preliminarycrystallographic data, primary sequence, and bindingdata for an anti-peptide Fab and its complex witha synthetic peptide from influenza virus hemag-glutinin. J. Biol. Chem. 263, 17100–17105.

Shapiro, A., Botha, J. D., Pastore, A. & Lesk, A. M. (1992).A method for the multiple superposition of structures.Acta Crystallog. sect. A, 48, 11–14.

Shenkin, P. S., Erman, B. & Mastrandrea, L. D. (1991).Information-theoretical entropy as a measure ofsequence variability. Proteins: Struct. Funct. Genet. 11,297–313.

Sheriff, S., Silverton, E. W., Padlan, E. A., Cohen, G. H.,Smith-Gill, S. J., Finzel, B. C. & Davies, D. R. (1987).Three-dimensional structure of an antibody-antigencomplex. Proc. Natl Acad. Sci. USA, 84, 8075–8079.

Sippl, M. J. (1990). Calculation of conformational en-sembles from potentials of mean force: an approach tothe knowledge-based prediction of local structures inglobular proteins. J. Mol. Biol. 213, 859–884.

Stanfield, R. L., Fieser, T. M., Lerner, R. A. & Wilson, I. A.(1990). Crystal structures of an antibody to a peptideand its complex with peptide antigen at 2.8 A. Science,248, 712–719.

Strong, R. K., Campbell, R., Rose, D. R., Petsko,G. A., Sharon, J. & Margolies, M. N. (1991). Three-dimensional structure of murine anti-p-azophenyl-arsonate Fab 36-71: 1. X-ray crystallography, site-directed mutagenesis, and modeling of the complexwith hapten. Biochemistry, 30, 3739–3748.

Subbiah, S. & Harrison, S. C. (1989). A method for multiplesequence alignment with gaps. J. Mol. Biol. 209,539–548.

Subbiah, S., Laurents, D. V. & Levitt, M. (1993). Structuralsimilarity of DNA-binding domains of bacterio-phage repressors and the globin core. Curr. Biol. 3,141–148.

Suh, S. W., Bhat, T. N., Navia, M. A., Cohen, G. H., Rao,D. N., Rudikoff, S. & Davies, D. R. (1986). Thegalactan-binding immunoglobulin Fab J539. An X-raydiffraction study at 2.6 A resolution. Proteins: Struct.Funct. Genet. 1, 74–80.

Suzuki, M., Gerstein, M. & Yagi, N. (1994). The stereo-chemical basis for DNA recognition by Zn fingers.Nucl. Acid. Res. 22, 3397–3405.

Suzuki, M., Yagi, N. & Gerstein, M. (1995). DNA-recognition and superstructure-formation by helix-turn-helix proteins. Protein Eng. 4, In the press.

Swindells, M. B. (1995). A procedure for the automaticdetermination of hydrophobic cores in protein struc-tures. Protein Sci. 4, 93–102.

Taylor, W. R. & Orengo, C. A. (1989). Protein structurealignment. J. Mol. Biol. 208, 1–22.

Williams, A. F. & Barclay, A. N. (1988). The immuno-globulin superfamily domains for surface recognition.Annu. Rev. Immunol. 6, 381–405.

Yee, D. P. & Dill, K. A. (1993). Families and the structuralrelatedness among globular proteins. Protein Sci. 2,884–899.

Edited by I. A. Wilson

(Received 17 February 1995; accepted in revised form 1 May 1995)