do proteins learn to evolve? the hopfield network as a basis for the understanding of protein...

sit

E-

J. theor. Biol. (2000) 202, 77}86Article No. jtbi.1999.1043, available online at http://www.idealibrary.com on

00

Do Proteins Learn to Evolve? The Hop5eld Network as a Basisfor the Understanding of Protein Evolution

LEIGHTON PRITCHARD* AND MARK J. DUFTON-

Department of Pure and Applied Chemistry, ;niversity of Strathclyde, 295 Cathedral Street,Glasgow, Scotland, G1 1X¸, ;.K.

(Received on 16 August 1999, Accepted on 5 November 1999)

Correlations between amino-acid residues can be observed in sets of aligned protein sequences,and the analysis of their statistical and evolutionary signi"cance and distribution has beenthoroughly investigated. In this paper, we present a model based on such covariations inprotein sequences in which the pairs of residues that have mutual in#uence combine toproduce a system analogous to a Hop"eld neural network. The emergent properties of sucha network, such as soft failure and the connection between network architecture and storedmemory, have close parallels in known proteins. This model suggests that an explanation forobserved characters of proteins such as the diminution of function by substitutions distantfrom the active site, the existence of protein folds (superfolds) that can perform severalfunctions based on one architecture, and structural and functional resilience to destabilizingsubstitutions might derive from their inherent network-like structure. This model may alsoprovide a basis for mapping the relationship between structure, function and evolutionaryhistory of a protein family, and thus be a powerful tool for rational engineering.

( 2000 Academic Press

Introduction

A huge number and diversity of chemical opera-tions are needed for the correct functioning ofeven a single cell, let alone a whole organism.These operations must be carried out with a highdegree of accuracy, selectivity, speci"city andspeed. Proteins play crucial roles in most of theseprocesses, and their function is dependent on theformation of speci"c molecular interactions. Des-pite their important role in life processes, and ourrecent gains in understanding, many elements ofprotein function and evolution are still onlypoorly understood.

*Present address: Institute of Biological Sciences, Univer-y of Wales, Aberystwyth, U.K.-Author to whom correspondence should be addressed.mail: [email protected]

22}5193/00/010077#10 $35.00/0

The necessary speci"city and diversity ofthe molecular interactions formed by proteinsrequires a large range of protein amino-acidsequences and three-dimensional structures. Onerecent estimate of the total number of proteinfolds indicates that there are probably around700 distinct folds, of which only about half havebeen experimentally observed (Zhang & DeLisi,1998). The existence of several protein &&super-folds'' that are capable of expressing more thanone function has been noted, but not explained(Sander & Schneider, 1991; Orengo et al., 1993;Hegyi & Gerstein, 1999). Our theoretical under-standing of how proteins fold still lacks anadequate physical model despite intense invest-igation, (Baldwin & Rose, 1999a, b). We are un-able, therefore, to predict protein function and/or

( 2000 Academic Press

78 L. PRITCHARD AND M. J. DUFTON

structure from amino acid (and RNA or DNA)sequences alone, although a threading methodhas recently been reported to assign structure toapproximately 30% of the Mycoplasma geni-talium genome (Jones, 1999). The inability to dir-ectly interpret primary sequence still representsa major gap in knowledge given the increasingnumber of protein sequences that remain un-characterized from the various experimentalgenome-mapping projects (Bork et al., 1998).

The expression of function in several well-char-acterized proteins has, however, been detailed insome depth (e.g. Scott & Sigler, 1994; Creighton,1993), and the chemical mechanisms and prin-ciples by which enzyme activity is modi"ed arethought to be quite well understood (Warshel,1998). Even so, the ancestral origins of anyparticular protein and the paths by which itssequence and structure have evolved are gener-ally unknown. The chemical and mechanical ori-gins of function in many contemporary proteinsare similarly obscure. For example, recent studieson ricin indicate that sections of structure thatwere not expected to play any role in functionalexpression do, in fact, make a functional contri-bution (e.g. Kitaoka, 1998). Other theoretical andexperimental studies indicate that the same istrue for other proteins (e.g. Pritchard & Dufton,1999; Oue et al., 1999).

Nonetheless, some shared characteristics ofproteins suggest that there are underlying behav-iours or principles common to all of them. Forexample, the spontaneous refolding of denaturedglobular proteins to their functional conforma-tions indicates that the information required forfolding is often contained in a protein's primarysequence, at least when in the proper environ-mental context (Dobson et al., 1998). Despitethis, attempts to predict even protein secondarystructure from primary sequence are onlyaround 60}70% e!ective, possibly indicatingsome subtlety of intramolecular interaction thatis currently absent from structural prediction al-gorithms. Moreover, the ability of protein func-tion and tertiary structure to remain stable in theface of most single-site substitutions (i.e., &&softfailure'') is potentially indicative of a subtle net-work of simple self-compensating interactions.

In this paper, we present a description of pro-tein evolution based upon covariation analysis

(e.g. Shindyalov et al., 1994; Taylor & Hatrick,1994; Neher, 1994; Chelvanayagam et al., 1997;Pollock & Taylor, 1997; Pollock et al., 1999).A protein is considered to be a network of aminoacids that have mutual, and variable, in#uenceon each other by means of covalent and non-covalent interactions within the tertiary struc-ture. This description is demonstrated to be verysimilar to the formal description of a Hop"eldmodel neural network. Emergent, theoretical anddynamic aspects of the Hop"eld network areconsidered in the context of protein structure,function and evolution. This paper hypothesizesthat the intramolecular selective pressures onprotein evolution can be described by a Hop"eldnetwork-like model.

PROTEIN EVOLUTION CAN BE DESCRIBED

AS A PATH THROUGH SPACE

Any system that can be described by a numberof variables, n, has a state described by the collec-tive values for each variable. For example, a sys-tem comprising a tripeptide sequence has threevariables, each one representing the choice ofamino acid at a single position in the chain. If thetripeptide is Ala}Tyr}Asp, then the state of thatpeptide is &&Ala}Tyr}Asp'', and the state can beplotted on a set of three-dimensional axes asa single point (Fig. 1). If a substitution is made inthe tripeptide, say Tyr to Trp, then the new state&&Ala}Trp}Asp'' can be plotted as a second point,distinct from the "rst, on the same axes. Fora tripeptide based on the 20 common aminoacids there are 8000 possible states, and each onecan be plotted as an individual point on a set ofthree-dimensional axes.

It is a general principle that the state of asystem described by n variables can be plotted asa unique point on a graph with n axes. If the stateof such a system changes over time, then theprogress of the system through a series of statescan be plotted as a path drawn through eachsuccessive point (Fig. 1).

As a protein comprising n residues evolvesfrom an ancestral to a contemporary sequenceover time, this progression can also be describedas a path drawn on n axes, i.e. in n-dimensionalspace, that #ows through the points which rep-resent each intermediate sequence.

FIG. 1. The state of the tripeptide sequence Ala}Tyr}Aspcan be represented as a point on three-dimensional axes.Substitution of Tyr by Trp results in a change of this state toAla}Trp}Asp, which can be plotted as a distinct point on thesame axes. The transition between the two states can berepresented as a line joining the two points. The gradualevolution of the Ala}Tyr}Asp sequence to Ile}Phe}Asn bymeans of single substitutions can be represented on theseaxes in this way, as a path that passes through the pointsrepresenting the intermediate states. This description ofevolution applies to any polypeptide or protein of chainlength n residues, requiring a set of n axes. Hence, theevolution of a protein sequence over time can be representedas a path in n-dimensional space. The "nal &&"t'' sequence insuch a pathway of evolution is an attractor, and selectivepressures determine its state (i.e. its coordinates).

HOPFIELD MODEL OF PROTEIN EVOLUTION 79

The process of evolutionary divergence, inwhich a set of proteins that share a single com-mon ancestor have evolved into distinct, uniqueprotein sequences, can be considered in this wayas series of divergent pathways emanating froma single point in n-dimensional space.

It is generally accepted that Natural Selectionacts to optimize each individual protein for itsfunction. If we assume that there is one contem-porary &&ideal'' sequence (or a small number ofequally good sequences) for a given function,then the &&ideal'' sequence can itself be describedas a point (or cluster of points) in phase space.The protein will appear to be attracted to thatpoint, and the point, or cluster, itself (the &&ideal''sequence) is called an attractor. For a proteinfamily with a single common structure and manydi!erent functional expressions, there will beseveral di!erent attractors.

However, neutral evolution (Kimura, 1983)introduces &&random'' substitutions that are not

bene"cial to function. These have the e!ect ofdiverting the evolutionary pathway mildly o!-course at every step. If we were to chart theevolutionary progress of many competingproteins in a population over time, we wouldsee a &&di!usion'' into n-dimensional space ofthe main evolutionary pathway that leads to the&&ideal'' sequence. This &&di!usion'' is due to ran-dom variation within the population increasingthe number of points around the pathway thatare visited by competing sequences at the sametime.

COVARIATION IN PROTEIN EVOLUTION

Covariation, also called correlated change, isan observable phenomenon in aligned homolog-ous protein sequences. It is usually manifest inthe observation that amino acid X is found atone position in the sequence when the residue atsome other site is >, and that the residue at the"rst site is A when the residue at the second site isB, and so on (Fig. 2). The detection of statisticallysigni"cant covariation has been the subject ofseveral analyses and various methods for identi-fying covariant pairs have been developed (e.g.Shindyalov et al., 1994; Taylor & Hatrick, 1994;Neher, 1994; Chelvanayagam et al., 1997; Pollock& Taylor, 1997; Pollock et al., 1999).

When a covariant pair is found by one of theabove methods it is assumed that one residuemust have some in#uence, either directly or in-directly, on the choice of accepted substitution atthe other. In Pollock & Taylor (1997), a covariantpair was de"ned as, &&When the probabilities of[accepted] substitution at [one] site change de-pending on the residue at [the other] site, the twosites are correlated''.

In some of the analytical methods above, thestrength of correlation is reported as a probabil-ity. This is usually interpreted as the probabilitythat the residue at one position truly a!ectssubstitutions which are accepted at the otherposition.

With this in mind, for two positions i and j ina protein sequence, we can describe a connectionstrength between them as being C

ij, the probabil-

ity that the nature of the residue at positionj a!ects which substitutions can be accepted atposition i. This leads us to formulate an expression


for the in#uence of the entire primary sequenceon substitutions at a single position i as a func-tion of the connections between sites, and the sidechains at those sites:

+iOj

f (Cij , Sj) (1)

In the above expression, the in#uence (as mea-sured by covariation analysis) on substitution ata single position, i, from every other position inthe molecule, j, is considered over the entirestructure. The variable Cij represents the prob-ability that the residue at a single position ja!ects the substitutions accepted at i. The vari-able Sj represents the side-chain present at posi-tion j and is required because the nature andextent of in#uence on position i will be dependenton the side-chain at j.

According to both selective and neutral evolu-tionary models, a substitution at position i willonly be "xed in the population if there is nosigni"cant compromise of overall "tness. Repres-enting the initial side-chain at i as Si, and theattempted replacement as Si@, we can formulatea state-change algorithm. Si will be substitutedby Si@ if, and only if, the in#uence of the rest of theprimary sequence dictates that there will be nosigni"cant loss of "tness on substitution. Other-wise, the residue at position i is likely to beretained. The algorithm can be written as

SiPSi@

Si H if G+iOj

f (Cij , Sj)Nno loss of fitness on substitution,

+iOj

f (Cij , Sj)Nloss of fitness on substitution.

(2)

More accurately, in terms of the neutral evolu-tionary model of Kimura (1983), we might alsoconsider a correction for the probability (which isdependent on population size) that the substitu-tion is "xed regardless of its "tness. Such a cor-rection would have to be applied to individualresidues based on a prior knowledge of eachresidue's contribution to structure or function.This correction has not been included here,partly in order to simplify comparison of thesimilarities between the covariation model andthe network model presented below. More

importantly, however, the network model that ispresented would tend to suggest that most, if notall, residues in a protein could contribute to-wards the overall function in some way, and soare not easily distinguished as &&neutral'' in evolu-tionary terms.

THE HOPFIELD NETWORK MODEL

The Hop"eld neural network model was "rstdescribed by Hop"eld (1982), and like all neuralnetworks, it comprises a set of nodes (or neur-ones) and their connections (Fig. 3). Hop"eldnetworks are often described as &&general content-addressable memories'' (GCMs), because theycan be trained to recall a unique pre-determinedstate when presented with information associatedwith that state. These networks can be trained todistinguish between two closely related states,and this property is used for their application instudies of visual recognition.

In Fig. 3, a four-node Hop"eld network isdescribed. One of the de"ning characteristics ofHop"eld neural networks is that they are fullyrecursive, i.e. each node is connected to everyother node in the network. In the simplest, andusual, case, each node in such a network isallowed to be in one of two binary states, say #1or !1. The choice of state at any given node isdependent on the states at every other node in the

network and on the strengths of connectionbetween the node and all other nodes.

Essentially, every node i is considered to beindividually connected to every other node j withconnection strength Cij (!1(Cij(#1). Theinput to node i from node j is dependent on thestate at node j and the strength of connection.For example, using a multiplicative rule, if thestate of node j is !1, and the connection strengthCij is !0.5, then the input at i from that node is#0.5. The overall input at node i from all othernodes in the network can then be written, and

FIG. 2. Two positions in a set of protein sequences arecovariant if the allowed choice of side-chain at one positionis in#uenced by the state at another position (see text). Thiscan be identi"ed by eye for a pair of sites in a set of alignedsequences when one site has state A while the second site hasstate B, but has state X when the second site has state>. Theletters in this "gure, having their conventional one-letter-code meanings, represent aligned pentapeptide fragments offour protein sequences. There is an invariant residue atposition 2, but positions 1, 3, 4 and 5 vary between se-quences. Positions 3 and 4 are most obviously correlated,since the residue at 3 is D when 4 is E, and K when 4 is R.Positions 1 and 5 however show no clear correlation withany other position in this pentapeptide for these foursequences.

FIG. 3. A simple Hop"eld network architecture: circlesS1}S

4represent the nodes, while the arrows C

12, etc., repres-

ent the connection strengths between nodes. The state of thenetwork as a whole is represented by the states of the nodes(the side chains at those sites, for proteins), and the activityof the network, i.e. what memories are recalled for whichinput, are determined by the connection strengths.


calculated, as

+iOj

Cij ) Sj , (3)

where Sj represents the state at node j. Node i,given this total input, then applies a thresholdcondition such that it adopts one state if theoverall input is greater than a certain value, ;i ,but adopts the other state if the input value liesbelow ;i . This can be written in the form

CiP#1

CiP!1 H if G+iOj

Cij ) Sj';i,

+iOj

Cij ) Sj(;i.

(4)

Discussion

It was stated by Hop"eld (1982) that, &&Anyphysical system whose dynamics in phase space is

dominated by a substantial number of locallystable states to which it is attracted can [2]be regarded as a general content-addressablememory.'' We have deduced that protein evolu-tion can be described as a set of paths throughphase space that is attracted to &&"t'' (locallystable) sequences; hence, a protein may be sim-ilarly regarded as a general content-addressablememory.

There are also close similarities between thedescription of protein evolution derived froma covariation approach and the formal descrip-tion of a Hop"eld network, in terms of con-nectivities [eqns (1) and (3)] and state changealgorithms [eqns (2) and (4)]. This similarityleads us to suggest that the process of proteinevolution towards a particular function isakin to the way in which a Hop"eld network&&remembers'' a trained state.

IS &&MEMORY RECALL'' THE SAME AS EVOLUTIONARY

DEVELOPMENT OF FUNCTION?

Hop"eld networks are typical of GCMs be-cause they can recall a memory from a partial orcorrupted input (e.g. being able to recall &&CharlesDickens, Oliver ¹wist11 when given only the in-formation &&C. Dickens, O. ¹.''). We consider thatan ancestral protein represents an imperfect in-put in the context of the current environment,and that the contemporary, functionally re"ned


protein is thus the &&complete'' memory. The pro-cess of a GCM recalling memories from partialinput is then reminiscent of the continual re"ne-ment of functional speci"city and "tness we see instructurally related protein families.

When a Hop"eld network is trained to recallone of several &&memories'', it is the connectivitiesbetween modes which, when given the incom-plete input, dictate which memories can be re-trieved*not the choice (or state) of input nodesor the current nodal state of the network (Lisboa,1992). By the analogous description of a proteingiven above, the states of the nodes are con-sidered to be the primary amino-acid sequenceand the potential range of connectivities, in termsof covalent and non-covalent interactions be-tween residues, is dominated by considerations ofthe protein's secondary and tertiary structure.The &&memories'' of a protein are the possiblefunctions that can be expressed, each memory/function being associated with a unique primarysequence (i.e. set of nodal states).

This alone suggests that a protein's potentialto develop and re"ne a particular functionalitythrough selection is also dependent more on theconnectivities between individual amino acids inits structure (i.e. its fold) than on its actualprimary sequence. The importance of tertiarystructure conservation for functional conserva-tion has been noted in many studies (e.g. Bajaj& Blundell, 1984; Hegyi & Gerstein, 1999). Itmay be that some protein tertiary structures pro-vide initial architectures that are better, in termsof structurally enforced interactions between resi-dues, for the development of multiple functions(perhaps through a common mechanism of ac-tion) than other tertiary structures. This providesa partial explanation (in conjunction with consid-erations of lowest-energy structures and hier-archic rules of protein folding) for the extremecommonality of certain structural motifs despitelow primary sequence identity (e.g. Orengo et al.,1997; Russell, 1998). Both the tendency to strong-ly retain tertiary, and less so primary, structurethrough evolution and the observed frequentoccurrence of &&superfolds'' (in which a single ter-tiary structure, with small variations, is able toprovide a framework for the development of sub-stantially di!erent function) could be an outcomeof this mechanism.

WHERE IS THE MEMORY/FUNCTION LOCATED?

Hop"eld networks, as well as being GCMs, aredistributed associative memories (DAMs). Theterm &&associative'' is derived from the way inwhich the network associates a stored memorywith the given input. The term &&distributed'' re-fers to the way in which the memory is stored, notin one place within the system, but collectivelyacross all nodes and connections of the system.

If proteins share the GCM property withHop"eld networks, then they may also share theproperty of distributed memory. If we considerthe function expressed by a protein to be a&&memory'', then generally, the responsibility forstorage of a memory in a Hop"eld network couldbe analogous to the responsibility for functionalexpression in a protein. By this reasoning, if thefunction of a protein is a distributed &&memory'',it may not be expressed by a single residue, oreven by a small region of the protein, but by theprotein as a whole.

If a protein's functional capability could becompletely described by only three residues, withno contribution from the rest of the molecule,then the &&memory'' could be said to be stored inthese three residues. Even for relatively small andsimple proteins, it does not appear that a singlelocalized site bears the entire responsibility forprotein function. Studies on bovine pancreatictrypsin inhibitor (BPTI) have demonstrated thatsynthetic analogues with the same compositionand geometry as the interfacial site are not e!ec-tive as inhibitors (Kitchell & Dyckes, 1982).Similarly, attempts to engineer the speci"city ofproteins similar to BPTI have required modi"ca-tions at regions that were expected to be uncon-nected with functional expression for success(Roberts et al., 1992; Kohfeldt et al., 1996; Kraun-soe et al., 1996). Studies on other proteins such asricin and aspartate aminotransferase have alsoindicated that residues apparently unconnectedwith functional expression do, in fact, make somefunctional contribution (Kitaoka 1998; Oue et al.,1999). These indicate that for proteins of a rangeof sizes (even for BPTI, which has an apparentlystraightforward interaction with trypsin) a largeshare of the responsibility for ensuring speci"cityand e$cacy lies outside the interactive site. Thisis consistent with a distributed memory model ofprotein function.

in these circumstances corresponds to a com-pletely asynchronous update of the nodes ina Hop"eld network.

HOPFIELD NETWORK DYNAMICS 2:

PROPAGATION AND TRANSMISSION

An important factor in the behaviour of aHop"eld network is whether, once the state ata single node has changed, the rest of the networkhas su$cient time to respond to that changecompletely before the next nodal state changetakes place. In the Hop"eld network, thisfailure of the state change information to reachall nodes in the system prior to the next change isreferred to as transmission delay. In the proteinevolution model, the consideration of potentialtransmission delay will involve the rate at whichsubstitutions are attempted, and the rate at which"t substitutions are "xed in the population. If therate of "xing marginally "tter substitutions isfaster than the rate of substitution, then theprotein has ample time to respond to eachchange individually, and transmission delays arenegligible.

We de"ne the mean attempt rate of substitu-tion, =, as the time period between successiveattempted substitutions at a single site in theprotein. This value can be estimated bearing inmind that not all attempted substitutions will be"xed. Natural rates for accepted amino-acid sub-stitution vary from 0.01 to ten accepted changesper amino acid site every 109 yr (Ridley, 1996). Ifwe assume that each position eventually "ndsa single most suitable substitution, and that eachaccepted substitution is the last in a series of 19attempted substitutions (the upper limit with 20common amino acids), then the maximum rate ofattempted substitutions is approximately 190 at-tempted substitutions every 109 yr. The upperlimit on=, the mean attempt rate at each site, isthus 1/5.26]106 yr, i.e. 1.9]10~7 attempts peryear. For a protein of 300 residues with thisuniform, maximum rate of substitution at eachsite, we should expect one attempted substitutionin the whole structure every 175 000 yr.

In a simple model of evolution, a substitutionthat confers a marginal "tness advantage of 1%on the expressing organism (if there is hetero-zygous advantage) can be "xed in 80% of the


PROTEINS EXHIBIT &&SOFT FAILURE''

As individual nodes of a Hop"eld network fail,the system's ability to recall the correct memoryis compromised. As the error rate increases, thesignal/noise ratio also falls, but absolute failureto recall pre-trained states is avoided. This fea-ture of the network is termed soft failure.

Proteins also exhibit this characteristic softfailure. The failure of an individual node in aHop"eld network is equivalent to an inappropri-ate side-chain substitution in a protein. It is ob-served that alanine-scanning mutagenesis atsingle sites rarely destroys folding capability orfunction, even in relatively small proteins (e.g.Dunwiddie et al., 1992; Yu et al., 1995; Gaspariniet al., 1998). In terms of evolutionary intermedi-ates, the less-"t substituted variant is presumedto &&fall by the wayside'', but usually only whena "tter version of the protein is present in thepopulation. The un"t substitution is, by itself,assumed to be usually insu$cient to cripple theprotein and prevent its potential future evolutionto a "tter variant.


ASYNCHRONOUS VS. SYNCHRONOUS UPDATE

The detailed behaviour of a Hop"eld networkis dependent on several attributes of the system.These attributes include whether all the nodeshave their states changed at exactly the same time(synchronous update), one by one (asynchronousupdate), or somewhere in between, as well as therate at which nodal states are changed and whatthe response of each individual node is to a set ofinputs.

Protein evolution is usually considered (forsimplicity) in most analyses to progress by meansof a Markov process: a series of events where onestate is substituted for another at a single site.This view of evolution largely ignores the variousinsertion, deletion and inversion events that areknown to apply. These complicating factors canbe avoided if we consider only the case of relatedsequences that share common tertiary structure,but have only a small degree of pairwise diver-gence (and, possibly, a relatively recent commonancestor). Such conditions are identical to thoserecommended for useful covariation analysis ac-cording to Pollock et al. (1999). Protein evolution

FIG. 4. This "gure represents an arbitrary thresholddevice for a single node of a Hop"eld network. The verticalaxis conventionally indicates the output from the node,while the horizontal axis indicates the weighted sum ofinputs from all other connected nodes. The labels for thisgraph are chosen to represent the in#uence of a singleadjacent residue on the site undergoing substitution (seetext). This corresponds to a very strong connection strengthbetween the two considered positions, and a mutual in#u-ence dependent on the relative sizes of the two side-chains.In this speci"c case, the attempted substitution at the site(node) can either be rejected or accepted, depending on itse!ect on the overall "tness of the protein (and expressingorganism). This is e!ectively, then, a two-state process. Thecase illustrated is one where there is functionally signi"cantsteric con#ict between adjacent residues. The substitution ofa large side-chain for an existing small side-chain will reducethe "tness of the protein with the result that if the side-chainis too large, the substitution is rejected.

population after 1000 generations (Ridley, 1996)from an initial frequency of 1%. By this modelthe frequency is "xed in 93% of the populationafter 2000 generations. A species with a genera-tion time of 20 yr will complete approximately8800 generations in 175 000 yr; hence, we canexpect that a protein has ample time to&&respond'' to any changes induced by each substi-tution.

From the above calculations it appears thattransmission delays in this model are likely to benegligible, and so we can ignore their e!ects ina Hop"eld-like description of protein evolution.


NEURONE RESPONSE PATTERNS

A vital characteristic of Hop"eld networks isthe response of the individual nodes, or neurones,i.e. how they calculate whether or not to changestate from the inputs they receive. In the originalpaper describing such networks (Hop"eld, 1982),it is stated that a nonlinear response is requiredfor the system to display emergent behavioursuch as soft failure and memory recall (Fig. 4).

Consider the process that occurs at a singleresidue according to the state-change algorithm(2). A substitution occurs at a single residue ina protein in one organism of the population, forexample a small residue, Ala say, is substituted byone with a large side-chain, such as Phe. In grossterms, this substitution will have either a deleteri-ous e!ect, or a neutral/bene"cial e!ect. If thesubstitution is bene"cial, the new variant shouldbe "xed in the population and, e!ectively, thesubstitution is &&accepted''. If, on the other hand,the substitution is deleterious, the variant will notbe "xed. This represents a direct choice betweenthe two states &&accept the substituted residue''and &&reject the substituted residue'' (Fig. 4).

Now consider a second site on the proteinstructure that makes a strong connection to thesite at which the "rst substitution is made. Weshall arbitrarily state that the ideal relationshipbetween these two sites is that where one side-chain is small and the other large (perhaps theyare adjacent and must maintain some aspect ofconformation). We then see that substituting thePhe (or any other large) side-chain for the smallerAla will cause the substitution to be rejected, and


the original variant remains the major compon-ent of the population. This response is essentiallytwo-state, nonlinear (Fig. 4), and completely gen-eral for any relationship between two sites wherethe side-chain at one site a!ects the choice ofside-chain at the other, regardless of its nature.As the choice between acceptance and rejectionof the substitution is dependent on other sites ina nonlinear manner, amino-acid residues in aprotein structure conform to the requirement ofnonlinear response and are capable of producingemergent behaviours.

IMPLICATIONS FOR COVARIATION ANALYSIS:

OPENING THE NEURAL NETWORK &&BLACK BOX''

Nearly all applications of covariation analysis(e.g. Shindyalov et al., 1994; Taylor & Hatrick,1994; Neher, 1994; Chelvanayagam et al., 1997;Pollock & Taylor, 1997) have presumed that the


overriding relationship between covariant amino-acid residues in a protein structure derives fromtheir being physically adjacent. Covariant pairswere usually also expected to coevolve in a com-pensatory manner. The methods developed haveall been used in attempts to identify spatiallyclose residues in a three-dimensional structurefrom the primary sequence alone, and all havemet with very limited success, persisting in theidenti"cation of strong relationships between dis-tant side-chains. Only a single recent paper (Pol-lock et al., 1999) has seriously considered thatgeometrically distant residues on a structure dotruly coevolve, and that any detectable covari-ation in such pairs is not merely &&noise''.

Pollock & Taylor (1997) and Pollock et al.(1999) recognized that if covariation betweennon-adjacent residues was a completely generalprocess, and that if the number of covariant siteswas large, then the use of covariation analysis topredict adjacent pairs from primary sequencewould fail. If, as we argue in this paper, theprocess of protein evolution relies on a complexmesh of covariant evolution throughout the mol-ecule, then covariation analysis cannot predictspatially close pairs from primary sequence.

Covariation analysis, though not useful in thispredictive role, will be an invaluable method forinvestigating the network-like nature of proteinevolution. Neural networks are often criticized intheir application to protein structure and func-tion analysis as being &&black boxes'' whose rulesof operation cannot be deciphered (e.g. Jones,1999). This criticism that, although it works, theexact rules by which it works are unknowable,could be levelled at protein evolution itself*per-haps more so if it operates by the network modelabove. However, for the case of protein evolutionwe (1) know the output &&memories'', i.e. the func-tions, which are assumed to be associated with(fairly) unique primary sequences of contempor-ary proteins, (2) can reconstruct ancestral se-quences, even if only imperfectly (Yang et al.,1995) and (3) can determine connection strengthsusing covariation analysis. It should, therefore,be possible to model the evolution of a proteinfamily using a specially constructed Hop"eld net-work model for that family.

The key stage in understanding the network-like mode of evolution will be the development of

a &&map'' indicating connection strengths betweenindividual residues on a single structure, and itsinterpretation in the light of experimental data.Prior to this, an e!ective means of distinguishing&&true'' covariation from &&false'' covariation dueto random events must also be developed, asexisting covariation analysis methods are stillinsu$cient in this regard (Pollock & Taylor,1997; Pollock et al., 1999).

It is worth noting that, in the model above,the connection strength Cij is not speci"ed aseither symmetrical or directional (i.e. Cij"Cji orCijOCji; residue i may or may not in#uenceposition j as strongly as position j in#uencesposition i). Covariation analyses can vary inwhether they report symmetrical or directionalcovariation, and the model presented here is ca-pable of accommodating both possibilities.

If these investigations are restricted to largesets of related sequences that share commontertiary structure, but have only a small degree ofpairwise divergence, then a fairly complete mapof connection strengths and nodal states could bedrawn. This map would locate those positionsthat have required modi"cation to initiate func-tional divergence, and describe the interrelation-ships between these residues and any others thathave undergone compensatory or conspiratorialsubstitution. Such a map could enable proteinengineers to pinpoint key residues responsible forfunctional expression, speci"city and selectivity,and their potential in#uence on the structure.Armed with this knowledge, rational engineeringof proteins could become a large step closer.

L.P. would like to thank the University of Strath-clyde for a studentship.

REFERENCES

BAJAJ, M. & BLUNDELL, T. (1984). Evolution and the terti-ary structure of proteins. Ann. Rev. Biophys. Bioeng. 13,453}492.

BALDWIN, R. L. & ROSE, G. D. (1999a). Is protein foldinghierarchic? I. Local structure and peptide folding. ¹rendsBiochem. Sci. 24, 26}33.

BALDWIN, R. L. & ROSE, G. D. (1999b). Is protein foldinghierarchic? II. Folding intermediates and transition states.¹rends Biochem. Sci. 24, 77}83.

BORK, P., DANDEKAR, T., DIAZ-LAZCOZ, Y., EISENHABER,F., HUYNEN, M. & YUAN, Y. 1998). Predicting function:from genes to genomes and back. J. Mol. Biol. 283,707}725.


CHELVANAYAGAM, G., EGGENSCHWILLER, A., KNECHT, L.,GONNET, G. H. & BENNER, S. A. (1997). An analysis ofsimultaneous variation in protein structures. Protein Eng.10, 307}316.

CREIGHTON, T. E. (1993). Proteins: Structures and MolecularProperties, 2nd (Ed), pp. 413}441. London: W. H. Free-man.

DOBSON, C. M., SAIL, A. & KARPLUS, M. (1998). Proteinfolding: a perspective from theory and experiment. Angew.Chem. Int. Ed. 37, 869}893.

DUNWIDDIE, C. T., NEEPER, M. P., NUTT, E. M., WAXMAN,L., SMITH, D. E., HOFMANN, K. J., LUMMA, P. K., GARSKY,V. M. & VLASUK, G. P. (1992). Site-directed analysis of thefunctional domains in the factor Xa inhibitor tick anti-coagulant peptide: identi"cation of two distinct regionsthat constitute the enzyme recognition sites. Biochemistry31, 12126}12131.

GASPARINI, S., DANSE, J.-M., LECOQ, A., PINKASFELD, S.,ZINN-JSTIN, S., YOUNG, L. C., DEMEDEIROS, C. C. L.,ROWAN, E. G., HARVEY, . L. & MENEZ, A. (1998). Delinea-tion of the functional site of a-dendrotoxin. J. Biol. Chem.278, 25393}25403.

HEGYI, H. & GERSTEIN, M. (1999). The relationship betweenprotein structure and function: a comprehensive surveywith application to the yeast genome. J. Mol. Biol. 288,147}164.

HOPFIELD, J. J. (1982). Neural networks and physical sys-tems with emergent collective computational abilities.Proc. Nat. Acad. Sci. ;.S.A. 79, 2554}2558.

JONES, D. J. (1999). GenTHREADER: an e$cient and re-liable protein fold recognition method for genomic se-quences. J. Mol. Biol. 287, 797}815.

KIMURA, M. (1983). ¹he Neutral ¹heory of MolecularEvolution. Cambridge: Cambridge University Press.

KITAOKA, Y. (1998). Involvement of the amino acids outsidethe active site cleft in the catalysis of ricin a chain. Eur. J.Biochem. 257, 255}262.

KITCHELL, J. P. & DYCKES, D. F. (1982). A synthetic 13-residue peptide designed to resemble the primary bindingsite of the basic pancreatic trypsin inhibitor. Biochem.Biophys. Acta 701, 149}152.

KOHFELDT, E., GOHRING, W., MAYER, U., ZWECKSTETTER,M., HOLAK, T. A., CHU, M.-L. & TIMPL, R. (1996). Conver-sion of the Kunitz-type module of collagen VI into a highlyactive trypsin inhibitor by site-directed mutagenesis. Eur.J. Biochem. 238, 333}340.

KRAUNSOE, J. A. E., CLARIDGE, T. D. W. & LOWE, G. (1996).Inhibition of human leukocyte and porcine pancreaticelastase by homologues of bovine pancreatic trypsin in-hibitor. Biochemistry 35, 9090}9096.

LISBOA, P. G. J. (1992). In: Neural Networks (Lisboa, P. G. J.,ed.), pp. 11}15. London: Chapman & Hall.

NEHER, E. (1994). How frequent are correlated changes infamilies of protein sequences? Proc. Nat. Acad. Sci. ;.S.A.91, 98}102.

ORENGO, C. A., FLORES, T. P., JONES, D. T,. TAYLOR,W. R. & THORNTON, J. M. (1993). Recurring structural

motifs in proteins with di!erent functions. Curr. Biol. 3,131}139.

ORENGO, C. A., MICHIE, A. D., JONES, S., JONES, D. T.,SWINDELLS, M. B. & THORNTON, J. M. (1997). CATH*ahierarchic classi"cation of protein domain structures.Structure 5, 1093}1108.

OUE, S., OKAMOTO, A., YANO, T. & KAGAMIYAMA, H.(1999). Redesigning the substrate speci"city of an enzymeby cumulative e!ects of the mutations of non-active siteresidues. J. Biol. Chem. 274, 2344}2349.

POLLOCK, D. D. & TAYLOR, W. R. (1997). E!ectiveness ofcorrelation analysis in identifying protein residues under-going correlated evolution. Protein Eng. 10, 647}657.

POLLOCK, D. D. & TAYLOR, W. R. & GOLDMAN, N. (1999).Coevolving protein residues: maximum likelihood identi-"cation and relationship to structure. J. Mol. Biol. 287,187}198.

PRITCHARD, L. & DUFTON, M. J. (1999). Evolutionary traceanalysis of the Kunitz/BPTI family: functional divergencemay have been based on conformational adjustment.J. Mol. Biol. 285, 1589}1607.

RIDLEY, M. (1996). Evolution, (2nd Ed), pp. 102, 156.London: Blackwell Science.

ROBERTS, B. L., MARKLAND, W., LEY, A. C., KENT, R. B.,WHITE, D. W., GUTERMAN, S. K. & LADNER, R. C. (1992).Directed evaluation of a protein selection of potent neu-trophil elastase inhibitors displayed on M13 fusion phage.Proc. Nat. Acad. Sci. ;.S.A. 89, 2429}2433.

RUSSELL, R. B. (1998). Detection of protein three-dimen-sional side-chain patterns: new examples of convergentevolution. J. Mol. Biol. 279, 1211}1227.

SANDER, C. & SCHNEIDER, R. (1991). Database of homol-ogy-derived protein structures and the structural meaningof sequence alignment. Proteins: Struct. Funct. Gent. 9,56}68.

SCOTT, D. L. & SIGLER, P. B. (1994). Structure and catalyticmechanism of secretory phospholipases A

2. Adv. Protein

Chem. 45, 53}88.SHINDYALOV, I. N., KOLCHANOV, N. A. & SANDER, C.(1994). Can three-dimensional contacts in proteins be pre-dicted by analysis of correlated mutations? Protein Eng. 7,349}358.

TAYLOR, W. R. & HATRICK, K. (1994). Compensatingchanges in protein multiple sequence alignments. ProteinEng. 7, 341}348.

WARSHEL, A. (1998). Electrostatic origin of the catalyticpower of enzymes and the role of preorganised active sites.J. Biol. Chem. 42, 27035}27038.

YANG, Z., KUMAR, S. & NEI, M. (1995). A new methodof inference of ancestral nucleotide and amino acidsequences. Genetics 141, 1641}1650.

YU, M.-H., WEISSMAN, J. S. & KIM, P. S. (1995). Contribu-tion of individual side-chains to the stability of BPTIexamined by alanine-scanning mutagenesis. J. Mol. Biol.249, 388}397.

ZHANG, C. & DELISI, C. (1998). Estimating the number ofprotein folds. J. Mol. Biol. 284, 1301}1305.

do proteins learn to evolve? the hopfield network as a basis for the understanding of protein...

Documents