constructive induction and protein tertiary … · constructive induction and protein tertiary...

9
Constructive Induction and Protein Tertiary Structure Prediction * Thomas R. Ioerger 1’4 Larry Rendell 1,3,4 Shankar Subramaniam 1,2,3,4 1Department of Computer Science 2Department of Physiology and Biophysics 3National Center for Supercomputing Applications 4The Beckman Institute, University of Illinois, 405 N. Mathews Ave., Urbana, IL 61801 [email protected], [email protected], [email protected] Abstract tations in terms of intermediate concepts, ttepresenta- To date, the only methods that have been used successfully to predict protein structures have been based on identifying homologous proteins whose structures are known. However, such meth- ods are limited by the fact that someproteins have similar structure but no significant sequence ho- mology. Weconsider two ways of applying ma- chine learning to facilitate protein structure pre- diction. We argue that a straightforward approach will not be able to improve the accuracy of classifi- cation achieved by clustering by alignment scores alone. In contrast, we present a novel constructive induction approach that learns better representa- tions of amino acid sequences in terms of physi- cal and chemical properties. Our learning method combines knowledge and search to shift the rep- resentation of sequences so that semantic similar- ity is more easily recognized by syntactic match- ing. Our approach promises not only to find new structural relationships amongprotein sequences, but also expands our understanding of the roles knowledge can play in learning via experience in this challenging domain. Introduction Predicting the tertiary structure of a protein is an im- portant but very difficult problem. Previous machine learning approaches to this problem have been limited because of the complex relationship between the low- level descriptions in terms of amino acid sequences and the high-level similarities amongthree-dimensional folds. [Ragavan and Rendell, 1993Jhave shown that, in other difficult domains, constructive induction can in- crease the accuracy and comprehensibility of learning over traditional symbolic, connectionist, and statistical methods. Constructive induction generally makes pat- terns in data more explicit by finding better represen- *This research was supported in part by an NSF Grad- uate Fellowship (TRI) and the following grants: NSF ASC- 89-02829 (SS), NSF IRI-88-22031 (LR), and ONR N00014- 88K0124 (David C. Wilkins, for equipment). tion change can be facilitated and learning improved by the use of knowledge [Rendell and Seshu, 1990; Towell et al., 1990]. Weare studying how molecular biologist’s knowledgeof amino acid properties can be incorporated to improve learning in this domain. One of the ultimate goals of computational biology is to predict the tertiary structure of a protein from its primary amino acid seguence (for a review, see [Schulz and Schirmer, 1979]). Protein structure pre- diction is important because the rate at which new sequences are being generated far exceeds the rate at which structures are being experimentally determined. It can take years of laboratory work to crystallize a protein for X-ray crystallography [Richards, 1992], cur- rent NMR techniques are limited to solving structures of at most 200 residues (amino acids) [Bax, 1989], and methods based on molecular dynamics are so computa- tionally intensive that simulations are highly unlikely to find conformations with globally minimum energy [McCammon and Harvey, 1987]. To date, the only approach that has been used suc- cessfully is to identify a similar sequence, based on de- gree of homology, whose structure is known[Subrama- niam et al., 1992]. Currently, structures for about 500 ~Be roteins have been deposited in the Protein Data Bank rnstein el al., 1977], falling into classes of about 100 distinct folds [Chothia, 1992]. If a newprotein is found to have significant sequence similarity to a pro- tein whose structure is known,then the newprotein is assumed to have a similar fold [Schulz and Schirmer, 1979]. This approach, called homology modeling, is dis- tinct from methods for predicting secondary structure [King and Sternberg, 1990; Qian and Sejnowski, 1988; Cohen et al., 1986], which have not been successfully extended to predict three-dimensional conformations. It has been observed that as many as one third of new sequences appear to be similar to known se- quences; such statistics have been used to estimate that the number of folds used in biological systems is only on the order of 1000 [Chothia, 1992]. This re- dundancy suggests that there is a high degree of struc- tural conservation; of all possible protein folds, only a 198 ISMB-93 From: ISMB-93 Proceedings. Copyright © 1993, AAAI (www.aaai.org). All rights reserved.

Upload: phamkiet

Post on 05-Sep-2018

223 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Constructive Induction and Protein Tertiary … · Constructive Induction and Protein Tertiary Structure Prediction * Thomas R. Ioerger 1’4 Larry Rendell 1,3,4 Shankar 1,2,3,4 Subramaniam

Constructive Induction and ProteinTertiary Structure Prediction *

Thomas R. Ioerger 1’4 Larry Rendell 1,3,4 Shankar Subramaniam1,2,3,4

1Department of Computer Science

2Department of Physiology and Biophysics

3National Center for Supercomputing Applications

4The Beckman Institute, University of Illinois, 405 N. Mathews Ave., Urbana, IL [email protected], [email protected], [email protected]

Abstract tations in terms of intermediate concepts, ttepresenta-

To date, the only methods that have been usedsuccessfully to predict protein structures havebeen based on identifying homologous proteinswhose structures are known. However, such meth-ods are limited by the fact that some proteins havesimilar structure but no significant sequence ho-mology. We consider two ways of applying ma-chine learning to facilitate protein structure pre-diction. We argue that a straightforward approachwill not be able to improve the accuracy of classifi-cation achieved by clustering by alignment scoresalone. In contrast, we present a novel constructiveinduction approach that learns better representa-tions of amino acid sequences in terms of physi-cal and chemical properties. Our learning methodcombines knowledge and search to shift the rep-resentation of sequences so that semantic similar-ity is more easily recognized by syntactic match-ing. Our approach promises not only to find newstructural relationships among protein sequences,but also expands our understanding of the rolesknowledge can play in learning via experience inthis challenging domain.

Introduction

Predicting the tertiary structure of a protein is an im-portant but very difficult problem. Previous machinelearning approaches to this problem have been limitedbecause of the complex relationship between the low-level descriptions in terms of amino acid sequences andthe high-level similarities among three-dimensionalfolds. [Ragavan and Rendell, 1993Jhave shown that, inother difficult domains, constructive induction can in-crease the accuracy and comprehensibility of learningover traditional symbolic, connectionist, and statisticalmethods. Constructive induction generally makes pat-terns in data more explicit by finding better represen-

*This research was supported in part by an NSF Grad-uate Fellowship (TRI) and the following grants: NSF ASC-89-02829 (SS), NSF IRI-88-22031 (LR), and ONR N00014-88K0124 (David C. Wilkins, for equipment).

tion change can be facilitated and learning improvedby the use of knowledge [Rendell and Seshu, 1990;Towell et al., 1990]. We are studying how molecularbiologist’s knowledge of amino acid properties can beincorporated to improve learning in this domain.

One of the ultimate goals of computational biologyis to predict the tertiary structure of a protein fromits primary amino acid seguence (for a review, see[Schulz and Schirmer, 1979]). Protein structure pre-diction is important because the rate at which newsequences are being generated far exceeds the rate atwhich structures are being experimentally determined.It can take years of laboratory work to crystallize aprotein for X-ray crystallography [Richards, 1992], cur-rent NMR techniques are limited to solving structuresof at most 200 residues (amino acids) [Bax, 1989], andmethods based on molecular dynamics are so computa-tionally intensive that simulations are highly unlikelyto find conformations with globally minimum energy[McCammon and Harvey, 1987].

To date, the only approach that has been used suc-cessfully is to identify a similar sequence, based on de-gree of homology, whose structure is known [Subrama-niam et al., 1992]. Currently, structures for about 500

~Beroteins have been deposited in the Protein Data Bankrnstein el al., 1977], falling into classes of about

100 distinct folds [Chothia, 1992]. If a new protein isfound to have significant sequence similarity to a pro-tein whose structure is known, then the new protein isassumed to have a similar fold [Schulz and Schirmer,1979]. This approach, called homology modeling, is dis-tinct from methods for predicting secondary structure[King and Sternberg, 1990; Qian and Sejnowski, 1988;Cohen et al., 1986], which have not been successfullyextended to predict three-dimensional conformations.

It has been observed that as many as one thirdof new sequences appear to be similar to known se-quences; such statistics have been used to estimatethat the number of folds used in biological systemsis only on the order of 1000 [Chothia, 1992]. This re-dundancy suggests that there is a high degree of struc-tural conservation; of all possible protein folds, only a

198 ISMB-93

From: ISMB-93 Proceedings. Copyright © 1993, AAAI (www.aaai.org). All rights reserved.

Page 2: Constructive Induction and Protein Tertiary … · Constructive Induction and Protein Tertiary Structure Prediction * Thomas R. Ioerger 1’4 Larry Rendell 1,3,4 Shankar 1,2,3,4 Subramaniam

small fraction have been selected, and these folds havebeen opportunistically adapted for a variety of pur-poses [Neidhaxt et aL, 1990]. Thus the ability to iden-tify a known sequence (with solved structure) similarto a new sequence will, with significant and increasingfrequency, provide structural information for analyzingnew sequences [Bowie et al., 1991].

To measure sequence similarity, the common methodis to align two sequences and then compute their ho-mology (percent of identical residues) [Needleman andWunsch, 1970]. The significance of a particular ho-mology value can be tested by comparing it to thedistribution of homologies from alignments of ran-dom sequences with the same amino acid compositions[McLachlan, 1971]. If the homology between two se-quences is high enough, they are probably evolutionar-ily related and hence are adaptations of the same fold.We call this method the alignment-homology approach.

An interesting limitation of this approach is relatedto observation that, while proteins with similar se-quences have similar structures, proteins with similarstructures often do not have similar sequences. For ex-ample, mandelate racemase (MR) and muconate lac-tonizing enzyme (MLE) have an average structuralsimilarity based on the r.m.s, of Ca distances of only1.3}k, yet their sequences show only 26% similarity[Neidhart et al., 1990]. Thus a new sequence can havethe same fold as a known sequence, hut the similarityis not detected by the alignment-homology method.

Incorporating Machine Learning

In this section, we examine the potential for machinelearning to relieve the aforementioned limitationof thealignment-homology approach to the protein structureprediction problem. One common machine learningtechnique is to induce a classification scheme (or "clas-sifter") from a set of pre-classifted examples that willclassify unseen examples with high accuracy (super-vised induction; for a review, see [Michalski, 1983]).We will sketch a fairly straightforward feature-basedlearning method as a thought experiment. Based onan analysis of how the induction algorithm interactswith the alignment algorithm, we will argue that thisfirst learning approach should not in fact improve pro-tein structure prediction.

In general, it is believed that domain knowledge isneeded to improve learning. However, the forms ofknowledge and methods for utilizing it differ from do-main to domain. We will suggest that molecular bi-ologists have partial knowledge related to the amino-acid-sequence representation of examples. In other do-mains, knowledge of representation has been exploitedby a learmng approach called constructive induction.We propose a novel constructive induction techniquethat uses this knowledge to search for better represen-tations of amino acid sequences that make structuralsimilarities more obvious.

Feature-Based Learning and Sequences

An obvious application of machine learning is to tryleaxning sequence patterns associated with distinctfolds. In this scenario, the examples are sequences andthe classifications are the fold identities. For example,the structure for MR would be given a distinct classname; the sequences for both MR and MLE wouldbe classified by this class name since they each havesuch a fold. The learning goal would be to construct aclassifier that mapped any new sequence to the cor-rect fold class (or none, if its fold is truly unique)with high accuracy. Success must be measured relativeto the predictive accuracy of the alignment-homologymethod, which already can, by definition, classify allhemoglobins together, all cytochromes together, etc.The real question that emerges is, Will this applicationof machine learning improve recognition of structuralsimilarity by sequence homology?

To further explore this proposal, we must considerhow a set of sequences could be generalized. Whilesome research has been done on generalizing sequenceswith rules for pattern completion (e.g. a nonde-terministic finite automaton) [Dietterich and Michal-ski, 1986] or various kinds of grammars [Fu, 1982;Sear!s and Liebowitz, 1990], most effort in machinelearning has focused on techniques for feature-basedlearning [Rendell, 1986]. The general assumption be-hind feature-based learning is that training and testingexamples can be described by giving a vector (with fixed number of dimensions) of values. To illustrate,a feature-based description of a protein might be con-structed from some of its biochemical properties, suchas molecular weight, isoelectric point, solubilities inspecific salt solutions, etc. Given this general assump-tion for feature-based learning, a great number of al-gorithms and their properties are known for inducinggeneralizations.

To construct primitive feature-based descriptions ofprotein sequences, one might suggest treating position1 of a sequence as feature 1, position 2 as feature 2,etc., with the feature values ranging over the 20 residuenames. However, it would seem to be a problem thatproteins vary in length, since the number of featuresmust be the same for all examples. In fact, it is clearthat insertions and deletions will cause positions thatshould correspond (based on structural comparison) shift relative to one another other. These observationssuggest that, for generalizing a set of sequences, weshould construct a multiple alignment [Needleman andWunsch, 1970; Dayhoff, 1972], perhaps allowing GAPas a new residue value. After a multiple alignmenthas been constructed, the features are well-defined bypositions in themultiple alignment.

Once features are defined by a multiple alignment,the potential for using feature-based learning to gener-alize sequences becomes apparent. A fold (or "con-cept") can be represented by the set of residuesobserved at each position (conjunctive normal form

Page 3: Constructive Induction and Protein Tertiary … · Constructive Induction and Protein Tertiary Structure Prediction * Thomas R. Ioerger 1’4 Larry Rendell 1,3,4 Shankar 1,2,3,4 Subramaniam

[Michalski, 1983]). Furthermore, generalization ofresidues at a position can he restricted to certain sub-sets based on our knowledge of likely amino acid re-placements, such as ’hydrophobic’ or ’bulky’ (inter-nal disjunction via tree-structured attributes). Somemachine learning techniques can even identify correla-tions of values at multiple positions (for example byfeature-construction [Matheus, 1989; Rendell and Se-shu, 1990]). The basis for these generalizations is takenfrom, and can extend, the molecular biologist’s idea ofa consensus sequence [Dayhoff, 1972].1

However, we will now argue that this straightforwardapplication of machine learning will unfortunately notimprove the ability to recognize structural similarity.Consider how to use such a concept representation toclassify new examples. To see if the feature values of anew sequence match the consensus pattern, we wouldhave to construct the feature-based description of thesequence. But recall that the new sequence probablyhas a different length than the consensus. We could at-tempt to assign features values for the new sequence byfinding the best possible alignment of it with any of thesequences in the multiple alignment. Cases where thereis a good alignment to some known sequence are unin-teresting, since by definition the aligument-homologymethod would have detected this similarity and madethe same classification.

In the cases of insignificant homology, however, therecan be no confidence in the alignment. If there is nostatistical evidence that the best alignment has higherhomology than would an alignment of random se-quences, then many alternative alignments might havethe same homology [McLachlan, 1971]. So either thenew sequence is obviously in the fold, or we cannotconstruct its feature-based description for analysis bycomparison to the consensus. For similar reasons, it iseven difficult to see how to generalize convergent se-quences within a fold. For example, it is not clear howto construct a multiple alignment of the various TIM-barrel proteins because there are no reliable pairwisealignments [Chothia, 1988].

Even if we could use machine learning in some vari-ation of the above proposal to learn sequence patternsfor folds, this would not facilitate fold recognition ingeneral, but only for certain folds for which convergentsequences are known. What we really need is a newway to apply machine learning to the protein struc-ture prediction problem so that learning improves theability to recognize structural similarity via homologymore generally. Since we do not want to give up the

1A consensus sequence is constructed from a multiplealignment by indicating the set of most frequent residuesoccurring at each sequence position. Often the sets axe re-stricted to sets of residues with common properties, sucha~ charge or hydrophobicity. For example, two sequences...Gly Val Asp Phe... and ...Gly Ile Glu Glu...might be represented by the consensus sequence ...("-ly hy-drophobic negative-charge anything ....

2OO ISMB-93

advantages of using the alignment algorithm, whichgives us a good method for analyzing global similaritybetween sequences of different lengths by finding andaveraging local similarities [Needleman and Wunsch,1970], we must look for some way of applying machinelearning to the comparison process itself, rather thanto the results of comparison.

Learning by Shift of Representation

The previous argument suggests there is an interac-tion between alignment and learning which we are notexploiting in the right way. The interaction becomesclear when we observe that the alignment-homologymethod is a classifier itself. The alignment-homologymethod learns how to classify sequences from pre-classified examples by saving the sequences with theirfold classifications (determined directly by NMR or X-ray analysis, or indirectly by significant homology withan already classified sequence). Then, given an unclas-sifted example, the alignment-homology method com-pares it to all the saved examples and returns the clas-sification of the example sequence to which it is mostsimilar (provided the homology is significant). Clearly,the alignment-homology method itself is performingnearest-neighbor learning [Duda and Hart, 1973].

With respect to improving the performance of anearest-neighbor learning algorithm, it is well knownthat this algorithm is highly sensitive to the metricused to compare examples [Kibler and Aha, 1987]. Forthis metric we are using the homology of the align-ment, and one of the parameters of the alignment-homology algorithm is the residue distance function[Erickson and Sellers, 1983]. This function returnsa real number indicating the degree of mismatch be-tween two aligned residues. In the standard alignment-homology method, the function returns 0 for identi-cal residues and 1 otherwise. However, one variationof the residue distance function that has proved use-ful for comparing sequences has been the inverse ofobserved substitution frequencies [McLachlan, 1971;Gribskov et al., 1987].

The rationale behind inverse substitution frequen-cies as a residue distance function is that it shouldcause structurally related sequences to appear moresimilar than truly unrelated sequences [Schulz andSchirmer, 1979]. ff two sequences have the samefold, substitutions between them generally must berestricted to chemically or physically similar residuesin order to fulfill local roles in determining structure.This biases the observed substitution frequencies be-cause residues that play similar roles exchange moreoften. By inverting the frequencies, we are countingresidues that play similar roles in structure as leasdistant (because they exchange more frequently), andresidues that play different roles as more distant. Se-quences from different folds should have a uniform dis-tribution of frequent and infrequent substitutions, can-celing the effect of varying mismatch penalties. But

Page 4: Constructive Induction and Protein Tertiary … · Constructive Induction and Protein Tertiary Structure Prediction * Thomas R. Ioerger 1’4 Larry Rendell 1,3,4 Shankar 1,2,3,4 Subramaniam

sequences from the same fold should have more of thefrequent substitutions, get penalized less, and henceappear more similar overall.

This explanation suggests that we should be look-ing at sequences in a special way: not as sequences ofresidue identities, but as sequences of physico-chemicalproperties. When we see ...Val Tyr Glu... in a se-quence, we actually think "...small hydrophobic residuebranched at C~, aromatic hydroxylated residue, smallnegatively charged residue that can form H-bonds...."Thus we could achieve the same effect of using asubstitution-frequency-based residue distance functionwith the identity residue distance function by trans-forming (prior to alignment) the symbols at each se-quence position from residue identity into a symbolrepresenting local physico-chemical properties. Fur-thermore, such transformations could take contextinto account by including properties of neighboringresidues, thus capturing conditional roles. A matchwould indicate that two residues could play the samerole in determining local structure, which is a vast im-provement over matches based solely on residue iden-tity.

So we propose that machine learning can be appliedto the protein structure prediction problem by learn-ing how to transform amino acid residues to representlocal properties involved in determining structure. Inthe machine learning literature, this approach is gen-erally called constructive induction [Michalski, 1983;Rendell and Seshu, 1990]. To improve the performanceof a fixed learning algorithm (e.g. the alignment-homology method), constructive induction shifts therepresentation of examples to provide a more appro-priate learning bias [Mitchell, 1980]. Constructive in-duction is thought to be particularly useful for learn-ing hard concepts in difficult domains by discoveringintermediate concepts, which, when added to the rep-resentation, make significant patterns in the data moreexplicit [Rendell and Seshu, 1990]. Constructive induc-tion has been found to significantly improve the accu-racy and comprehensibility of learning over standardalgorithms, such as decision tree builders (e.g. C4.5),neural nets (e.g. backpropagation), and statistics-based programs (e.g. MARS) [Ragavan and Rendell,19931.

While constructive induction would seem in princi-ple to be advantageous in the domain of protein struc-ture prediction, current frameworks are not applicablebecause examples are represented by sequences ratherthan feature vectors [Matheus, 1989]. In the follow-ing sections, we propose a new method of constructiveinduction that utilizes molecular biologists’ knowledgeof likely relevant properties of amino acids to searchfor better representations of sequences, ultimately tomake sequence comparisons better reflect structural re-lationships. Our approach to this important problemis unique in combining traditional statistical analysiswith knowledge of amino acid properties, and could po-

tentially discover new structural relationships amongprotein sequences.

Transformation Functions

As we have proposed above, our learning goal is tofind an effective transformation function which will re-represent sequences so that structural similarity is eas-ier to recognize with the alignment-homology method.2

First we will define transformation functions, and thenshow how to construct a space of them.

Sequences are constructed in the usual way from fi-nite alphabets and have finite, non-zero lengths: A E~+, A = ala2...an, ai E E, where n E N is the lengthof A, denoted IAI. Ea, is the usual alphabet for proteinsequences, consisting of the 20 amino acid symbols. Wewill be constructing other alphabets for describing lo-cal properties.

In order to transform an entire sequence, we performa local trartsformation on each position in the sequencewith local transformation functions. Then we extendthe local transformation function into a sequence trans-formation function by applying it to each position in asequence.

Definition 1 A local transformation function ~r mapsa sequence ..4 over one alphabet E1 and an index i (1 <i < IAI) to a symbol in another alphabet ~2: jr : E1+ xN ~-+ E2.

Definition 2 A sequence transformation function :~is an extension of a local transformation function :7:that maps a sequence A over one alphabet E1 to asequence B of the same length (IBI -- IAI) over an-other alphabet ~2 by applying .T to each position in A:bi = 7(A, i), (1 < i < IAI).

The simplest examples of local transformation func-tions are identities. The function IDo, when appliedto position i of a sequence, simply returns the symbolat that position. Thus IDo copies one sequence to an-

other: if B = IDo(A), then bi = 04 for 1 < i < IAI.Other identity functions return neighboring symbols,and their sequence-transformation-function extensionscause shifts. For example, if B = ID_I(A), thenbi = 04-1 for 2 < i < [A[, and bl = a1.3

Abstraction

Given this base class of local transformation functions,we can resursively construct more interesting transfor-mations by two processes, the first of which is called

2Transformations were also used in [Dietterich andMichalski, 1986] to facilitate sequence learning. Their op-erator for adding derived attributes subsumes our abstrac-tion operator (section 3.1), and their blocking operator subsumed by our crossing operator (section 3.2).

SThere is some freedom in defining boundary conditions;we alternatively might have extended the alphabet of B tocontain a symbol for ~undefined."

Ioerger 201

Page 5: Constructive Induction and Protein Tertiary … · Constructive Induction and Protein Tertiary Structure Prediction * Thomas R. Ioerger 1’4 Larry Rendell 1,3,4 Shankar 1,2,3,4 Subramaniam

abstraction. Intuitively, to abstract a position in a se-quence is to replace the symbol with its class accordingto some partition of the alphabet.

Definition 3 An abstraction function ABe, maps asequence A over an alphabet ~ and an index: i (1 < i IAI) to a symbol in the alphabet ~/P, the class namesof the symbols in E under the partition 7).

The effect of abstraction is that, when comparingtwo sequences via the alignment-homology method,some mismatches would be changed to matches be-cause the symbols had been identified together. Themost obvious abstraction function is the one that mapsthe amino acids into their classes heuristically derivedfrom observed substitution frequencies: ABe,,., where7)a~ = {{V,I,L,M},{C},{F,Y,W},{K,R,H},{S,T,D, N, G, A, E, Q, P}} tDayhoff e~ at., 1972]. How-ever, abstraction is general and can be used toidentify any subset of symbols. For example sup-pose we partitioned the amino acids into threeclasses: HYDROPHILIC, HYDROPHOBIC, and AM-PHIPATHIC. Then we could single out the prop-erty of being HYDROPHOBIC by combining HY-DROPHILIC with AMPHIPATHIC via the parti-tion {{HYDROPHILIC,AMPFIIPATtIIC},{HYDRO-PHOBIC}}. In terms of constructive induction, theabstraction process disjoins feature values [Michalski,1983; Rendell and Seshu, 1990].

Crossing

One problem with abstracting residues into classes isthat there are multiple dimensions of similarity amongamino acids which might get confounded in any singlepartition. For example, threonine is similar to valinein size and similar to tyrosine because its hydroxyl canparticipate in hydrogen bonds, but it is not meaning-ful to identify all three of these residues together. Thesubstitutidn frequency matrix, Mthough more flexiblebecause of its scalar similarity values, also suffers fromsuch confounding, and must suffice to average out rel-ative similarities based on any and all relevant proper-ties [Dayhoff et al., 1972].

To alleviate this confounding effect, we observe thatcontext often determines the primary roles played by aresidue. For example, the important properties of Valin a E-sheet are its hydrophobicity and its branching atCZ (for shielding the polar backbone) but not its small-ness [Schulz and Schirmer, 1979]. If we could estimatethe local environment, then we could abstract residuesconditional on which properties are most likely beingexploited. A method for approximating the locM envi-ronment is to find patterns in neighboring residues andtheir properties [Schulz and Schirmer, 1979]. Thus weintroduce crossing as a second process for constructingnew transformation functions. Crossing takes symbolsfrom two local transformation functions and forms anew symbol in the product of the two alphabets.

202 ISMB--93

Definition 4 The cross 3rl x ~2 of two local transfor-mation functions 3ri (mapping into ~l) and Y:2 (map-ping into E2) maps a sequence (over E) and a positionindex into the cross product of the alphabets of the twofunctions: ~rl × .Tz : ~+ x N ~ Ei × ~2.

As a hypothetical example, suppose that normallyhydrophobicity is the most important property, but inturns (say, when glycine is the previous residue) size most important. We could implement this knowledgein a transformation function like AB(ID_ i × IDo) thatcrossed the identity of position i - 1 with the identityof position i and then abstracted the product symbolsin the following way. When the first symbol in theproduct is glycine, form classes based on the size ofthe second symbol in the product, and otherwise, formclasses based on the hydrophobicity of the second sym-bol in the product. Thus the symbol Gly × Val wouldget mapped into one class, perhaps called NEXT-TO-GLY-AND-SMALL, and Set × Val would get mappedinto another class, perhaps called NOT-NEXT-TO-GLY-AND-HYDKOPHOBIC. In terms of constructiveinduction, the crossing process conjoins features [Ren-dell and Seshu, 1990; Michalski, 1983].

Constructing the Space

Through the base cases (identity functions) and recur-sive cases (abstractions and crossings), a large spaceof functions can be constructed, similar to extending aset of features by the logical operators {^, V} [Rendelland Seshu, 1990]. These functions formally capturethe ability to compare sequence positions in terms oftheir local properties. What we hope is that, by findingan appropriate transformation function and applyingit to a pair of sequences, the alignment-homology al-gorithm will be facilitated by a better correspondencebetween syntactic matching and semantic similaritybased on physico-chemical roles played in determininglocal structure.

Perhaps the ultimate local transformation functionwould be one that maps sequence positions into sec-ondary structure classes [King and Sternberg, 1990].If sequences were compared this way, the alignment-homology method would be an extremely good clas-sifter for folds. It is possible that two sequencescould have similar secondary sequence patterns and yetfold into distinct global conformations, but this seemshighly improbable, especially considering that only onthe order of 1000 folds are predicted to be used in bi-ological systems. Since secondary structure is largelydetermined by properties of residues close in sequence,we expect the space of transformation functions to con-tain a function that expresses such a representation.Importantly, our approach surpasses secondary struc-ture prediction methods [King and Sternberg, 1990;Qian and Sejnowski, 1988; Cohen et ai., 1986] by usingsuch local predictions to recover full three-dimensionalconformations.

Page 6: Constructive Induction and Protein Tertiary … · Constructive Induction and Protein Tertiary Structure Prediction * Thomas R. Ioerger 1’4 Larry Rendell 1,3,4 Shankar 1,2,3,4 Subramaniam

Searching for TransformationsThe constructions mechanize operations known to beimportant, imparting knowledge of the form of trans-formations, and relegate the task of search to thecomputer to instantiate the details of transforma-tions. The space of transformation functions is verylarge (consider all the possible partitions for abstract-ing an alphabet of size 20; consider the exponentialgrowth of neighborhood conditions when crossing withmore neighbors), so it must be searched intelligently.In order to measure progress we must operational-ize our goal into a method for evaluating transfor-mation functions. Since we are looking for a trans-formation function that improves the ability of thealignment-homology method to recognize structuralsimilarity, the basic test will be to observe the ef-fect that pre-processing sequences with a transforma-tion function has on the predictive accuracy of thealignment-homology method.

The predictive accuracy can be estimated by clas-sifying a training set of sequence pairs with insignif-icant homology, some of which are known to be inthe same fold (+ class: SAME-FOLD), and the othersknown to be in different folds (- class: DIFFERENT-FOLD). Without any transformation, the alignment-homology method would classify all these sequencepairs as DIFFERENT-FOLD, so the goal is to findtransformation functions that reduce the false nega-tives while preserving the true negatives.

If we plot a histogram of homologies from alignmentsof unrelated sequences (pairs in the training set clas-sifted as DIFFERENT-FOLD), we get a distributionwith a low mean of roughly 10-20%. If we were to plota similar histogram for insignificantly homologous se-quences classified as SAME-FOLD, we would expectto see a similar distribution since this is the basis forthe definition of insignificant homology. The overlap isprecisely the reason that syntactic matching (homol-

ogy using the "null" transformation function IDo) isan inadequate method for recognizing structural simi-larity: there is poor separation of sequence-pair classifi-cations at low homology. Thus an operational versionof our goal is to find a transformation function thatseparates these two peaks. We can quantitatively eval-uate a transformation function (relative to a trainingset of sequence pairs) by computing the degree of sep-aration S based on averagehomologies #i, variancesai, and sample sizes hi, where i is the class name:S = (p+ - #_)/(~+2/n+ o’ _2/n_). This fo rmulacaptures the notion that the distance between two dis-tibutions depends not only on the distance between themeans, but also on how spread out they are.

This evaluation can be used to search the spacefor effective transformation functions. For example,perhaps a hill-climbing approach would incrementallybuild increasingly better representations through thecrossing and abstracting constructions. Or a ge-netic algorithm [Booker ef al., 1989] that recombines

sub-functions of highly rated transformation functionsmight be effective. Since there are so many possi-ble ways to abstract or cross a particular transforma-tion function, it is clear that some domain knowledge(more than is already built into the transformation op-erators and alignment-homology algorithm [Matheus,1989]) will be necessary [Michalski, 1983] [Towell etal., 1990]. Fortunately, the molecular biologist’s knowl-edge of physical and chemical properties that are likelyto be involved can be used as suggestions for abstrac-tion functions [Schulz and Schirmer, 1979]. Similarly,the knowledge that the local environment at a seqe-unce position is largely determined by up to 5 residuesin both directions is useful in restricting the crossingconstructs [Schulz and Schirmer, 1979]. By searchingthe space of transformations near constructions con-sistent with such knowledge, the evaluation metric canguide us to better transformations, and we might beable to refine our knowledge of the principles determin-ing protein structure by interpreting the search results[Towell et al., 1990]. Our research expands the rolesknowledge can play in learning.

In summary, our approach to improving proteinstructure prediction is essentially to optimize the rep-resentation of amino acid sequences. In contrast tothe nearest-neighbor approach described earlier, thislearning method takes pairs of sequences as a train-ing set (rather than single sequences), represents con-cepts in the form of transformation functions (insteadof saved examples), and classifies unseen examples (se-quence pairs) as SAME-FOLD or DIFFERENT FOLD(instead of returning actual fold identities of single se-quences). The learning is achieved by optimizing pre-processing transformation of a training set of se-quence pairs for predictive accuracy of the alignment-homology classifier. The evaluation is based on the sep-aration of peaks from distributions of alignment scoresbetween sequence pairs of same and of different struc-ture. The shift of representation to fit a fixed learningbias makes this approach a kind of constructive in-duction, and promises to exploit and augment molecu-lar biologists’ knowledge of the physico-chemical rolesplayed by amino acids in determining protein struc-ture.

Preliminary Experimental ResultsIn this section, we demonstrate the potential for ourconstructive induction approach to facilitate recogni-tion of structural similarity. Table 1 shows a set ofpairs of sequences we used as data. Each pair repre-sents dissimilar sequences with similar folds and will beused as a positive example (SAME-FOLD). Sequencesnot listed in the same pair are not only different in se-quence, but also in structure; such pairs will be usedas negative examples (DIFFERENT-FOLD).

To demonstrate that the alignment-homologymethod would have difficulty classifying this data set,we computed the best possible alignment scores for

Ioerger 203

Page 7: Constructive Induction and Protein Tertiary … · Constructive Induction and Protein Tertiary Structure Prediction * Thomas R. Ioerger 1’4 Larry Rendell 1,3,4 Shankar 1,2,3,4 Subramaniam

Table i: The data set used for these experiments: pairsof sequences with structural similarity but insignificantsequence homology. The names refer to PDB entries.

name I chain description

256b cytochrome b5622mhr myohemerythrinlald aldolase A4xia xylose isomeraselrhd 1-146 rhodaneselrhd 152-293 rhodaneselgox glycolate oxidaselwsy beta tryptophan synthaseIcd4 T-cell CD42fb4 light immunoglobulin Fab2hhb alpha hemoglobinlecd erythrocruorin2aza ~urinIpcy plastocyanin

heavy IgA FabIfcl IgG1 Fc

each example pair (using the algorithm of [Gotoh,1982] with a gap start penalty of 3, a gap extensionpenalty of 0.1, and the identity residue distance func-tion described above). It appeared that the alignmentscores were linearly dependent on the minimum lengthof the sequences being aligned, so this variable was fac-tored out of each score, leaving a number between 0 (nomatches) and 1 (identical sequences). The distribu-tions of scores for positive and negative examples arecompared in Figure I (showing unweighted probabil-ity density functions for normal distributions given themeans and standard deviations for each set of scores).The distributions are extremely overlapped; the peakseparation S is 114.4

Interestingly, when amino acid residues are mappedinto classes according to substitution frequencies byABe,,, the ability to distinguish structural similaritydoes not improve. Figure 2 shows the distributions ofalignment scores for SAME-FOLD and DIFFERENT-FOLD sequences pairs. The scores have shifted higherbecause the abstraction causes more matching. How-ever, the shift was independent of fold similarity; thepeaks are still indistinguishable. The separation is-43. Clearly, residue class at a single position is not

4 As defined in the section on searching for transforma-tions, 5 is an abstract, unitless measure. The separationvalue of a transformation by itself is meaningless; it is onlyuseful in comparison to the separation of another transfor-mation. Separation can quantitate the observed differencesamong Figures 1, 2, 3, and 5, and is used for making deci-sions in our hill-climbing algorithm. Based on our figures,peaks are not noticeably separated until S is on the orderof 1000; negative separation indicates that the mean of thepositive peak is below the mean of the negative peak.

18

15

, / , ,!~ , Nom~l~d0.04 0.1 0.16 0.2 8¢om

Figure 1: Distributions of alignment scores for se-quences pairs (without any transformation) classifiedas SAME-FOLD (solid line) and DIFFERENT-FOLD(dashed line). The separation of the peaks is 114.

18

10~

Z \I / \

/ %

0.48 O,~S 0.58 0.6 8oore

Figure 2: Distributions of alignment scores for se-quences pairs when amino acids are transformed intoresidue classes (ABe.e). The separation of the peaksis -43.

refined enough to capture secondary structure. Therepresentation is apparently confounding the variousroles amino acids can play, hence causing the substitu-tion patterns among sequences with the same structureto appear random.

To find a more expressive representation, we crossedthe residue class at a position with its neighbors, oneeach in the N-terminal and C-terminal directions. Sim-ply crossing the residue classes at these three positionsproduces an alphabet of size 125, since the range ofclass values is 5 for each position. If such a transfor-mation function were used, most sites would appeardissimilar, including those that should match based onstructural comparison. Thus we would like to find apartition of the 125 product symbols that maps to-gether some of the triples which are in fact found inthe same secondary structures.

To find such an abstraction function, we used thetechnique of hill-climbing [Winston, 1984]. First, werandomly partitioned the 125 values into 10 classes.This initial abstraction function, applied on top ofthe crossing function, did not separate the scores verywell (see Figure 3); the initial separation value was 20 (nearly complete overlap, like Figure 2). Then iteratively perturbed the partition and tested for im-

204 ISMB-93

Page 8: Constructive Induction and Protein Tertiary … · Constructive Induction and Protein Tertiary Structure Prediction * Thomas R. Ioerger 1’4 Larry Rendell 1,3,4 Shankar 1,2,3,4 Subramaniam

~I~

14 / %

0,15 $,2 0.25 0.8 0~8 0A 8ochre

Figure 3: Distributions of alignment scores for se-quences pairs when the residue classes at positions i-1,i, and i + 1 are crossed, and then abstracted accordingthe random initial partition in the experiment. Theseparation of the peaks is -20.

proved separation over the data set. The perturbationwas accomplished by randomly choosing one of the 125values and switching it from its current class to a newone (sometimes naturally emptying a class; 10% of thetime creating its own new class). The perturbed par-tition was evaluated by applying it with the crossingtransformation to the sequence pairs and computingthe separation of the positive and negative peaks asbefore. If the separation increased, the perturbed par-tition was kept for the next iteration.

In our experiment, the initial random partition wasperturbed over 500 iterations, and the hill-climbingprocedure appeared to be quite successful. Figure 4shows that the ability to separate the peaks steadily in-creased. The best partition found had 13 classes withsizes ranging from 2 to 25 product symbols; no in-terpretable pattern in triplets of residue classes hadyet appeared. Figure 5 shows how well this parti-tion separated the peaks; its evaluation was 1338. Wesuggest that such a transformation function has cap-tured something about local physico-chemical proper-ties that is related to secondary structure. As a conse-quence, the alignment-homology method has become abetter classifier; homologies between sequences that doindeed have similar structure have become incresinglysignificant and identifiable. Extensions of this experi-ment should include cross-validation of the results withan independent data set, and might include an analy-sis of new biophysical principles implied in the learnedtransformations.

ConclusionIn this paper we considered two ways of applying ma-chine learning to the protein structure prediction prob-lem. We argued that the straightforward approach ofapplying feature-based learning to generalize sequencesin a fold would not be effective since the ability toconstruct multiple alignments would by itself classifysequences as well. We suggested that machine learn-ing could be more appropriately applied in the form

f

Figure 4: Increase in separation with each perturbationof the partition in the experiment.

A0,15 U 0.25 ill 0,16 e.4 8oom

Figure 5: Distributions of alignment scores for se-quences pairs when the residue classes at positions i-1,i, and i Jr 1 are crossed, and then abstracted accord-ing the best partition found in the experiment. Theseparation of the peaks is 1338.

"t10

of constructive induction by shifting the representa-tion of amino acids sequences before computing thealignments. We presented a language of transforma-tions that should be able to express local physical andchemical properties that determine protein structure;re-representing sequences with such a function shouldimprove the correspondence between syntactic and se-mantic similarity. Finding such a function is an im-mense search task, but offers many possibilities for in-corporating and refining molecular biologists’ knowl-edge. The novel learning method we developed forthis unique domain will expand current constructiveinduction frameworks and help us better understandthe roles knowledge can play in learning.

ReferencesBax, A. 1989. Two-dimensional nmr and proteinstructure. Ann. Rev. of Biochemistry. 58:223-256.Bernstein, F.; Koetzle, T.; Williams, G.; Meyer, E.;Brice, M.; Rodgers, J.; Kennard, O.; Shimanouchi,T.; and Tasumi, M. 1977. The protein data bank:A computer-based archival file for macromolecularstructures. Journal of Molecular Biology 112:535-542.Booker, L.; Goldberg, D.; and Holland, J. 1989. Clas-sifter systems and genetic algorithms. Artificial Intel-ligence 40:235-282.

Ioerger 205

Page 9: Constructive Induction and Protein Tertiary … · Constructive Induction and Protein Tertiary Structure Prediction * Thomas R. Ioerger 1’4 Larry Rendell 1,3,4 Shankar 1,2,3,4 Subramaniam

Bowie, J.; Luthy, R.; and Eisenberg, D. 1991. Amethod to identify protein sequences that fold into aknown 3-dimensional structure. Science 253:164-170.

Chothia, C. 1988. The fourteenth barrel rolls out.Nature 333:598-599.

Chothia, C. 1992. One thousand families for themolecular biologist. Nature 357:543-544.

Cohen, F.; Abarbanel, R.; Kuntz, I.; and Fletter-ick, R. 1986. Turn prediction in proteins using acomplex pattern-matching approach. Biochemistry25:266-275.Dayhoff, M.; Eck, R.; and Park, C. 1972. A modelof evolutionary change in proteins. In Dayhoff, M.,editor 1972, Atlas of Protein Sequence and Structure,volume 5. Silver Springs, MD: National BiomedicalResearch Foundation.Dayhoff, M. 1972. Atlas of Protein Sequence andStructure, volume 5. Silver Springs, MD: NationalBiomedical Research Foundation.

Dietterich, T. and Michalski, R. 1986. Learning topredict sequences. In Michalski, R.; Carbonell, J.;and Mitchell, T., editors 1986, Machine Learning: AnArtificial Intelligence Approach, II. San Mateo, CA:Morgan Kaufmann Publishers. 63-106.

Duda, R. and Hart, P. 1973. Pattern Classificationand Scene Analysis. New York: Wiley.

Erickson, B. and Sellers, P. 1983. Recognition ofpatterns in genetic sequences. In Sankoff, D. andKruskal, J., editors 1983, Time Warps, String Edits,and Macromolecules. Reading, MA: Addison-Wesley.55-91.Fu, K. 1982. Syntactic Pattern Recognition and Ap-plications. Englewood Cliffs, N J: Prentice-Hall.

Gotoh, O. 1982. An improved algorithm for matchingbiological sequences. Journal of Molecular Biology162:705-708.Gribskov, M.; MeLachlan, A.; and Eisenberg, D.1987. Profile analysis: Detection of distantly relatedproteins. Proceedings of the National Academy of Sci-ences USA 84:4355--4358.Kibler, D. and Aha, D. 1987. Learning representa-tive exemplars as concepts: An initial case study. InProceedings of the Fourth International Workshop onMachine Learning. San Mateo, CA: Morgan Kauf-mann Publishers. 24-30.

King, R. and Sternberg, M. 1990. Machine learn-ing approach for the prediction of protein secondarystructure. Journal of Molecular Biology 216:441-457.

Matheus, C. 1989. Feature Construction: An Ana-lytic Framework and an Application to Decsion Trees.Ph.D. Dissertation, University of Illinois, Departmentof Computer Science.

McCammon, J. and Harvey, S. 1987. Dynamics ofProteins and Nucleic Acids. New York: CambridgeUniversity Press.

2O6 ISMB-93

McLachlan, A. 1971. Test for comparing relatedamino acid sequences. Journal of Molecular Biology61:409-424.Michalski, R. 1983. A theory and methodology ofinductive learning. Artifical Intelligence 20:111-161.

Mitchell, T. 1980. The Need for Biases in LearningGeneralizations. Technical Report CBM-TR-117.

Needleman, S. and Wunsch, C. 1970. A generalmethod applicable to the search for similarities inthe amino acid sequence of two proteins. Journal ofMolecular Biology 48:443-453.

Neidhart, D.; Kenyon, J.; Gerlt, J.; and Petsko, G.1990. Mandelate racemase and muconate lactonizingenzyme are mechanicaly distinct and structurally ho-mologous. Nature 347:692.

Qian, N. and Sejnowski, T. 1988. Predicting thesecondary structure of globular proteins using neu-ral network models. Journal of Molecular Biology202:865.Ragavan, H. and Rendell, L. 1993. Lookahead fea-ture construction for learning hard concepts. In Pro-ceedings of the Tenth International Machine LearningConference. to appear.

Rendell, L. and Seshu, R. 1990. Learning hard con-cepts through constructive induction: Framework andrationale. Computational Intelligence 6:247-270.

Rendell, L. 1986. A general framework for inductionand a study of selective induction. Machine Learning1:177-226.Richards, F. 1992. Folded and unfolded proteins: Anintroduction. In Creighton, T., editor 1992, ProteinFolding. New York: Freeman. 1-58.

Schulz, G. and Schirmer, R. 1979. Principles of Pro-rein Structure. New York: Springer-Verlag.

Searls, D. and Liebowitz, S. 1990. Logic grammars asa vehicle for syntactic pattern recognition. In Pro-ceedings of the Workshop on Syntactic and Struc-tural Pattern Recognition. International Associationfor Pattern Recognition. 402-422.

Subramaniam, S.; Tcheng, D.; Hu, K.; Ragavan, H.;and Rendell, L. 1992. Knowledge engineering for pro-tein structure and motifs: Design of a prototype sys-tem. In Proceedings of the Fourth international Con-ference on Software Engineering and Knowledge En-gineering. Washington, DC: IEEE Computer Society.420-433.

Towell, G.; Shavlik, J.; and Noordewier, M. 1990.Refinement of approximate domain theories byknowledge-based neural networks. In Proc. EighthNatl. Conf. on Artificial Intelligence. 861-866.

Winston, P. 1984. Artifical Intelligence. Reading,MA: Addison-Wesley.