within the twilight zone: a sensitive profile-profile comparison tool

doi:10.1006/jmbi.2001.5293 available online at http://www.idealibrary.com on J. Mol. Biol. (2002) 315, 1257±1275

Within the Twilight Zone: A Sensitive Profile-ProfileComparison Tool Based on Information Theory

Golan Yona* and Michael Levitt

Department of StructuralBiology, Fairchild Bldg. D-109Stanford UniversityCA 94305, USA

E-mail address of the [email protected]

Present address: G. Yona, DepartScience, Cornell University, Ithaca,

Abbreviations used: HMM, hidde

0022-2836/02/051257±19 $35.00/0

This paper presents a novel approach to pro®le-pro®le comparison. Themethod compares two input pro®les (like those that are generated byPSI-BLAST) and assigns a similarity score to assess their statistical simi-larity. Our pro®le-pro®le comparison tool, which allows for gaps, can beused to detect weak similarities between protein families. It has also beenoptimized to produce alignments that are in very good agreement withstructural alignments. Tests show that the pro®le-pro®le alignments areindeed highly correlated with similarities between secondary structureelements and tertiary structure. Exhaustive evaluations show that ourmethod is signi®cantly more sensitive in detecting distant homologiesthan the popular pro®le-based search programs PSI-BLAST andIMPALA. The relative improvement is the same order of magnitudeas the improvement of PSI-BLAST relative to BLAST. Our new tooloften detects similarities that fall within the twilight zone of sequencesimilarity.

# 2002 Elsevier Science Ltd.

Keywords: pro®le-pro®le comparison; PSI-BLAST; structural alignment;remote homologies
*Corresponding author
Introduction

Traditionally, database searches have been themeans by which new protein sequences are ana-lyzed. The query sequence is compared with eachindividual sequence of the database, one at a time,in what is termed a pairwise comparison. Signi®-cant sequence similarities with database sequencesmay imply a common evolutionary ancestry (hom-ology) between the corresponding proteins. Hom-ologous sequences usually have similar fold andclose or related biological function,1,2 therefore,detecting homology can help to assign a putativefunction to a new protein sequence.

Pairwise sequence comparison algorithms areuseful to detect similarities between sequences thathave not diverged greatly, beyond the twilightzone of sequence similarity (de®ned as 20 %-30%sequence identity3). However, in the coarse of evol-ution, sequences may have changed signi®cantlydue to mutations and insertions, and in manycases proteins may still have the same fold andclose biological function without a signi®cant

ing author:

ment of ComputerNY 14853-7501, USA.n Markov model.

sequence similarity.4 ± 6 Those similarities areusually missed by pairwise sequence comparisons.

A great deal of work has been done to developtools that can detect weak relationships amongsequences. Among these are search programs thatuse statistical representations of protein families,such as pro®les7 and hidden Markov models(HMM)8 and the latest generation of powerful iter-ated search programs such as PSI-BLAST9 andSAM-T98,10 that use signi®cant hits detected in the®rst iteration to create a pro®le or HMM. Themodel is then used to search the database again,repeatedly, until no more new hits are detected.The underlying idea in all these techniques is thatby integrating the information from multiple,related sequences, one can achieve a concise,robust and powerful statistical representation of aprotein family. These tools were able to enhancethe ability to detect relationships between distantlyrelated proteins and became the standard meansby which sequences are analyzed these days.

Nonetheless, even iterative tools such as PSI-BLAST may miss weak sequence similarities.Moreover, these tools are sensitive to parametertuning. For example, using a permissive thresholdfor inclusion in a pro®le may cause the inclusion ofunrelated sequences in the pro®le and lead todiversion from the original query sequence. There-

# 2002 Elsevier Science Ltd.

1258 A Sensitive Pro®le-Pro®le Comparison Tool

fore, users are often forced to use more stringentthresholds at the cost of sensitivity.

There are other related methods that can bequite effective in the twilight zone of sequencesimilarity. These methods are especially tuned todetect similarities with sequences of known struc-tures, and most of them integrate additional infor-mation (e.g. predicted secondary structure,structural information) in the process. Suchmethods were applied successfully in the latestCASP (critical assessment of structure prediction)meeting{. For example, PDB-BLAST11 is a variationon the PSI-BLAST program. PSI-BLAST is used tocollect the protein sequences belonging to a par-ticular family. Using the supplied query sequence,PSI-BLAST runs for ®ve iterations. A pro®le is gen-erated from this collection but in a different waythan done in PSI-BLAST. The sequence pro®le isthen saved and used to scan the database of pro-teins with known structure. SAM-T9912 is a vari-ation on SAM-T98 that builds a multiple alignmentby iterated search using hidden Markov models. Ituses the alignment to predict secondary structure(using various methods) and to build a HMM thatis then used to search the PDB for similar proteins.A library of HMMs built by similar methods fromPDB sequences is used to score the target sequence.INBGU13 is a combination of ®ve methods thatexploit sequence and structure information indifferent ways to produce one consensus predic-tion. It uses predicted versus observed secondarystructure and sequence pro®les for both the queryand for the folds in the library. GenTHREADER14

uses a combination of various methods includingsequence alignment with structure based scoringfunctions. It uses a neural network based jury sys-tem to calculate the ®nal score for the alignment.3D-PSSM15 is based on a threading approach using1D and 3D pro®les coupled with secondary struc-ture and solvation potentials. The current versionof this server uses a fold library, which is automati-cally updated every week.

Although it seems that PSI-BLAST and similarprocedures already exploit the maximum infor-mation that is encoded in the sequence, we believethat this information has not been fully utilizedyet. All the aforementioned methods compare asequence to a model as encoded in a template;their power stems from the ability of such a modelto better discern related from unrelated proteins.Our approach takes this idea one step further andcompares two models. Speci®cally, we use the pro-®le representation (as is generated by PSI-BLAST)as a statistical model of a protein family, and com-pare pro®les of different protein families, in searchof possible remote kinship.

The choice of the model can greatly affect theeffectiveness of the method. For example, a statisti-cal model for the sequences such as a HMM makescertain assumptions on the origin and diversity of

{ http://predictioncenter.unl.gov/casp4/Casp4.html

the sequences that are not always justi®ed.Methods that make minimal assumptions aboutthe nature of sequences are desirable, as suchmethods can be more robust. In our choice of themodel and the statistical similarity measure wehave tried to follow this guideline.

Several procedures to compare pro®les havebeen reported in the literature. Gotoh16 proposedan iterative alignment method to align two groupsof biological sequences, including pro®le-basedoperations. However, his method is based on opti-mizing a weighted sum-of-pairs score, and essen-tially compares pairs of sequences, with overallcomputation time that is proportional to the pro-duct of the numbers of sequences in the twogroups. Pietrokovski17 compared pro®les that weregenerated from multiple alignments of proteinfamilies in the Blocks database,18 but his methoddoes not allow gaps in the alignment. Lyngsoet al.19 used the co-emission probability of two pro-®le hidden Markov models to measure their simi-larity. Despite the mathematical elegance of theirapproach, the metrics that they propose are overlysensitive to the differences between the probabilitydistributions and to the size of the training data. Inother words, their metrics emphasize the differ-ences rather than the similarity of the two models.Therefore, it is not clear whether the method candetect subtle similarities between protein families.The most similar effort to ours was done byRychlewski, Godzik, and colleagues.20,21 Their pro-®le-pro®le comparison procedure is based on adynamic programming algorithm. Their algorithmuses the correlation of probability distributions asa measure of similarity between pro®le columns.Their method (called FFAS) was applied success-fully in the last CASP meeting.

Our algorithm for pro®le-pro®le comparison isbased on the classical dynamic programmingalgorithm ®rst used for protein sequences almost30 years ago21 with the modi®cation to allow forlocal similarities.22 The novel ingredient in our pro-cedure is the de®nition of pro®le similarity scores.Our scores are based on a powerful, informationtheory based, measure of similarity between prob-ability distributions. The similarity measure isbased only on the observed distributions andtherefore provides a model independent criterionfor comparing two statistical sources. The simi-larity score of two columns in two different pro-®les is de®ned as a combination of their statisticalsimilarity and the signi®cance of the statisticalsimilarity. A transformation is then applied tothese scores, so as to make them suitable fordetecting local sequence similarities.

The paper is organized as follow: we ®rstdescribe the methods and the similarity measuresthat we use. Then we evaluate the performance ofthe new tool by testing it on a large set of proteinfamilies. We end with selected examples thatdemonstrate the power of the pro®le-pro®le com-parison method.

A Sensitive Pro®le-Pro®le Comparison Tool 1259

Methods

Definition of a profile

A pro®le is a representation of a group of relatedprotein sequences, usually based on a multiplealignment of those sequences (reviews on algor-ithms for multiple alignment23 ± 25). Once themultiple alignment is de®ned, the pro®le is con-structed by counting the numbers of each aminoacid at each position along the multiple alignment.These counts are transformed into probabilities bynormalizing the counts by the total number ofamino acids and gaps observed at that position.These empirical probabilities re¯ect the likelihoodof observing any amino acid k at position i. Sincethe counts are based on a ®nite set of sequences itcan happen that not all 20 amino acids areobserved at each position. Therefore, pseudocounts are introduced so that no amino acid has azero probability to occur at position i. For moreinformation on pro®le generating techniques, seeGribskov & Veretnik.26

Iterative search procedures such as PSI-BLAST9

can also be used to generate multiple alignmentsand pro®les. PSI-BLAST, arguably the most popu-lar search method today, is an iterative version ofBLAST, with a position-speci®c scoring matrix,which is generated from signi®cant alignmentsfound in round i and used in round i � 1.

Once the probability distributions have beencalculated for each position along the multiplealignment the pro®le is de®ned as a series of prob-ability distributions (one per each position)P � p1p2 � �pn where n is the length of the multiplealignment, and pi is a probability distribution overthe 20 amino acids at position i. One can think ofthe pro®le as a (k,i) matrix of 20 rows and n col-umns. Each probability distribution is one columnin the matrix representation and hence is calledpro®le column.

The data set

There are several publicly available classi®-cations of protein architectures including SCOP,27

CATH28 and FSSP/DALI.29 These classi®cationsprovide excellent sets for testing proteinsequence and structure comparison algorithms.Whereas SCOP is built by the careful manualcuration of Dr Alexei Murzin, both CATH andFSSP are built more or less automatically fromstructural alignments. CATH has a rather simplehierarchy with just four fold classes and a fewtens of architectures in each class. SCOP has amuch more complicated hierarchy with sevenfoldclasses, some containing over a 100-folds. TheFSSP classi®cation is automatic with a hierarchybuilt by the Z-score similarity of proteins ineach branch of the tree. While the CATH andFSSP classi®cations use protein chains as theobject of interest, SCOP breaks proteins intodomains as required, thus eliminating the pro-

blem of placing multi-domain proteins in thehierarchy. Our choice of the SCOP classi®cationwas motivated by the high quality of this data-base, the use of domains instead of completeprotein chains, and our extensive experiencewith this database for sequence and structureclassi®cations.30 ± 32

We use the SCOP 1.50 classi®cation of proteinstructures; This manually curated database con-tains 23,780 protein domains classi®ed into 1287protein families, 814 superfamilies, 545 folds andseven classes. Each of the 1287 families is rep-resented by a pro®le that is generated using PSI-BLAST. We ®rst select as a seed the sequencewhose average distance from all the other mem-bers in the family is the smallest. Then we use PSI-BLAST to run this seed sequence against thesequences in the family, after eliminating identicalentries. Families for which there is only one mem-ber, or for which PSI-BLAST failed to generate apro®le, were represented by a pro®le generateddirectly from the seed sequence by using prob-abilities derived from the original BLOSUM62frequency matrix.33

It is well known that a general PSI-BLAST searchis very sensitive to the program parameters(threshold for inclusion in the pro®le, number ofiterations). An iterative PSI-BLAST search mayoverestimate the statistical signi®cance of matcheswith unrelated sequences, as a result of integratingunrelated sequences into the pro®le. Our PSI-BLAST results are somewhat cleaner than thosefrom a typical PSI-BLAST search. The seedsequence is searched only against the other mem-bers of the family (the ``database''). Since there areno unrelated sequences in the database, there is nodanger that false positives will be included in thepro®le. Therefore, our procedure creates a cleanpro®le that reliably represents the protein familyand is less error-prone than pro®les that are gener-ated by an iterative PSI-BLAST search against alarge sequence database.

For parameter optimization and performanceevaluation we used a subset of 456 families. Thoseare all families within superfamilies that contain atleast two other families (the largest superfamily,the ``Winged helix'' DNA-binding domain, con-tains 21 families). Approximately one-quarter ofthose families (120 families, each with at least sixmembers after eliminating redundancy, and with aseed sequence longer than 50 amino acids) werechosen as the training set for parameter optimiz-ation (see below).

Profile-profile comparison

The pro®le-pro®le comparison is performedusing dynamic programming algorithm, and thealignment is assigned a score that accounts formatches, insertions and deletions, much in thesame way sequence-sequence alignment is calcu-lated. The differences are in the scoring scheme.Unlike sequence-sequence comparison, where a

Figure 1. (a) Distributions of divergence scores and signi®cance scores (a) and of similarity scores (b). All distri-butions are based on the largest 100 families in the SCOP 1.50 database. For each family a pro®le was generated (seeMethods) and the divergence score DJS, the signi®cance score S and the similarity score were calculated for everypair of columns along the pro®le, giving a total of 2986,151 column pairs.


scoring matrix like BLOSUM62 gives the scorefor different pairs of aligned amino acids, pro-®le-pro®le comparison is more complicated. Thecore of our procedure is the de®nition of pro®lesimilarity scores, and the parameters used toquantify this measure of similarity. Our schemehas been described before, but in much lessdetail.34,35

The divergence score

Given two pro®les P � p1p2p3 � � �pn andQ � q1q2q3 � � �qm, where n and m are the lengths ofthe pro®les (the number of positions or columns)and pi, qj are probability distributions over the 20letter alphabet of amino acids, we de®ne the matchscore between two columns pi and qj based ontheir statistical similarity.

A commonly used measure of statistical simi-larity between two arbitrary probability distri-butions pi(x) and qj(x), is the Kullback-Leibler (KL)divergence36 de®ned as:

DKL�pijjqj� �X

k

pik log2

pik

qjk

This measure has the disadvantages of being asym-metric and unbounded. A better measure ofstatistical similarity is the Jensen-Shannon (JS)divergence between probability distributions.37

{ This in itself is not enough to determine commonancestry, the same way that similarity or identity of twoamino acids along a sequence alignment is not enoughto determine kinship. Only the overall similarity scoreof the alignment may indicate such kinship. In that casea pro®le that consists of the common sourcedistributions (along the alignment) can be considered asa faithful representation of the common ancestry.

Given two (empirical) probability distributions pand q, for every 0 4 l 4 1, the l-JS divergence isde®ned as:

DJSl �pjjq� � lDKL�pjjr� � �1ÿ l�DKL�qjjr�

where:

r � lp� �1ÿ l�qcan be considered as the most likely commonsource distribution of both distributions p and q,with l as a prior weight. Without a priori infor-mation, a natural choice is l � 1/2. We call thecorresponding measure the divergence score anddenote it by DJS. This measure is symmetric andranges between 0 and 1, where the divergence foridentical distributions is 0. In Figure 1(a) we plotthe distribution of DJS for actual amino acid distri-butions in pro®les of groups of related proteins.

An attractive feature of the DJS divergencemeasure is that it is proportional to minus logar-ithm of the probability that the two empirical dis-tributions represent samples drawn from the same(``common'') source distribution.38 This aspect ofthe similarity measure makes it appealing in thecontext of protein sequence comparison, since bycomparing pro®les we wish to detect an evolution-ary relationship, i.e. a common ancestry. The com-mon source distribution as de®ned above is thesource distribution most likely to produce the twodistributions that are actually observed for the twopro®les that are being compared. A small value ofDJS indicates that the two pro®le columns are clo-sely related, and may be well approximated by thedistribution of the common source{.

The significance score

While a statistical measure estimating the prob-ability that two distributions represent the same


source distribution seems appropriate for the com-parison of pro®les, a major ingredient is ignored;the a priori probability of the source distribution.Imagine that the two particular empirical distri-butions each resembles the overall distribution ofamino acids in the database (i.e. the distribution ofthe common source is similar to the backgrounddistribution). In this case, should they be con-sidered signi®cantly similar? Obviously not, asthey both match the distribution expected for ran-dom pro®les. Clearly this match is not as signi®-cant as a match of two probability distributionsthat both resemble a unique distribution (i.e. a dis-tribution different from the overall distribution ofamino acids in the database). In other words,the similarity of two random distributions is not assigni®cant as the similarity of two uniquedistributions.

To assess the signi®cance score S of a match wemeasure the JS divergence of the (common) sourcedistribution, r, from the base (background) distri-bution, P0 (de®ned as the overall amino acid distri-bution in a large sequence database, such asSWISSPROT � TrEMBL40):

S � DJS�rjjP0�This measure re¯ects the probability that thesource distribution, r, could have been obtained bychance. The higher it is, the more distinctive is thecommon source distribution, and the lower is theprobability that it could have been obtained bychance. The distribution of signi®cance scores isshown in Figure 1(a).

Combining the divergence score and thesignificance score in a single similarity score

Because our method uses dynamic programmingto compare pro®les, a match between two columnsneeds to be assigned a single score{. This scoreshould ideally re¯ect both the divergence scoreand the signi®cance score. Therefore, we de®ne thesimilarity score of two probability distributions pand q to be:

Score�p;q� � 1

2�1ÿD��1� S�

� 1

2�1ÿDJS�pjjq��1�DJS�rjjP0��

With this expression, the similarity score of twosimilar distributions (D! 0) whose common

{ One may think of a different framework, in whichthese two different scores are treated separately. Forexample, use the divergence scores to assess the overallscore of the alignment, and use the signi®cance scores toassess the overall signi®cance of the alignment.However, in such scenario, an alignment that isoptimized for the divergence score is not necessarilyoptimized for the signi®cance score.

source is far from the background distribution(S! 1), tends to one. On the other hand, the simi-larity score of two dissimilar distributions (D! 1)whose most likely common source distributionresembles the background distribution (S! 0)tends to zero. This scoring scheme also dis-tinguishes two distributions that each are similarto the background distribution (D! 0 and S! 0,giving Score � 1/2) from two dissimilar distri-butions, but whose common source is similar tothe background distribution (D! 1 and S! 0,giving Score � 0). The distribution of similarityscores observed in protein families is plotted inFigure 1(b).

Score transformation - creating a localalignment scoring scheme

The pro®le similarity scores de®ned in the pre-vious section range from zero to one. For localalignments, the similarity function Score(a,b) mustsatisfy two requirements: (i) mean(Score(a,b)) < 0and (ii) max{Score(a,b)} > 0. The ®rst requirementguarantees that the average score of a randommatch is negative (otherwise, an extension of a ran-dom match would tend to increase its score, con-tradicting the idea of local similarity). The secondcondition means that a match with a positive scoreis possible. These criteria are satis®ed by all stan-dard scoring matrices, such as the BLOSUM andthe PAM matrices.

Our pro®le similarity scores must be adjusted tomeet these requirements. A simple transformationwould be to subtract a constant offset from allsimilarity scores. Here we tested this transform-ation (named shift transformation) extensively. Wealso tested a more elaborate transformation(named mass-conserving transformation), whichhas been described elsewhere.35

The shift transformation

The average similarity score calculated for alarge set of pro®le column pairs helps determinethe constant offset for the shift transformation.Based on the largest 100 families in the SCOP data-base, this average is 0.42. Such a shift would give ascoring scheme whose average is zero. However,the optimum average may well be less than zero,as is the case for BLOSUM62 matrix, where theaverage is ÿ0.95. It has been shown that the stat-istical properties of matches between randomsequences are very sensitive to the exact averagescore value.40,41 Matches that are scored using ascoring function with zero average may haveunstable properties so that small ¯uctuations maygreatly affect the selectivity of the method. There-fore, the shift should be higher than 0.42 to insureselectivity, but it should not be so much higher asto reduce sensitivity.

To limit the search space we ®rst calculated fourdifferent distributions of similarity scores for alarge sample of pro®le columns (see Figure 2(a);

Figure 2. (a) Distribution of similarity scores for different column types. The distributions are based on the largest100 families in SCOP 1.50 database. The pairs of pro®le columns are divided into four categories depending on thenature of the seed amino acids. The categories are: (1) identical columns (a column with itself); (2) different columnswith similar or same seed amino acids; (3) different columns with neutral seed amino acids; and (4) different columnswith dissimilar seed amino acids. (b) Distribution of similarity scores for different columns of BLOSUM62. The dis-tributions for the BLOSUM62 matrix are derived by ®rst converting the frequencies to probability values. Each col-umn, which corresponds to the replacement probability of a particular amino acid with all others, is akin to acolumn in a pro®le and different columns can be compared using our similarity score. Four distributions are plottedas for (a). In this case, the ``similar aa'' category does not include the pairs with the same amino acid; these arecounted in the ``identical column'' category.


the four different distributions correspond todifferent column types). The same four distri-butions were calculated from the originalBLOSUM62 frequency matrix33 (Figure 2(b)).

As Figure 2 shows, around 0.5 there is a cleardistinction between distributions of identical col-umns (red line) and distributions with dissimilarseed amino acids (black line). All identical columnsscored at least 0.5, as well as a substantial part ofthe distributions with similar seed amino acid. Thesame is true for similarity scores of distributionsderived from the BLOSUM62 matrix. In addition,distributions with mutually neutral seed aminoacids peak at a similarity score around 0.45. Inlocal alignments one would expect a neutral aminoacid to score close to zero. This indicates that theshift should lie between 0.42 and 0.5.

To derive a sensitive scoring scheme, the shiftparameter should be de®ned more precisely. Inaddition, gap penalties can make the differencebetween a sensitive scoring scheme and a mediocreone, so they also need to be optimized. In Optimiz-ation of Parameters, below, we embarked uponextensive tests to de®ne the best set of parameters,and the best score transformation technique.

Optimization of Parameters

In search of maximal sensitivity

To optimize the parameters we used a test setfrom the SCOP database (see above). This set com-prises of 120 families, each of which has at leasttwo related families within the same SCOP super-family. We refer to family-family relationships

within the same superfamily as true relationshipsand to all other relationships as false relationships.Given a speci®c set of parameters, for each familyin the test set we calculate the pro®le-pro®le simi-larities with all 1287 families. For each family theresults are sorted and we count the number of truefamily-family relationships detected before the ®rstfalse relationship is detected. Finally, the resultsare summed over all families in the test set to givethe total number of true family-family relationshipsdetected.

A total of 6 � 4 � 4 sets of parameters weretested, with shift value between 0.43 and 0.5(values of 0.43, 0.44, 0.45, 0.46, 0.47, 0.5), a gapopening penalty between 1 and 4 (values of 1, 2, 3,4), and gap extension penalty between 0.1 and 0.4(values of 0.1, 0.2, 0.3, 0.4). In keeping with theBLOSUM62 matrix and the default gap parametersused by PSI-BLAST we set the gap opening pen-alty parameter to start at the maximal match value(here 1.0), and the gap extension penalties to beone order of magnitude smaller. The results, forselected sets of parameters, and for the scoringscheme based on the mass conserving transform-ation, are given in Figure 3.

The shift transformation did better than themass-conserving transformation, performing verywell for several sets of parameters. These sets allhave a shift parameter of 0.45, and have gap pen-alty pairs (opening, extending) of (2,0.2), (2,0.3),(2,0.4), (3,0.1), (3,0.2), (3,0.3) and (3,0.4). The bestset of parameters was selected using a second testdescribed next.

Figure 3. Performance for different sets of par-ameters. For each set of parameters, we give the num-ber of true family-family relationships that are detectedbefore the ®rst erroneous relationship (see the text fordetails). Parameter sets with a shift value of 0.5 per-formed worse than the others and were omitted forgreater clarity.


In search of maximal accuracy

With our pro®le-pro®le procedure we wish notonly to detect as many relationships as possible(maximum sensitivity). We would also like to pro-duce reliable alignments with the highest accuracy.To assess the quality of pro®le-pro®le alignmentswe compared them with structural alignments. Thestructural alignments are calculated between thestructures of the family seeds (the same seeds thatwere used to generate the pro®les) using twomethods, Structal30 and CE.41

Clearly, the test set of family-family align-ments should be the same for each set of par-ameters tested. However, alignments are notnecessarily reported at the same level of signi®-cance by different sets of parameters. Indeed,some alignments reported as signi®cant with oneparameter set may not be signi®cant withanother set. To create a consistent set of align-ments that are considered signi®cant by all setstested, one needs a clear de®nition of signi®-cance. We elaborate on this below. Using suchstatistical estimates, we are able to associatewith each raw alignment score an E-value(expectation value), which is the number oftimes such a score would be expected to beobtained by chance. In our ®nal set of test align-ments, we consider only alignments that arereported with E-value 4 1, and take the intersec-tion of the alignments of all parameter sets as acommon subset of 82 alignments.

Several different indices are used to measure thequality of a pro®le-pro®le alignment with respect toa structural alignment (see Figure 4). (a) Naligned isthe overlap index, which is the total number of pos-itions in the query sequence that are aligned by

both methods, excluding gaps. (b) Qshift is the shiftindex, which is the average shift of positions alongthe aligned region. (c)Nagreement is the quality index,which is the number of positions in both alignmentsthat are in agreement (number of pairs that arealigned identically in both alignments). This de®nesthe ``correctly aligned'' positions. (d) Qmodeler is thequality of the alignment from the modeler's point ofview;42 it is the fraction of correctly aligned pos-itions in the pro®le-pro®le alignment (number ofcorrectly aligned positions divided by total numberof aligned positions). If we were to build a 3Dmodel based on pro®le-pro®le alignment, this indexindicates the fraction of the model that would``agree'' with the structure of the target when itbecomes known. (e) Qdeveloper is the quality of thealignment from the developer's point of view;42 it isthe fraction of correctly aligned positions in thestructural alignment. All these indices are shown inFigure 4 for the seven best sets of parameters (seeabove).

Since the average length of the pro®le-pro®lealignment tends to decrease with increasing thegap opening penalty, Qmodeler clearly increases,while Qdeveloper decreases (see opposing trends seenin Figure 4(d) and (e)). A better quality indexshould take into account those parts of the struc-tural alignment that are missed by pro®le-pro®lealignment as well as parts that are added by thepro®le-pro®le alignment. We de®ne Qcombined asthe fraction of correctly aligned positions dividedby the total number of positions that are alignedby either Structal or prof_sim (Figure 4(f)).

Based on these graphs one set of parametersstands out; that with gap opening penalty of 2 andgap extension penalty of 0.2. Similar results wereobtained when we used CE instead of Structal. Itshould be noted that the differences in perform-ance between all seven candidate sets of par-ameters both in terms of accuracy and sensitivityare relatively marginal, and speci®cally ®ve sets ofparameters provide overall good performance:(2,0.2), (2,0.3), (2,0.4), (3,0.1) and (3,0.2). With largertest sets, the parameter pairs (3,0.1) and (2,0.3) per-formed slightly better in terms of overall sensitivity(number of true family-family relationshipsdetected), but produced shorter alignments. Werecommend using the set (2,0.2) because it isalmost as sensitive as the best of these sets, and itproduces relatively longer alignments with overallslightly better accuracy.

Statistical Significance of Profile-Profile Matches

In order to distinguish true similarities from ran-dom matches one need to use a statistical measurethat estimates the probability that a particularmatch could have been obtained by chance.Though statistically signi®cant similarity is neithernecessary nor suf®cient for a biological relation-

Figure 4. Accuracy of pro®le-pro®le alignments. For each candidate set of parameters (see Optimization of Par-ameters) we measured the quality of the pro®le-pro®le alignments with respect to structural alignments that weregenerated using Structal. Several indices of quality, described fully in the text, were used: (a) Naligned; (b) Qshift; (c)Nagreement; (d) Qmodeler; (e) Qdeveloper; (f) Qcombined.


ship, it may give us a good indication of such

relationship.

{ Local ungapped alignments were studied intensivelyand characterized mathematically and were shown tofollow the extreme-value distribution.40,47,48 However,introducing gaps in the alignments greatly complicatestheir mathematical tractability, and rigorous results havebeen obtained only for local alignments without gaps.Nevertheless, recent studies strongly suggest that thescore of local gapped alignments can be characterized inthe same manner as the score of local ungappedalignments.

Empirical studies43,44 have shown that the distri-bution of local gapped similarity scores can be wellapproximated by the extreme value distribution45

(though some correction factors may apply46){. Toassess the signi®cance of pro®le-pro®le matches weestablished two baseline empirical distributions.

The ®rst is based on matches of pro®les of unre-lated families (families that belong to differentSCOP classes and do not share signi®cant structur-al similarity); it can be used to assess the signi®-cance of a match for any two given pro®les,without further computations. This distribution isshown in Figure 5(a). Effectively, this distribution

Figure 5. (a) Distribution of pro®le similarity scores of random pro®les. The distribution is based on a large setof pro®le-pro®le similarity scores of unrelated families. (b) Distribution of pro®le similarity scores for a speci®cSCOP family (SCOP ID 1.3.1.3, the two-domain cytochrome c; the distributions for other families resemble this distri-bution). Both distributions follow an extreme value distribution.


re¯ects the distribution of similarity scores of ran-dom pro®les. An extreme-value distribution was®tted to this distribution, so as to estimate the stat-istical signi®cance (E-value) for any raw similarityscore.

The second distribution is based on matches ofpro®les of a speci®c family; it provides a betterapproach to assess the signi®cance of matches withthe particular pro®le. When calculating the simi-larity score of a given pro®le with all other pro-®les, one gets hundreds of scores from pro®les thatare unrelated to the query pro®le. These pro®lesare effectively random, and the correspondingscores provide a reliable baseline distribution. Wederive for each pro®le the similarity scores with allother pro®les and ®t an extreme-value distributionto the resulting distribution of scores, after exclud-ing high-scoring pro®les (see Figure 5(b)). Basedon the theoretical ®t we estimate the E-value ofraw similarity scores. This approach, which allowsfor any unusual properties of the query pro®le(e.g. unusual amino acid composition) is self-cali-brated, and has proved to be more accurate androbust.6,49,50

Performance Evaluation

To evaluate the sensitivity and selectivity of ouralgorithm in detecting weak relationships betweenprotein families, we have checked it againstrelationships implied by the SCOP classi®cation.Our test set is composed of 456 families that haveat least two related families within the same super-family (see above). This gives a total number of2492 family-family relationships.

The de®nition of a protein superfamily in theSCOP database is a collection of protein familiesthat are considered to be distantly related throughevolution. Proteins that belong to the same super-family have similar structures and usually haveclose or related biological functions,51,52 but

sequence similarity is often not detectable. Movingup the SCOP hierarchy, protein superfamilies thatbelong to the same fold are expected to share sub-structures. In some cases even proteins of differentfolds within the same class may show some struc-tural similarity. Even belonging to different classesdoes not necessarily imply that the proteins cannotbe structurally similar as most SCOP classes sharesecondary structure elements. Out of the sevenclasses in SCOP (all-alpha, all-beta, alpha/beta,alpha and beta, multi-domain proteins (alpha andbeta), membrane and cell surface proteins, andsmall proteins), only two classes (all-alpha and all-beta) are not expected to share any secondarystructure elements and therefore are not expectedto be similar structurally. Because similar second-ary structure will affect the allowed amino acids ina particular region, these preferences may bedetected by pro®le-pro®le comparison.

Two quality measures are used to assess theeffectiveness of parameters and evaluate the per-formance based on this hierarchy. The ®rst, whichis applied to each family individually, is the num-ber of true family-family relationships that aredetected before the ®rst false connection occurs (asabove). The de®nition of a false connection issubtle. One may say that any relationship besidestrue family-family relationship is false. Some mayargue that relationships between families thatbelong to the same fold may as well be true. Wede®ne a relationship between two protein familiesto be a true relationship if both families belong tothe same superfamily, a possible relationship ifboth families belong to the same SCOP class, anerror if one protein is all-alpha and the second isall-beta, and suspicious otherwise. For each familyin the test set we calculated the pro®le-pro®le simi-larities with all other 1286 families. The results aresorted by signi®cance (E-value) and we count thenumber of family-family relationships that aredetected before the ®rst possible, suspicious and

Table 1. Performance evaluation results

Number of true family-family relationships detected by:

Type of first false-connection Gapped-BLAST IMPALA PSI-BLAST prof_sim

Different superfamily, same class 163 168 189 (205) 231(``possible'' relationship)

Different class 174 189 205 (221) 253(``suspicious'' relationship)

Alpha$ Beta 709 810 690 (694) 1586(``error'' relationship)

Methods compared: Gapped-BLAST, PSI-BLAST, IMPALA and pro®le-pro®le similarity (prof_sim). For each method, the numberof true relationships that are detected before the ®rst false connection occurs, is given. Results are given for the following types offalse connections: possible, suspicious and error (see the text for details).


error relationship is reported{. The results,summed for all families in the test set are summar-ized in Table 1.

For comparison we provide the same numbersfor Gapped-Blast, PSI-BLAST and IMPALA.53 ForPSI-BLAST we use the family pro®les to search theset of 1287 seed proteins of all 1287 protein familiesin our benchmark. For each input pro®le, we runPSI-BLAST only for one iteration to score the seedsequences. This is essentially a single iterationBLAST search (the iterative phase of PSI-BLAST isin creating the pro®le, as described above). There-fore, the statistical estimates should be as reliableas in a standard BLAST search. IMPALA is differ-ent, since it compares a sequence with a library ofpro®les; it uses statistical estimates that are betterthan those used by PSI-BLAST as well as a rigor-ous dynamic programming algorithm. Our libraryof pro®les contains all 1287 family pro®les (seeabove). For each family in the test set the querysequence is selected to be the same seed sequencethat was used to generate the pro®le for thisfamily.

When comparing the performance of our meth-od to that of PSI-BLAST we tried to focus on thecomparison method, while keeping all other com-ponents of the evaluation procedure identical forall methods compared. In particular, the input (thepro®le) was ®xed. Iterative PSI-BLAST may intro-duce other, related or unrelated sequences in thepro®le, and the input pro®le is changing from oneiteration to the next one, thus violating the setup ofthe experiment, and creating a bias. Under this set-ting it is harder to compare PSI-BLAST and prof_-sim and draw decisive conclusions from theresults, since the two methods do not use the sameinput models anymore. Nevertheless, we have alsotried iterative PSI-BLAST with up to ten iterations.There is an improvement in PSI-BLAST perform-ance (see third column in Table 1, numbers in par-entheses), however, prof_sim is still more sensitive,even without incorporating the information fromthe sequences that were not part of the original

{ Our tests show that the results are almost identicalif we consider an erroneous match between families asbeing to a different fold rather than to a differentsuperfamily

input pro®le and were integrated into the PSI-BLAST pro®le in subsequent iterations. Usingprof_sim with the updated pro®le that is generatedin the last iteration of PSI-BLAST before conver-gence is expected to improve the performance ofprof_sim as well.

As Table 1 indicates, our program, prof_sim,detects more true similarities in all tests (i.e. for allpossible types of ®rst false-connection). Especiallyinteresting is the case where the ®rst false connec-tion is a connection between an all-alpha proteinand an all-beta protein. The number of true family-family relationships that are detected before the®rst such erroneous connection is reported is morethan twice the number of such relationships thatare detected by PSI-BLAST under the same test.This fact clearly indicates that beyond puresequence similarity, pro®le-pro®le similarityre¯ects the similarity in the secondary structurecontent of protein families. The preferences forspeci®c types of amino acids in every position thatare encoded in the pro®le representation, arestrongly correlated with the characteristic second-ary structure in that position. Different secondarystructures are likely to induce different preferenceswhich differ markedly between secondary struc-tures such as beta strands and alpha helix, thusresulting in a low or negative similarity score forthe corresponding distributions. This informationis not accounted for when comparing a sequencewith a sequence or a pro®le with a sequence.

The E-value that is associated with a similarityscore is usually used to set a threshold abovewhich similarities are suspected to re¯ect truerelationships. Table 2 lists the number of truerelationships that are detected by each of the fourmethods tested, with E-value 40.1 (one randommatch per ten searches, on average). As Table 2suggests, compared to all other methods, moretrue relationships are detected by prof_sim withE-value 40.1, whereas less errors and suspiciousconnections are reported (e.g. 166 true relation-ships and 21 suspicious connections are reportedwith prof_sim as opposed to 146 true relationshipsand 31 suspicious connections with PSI-BLAST).

The second quality measure is the receiver oper-ating characteristic (ROC), which is a common wayof assessing sensitivity and selectivity. Given a


Number of relationships with E-value40.1 detected by:

Relationship-type Gapped-BLAST IMPALA PSI-BLAST prof_sim

Same superfamily 116 115 146 166(true relationship)

Same fold 0 0 3 1(``possible'' relationship)

Same class 18 14 17 14(``possible'' relationship)

Different class 31 20 31 21(``suspicious'' relationship)

Alpha$ Beta 1 1 1 0(``error'' relationship)

Total (with E-value 4 0.1) 166 150 198 202

For each method, the number of true, possible, suspicious and error relationships that are detected with E-value 4 0.1 is reported.


sorted list of hits, the ROC curve plots the numberof true positives that are detected as a function ofthe number of false positives. In Figure 6 we plotthe ROC50 curves for BLAST, IMPALA, PSI-BLAST and prof_sim. To plot this curve for aspeci®c method, we ®rst sort the scores of all pair-wise family-family comparisons by their E-valueand then count the number of true positives thatare detected until 50 errors occur. A true positive isde®ned as a match between families within thesame superfamily; all other types of connectionsare de®ned as false positives. The idea behind thisplot is that in scanning a database search resultsone may be willing to overlook few errors, ifadditional meaningful similarities can be detected.The area under the curve can be used to comparethe overall performance of different methods. Thedistribution of connections and the area under theROC curves are given in Table 3. From thoseresults it is clear that the relative improvement ofprof_sim with respect to PSI-BLAST is as signi®cant


Number of

Relationship-type Gapped-BLAST IMPAL

Same superfamily 116 120(true relationship)

Same fold 0 0(``possible'' relationship)

Same class 18 21(``possible'' relationship)

Different class 31 28(``suspicious'' relationship)

Alpha$ Beta 1 1(``error'' relationship)

Total 166 170E-value 0.1 0.14ROC area 5155 5322 (3.

For each method, we report the number of true, possible, suspicioutions occur (a false connection is everything but a true relationship)the area under the ROC plot, with the relative improvement with reof relationships that are detected with Structal.

as the relative improvement of PSI-BLAST withrespect to BLAST.

Note that the E-value at which the 50th erroroccurs is about 0.1 for BLAST, PSI-BLAST (Table 3),and 0.14 for IMPALA and prof_sim. By de®nition, asearch with a threshold E-value of 0.1 will yieldone erroneous similarity per ten searches, on aver-age. Since we repeat the search 456 times, we mayexpect as many as 46 erroneous similarities (dis-tributed between the 456 searches). Indeed, theresults for BLAST, IMPALA, PSI-BLAST and prof_-sim are fairly consistent with this simple argument.BLAST and PSI-BLAST report total of 50 suppo-sedly false positives with E-value 40.1, a little bitmore than expected, but within a reasonable mar-gin (Table 3). The statistical estimates of IMPALAand prof_sim are more conservative. The 50th erroris detected with E-value of 0.14, while with thisE-value we should have expected 456 � 0.14 � 64false positives. The advantage of this is that itdecreases the chances of detecting chance similaritywith E-value <0.1. The agreement of the statistical

relationships detected by:

A PSI-BLAST prof_sim Structal

146 173 355

2 1 45

17 18 5

30 30 0

1 1 0

196 223 4050.1 0.14 4.93e-07

3 %) 6335 (23 %) 7266 (41 %) 13667 (165 %)

s and error relationships that are detected until 50 false connec-. Also given are the E-value at which the 50th error occurs, andspect to BLAST in parentheses. The last column lists the number

Figure 6. ROC50 curves. A true positive is de®ned asa connection between families within the same super-family. Note that the relative improvement of prof_simwith respect to PSI-BLAST is comparable to the relativeimprovement of PSI-BLAST with respect to BLAST (seealso Table 3).


estimates of IMPALA and prof_sim indicates thatour statistical estimates are reliable.

There is another factor to consider when asses-sing the performance. In the SCOP database,there is a great variation in the number offamilies in a superfamily, and larger superfami-lies may dominate the evaluation results. To ver-ify that our results are not biased toward thelarger superfamilies, we repeated the calculationswith normalized counts. Two normalizationschemes were used. In the ®rst scheme, we nor-malize the counts (number of families detected)by the total number of related families in thesuperfamily. The second scheme, we normalizethe counts by the total number of pairwiserelationships in the superfamily. With the ®rstnormalization each superfamily with n familiescan add n at the most to the total count. Withthe second normalization, each superfamily canadd one at the most to the total count. Withoutany normalization each superfamily can add upto n(n ÿ 1) to the count. All schemes gave thesame qualitative results: the relative improve-ment of prof_sim over PSI-BLAST is at least aslarge as the relative improvement of PSI-BLASTover BLAST.

It did not escape our attention that a few sup-posedly false positives are ranked by prof_simhigher than by PSI-BLAST. Consequently, whenaveraged over all families, PSI-BLAST (Figure 6)detects more true similarities than prof_sim up toeight false positives, but then prof_sim takes thelead and performs much better. We took a closerlook at the ®rst ten false positives. These simi-larities are intriguing. Four of these are betweenfamilies that are classi®ed to the same class andthree of these are supported by signi®cant struc-tural similarities (when using the pro®le-pro®le

alignment). Another four agree on about 50 % oftheir secondary structure elements at the level ofthe amino acid. Therefore we strongly believethat some of these similarities actually re¯ecttrue similarities (for a more detailed analysis, seebelow).

The last column in Table 3 lists the number offamily-family relationships that are detected withStructal using the same criteria. These numbersprovide an upper bound on the maximal sensi-tivity that we should expect with sequence-basedmethods. Note that most allegedly false structuralsimilarities are between families with the samefold, with a few from the same class. This is notsurprising as structural similarities are commonbetween families that adopt the same fold, andeven between families from the same class. Thesesimilarities are rarely due to a common evolution-ary origin, yet they may imply related biologicalfunctions. By relaxing the de®nition of a false posi-tive, many more signi®cant structural similaritiesare detected between families of the same super-family, fold and class. For example, when a truepositive is de®ned as a relationship betweenfamilies from either the same superfamily or thesame fold, 613 relationships are detected from the®rst type and 215 relationships are detected fromthe second type before the 50th error occurs (mostof those errors are between families within thesame class). The improvement in performance forthe sequence-based methods is not as signi®cant.

Alignment accuracy

Our goal in developing the pro®le-pro®le com-parison algorithm was to devise a sensitive as wellas an accurate algorithm, where accuracy ismeasured in terms of alignment accuracy. Optimiz-ing the accuracy of sequence alignment withrespect to structural alignments is especiallyimportant since one can expect that such optimizedalignments will provide good initial seeds forreliable 3D models, and they can help direct site-speci®c experiments.

We compared the accuracy of the prof_sim align-ments with the accuracy of BLAST, IMPALA andPSI-BLAST, using the same methodology that isdescribed above. The results are shown in Figure 7.Note that prof_sim performed better in almost alltests, producing more accurate alignments, withsmaller shift value, larger coverage and a higherpercent of correctly aligned residues.

Detecting Interesting Similaritiesbetween Protein Families

We were curious to see what kind of similaritiesare missed by PSI-BLAST and detected by prof_simand vice versa. Of all true family-family relation-ships that are detected by PSI-BLAST with E-value 4 1, 26 are missed by prof_sim (i.e. reportedwith E-value > 1). Of these, ten are reported by

Figure 7. Accuracy of sequence-based alignments. For each sequence-based method (BLAST, IMPALA, PSI-BLAST and prof_sim) we measured the quality of the alignments with respect to structural alignments that were gen-erated using Structal. Several indices of quality were used: (a) Naligned; (b) Qshift; (c) Nagreement; (d) Qmodeler; (e) Qdevelo-

per; (f) Qcombined. See Optimization of Parameters for details.


PSI-BLAST with E-value 4 0.1 (the thresholdde®ned by the ROC50 curve; see Table 3). Themost signi®cant hit of those is between SCOPfamilies 1.110.1.2 and 1.110.1.1 (ARM repeat super-family). PSI-BLAST reports this similarity with E-value of 2e-05 (see Figure 8). Our method, prof_sim,reports a slightly longer similarity with E-value of5.8. The percent identity in this alignment (12 %) ismuch lower than the percent identity in the PSI-BLAST alignment (24 %). Strikingly, this alignmentcorresponds to a better match in terms of structure:aligning both structures based on the PSI-BLAST

and the prof_sim alignments, gives RMS values of5.7 AÊ and 4.1 AÊ , respectively (Figure 8). This indi-cates that prof_sim produces alignments that aredriven by structural similarities more than othersequence-based methods. The second most signi®-cant miss is between SCOP families 1.97.4.3 and1.97.4.2 (terpenoid cylases/protein prenyltrans-ferases superfamily). It is reported by PSI-BLASTwith E-value of 9e-04 and corresponds to structuralsimilarity with RMS value of 7.0 AÊ . Again, prof_simreports a different (tough shorter) alignment, withE-value 7.1, but with RMS value of 4.1 AÊ . The

Figure 8. PSI-BLAST and prof_sim alignment of SCOP families 1.110.1.2 and 1.110.1.1. The SCOP identi®ers ofthe seeds are d1b3ua and d1qgra, respectively. The prof_sim alignment is more accurate in terms of structure (lowerRMS).


other similarities are reported by PSI-BLAST withE-values between 0.001 and 0.1. Our programreports similarities between the other eight familypairs with E-values that range between 1.3 and10.8. In three cases prof_sim reports a fragment ofthe same alignment reported by PSI-BLAST, and inthe other ®ve cases prof_sim reports an alignmentwhich differs from the PSI-BLAST alignment(though usually shorter). However, the improve-ment in the RMS value is signi®cant. In ®ve out ofthe eight pairs, the RMS values of the PSI-BLASTalignments ranges between 6.7 and 13.2 while theprof_sim alignments have RMS values between 2.06and 6.4 (improvement ranges between 1.5 AÊ and10.5 AÊ ). In one case both alignments have highRMS values (though prof_sim alignment improvesthe RMS from 13.2 to 12.45), and in two othercases both have low and almost the same RMSvalues (within 0.22 AÊ ), in one case the PSI-BLASTalignment is longer, and in the other case theprof_sim alignment is longer.

There are 95 true family-family relationships forwhich prof_sim reports an alignment withE-value 4 1 while PSI-BLAST miss them. Fortyalignments are reported with E-value 4 0.14 (thethreshold as de®ned by the ROC50). The most sig-

ni®cant hit has an E-value of 2.7e ÿ 04. Of these,18 are reported by PSI-BLAST with E-valuebetween 1 and 10, and in three more cases PSI-BLAST reports shorter similarities with E-values42, 49 and 259. All the rest, 19 pairs, are notdetected withE-value 4 500. Five examples are shown inFigure 9; they correspond to alignments with RMSvalues between 1.7 AÊ and 5.9 AÊ .

In all other cases both methods reported a simi-larity usually with a different signi®cance value.Those differences are expected due to the differentstatistical estimates or different alignments. Inmost cases the same or almost the same alignmentsare detected with prof_sim but because of the moreconservative statistical estimates of prof_sim theyare sometimes reported as less signi®cant. In othercases prof_sim reports a different alignment.

An interesting case of homology that is noteasily detected by sequence analysis methods isthe actin/kinase/hsp70 homology. These threefamilies share the same fold and are considered ashomologous families.54 In SCOP, the actins andheat shock proteins are classi®ed into the samefamily (designated 3.50.1.1). The mammalian type I

Figure 9. Signi®cant similarities that are detected by prof_sim but missed by PSI-BLAST. The SCOP identi®ers ofthe seeds: d1aoic (1.23.1.1); d1a7w (1.23.1.2); d1aj5a (1.41.1.7); d4icb (1.41.1.1); d1dcfa (3.16.2.2); d3chy (3.16.2.1);d1cqxa3 (3.18.1.4); d1fnc_2 (3.18.1.1); d1maac (3.64.1.1); and d1jkmb (3.64.1.2).


hexokinase is classi®ed into family 3.50.1.2 and theglycerol kinase is classi®ed to 3.50.1.3.

None of these similarities are detected byBLAST. PSI-BLAST detects the similarity between3.50.1.1 and 3.50.1.3 with E-value of 4.9 (thirdmatch after two false positives). prof_sim detectsthis similarity with E-value of 0.31. This is the ®rstmatch and there are no false positives. Moreover,the prof_sim alignment is longer than the PSI-BLAST alignment by 33 amino acid pairs.

When the seed sequence of family 3.50.1.3 isused as a query, PSI-BLAST misses the similaritywith 3.50.1.1 (up to E-value of 500), while prof_simdetects a similarity with E-value of 1.0 (it is thesecond match after one false positive). Interest-ingly, the false positive (d2dik_3, family 4.121.1.5)is classi®ed as a member of another class (class 4of a � b proteins with segregated alpha and betaregions, while family 3.50.1.3 consists of a/b pro-teins), but both share similar secondary structure


elements, and to some extent, similar arrangementsof these elements. None of the methods detect thesimilarity with family 3.50.1.2 as signi®cant.Further tuning and integration of additional infor-mation (see Discussion) is probably needed todetect this similarity.

Analysis of errors

The SCOP classi®cation is an excellent resource,which is based on extensive expert knowledge.Nonetheless, one should keep in mind that thede®nitions of the family/superfamily/fold levelsin the SCOP hierarchy are based in large part onobservations made by human experts, rather thanon a quantitative measure. One may argue thatthis hierarchy is biased by human perceptions, anddoes not truly re¯ect nature hierarchy. In otherwords, the de®nitions of domains, folds andclasses do not necessarily conform to ``Nature'sde®nitions''. Therefore, any assessment usingSCOP may be biased due to errors in that referenceclassi®cation.

To see if some interesting similarities might havebeen missed by SCOP experts, we checked themost signi®cant similarities between families fromdifferent folds. Most of the suspicious connectionsare due to similarity along a relatively smallsubstructure (few common secondary structureelements). Such similarities are often observedbetween proteins from different SCOP classes andare not necessarily false similarities. Such simi-larities are usually too short to maintain the sametopology, and within the context of the whole pro-tein structure they may get different interpret-ations.

Out of the ®rst 50 errors (see Table 3), nine arerelatively short fragments (average length of 30)with very high agreement in secondary structurecontent along the aligned residues (at least 70 %per alignment, and 83 % on average). Additional11 alignments (with average length of 88) agree onmore than 50 % of the aligned residue in terms ofthe secondary structure content. Therefore, webelieve that at least some of the supposedly falsepositives are indeed true relationships.

Discussion

With the rapid increase in the number ofsequences that cannot be characterized by existingtools for sequence analysis, there is a growing needfor more powerful tools that can detect weak butsigni®cant sequence similarities. The latest gener-ation of iterated search methods, such as PSI-BLAST improved sensitivity of sequence analysissigni®cantly.55 Yet, the vast majority of remoterelationships cannot be detected by these methods.

Here, we have presented a new tool for pro®le-pro®le comparison that can be used to detect suchweak relationships between protein families. Whatdistinguishes sequence-sequence comparison frompro®le-pro®le comparison is the scoring scheme

used to assess the similarity/dissimilarity of two``atomic'' objects in the alignment (pairs of aminoacids in sequence-sequence alignment and pairs ofprobability distributions in pro®le-pro®le compari-son). The power of the pro®le-pro®le comparisonlies in the de®nition of those similarity scores. Theinformation that is coded in the multiple alignmentthat was used to generate the pro®le can decipherambiguities or insigni®cant similarities that are fre-quently observed in sequence-sequence compari-son, and a sensitive scoring scheme that is wiselydesigned can help detect those subtle similarities.Our scoring scheme was designed to obtain maxi-mal sensitivity by using an information theorybased measure of similarity between probabilitydistributions. The signi®cance of the pro®le-pro®lesimilarity scores is assessed using the same statisti-cal framework as for sequence analysis, andextreme value distributions are ®tted to the empiri-cal distributions.

A collection of protein families from the SCOPdatabase is used as a benchmark. This benchmarkprovides extensive test of the selectivity, sensitivityand accuracy of the new method we propose. Ourtests show that this tool outperforms pairwise com-parison algorithms such as BLAST, as well as themore powerful iterated PSI-BLAST (surprisingly,IMPALA did not perform as well as PSI-BLAST).The relative improvement of prof_sim over to PSI-BLAST is the same order of magnitude or even lar-ger than the relative improvement of PSI-BLASTover BLAST. Therefore, we believe that the pro®le-pro®le measure of similarity can be used to detectweak relationships between protein families thathave diverged much. Our method for parameterselection is based on a two-phase optimization pro-cedure. This procedure is especially relevant wherethere are several sets of parameters that performfairly well using one criterion (such as classi®cationquality), but some may perform poorly when con-sidering some other criterion (such as alignmentaccuracy). As our method was also optimized toproduce alignments in a very good agreement withstructural alignments, it had higher accuracy thanall other methods, and can be expected to producebetter three-dimensional homology models.

In this work we compare our method to PSI-BLAST. Of all publicly available software, PSI-BLAST is an advanced method and is probably themost popular program used today to compare pro-tein sequences. There are other methods that areused via publicly accessible web-servers. The mostnotable of these include: PDB-BLAST and FFASfrom Godzik's group;11 SAM-T99 from Karplus'group;12 INBGU from Fischer's group;13 Gen-THREADER Jones';14 and 3D-PSSM from Stern-berg's group.15

While these servers are ideally suited for use inblind prediction schemes like critical assessment ofstructure prediction (CASP), they cannot be easilycompared with the present method as there is noway to know exactly what known relationships arebuilt into the server. Therefore, it is hard to predict


how well our method would perform in CASP.Moreover, in the CASP meeting, methods wereranked based on the quality of the model theygenerated, and many methods used optimizationprocedures to optimize the position of side-chainsand gaps, or were manually calibrated.

Unfortunately, our method was not available intime to be tested at CASP4, which closed on Sep-tember 2001. We recognize that fact that our use ofPSI-BLAST and pro®le comparison is in fact ratherclose to the service provided by the FFAS server.11

This server performed well at CASP and webelieve that our method performs as well{. Weintend to convert our prof_sim method to a web-based server so that it can be tested both at CASP5to be held in 2002 as well as by the automatic ser-ver testing schemes like LiveBench.13

We have integrated our method in a large-scaleeffort to map the protein space and create hierarch-ical organization of protein families and superfami-lies (the BioSpace system35). In this study a multi-stage analysis was carried that ®rst identi®edstructure-based clusters (clusters with structuralrepresentatives) and sequence-based clusters (clus-ters with no structural representative). Clusters arecompared using either a structure metric (when 3Dstructures are known) or our pro®le-pro®le metric.These scores are used to de®ne a uni®ed and con-sistent metric between all clusters, and clusters areorganized in a meta-organization of super-clusters,using the uni®ed metric. Many of the super-clus-ters contain both structure-based clusters andsequence-based clusters, due to cluster similaritiesthat were detected only by the pro®le-pro®lemetric. Based on this meta-organization we caninfer plausible conformations for families, whichwere not structurally characterized, and are clus-tered into the same super-cluster as a structure-based cluster. In some cases we can also providehints about the possible functionality of the family.We, therefore, believe that this tool can extendstructure and function prediction beyond what ispossible with current means of sequence analysis.

The disparity in performance between sequence-based methods and the structure-based method (asin Table 3) challenges for future developers ofsequence-based methods. Further improvements of

{ We could not implement their method, since wecould not reproduce the parameters used by Rychlewskiet al.11 However, preliminary results suggest that thecorrelation scores used in FFAS to compare probabilitydistributions are less sensitive than our measures, whichare based on information theory principles. Speci®cally,we calculated the distributions of correlation scores ofpro®le columns for different column types (the sameway as in Figure 2). According to these distributions(not shown) the correlation scores are less successful indistinguishing related columns from columns which arelikely to be unrelated. In general the distributions highlyoverlap, and the tail of the second distribution falls wellwithin the ®rst distribution. We believe that this mayaffect the performance signi®cantly.

the pro®le-pro®le comparison method should con-sider the topology of the protein, if its structure isknown, or the propensity of amino acids to be partof loops, alpha helices or beta strands. The propen-sities of amino acids to form loops would beespecially useful as it can be used to assign vari-able gap penalties. These enhancements will beintegrated in future versions of our method (Yona& Yeh, work in progress).

Acknowledgments

We thank Patrice Koehl for providing us with his pro-grams that generate structural alignments based onsequence alignments. G.Y. is supported by a Burroughs-Welcome Fellowship from the Program in Mathematicsand Molecular Biology (PMMB). This work was sup-ported by DOE award DE-FG03-95ER62135 (to M.L.).

References

1. Sander, C. & Schneider, R. (1991). Database ofhomology-derived protein structures and the struc-tural meaning of sequence alignment. Proteins:Struct. Funct. Genet. 9, 56-68.

2. Hilbert, M., Bohm, G. & Jaenicke, R. (1993).Structural relationships of homologous proteins as afundamental principle in homology modeling.Proteins: Struct. Funct. Genet. 17, 138-151.

3. Doolittle, R. F. (1992). Reconstructing history withamino acid sequences. Protein Sci. 1, 191-200.

4. Murzin, A. G. (1993). OB(oligonucleotide/oligosac-charide binding)-fold: common structural and func-tional solution for non-homologous sequences.EMBO J. 12, 861-867.

5. Pearson, W. R. (1997). Identifying distantly relatedprotein sequences. Comp. App. Biosci. 13, 325-332.

6. Brenner, S. E., Chothia, C. & Hubbard, T. J. P.(1998). Assessing sequence comparison methodswith reliable structurally identi®ed distant evol-utionary relationships. Proc. Natl Acad. Sci. USA, 95,6073-6078.

7. Gribskov, M., Mclachlen, A. D. & Eisenberg, D.(1987). Pro®le analysis: detection of distantly relatedproteins. Proc. Natl Acad. Sci. USA, 84, 4355-4358.

8. Krogh, A., Brown, M., Mian, I. S., SjoÈ lander, K. &Haussler, D. (1996). Hidden Markov models in com-putational biology: application to protein modeling.J. Mol. Biol. 235, 1501-1531.

9. Altschul, S. F., Madden, T. L., Schaffer, A. A.,Zhang, J., Zhang, Z., Miller, W. & Lipman, D. J.(1997). Gapped BLAST and PSI-BLAST: a new gen-eration of protein database search programs. Nucl.Acids Res. 25, 3389-3402.

10. Karplus, K., Barrett, C. & Hughey, R. (1998). HiddenMarkov models for detecting remote protein hom-ologies. Bioinformatics, 14, 846-856.

11. Rychlewski, L., Jaroszewski, L., Li, W. & Godzik, A.(2000). Comparison of sequence pro®les. Strategiesfor structural predictions using sequence infor-mation. Protein Sci. 9, 232-241.

12. Karplus, K., Barrett, C., Cline, M., Diekhans, M.,Grate, L. & Hughey, R. (1999). Predicting proteinstructure using only sequence information. Proteins:Struct. Funct. Genet. 37, 121-125.


13. Bujnicki, J. M., Elofsson, A., Fischer, D. &Rychlewski, L. (2001). LiveBench-1: continuousbenchmarking of protein structure predictionservers. Protein Sci. 10, 352-361.

14. Jones, D. T. (1999). GenTHREADER: an ef®cient andreliable protein fold recognition method for genomicsequences. J. Mol. Biol. 287, 797-815.

15. Kelley, L. A., MacCallum, R. M. & Sternberg, M. J.(2000). Enhanced genome annotation using struc-tural pro®les in the program 3D-PSSM. J. Mol. Biol.299, 499-520.

16. Gotoh, O. (1993). Optimal alignment betweengroups of sequences and its application to multiplesequence alignment. Comput. Appl. Biosci. 9, 361-370.

17. Pietrokovski, S. (1996). Searching databases ofconserved sequence regions by aligning proteinmultiple-alignments. Nucl. Acids Res. 24, 3836-3845.

18. Henikoff, J. G., Henikoff, S. & Pietrokovski, S.(1999). New features of the Blocks Database servers.Nucl. Acids Res. 27, 226-228.

19. Lyngso, R. B., Pedersen, C. N. S. & Nielsen, H.(1999). Metrics and similarity measures for hiddenMarkov models. In The Proceedings of ISMB 1999,pp. 178-186, AAAI Press, Menlo Park.

20. Rychlewski, L., Zhang, B. & Godzik, A. (1998). Foldand function predictions for Mycoplasma genitaliumproteins. Fold. Des. 3, 229-238.

21. Needleman, S. B. & Wunsch, C. D. (1970). A generalmethod applicable to the search for similarities inthe amino acid sequence of two proteins. J. Mol.Biol. 48, 443-453.

22. Smith, T. F. & Waterman, M. S. (1981). Comparisonof biosequences. Adv. App. Math. 2, 482-489.

23. Waterman, M. S. (1995). Introduction to ComputationalBiology, Chapman & Hall, London.

24. Setubal, J. C. & Meidanis, J. (1996). Introduction toComputational Molecular Biology, PWS Publishing Co.,Boston.

25. Taylor, W. R. (1996). Multiple protein sequencealignment: algorithms and gap insertion. MethodsEnzymol. 266, 343-367.

26. Gribskov, M. & Veretnik, S. (1996). Identi®cation ofsequence patterns with pro®le analysis. MethodsEnzymol. 266, 198-211.

27. Murzin, A. G., Brenner, S. E., Hubbard, T. &Chothia, C. (1995). SCOP: a structural classi®cationof proteins database for the investigation ofsequences and structures. J. Mol. Biol. 247, 536-540.

28. Orengo, C. A., Michie, A. D., Jones, S., Jones, D. T.,Swindells, M. B. & Thornton, J. M. (1997). CATH - ahierarchic classi®cation of protein domain structures.Structure, 5, 1093-1108.

29. Holm, L. & Sander, C. (1997). Dali/FSSP classi®-cation of three-dimensional protein folds. Nucl. AcidsRes. 25, 231-234.

30. Levitt, M. & Gerstein, M. (1998). A uni®ed statisti-cal framework for sequence comparison and struc-ture comparison. Proc. Natl Acad. Sci. USA, 95,5913-5920.

31. Gerstein, M. & Levitt, M. (1998). Comprehensiveassessment of automatic structural alignment againsta manual standard, the SCOP classi®cation of pro-teins. Protein Sci. 7, 445-456.

32. Brenner, S. E., Koehl, P. & Levitt, M. (2000). Theastral compendium for protein structure andsequence analysis. Nucl. Acids Res. 28, 255-256.

33. Henikoff, S. & Henikoff, J. G. (1992). Amino acidsubstitution matrices from protein blocks. Proc. NatlAcad. Sci. USA, 89, 10915-10919.

34. Yona, G. & Levitt, M. (2000). A uni®ed sequence-structure classi®cation of proteins: combiningsequence and structure in a map of protein space. InThe proceedings of RECOMB 2000, pp. 308-317, ACMPress, New York.

35. Yona, G. & Levitt, M. (2000). Towards a completemap of the protein space based on a uni®edsequence and structure analysis of all known pro-teins. In The Proceedings of ISMB 2000, pp. 395-406,AAAI Press, Menlo Park.

36. Kullback, S. (1959). Information Theory and Statistics,John Wiley and Sons, New York.

37. Lin, J. (1991). Divergence measures based on theShannon entropy. IEEE Trans. Info. Theory, 37, 145-151.

38. El-Yaniv, R., Fine, S. & Tishby, N. (1997). Agnosticclassi®cation of markovian sequences. Advan. NeuralInformat. Proc. Syst. 10, 465-471.

39. Bairoch, A. & Apweiler, R. (1999). The SWISS-PROTprotein sequence data bank and its supplementTrEMBL in 1999. Nucl. Acids Res. 27, 49-54.

40. Karlin, S. & Altschul, S. F. (1990). Methods for asses-sing the statistical signi®cance of molecular sequencefeatures by using general scoring schemes. Proc. NatlAcad. Sci. USA, 87, 2264-2268.

41. Shindyalov, I. N. & Bourne, P. E. (1998). Proteinstructure alignment by incremental combinatorialextension (CE) of the optimal path. Protein Eng. 11,739-747.

42. Sauder, J. M., Arthur, J. W. & Dunbrack, R. L., Jr.(2000). Large-scale comparison of protein sequencealignment algorithms with structure alignments.Proteins: Struct. Funct. Genet. 40, 6-22.

43. Smith, T. F., Waterman, M. S. & Burks, C. (1985).The statistical distribution of nucleic acid simi-larities. Nucl. Acids Res. 13, 645-656.

44. Waterman, M. S. & Vingron, M. (1994). Rapid andaccurate estimates of statistical signi®cance forsequence data base searches. Proc. Natl Acad. Sci.USA, 91, 4625-4628.

45. Gumbel, E. J. (1958). Statistics of Extremes, ColumbiaUniversity Press, New York.

46. Altschul, S. F. & Gish, W. (1996). Local alignmentstatistics. Methods Enzymol. 266, 460-480.

47. Dembo, A. & Karlin, S. (1991). Strong limit theoremsof empirical functionals for large exceedances ofpartial sums of i.i.d variables. Ann. Prob. 19, 1737-1755.

48. Dembo, A., Karlin, S. & Zeitouni, O. (1994). Limitdistribution of maximal non-aligned two-sequencesegmental score. Ann. Prob. 22, 2022-2039.

49. Pearson, W. R. (1998). Empirical statistical estimatesfor sequence similarity searches. J. Mol. Biol. 276,71-84.

50. Yona, G., Linial, N. & Linial, M. (1999). ProtoMap:automatic classi®cation of protein sequences, ahierarchy of protein families, and local maps of theprotein space. Proteins: Struct. Funct. Genet. 37,360-378.

51. Martin, A. C., Orengo, C. A., Hutchinson, E. G.,Jones, S., Karmirantzou, M., Laskowski, R. A. et al.(1998). Protein folds and functions. Structure, 6, 875-884.

52. Hegyi, H. & Gerstein, M. (1999). The relationshipbetween protein structure and function: a compre-hensive survey with application to the yeast gen-ome. J. Mol. Biol. 288, 147-164.

53. Schaffer, A. A., Wolf, Y. I., Ponting, C. P., Koonin,E. V., Aravind, L. & Altschul, S. F. (1999). IMPALA:


matching a protein sequence against a collection ofPSI-BLAST-constructed position-speci®c scorematrices. Bioinformatics, 15, 1000-1011.

54. Holmes, K. C., Sander, C. & Valencia, A. (1993). Anew ATP-binding fold in actin, hexokinase andHsc70. Trends Cell Biol. 3, 53-59.

55. Park, J., Karplus, K., Barrett, C., Hughey, R.,

Haussler, D., Hubbard, T. & Chothia, C. (1998).

Sequence comparisons using multiple sequences

detect three times as many remote homologues as

pairwise methods. J. Mol. Biol. 284, 1201-1210.

Edited by B. Honig

(Received 8 November 2000; received in revised form 18 May 2001; accepted 21 November 2001)

within the twilight zone: a sensitive profile-profile comparison tool

Documents