protein clustering to assemble families of homeomorphic proteins chris elsik [email protected]

Protein Clustering to Assemble Families of Homeomorphic Proteins

Chris Elsik

[email protected]

Outline

• Using evolution to infer protein function

• The problem of automatic assembling of homeomorphic (identical domain organization) protein families

• Agglomerative clustering

• Delineating protein domain boundaries

Evolution Allows us to Infer Function• The most powerful method for inferring function of a gene or

protein is by similarity searching a sequence database.

• Our ability to characterize biological properties of a protein using sequence data alone stems from properties conserved through evolutionary time.

• Homologous (evolutionarily related) proteins always share a common 3-dimensional folding structure.

• They often contain common active sites or binding domains.

• They frequently share common functions.

• Predictions made using similar, but non-homologous proteins are much less reliable.

Orthologs• Homologs = genes that are evolutionarily related

• There are two kinds of homologs:

• Orthologs = genes in different species that have diverged from a common gene in an ancestral species.

• Paralogs = genes that have diverged due to gene duplication.

• Orthologs are more likely than paralogs to have conserved function.

• Orthologs cannot be identified using BLAST or FASTA sequence comparison alone.

• Reliable ortholog identification requires phylogenetic methods.

Rice-2b

Rice-2a

Maize-2

Wheat-2

Sorghum-2

Barley-1

Wheat-1

Maize-1

Sorghum-1

Arabidopsis

Example Gene Tree (with plant genes)

The outgroup, Arabidopsis is a dicot. The cereals are monocots. Dicots and monocots diverged ~230 million years ago. These monocots diverged from each other ~60 mya.

orthologs

paralogs

paralogs

Why shouldn’t we depend on inferences based on paralogs?

• Paralogs emerge after a gene duplication.

• Possible fates of duplicated genes:

– Loss of function for one of the duplicates - lack of selective pressure allows gene to mutate beyond recognition

– Emergence of new functional paralogs - one duplicate aquires a new function, so selection favors its maintenance in the genome

– Sub-functionalization - both duplicates are required to maintain the function of the original

How do people identify “orthologs”?• “Best Hit”

– Problems - 1) What if the ortholog is not present in the database - there is always a best hit. 2) For multidomain proteins, the best hit may be a local domain hit.

• “Reciprocal Best Hit”– Problem - this is usually based on E-value, but E-value is affected by the

length of the match.– Predicted genes are often gene fragments (not the true length of the

gene)– This method is a phenetic, not phylogenetic, approach. It does not take

advantage of a model of protein evolution.• Synteny - infer orthology by comparing locations on chromosomes

– Can be applied to closely related species, such as mammals, but not distantly related species

– difficult to distinguish tandemly duplicated genes• Phylogenetics - build gene trees to distinguish orthologs and paralogs

1>>>CG2830-PA - 648 aa vs fly_test.lseg library

The best scores are: length bits E(18717) %_id alenCG7235-PC ( 576) 474.6 9.4e-134 0.590 581CG7235-PA ( 576) 474.6 9.4e-134 0.590 581CG7235-PB ( 576) 474.6 9.4e-134 0.590 581 CG12101-PB ( 573) 474.4 1.1e-133 0.620 597CG12101-PA ( 573) 474.4 1.1e-133 0.620 597CG2830-PA-short ( 240) 339.2 2.2e-93 1.000 240CG16954-PB ( 558) 298.3 1.1e-80 0.450 536 CG16954-PA ( 558) 298.3 1.1e-80 0.450 536

Reciprocal Best Hit & incorrect gene lengthA gene fragment was simulated by replacing a protein in the library dataset with a truncated sequence. The library was searched with the full-length protein. The best match is a paralog. The identical, but truncated, match has a less significant E-value. We would not be able to identify the ortholog using the reciprocal best hit method.

Identifying Orthologs Using Phylogenetics

• Challenge - we need to create gene families prior to creating phylogenetic gene trees

• One approach is automated sequence clustering

Reasons for Clustering Protein Sequences

• Classifying proteins - structural and functional annotation

• Identifying errors in gene prediction

• Selecting representative proteins

• Creating models of protein domain families to aid identification of distantly related proteins

• Studying protein family evolution

Protein Clustering - General Method

• Perform a protein similarity search of a protein database against itself using FASTA or BLAST

• Use E-value or score as a similarity measure for automated clustering

– E-value is not the optimal linkage criterion due to the length effect

Problems in Clustering Proteins

• Multidomain proteins can cause unrelated proteins to group together– This problem is amplified in large proteomes

• Cluster validation is difficult -– We do not know the correct number of clusters– We do not have a good test set for large

proteomes

Src Tyrosine Kinase

MGSNKSKPKDASQRRRSLEPAENVHGAGGGAFPASQTPSKPASADGHRGPSAAFAPAAAEPKLFGGFNSSDTVTSPQRAGPLAGGVTTFVALYDYESRTETDLSFKKGERLQIVNNTEGDWWLAHSLSTGQTGYIPSNYVAPSDSIQAEEWYFGKITRRESERLLLNAENPRGTFLVRESETTKGAYCLSVSDFDNAKGLNVKHYKIRKLDSGGFYITSRTQFNSLQQLVAYYSKHADGLCHRLTTVCPTSKPQTQGLAKDAWEIPRESLRLEVKLGQGCFGEVWMGTWNGTTRVAIKTLKPGTMSPEAFLQEAQVMKKLRHEKLVQLYAVVSEEPIYIVTEYMSKGSLLDFLKGETGKYLRLPQLVDMAAQIASGMAYVERMNYVHRDLRAANILVGENLVCKVADFGLARLIEDNEYTARQGAKFPIKWTAPEAALYGRFTIKSDVWSFGILLTELTTKGRVPYPGMVNREVLDQVERGYRMPCPPECPESLHDLMCQCWRKEPEERPTFEYLQAFLEDYFTSTEPQYQPGENL

Agglomerative Clustering Methods

• Single linkage - similarity between closest neighbors meets a threshold (BLASTCLUST)

• Complete linkage - similarity between furthest neighbors meets a threshold

• Average linkage - average pairwise similarity meets a threshold

• Fractional linkage - fraction of pairwise similarities meet a threshold

Single linkage

“If A is related to B, and B is related to C, then A is related to C.”

This is true only if proteins are homologous over their entire length. Multidomain proteins pull unrelated proteins into the same cluster

A

B

C

Complete linkage

Each protein must be homologous to every other protein in the cluster.

In the analysis of the human genome, Celera used the criterion that the list of matches for each protein must be equivalent.

Query: A B C DMatches: A

BC

ABC

ABCD

CD

Clusters:A B C D

Problem: Does the domain structure of D differ from A and B, or is it homologous, but too distantly related to be detected by the BLAST search? Complete linkage generates many small clusters.

Fractional linkageProteins must have a fraction of matches in common.

Query: A B C DMatches: A

BC

ABC

ABCD

CD

Clusters: A B C

In analysis of the human genome, Celera used the Lek metric with a cutoff of .75:

Lek = 2 x (MatchesA MatchesB) / (MatchesA + MatchesB)

LekAB = 2 x 3 / (3 + 3) = 1

LekAC = 2 x 3 / (3 + 4) = 0.85 DLekCD = 2 x 2 / (4 + 2) = 0.67

LekAD = 2 x 1 / (3+2) = 0.40

Average linkage

The average pairwise similarity between all proteins must meet a threshold.

In hierarchical average linkage, the most closely related proteins are grouped first. Then the threshold is relaxed, allowing clusters to merge.

Which is method is best for clustering full-length protein sequences?

• Objectives

– Group full-length proteins with similar domain organization

– Minimize the problem caused by multidomain proteins

– Minimize number of singletons

Comparing Protein Clustering Methods

• Four methods - single, complete, average and fractional linkage

• Thresholds range from E 10-10 (permissive) to E 10-200 (stringent)

• Look at cluster number, number of singletons, largest cluster size, CDC score for each cluster set

• Test set: the Arabidopsis proteome (~25,000 proteins)

Comparison of cluster sets assembled using linkage threshold E 10-10

Method Clusters(includingSingletons)

Singletons LargestCluster Size

SL 7702 4940 5852

AL

FL

CL

9607 5943 385

12072 8248 428

17639 14151 60

Comparing protein cluster sets: CDC Score

• Cluster Domain Consistency (CDC) score - a metric for testing similarity in domain organization among clustered proteins

• CDC = 0: proteins have no domains in common

• CDC = 1: proteins have identical domain organization

• The proteins first must be searched against a domain database (Pfam) to identify domains

Step 4. Calculate CDC score for the cluster set:

Mean protein score for proteins in clusters of 1.

CDC = (.278 + .250 + .334 + .056 + 1 + 1)/6 = = 0.486

Step 1. List domains in each cluster.

Protein 1

Protein 2

Protein 3

Protein 4All Domains

Cluster 1

Cluster 2

Protein 5

Protein 6All Domains

npjdi = number of proteins in cluster j with domain i

npj = number of proteins in cluster j

Step 2. Domain score = number of pairwise domain matches between proteins as fraction of total possible

Sdij = domain score for domain i in cluster j = npjdi(npjdi-1)/npj(npj-1)

2 4 0.167

npjdi npj Sdij

3332

2

4444

4

0.50.50.50.1670.167

npjdi npj Sdij

2

222

11

Step 3. Calculate protein score (Spk) for each protein.

Spk = ∑ Sdijk / ndj where

Sdik = domain score of domain i in protein k

nd = number of domains in cluster i

Sp1 = (.5 + .5 + .5 +.167) / 6 = 0.278

Protein Spk

1 0.278 2 0.250 3 0.334 4 0.056 5 1 6 1

Pfam CDC Score

0

0.2

0.4

0.6

0.8

1

6000 10000 14000 18000 22000 26000

Number of Clusters

CLFLALSL

Linkage Criteria for Protein Clustering

• The linkage criterion should discern full- and partial-length matches

– If proteins are similar in domain organization, we expect them to match over their entire length

Linkage Criteria: E-value

• E-value may not be the best linkage criterion because it is length-dependent

• Matches to longer proteins have lower E-values (more significant)

• Length dependency makes it difficult to discern partial and full-length protein matches

Linkage Criteria: Global Alignment Score

• Partial matches will have low scores and full-length matches will have high scores

• Disadvantage: global alignments for each protein pair must be generated using the ALIGN program

Linkage Criteria: Weighted Local Alignment Score

• The local Smith-Waterman score for a pair of similar proteins is divided by the self-match score for the larger of the two proteins.

• The weighted score is low for partial matches.

• The advantage over global score is that global alignments are not required.

Linkage Criteria Comparison

• Two clustering methods: single and average linkage

• Three linkage criteria: E()-value, weighted local score, global score

• Range of thresholds tested for each criterion

• Look at cluster number, number of singletons, largest cluster size, CDC score for each cluster set

Cluster set comparison using most permissive thresholds that resulted in biologically meaningful family sizes (< 600 members).

Method Single Linkage Average Linkage

E()-value E()-valueWeightedLocal Score

WeightedLocal Score

Global Score

LinkageParameter

Threshold

Clusters (includingsingletons)

Singletons

Largest Cluster

E() 10-40

14570

11560

250

0.25 1.1

Global Score

0.05 0.5E() 10-10

9607

5943

385

808711469

7780 4643

225538

6195

10073

171563

10905

7429

Pfam CDC Score

0

0.2

0.4

0.6

0.8

1

6000 10000 14000 18000 22000 26000

Number of Clusters

AL, E()-valueSL, E()-valueAL, localSL, localAL, globalSL, global

Protein Clustering - Conclusions

• Single linkage using E-value is not appropriate for clustering eukaryote proteomes.

• Weighted local score and global score are better linkage criteria than E-value.

• Single linkage using a weighted local score above 0.40 performs as well as average linkage and generates a moderate number (15,746) of Arabidopsis clusters.

• Average linkage generates the smallest cluster numbers and reasonable cluster sizes, and should be the useful for grouping distantly related proteins for function and structure prediction.

Which clustering method is best prior to phylogenetic analysis of predicted proteins?

• Complete linkage? - Too many singletons

• Fractional linkage? - Also a large number of singletons

• Average linkage?

– Problem caused by truncated or overextended gene predictions. Average linkage would likely separate orthologs with incorrectly predicted protein lengths.

• Single linkage with a strict alignment criterion?

– Also has problem caused by truncated or overextended gene predictions:

– A strict alignment criterion separates orthologs due to incorrectly predicted protein lengths

• Single linkage without a strict alignment criterion

– Problem associated with multidomain proteins

SL-AL: a compromise

• We have developed the SL-AL algorithm, which uses a combination of single linkage and average linkage

• First sequences are clustered using single linkage and a permissive alignment criterion

• Percent identity for each protein pair in a single linkage cluster is recorded

• Any cluster with < a threshold percent identity is reclustered using average linkage

Advantages of SL-AL

• The multidomain problem is reduced.

• The algorithm is more tolerant to incorrect gene predictions, and less likely to separate incorrectly predicted orthologs into separate clusters. The cluster set can be used to identify gene prediction problems.

• A minimum pairwise percent identity can be set, making multiple alignment possible.

Another Alternative for Protein Clustering

• Overcome the multidomain problem by dissecting proteins into domains prior to clustering

• Prodom uses protein alignments to identify domain families (~129,000 domain families)

• Problem - partial sequences in protein databases cause the overfragmenting of domains.

• Our work toward a solution: use an alternative approach to identify domain boundaries

Identifying protein linkers to delineate domain boundaries

• Protein structural domain - a unit of a protein that can independently fold into a stable tertiary structure

• Evolutionary domain - a conserved modular unit that can be identified by aligning protein sequences

• Often structural domains are also evolutionary domains (not always)

• Linker - a peptide that connects two protein domains

Src Tyrosine Kinase

MGSNKSKPKDASQRRRSLEPAENVHGAGGGAFPASQTPSKPASADGHRGPSAAFAPAAAEPKLFGGFNSSDTVTSPQRAGPLAGGVTTFVALYDYESRTETDLSFKKGERLQIVNNTEGDWWLAHSLSTGQTGYIPSNYVAPSDSIQAEEWYFGKITRRESERLLLNAENPRGTFLVRESETTKGAYCLSVSDFDNAKGLNVKHYKIRKLDSGGFYITSRTQFNSLQQLVAYYSKHADGLCHRLTTVCPTSKPQTQGLAKDAWEIPRESLRLEVKLGQGCFGEVWMGTWNGTTRVAIKTLKPGTMSPEAFLQEAQVMKKLRHEKLVQLYAVVSEEPIYIVTEYMSKGSLLDFLKGETGKYLRLPQLVDMAAQIASGMAYVERMNYVHRDLRAANILVGENLVCKVADFGLARLIEDNEYTARQGAKFPIKWTAPEAALYGRFTIKSDVWSFGILLTELTTKGRVPYPGMVNREVLDQVERGYRMPCPPECPESLHDLMCQCWRKEPEERPTFEYLQAFLEDYFTSTEPQYQPGENL

Amino acid propensity

• Our goal is to predict protein linker regions using amino acid sequence alone, in the absence of known homologs

• We take advantage of amino acid propensity - the preference of some amino acids for linkers or domains

Our Dataset• Pfam-A - a collection of domain families created using profile

HMMs built on multiple alignments of homologous proteins. (Boeckmann et al. Nucleic Acids Research 2003)

• Linker = a sequence segment of 4 to 20 residues that connects two adjacent regions identified by Pfam as domains.

• Non-linker regions = sequence segments excluding linker regions.

• Used only protein sequences whose entire length can be classified as linker or domain by our criteria, except we allowed up to 20 non-domain residues at the N- and C-termini.

• Results - 11968 sequences with at least one linker region (14339 linker, 28726 corresponding domains and 824 unique domain regions).

Removing Redundancy in the Dataset• 11968 proteins were first grouped into homeomorphic families

(identical domain organization).• All by all sequence comparison of the 11968 sequences using

FASTA • Single-linkage clustering using criteria of E-value10-6 and at least

80% alignment coverage. • Some of the resulting clusters contained sequences with different

domain organizations, due to the transitive nature of single-linkage clustering, so we selected one sequence from each domain organization within each cluster.

• Result - 802 sequences with at least one linker region. These 802 sequences contained 993 linkers and 1988 corresponding domain regions from 376 unique Pfam-A domain families.

• The average length of linkers and domains was 11.24 and 141.38, respectively.

Linker Index• Linker index, yl, reflects the preference of amino acids in the linkers

relative to the domain region from Suyama and Ohara, 2003, Bioinformatics.

• yl = - ln ( fllinker / fl

domain )• Where fl

linker = relative frequency of the amino acid l in the linker (region in the data set.

• yl will be negative if the relative frequency of amino acid l in the linker region is greater than its relative frequency in the domain region.

• To calculate the smoothed linker index, average the linker index within each window size w and assign this averaged linker index value y to the center amino acid of the window by sliding from the N-terminus to the C-terminus of a protein sequence.

• Window size, w = 9, provided the greatest discrimination between linker and non-linker regions among the window sizes from 3 to 20.

Amino Linker Domain yl Acid A 7.97 (7.94) 8.10 0.0166 C 0.89 (1.24) 1.50* 0.5724 D 6.32 (5.28) 5.60* -0.1278 E 7.97 (6.89) 6.60* -0.1794 F 2.74 (4.34) 4.03* 0.3561 G 7.74 (6.14) 7.37 -0.0442 H 1.91 (2.32) 2.27 0.1643 I 4.73 (5.13) 6.37* 0.2758 K 6.97 (5.72) 5.81* -0.2134 L 7.51 (9.60) 9.54* 0.2523 M 2.13 (2.15) 2.24 0.0197 N 4.22 (4.12) 4.08 -0.0786 P 6.63 (6.07) 4.30* -0.4188 Q 3.90 (4.05) 3.33* -0.1051 R 5.77 (5.79) 5.24 -0.0762 S 7.20 (5.55) 6.13* -0.1629 T 5.80 (5.66) 5.35 -0.0701 V 6.24 (6.64) 7.34* 0.1782 W 0.81 (1.24) 1.32* 0.3836 Y 2.46 (3.47) 3.38* 0.2500

Relative Amino Acid Frequency

Numbers in parentheses are from the linker database of George and Heringa, 2002, Protein Engineering.

Hidden Markov Model to predict linker residues

• Sequences are assumed to have a structure composed of regions that are homogeneous within a region but may differ between regions.

• We assume protein sequence data is produced by a hidden Markov model and compositional variation is likely to reflect functional or structural differences between regions.

• Each region is classified into one of a finite number of states (linker and non-linker); we wish to estimate the states given the observed protein sequence.

• Instead of recognizing the protein sequence as a string of amino acids (categorical variables), we recognize the protein sequence as a string of linker index values (continuous variables).

Predicting linker state

• Our objective is to identify linker index values that discriminate linker and non-linker regions.

• Used Gibbs sampling to overcome the problem of missing data.

• Calculate the probability state (linker or non-linker) for each residue i in a protein sequence.

• Predict the state of an amino acid using the classification variable

• CV = 1 if the probability of state k is x.• CV = 0 if the probability of state k is x.

Training and Testing

• We applied our model to the protein sequence dataset constructed from Pfam-A.

• 5-fold cross validation was applied to the dataset - we divided the dataset into the training dataset and the test dataset randomly with the ratio of 4:1.

• We trained the model with the training dataset of 642 sequences and tested the trained model with the test dataset of 160 sequences.

• We ran Gibbs sampling with 40,000 iterations and 10,000 burn-

in to train the model.

Examples of good predictions

Examples of Over-prediction

Sensitivity and Selectivity

• Sensitivity = the percentage of actual linker residues that were predicted to be linker. Sn=Tp(Tp+Fn)

• Specificity = the percentage of predicted linker residues that were truly linker. Sp=Tp/(Tp+Fp)

Sensitivity and Selectivity plotted against cut-off for classification variable

Sensitivity and selectivity are each 67% at their intersection.

Conclusions• This method will be useful in defining linkers for evolutionary

domains. • Our method appears to do well at distinguishing linkers from

intra-domain loops.• We must test our approach using a structurally defined

dataset to fully understand its ability to distinguish structural linkers from intra-domain loops.– Since Pfam-A identifies domains as evolutionarily conserved units,

non-conserved intra-domain loops can cause structural domains to be annotated as multiple Pfam-A domains. Thus, some of our Pfam-A defined linkers may actually be loops in structural domains.

– Conversely, two structural domains that are always found together may be defined by Pfam-A as a single evolutionary domain; some of our false positives may actually be structural linkers.

Acknowledgements• William Pearson - University of Virginia

• Bani Mallick - Texas A&M University

• Elsik Lab– Kyounghwa Bae– Justin Reese– Anand Venkatraman– Shreyas Murthi– Michael Dickens– Juan Anzola

protein clustering to assemble families of homeomorphic proteins chris elsik [email protected]

Documents

orthologs paralogs

common gene

gene duplication

gene trees

gene fragments

protein clustering

conserved function

orthologs homologs