the pirsf protein classification system as a basis for automated uniprot protein annotation

42
1 The PIRSF Protein Classification System as a Basis for Automated UniProt Protein Annotation Darren A. Natale, Ph.D. Project Manager and Senior Scientist, PIR Research Assistant Professor, GUMC Icobicobi 2004 Angra Dos Reis, RJ, Brasil

Upload: temima

Post on 21-Mar-2016

47 views

Category:

Documents


0 download

DESCRIPTION

The PIRSF Protein Classification System as a Basis for Automated UniProt Protein Annotation. Icobicobi 2004 Angra Dos Reis, RJ, Brasil. Darren A. Natale, Ph.D. Project Manager and Senior Scientist, PIR Research Assistant Professor, GUMC. 1). UniProt Overview. 2). - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: The PIRSF Protein Classification System  as a Basis for Automated UniProt Protein Annotation

1

The PIRSF Protein Classification System

as a Basis forAutomated UniProt Protein Annotation

Darren A. Natale, Ph.D.Project Manager and Senior Scientist, PIRResearch Assistant Professor, GUMC

Icobicobi 2004Angra Dos Reis, RJ, Brasil

Page 2: The PIRSF Protein Classification System  as a Basis for Automated UniProt Protein Annotation

2

Major Topics

UniProt Overview1)

PIRSF Protein Classification System2)

Family-Driven Protein Annotation3)

Page 3: The PIRSF Protein Classification System  as a Basis for Automated UniProt Protein Annotation

3

UniProt: Universal Protein Resource Central Resource of Protein Sequence and Function International Consortium: PIR, EBI, SIB Unifies PIR-PSD, Swiss-Prot, TrEMBL

http://www.uniprot.org

Page 4: The PIRSF Protein Classification System  as a Basis for Automated UniProt Protein Annotation

4

UniProt Databases UniParc: Comprehensive Sequence Archive with Sequence History UniProt: Knowledgebase with Full Classification and Functional Annotation UniRef: Condensed Reference Databases for Sequence Search

Classification, Literature-Based &

Automated Annotation

UniParc (Archive)

UniRef100 (NREF)

Swiss-Prot

PIR-PSDTrEMBL RefSeq GenBank/EMBL/DDBJ

Ensembl PDB PatentData

Other Data

UniProt (Knowledgebase)

Clustering at 100, 90, 50% Identity UniRef90

UniRef50

Merging

Classification, Literature-Based &

Automated Annotation

UniParc (Archive)

UniRef100 (NREF)

Swiss-Prot

PIR-PSDTrEMBL RefSeq GenBank/EMBL/DDBJ

Ensembl PDB PatentData

Other Data

Swiss-Prot

PIR-PSDTrEMBL RefSeq GenBank/EMBL/DDBJ

Ensembl PDB PatentData

Other Data

UniProt (Knowledgebase)

Clustering at 100, 90, 50% Identity UniRef90

UniRef50

Merging

Classification, Literature-Based &

Automated Annotation

UniParc (Archive)

UniRef100 (NREF)

Swiss-Prot

PIR-PSDTrEMBL RefSeq GenBank/EMBL/DDBJ

Ensembl PDB PatentData

Other Data

UniProt (Knowledgebase)

Clustering at 100, 90, 50% Identity UniRef90

UniRef50

Merging

Classification, Literature-Based &

Automated Annotation

UniParc (Archive)

UniRef100 (NREF)

Swiss-Prot

PIR-PSDTrEMBL RefSeq GenBank/EMBL/DDBJ

Ensembl PDB PatentData

Other Data

Classification, Literature-Based &

Automated Annotation

UniParc (Archive)

UniRef100 (NREF)

Swiss-Prot

PIR-PSDTrEMBL RefSeq GenBank/EMBL/DDBJ

Ensembl PDB PatentData

Other Data

UniProt (Knowledgebase)

Clustering at 100, 90, 50% Identity UniRef90

UniRef50

Merging

Classification, Literature-Based &

Automated Annotation

UniParc (Archive)

UniRef100 (NREF)

Swiss-Prot

PIR-PSDTrEMBL RefSeq GenBank/EMBL/DDBJ

Ensembl PDB PatentData

Other Data

Swiss-Prot

PIR-PSDTrEMBL RefSeq GenBank/EMBL/DDBJ

Ensembl PDB PatentData

Other Data

UniProt (Knowledgebase)

Clustering at 100, 90, 50% Identity UniRef90

UniRef50

Merging

Classification, Literature-Based &

Automated Annotation

UniParc (Archive)

UniRef100 (NREF)

Swiss-Prot

PIR-PSDTrEMBL RefSeq GenBank/EMBL/DDBJ

Ensembl PDB PatentData

Other Data

UniProt (Knowledgebase)

Clustering at 100, 90, 50% Identity UniRef90

UniRef50

Merging

Classification, Literature-Based &

Automated Annotation

UniParc (Archive)

UniRef100 (NREF)

Swiss-Prot

PIR-PSDTrEMBL RefSeq GenBank/EMBL/DDBJ

Ensembl PDB PatentData

Other Data

Classification, Literature-Based &

Automated Annotation

UniParc (Archive)

UniRef100 (NREF)

Swiss-Prot

PIR-PSDTrEMBL RefSeq GenBank/EMBL/DDBJ

Ensembl PDB PatentData

Other Data

UniProt (Knowledgebase)

Clustering at 100, 90, 50% Identity UniRef90

UniRef50

Merging

Classification, Literature-Based &

Automated Annotation

UniParc (Archive)

UniRef100 (NREF)

Swiss-Prot

PIR-PSDTrEMBL RefSeq GenBank/EMBL/DDBJ

Ensembl PDB PatentData

Other Data

Swiss-Prot

PIR-PSDTrEMBL RefSeq GenBank/EMBL/DDBJ

Ensembl PDB PatentData

Other Data

UniProt (Knowledgebase)

Clustering at 100, 90, 50% Identity UniRef90

UniRef50

Merging

Classification, Literature-Based &

Automated Annotation

UniParc (Archive)

UniRef100 (NREF)

Swiss-Prot

PIR-PSDTrEMBL RefSeq GenBank/EMBL/DDBJ

Ensembl PDB PatentData

Other Data

Swiss-Prot

PIR-PSDTrEMBL RefSeq GenBank/EMBL/DDBJ

Ensembl PDB PatentData

Other Data

UniProt (Knowledgebase)

Clustering at 100, 90, 50% Identity UniRef90

UniRef50

Merging

Classification, Literature-Based &

Automated Annotation

UniParc (Archive)

UniRef100 (NREF)

Swiss-Prot

PIR-PSDTrEMBL RefSeq GenBank/EMBL/DDBJ

Ensembl PDB PatentData

Other Data

UniProt (Knowledgebase)

Clustering at 100, 90, 50% Identity UniRef90

UniRef50

Merging

Classification, Literature-Based &

Automated Annotation

UniParc (Archive)

UniRef100 (NREF)

Swiss-Prot

PIR-PSDTrEMBL RefSeq GenBank/EMBL/DDBJ

Ensembl PDB PatentData

Other Data

Classification, Literature-Based &

Automated Annotation

UniParc (Archive)

UniRef100 (NREF)

Swiss-Prot

PIR-PSDTrEMBL RefSeq GenBank/EMBL/DDBJ

Ensembl PDB PatentData

Other Data

UniProt (Knowledgebase)

Clustering at 100, 90, 50% Identity UniRef90

UniRef50

Merging

Classification, Literature-Based &

Automated Annotation

UniParc (Archive)

UniRef100 (NREF)

Swiss-Prot

PIR-PSDTrEMBL RefSeq GenBank/EMBL/DDBJ

Ensembl PDB PatentData

Other Data

Classification, Literature-Based &

Automated Annotation

UniParc (Archive)

UniRef100 (NREF)

Swiss-Prot

PIR-PSDTrEMBL RefSeq GenBank/EMBL/DDBJ

Ensembl PDB PatentData

Other Data

UniProt (Knowledgebase)

Clustering at 100, 90, 50% Identity UniRef90

UniRef50

Merging

Classification, Literature-Based &

Automated Annotation

UniParc (Archive)

UniRef100 (NREF)

Swiss-Prot

PIR-PSDTrEMBL RefSeq GenBank/EMBL/DDBJ

Ensembl PDB PatentData

Other Data

Swiss-Prot

PIR-PSDTrEMBL RefSeq GenBank/EMBL/DDBJ

Ensembl PDB PatentData

Other Data

UniProt (Knowledgebase)

Clustering at 100, 90, 50% Identity

UniProt (Knowledgebase)

Clustering at 100, 90, 50% Identity UniRef90

UniRef50

Merging

Classification, Literature-Based &

Automated Annotation

UniParc (Archive)

UniRef100 (NREF)

Swiss-Prot

PIR-PSDTrEMBL RefSeq GenBank/EMBL/DDBJ

Ensembl PDB PatentData

Other Data

Swiss-Prot

PIR-PSDTrEMBL RefSeq GenBank/EMBL/DDBJ

Ensembl PDB PatentData

Other Data

UniProt (Knowledgebase)

Clustering at 100, 90, 50% Identity UniRef90

UniRef50

Merging

Page 5: The PIRSF Protein Classification System  as a Basis for Automated UniProt Protein Annotation

5

UniParc An archive for tracking protein

sequences

Comprehensive: All published protein sequences

Non-Redundant: Merge identical sequence strings

Traceable: Versioned, with ‘Active’ or ‘Obsolete’ status tag

Concise: no annotation of function, species, tissue, etc.

2.5 million unique entries from6 million source-database entries

Page 6: The PIRSF Protein Classification System  as a Basis for Automated UniProt Protein Annotation

6

UniProt Knowledgebase Annotated: Fully manually-curated (Swiss-Prot section) and automatically-

annotated based on family-driven rules (TrEMBL section) Cross-referenced: Links to over 50 external databases (classification,

domain, structure, genome, functional, boutique) Non-redundant: Merge in a single record all protein products derived from a

certain gene in a given species

High Information Content: Isoform Presentation: Alternatively Spliced Forms, Proteolytic

Cleavage, and Post-Translational Modification (each with FTid) Nomenclature: Gene/Protein Names (Nomenclature Committees) Family Classification and Domain Identification: InterPro and PIRSF Functional Annotation: Function, Functional Site, Developmental

Stage, Catalytic Activity, Modification, Regulation, Induction, Pathway, Tissue Specificity, Sub-cellular Location, Disease, Process

Page 7: The PIRSF Protein Classification System  as a Basis for Automated UniProt Protein Annotation

7

UniProt Report

•ID & Accession

•Name & Taxon

•References

•Activity•Pathway•Disease

Modified Swiss-Prot“NiceProt” view

•Cross-Refs

Page 8: The PIRSF Protein Classification System  as a Basis for Automated UniProt Protein Annotation

8

UniProt Report (II)

Position-specific features:•Active sites•Binding sites•Modified residues•Sequence variations

•Additional Info•Expanded detail

Page 9: The PIRSF Protein Classification System  as a Basis for Automated UniProt Protein Annotation

9

UniRef Databases

Non-Redundant: Merge sequences and subsequences UniRef100: 100% sequence identity from all species, including

sub-fragments Superset of Knowledgebase: Includes splice variants and

selected UniParc sources (e.g. EnsEMBL, IPI, and patent data)

Optimized: For Faster Searches using Reduced Data Sets UniRef90: 90% sequence identity (36% size reduction) UniRef50: 50% sequence identity (63% size reduction)

Page 10: The PIRSF Protein Classification System  as a Basis for Automated UniProt Protein Annotation

10

UniRef100 Report

Splice variants

Sub-fragments

100% sequence identity from all species, including sub-fragments Splice Variants as separate entries

Page 11: The PIRSF Protein Classification System  as a Basis for Automated UniProt Protein Annotation

11

Representative sequence

UniRef90/50 Reports

90%

Merged sequences likely have the same function

50%

Phenylalanine hydroxylase&

Tryptophan hydroxylase

Page 12: The PIRSF Protein Classification System  as a Basis for Automated UniProt Protein Annotation

12

UniProt Web Site

Publicly available Dec. 15, 2003

Text/Sequence Searches against UniProt, UniRef, UniParc

Links to Useful Tools

Download UniProt, UniRefs

FAQs and Information

User Help/feedback forms

http://www.uniprot.org

Page 13: The PIRSF Protein Classification System  as a Basis for Automated UniProt Protein Annotation

13

The Need for Classification

This all works only if the system is optimized for annotation

Most new protein sequences come from genome sequencing projects Many have unknown functions Large-scale functional annotation of these sequences based simply on

BLAST best hit has pitfalls; results are far from perfect

Problem:

Highly curated and annotated protein classification system Solution:

Automatic annotation of sequences based on protein families Systematic correction of annotation errors Name standardization in UniProt Functional predictions for uncharacterized proteins

Facilitates:

Page 14: The PIRSF Protein Classification System  as a Basis for Automated UniProt Protein Annotation

14

Levels of Protein ClassificationLevel Example Similarity Evolution

Class / Structural elements No relationships

Fold TIM-Barrel Topology of backbone Possible monophyly

Domain Superfamily

Aldolase Recognizable sequence similarity (motifs); basic biochemistry

Monophyletic origin

Family Class I Aldolase High sequence similarity (alignments); biochemical properties

Evolution by ancient duplications

Orthologous group

2-keto-3-deoxy-6-phosphogluconate aldolase

Orthology for a given set of species; biochemical activity; biological function

Traceable to a single gene in LCA

Lineage-specific expansion(LSE)

PA3131 and PA3181

Paralogy within a lineage Recent duplication

Page 15: The PIRSF Protein Classification System  as a Basis for Automated UniProt Protein Annotation

15

Protein Evolution

With enough similarity, one can trace back to a

common origin

Sequence changes

What about these?

Domain shuffling

Page 16: The PIRSF Protein Classification System  as a Basis for Automated UniProt Protein Annotation

16

PDT?

CM/PDH?

Consequences of Domain Shuffling

PIRSF001500CM (AroQ type) PDT ACT

PIRSF001501

CM (AroQ type)

PIRSF006786 PDH

PIRSF001499

PIRSF005547PDH ACT

PDT ACT PIRSF001424

CM = chorismate mutasePDH = prephenate dehydrogenase PDT = prephenate dehydrataseACT = regulatory domain

PDH?

CM/PDT?

CM?PDHCM (AroQ type)

Page 17: The PIRSF Protein Classification System  as a Basis for Automated UniProt Protein Annotation

17

Peptidase M22Acylphosphatase ZnF YrdCZnF- - - -

Whole Protein = Sum of its Parts?

On the basis of domain composition alone, biological function was predicted to be: ● RNA-binding translation factor ● maturation protease

PIRSF006256

Actual function: ● [NiFe]-hydrogenase maturation factor, carbamoyl phosphate-converting enzyme

Page 18: The PIRSF Protein Classification System  as a Basis for Automated UniProt Protein Annotation

18

Classification GoalsWe strive to reconstruct the natural classification of

proteins to the fullest possible extentBUT

Domain shuffling rapidly degrades the continuity in the protein structure (faster than sequence divergence degrades similarity)

THUSThe further we extend the classification, the finer

is the domain structure we need to considerSO

We need to compromise between the depth of analysis and protein integrity

OR Credit: Dr. Y. Wolf, NCBI

Page 19: The PIRSF Protein Classification System  as a Basis for Automated UniProt Protein Annotation

19

Domain Classification Allows a hierarchy that can

trace evolution to the deepest possible level, the last point of traceable homology and common origin

Can usually annotate only general biochemical function

Whole-protein Classification Cannot build a hierarchy deep

along the evolutionary tree because of domain shuffling

Can usually annotate specific biological function (preferred to annotate individual proteins)

Can map domains onto proteinsCan classify proteins even when domains are not defined

Complementary Approaches

Page 20: The PIRSF Protein Classification System  as a Basis for Automated UniProt Protein Annotation

20

The Ideal System… Comprehensive: each sequence is classified either as a member of a

family or as an “orphan” sequence

Hierarchical: families are united into superfamilies on the basis of distant homology, and divided into subfamilies on the basis of close homology

Allows for simultaneous use of the whole protein and domain information (domains mapped onto proteins)

Allows for automatic classification/annotation of new sequences when these sequences are classifiable into the existing families

Expertly curated membership, family name, function, background, etc.

Evidence attribution (experimental vs predicted)

Page 21: The PIRSF Protein Classification System  as a Basis for Automated UniProt Protein Annotation

21

PIRSF Classification System PIRSF:

A network structure from superfamilies to subfamilies Reflects evolutionary relationships of full-length proteins

Definitions: Homeomorphic Family: Basic Unit Homologous: Common ancestry, inferred by sequence similarity Homeomorphic: Full-length similarity & common domain architecture Network Structure: Flexible number of levels with varying degrees of

sequence conservation; allows multiple parents

Advantages: Annotate both general biochemical and specific biological functions Accurate propagation of annotation and development of standardized

protein nomenclature and ontology

Page 22: The PIRSF Protein Classification System  as a Basis for Automated UniProt Protein Annotation

22

PIRSF001499: Bifunctional CM/PDH (T-protein)

PIRSF006786: PDH, feedback inhibition-insensitive

PIRSF005547: PDH, feedback inhibition-sensitive

PF02153: Prephenatedehydrogenase (PDH)

PIRSF017318: CM of AroQ class, eukaryotic type

PIRSF001501: CM of AroQ class, prokaryotic type

PIRSF026640: Periplasmic CM

PIRSF001500: Bifunctional CM/PDT (P-protein)

PIRSF001499: Bifunctional CM/PDH (T-protein)

PF01817: Chorismatemutase (CM)

PIRSF006493: Ku, prokaryotic type

PIRSF500001: IGFBP-1

…PIRSF500006: IGFBP-6

PIRSF Homeomorphic Subfamily

• 0 or more levels• Functional specialization

PIRSF018239: IGFBP-related protein, MAC25 type

PIRSF001969: IGFBP

PIRSF003033: Ku70 autoantigen

PIRSF016570: Ku80 autoantigen

PIRSF Homeomorphic Family• Exactly one level

• Full-length sequence similarity and common domain architecture

PIRSF Superfamily• 0 or more levels

• One or more common domains

PF00219: Insulin-like growth factor binding protein

(IGFBP)

PIRSF800001: Ku70/80 autoantigenPF02735: Ku70/Ku80 beta-barrel domain

Domain Superfamily• One common Pfam

domain

PIRSF001499: Bifunctional CM/PDH (T-protein)

PIRSF006786: PDH, feedback inhibition-insensitive

PIRSF005547: PDH, feedback inhibition-sensitive

PF02153: Prephenatedehydrogenase (PDH)

PIRSF017318: CM of AroQ class, eukaryotic type

PIRSF001501: CM of AroQ class, prokaryotic type

PIRSF026640: Periplasmic CM

PIRSF001500: Bifunctional CM/PDT (P-protein)

PIRSF001499: Bifunctional CM/PDH (T-protein)

PF01817: Chorismatemutase (CM)

PIRSF006493: Ku, prokaryotic type

PIRSF500001: IGFBP-1

…PIRSF500006: IGFBP-6

PIRSF Homeomorphic Subfamily

• 0 or more levels• Functional specialization

PIRSF018239: IGFBP-related protein, MAC25 type

PIRSF001969: IGFBP

PIRSF003033: Ku70 autoantigen

PIRSF016570: Ku80 autoantigen

PIRSF Homeomorphic Family• Exactly one level

• Full-length sequence similarity and common domain architecture

PIRSF Superfamily• 0 or more levels

• One or more common domains

PF00219: Insulin-like growth factor binding protein

(IGFBP)

PIRSF800001: Ku70/80 autoantigenPF02735: Ku70/Ku80 beta-barrel domain

Domain Superfamily• One common Pfam

domain

PIRSF Classification SystemA protein may be assigned to only one homeomorphic family, which may have zero or more child nodes and zero or more parent nodes. Each homeomorphic family may have as many domain superfamily parents as its members have domains.

Page 23: The PIRSF Protein Classification System  as a Basis for Automated UniProt Protein Annotation

23

Variable Domain Architecture

1. Variable number of repeats

Domain architecture can not be strictly followed in every case without making small and meaningless PIRSFs that preclude automatic member addition. Therefore, define a “core” and allow:

Page 24: The PIRSF Protein Classification System  as a Basis for Automated UniProt Protein Annotation

24

Variable Domain Architecture

2. Presence/absence of auxiliary domains Easily lost or acquired Usually small mobile domains Different versions of domain architecture arising many times

Domain architecture can not be strictly followed in every case without making small and meaningless PIRSFs that preclude automatic member addition. Therefore, define a “core” and allow:

Page 25: The PIRSF Protein Classification System  as a Basis for Automated UniProt Protein Annotation

25

Variable Domain Architecture

3. Domain duplication

Domain architecture can not be strictly followed in every case without making small and meaningless PIRSFs that preclude automatic member addition. Therefore, define a “core” and allow:

Page 26: The PIRSF Protein Classification System  as a Basis for Automated UniProt Protein Annotation

26

Classification Tool: BlastClust Curator-guided

clustering

Retrieve all proteins sharing a common domain

Single-linkage clustering using BlastClust

Fixed-length coverage enforces homeomorphicity

Iterative procedure allows tree view

Page 27: The PIRSF Protein Classification System  as a Basis for Automated UniProt Protein Annotation

27

PIRSF Family Report (I)

Curated family name

Description of family

Sequence analysis tools

Phylogenetic tree and alignment view allows further sequence analysis

Taxonomic distribution of PIRSF can be used to infer evolutionary history of the proteins in the PIRSF

Page 28: The PIRSF Protein Classification System  as a Basis for Automated UniProt Protein Annotation

28

PIRSF Family Report (II)

Integrated value-added information from other databases

Mapping to other protein classification databases

Page 29: The PIRSF Protein Classification System  as a Basis for Automated UniProt Protein Annotation

29

PIRSF Protein Classification provides a platform for UniProt protein annotation

Improve Annotation Quality Annotate biological function of whole proteins Annotate uncharacterized hypothetical proteins

(functional predictions helped by newly-detected family relationships)

Correct annotation errors Improve under- or over-annotated proteins

Standardize Protein Names in UniProt Site annotation

Family-Driven Protein Annotation

Page 30: The PIRSF Protein Classification System  as a Basis for Automated UniProt Protein Annotation

30

Enhanced Annotations in UniProtUniProt ID OLD name NEW (proposed) name PIRSFP38678 Glucan synthase-1 Cell wall assembly and cell proliferation coordinating protein PIRSF017023

Q05632 Decarboxylase Probable cobalt-precorrin-6Y C(15)-methyltransferase [decarboxylating]

PIRSF019019

P72117 PAO substrain OT684 pyoverdine gene transcriptional regulator PvdS

Thioesterase, type II PIRSF000881

UniProt ID OLD name NEW (proposed) name PIRSFP37185 Hydrogenase-2 operon protein hybG [NiFe]-hydrogenase maturation chaperone PIRSF005618

P40360 Hypothetical 65.6 kDa protein in SMC3-MRPL8 intergenic region

Amino-acid acetyltransferase, fungal type PIRSF007892

Q98FY9 CobT protein Aerobic cobaltochelatase, CobT subunit PIRSF031715

Corrections

Upgraded underannotations

Predicted functions for “hypothetical” proteinsUniProt ID OLD name NEW (proposed) name PIRSF

Q57948 Hypothetical protein MJ0528 Predicted [NiFe]-hydrogenase-3-type complex Eha, membrane protein EhaA

PIRSF005019

Q58527 Hypothetical protein MJ1127 Predicted metal-dependent hydrolase PIRSF004961

O28300 Hypothetical protein AF1979 Predicted nucleotidyltransferase PIRSF005928

Page 31: The PIRSF Protein Classification System  as a Basis for Automated UniProt Protein Annotation

31

Name Rules

Hierarchy

PIRSF Classification Name

Site Rules

Family-Driven Protein AnnotationObjective: Optimize for protein annotation

PIRSF Classification Name Reflects the function when possible Indicates the maximum specificity that still describes the entire group Standardized format Name tags: validated, tentative, predicted, functionally heterogeneous

Hierarchy Subfamilies increase specificity (kinase -> sugar kinase -> hexokinase)

Name Rules Define conditions under which names propagate to individual proteins Enable further specificity based on taxonomy or motifs Names adhere to Swiss-Prot conventions (though we may make suggestions

for improvement)

Site Rules Define conditions under which features propagate to individual proteins

Page 32: The PIRSF Protein Classification System  as a Basis for Automated UniProt Protein Annotation

32

PIR Name Rules

Monitor such variables to ensure accurate propagation

Account for functional variations within one PIRSF, including: Lack of active site residues necessary for enzymatic activity Certain activities relevant only to one part of the taxonomic tree Evolutionarily-related proteins whose biochemical activities are known to

differ

Propagate other properties that describe function:EC, GO terms, misnomer info, pathway

Name Rule types: “Zero” Rule

Default rule (only condition is membership in the appropriate family) Information is suitable for every member

“Higher-Order” Rule Has requirements in addition to membership Can have multiple rules that may or may not have mutually exclusive conditions

Page 33: The PIRSF Protein Classification System  as a Basis for Automated UniProt Protein Annotation

33

Example Name Rules

Rule ID Rule Conditions Propagated Information

PIRNR000881-1 PIRSF000881 member and vertebrates

Name: S-acyl fatty acid synthase thioesteraseEC: oleoyl-[acyl-carrier-protein] hydrolase (EC 3.1.2.14)

PIRNR000881-2 PIRSF000881 member and not vertebrates

Name: Type II thioesteraseEC: thiolester hydrolases (EC 3.1.2.-)

PIRNR025624-0 PIRSF025624 member Name: ACT domain proteinMisnomer: chorismate mutase

Note the lack of a zero rule for PIRSF000881

Page 34: The PIRSF Protein Classification System  as a Basis for Automated UniProt Protein Annotation

34

Name Rule in Action at UniProt

Current:• Automatic annotations (AA) are in a separate field• AA only visible from www.ebi.uniprot.org

Future:• Automatic name annotations will become DE line if DE line will improve as a result• AA will be visible from all consortium-hosted web sites

Page 35: The PIRSF Protein Classification System  as a Basis for Automated UniProt Protein Annotation

35

Affiliation of Sequence: Homeomorphic Family or Subfamily (whichever PIRSF is the lowest possible node)

No Yes Assign name from Name Rule 1 (or 2 etc)

Protein fits criteria for any higher-order rule?

No Yes

Nothing to propagate

Assign name from Name Rule 0PIRSF has zero rule?

Yes No Nothing to propagate

Name Rule Propagation Pipeline

Name rule exists?

Page 36: The PIRSF Protein Classification System  as a Basis for Automated UniProt Protein Annotation

36

PIR Site Rules Position-Specific Site Features:

active sites binding sites modified amino acids

Current requirements: at least one PDB structure experimental data on functional sites: CATRES database (Thornton)

Rule Definition: Select template structure Align PIRSF seed members with structural template Edit MSA to retain conserved regions covering all site residues Build Site HMM from concatenated conserved regions

Page 37: The PIRSF Protein Classification System  as a Basis for Automated UniProt Protein Annotation

37

Propagate Information Feature annotation using controlled vocabulary Evidence attribution (experimental vs. computational prediction) Attribute sources and strengths of evidence

Site Rule Algorithm Match Rule Conditions

Membership Check (PIRSF HMM threshold) Ensures that the annotation is appropriate

Conserved Region Check (site HMM threshold) Site Residue Check (all position-specific residues in HMMAlign)

Page 38: The PIRSF Protein Classification System  as a Basis for Automated UniProt Protein Annotation

38

Match Rule ConditionsOnly propagate site annotation if all rule conditions are met

Page 39: The PIRSF Protein Classification System  as a Basis for Automated UniProt Protein Annotation

39

Defined rules for annotation

Site rules allow precise annotation of features for UniProt proteins within the PIRSF

PIRSF Family Report (III)

Page 40: The PIRSF Protein Classification System  as a Basis for Automated UniProt Protein Annotation

40

Site Rules Feed Name Rules

?

Functional variation within one PIRSF: binding sites with different specificity drive choice of applicable rule to ensure

appropriate annotation

Functional Site rule: tags

active site, binding, other residue-specific information

Functional Annotation rule: gives name, EC, other activity-specific information

Page 41: The PIRSF Protein Classification System  as a Basis for Automated UniProt Protein Annotation

41

PIR Team Dr. Cathy Wu, Director Curation team

Dr. Winona Barker Dr. Darren Natale Dr. CR VinayakaDr. Zhangzhi Hu Dr. Anastasia Nikolskaya Dr. Xianying Wei Dr. Raja Mazumder Dr. Sona Vasudevan Dr. Lai-Su Yeh

Informatics teamDr. Leslie Arminski Yongxing Chen, M.S. Jian Zhang, M.S.Dr. Hsing-Kuo Hua Sehee Chung, M.S. Amar Kalelkar Dr. Hongzhan Huang Baris Suzek, M.S.

StudentsJorge Castro-Alvear Vincent Hormoso Rathi ThiagarajanChristina Fang Natalia Petrova

UniProt CollaboratorsDr. Rolf Apweiler/EBI Dr. Amos Bairoch/SIB

Page 42: The PIRSF Protein Classification System  as a Basis for Automated UniProt Protein Annotation

42

Curator’s Decision Maker