the gene ontology annotation (goa) database and enhancement of go annotations through interpro2go
DESCRIPTION
The Gene Ontology Annotation (GOA) Database and enhancement of GO annotations through InterPro2GO. Nicky Mulder [email protected]. Contents. Introduction to GOA Manual GOA annotation Electronic annotation: InterPro2GO GOA data flow Uses of GOA Future plans. What is GO annotation?. GO - PowerPoint PPT PresentationTRANSCRIPT
European Bioinformatics Institute
The Gene Ontology Annotation (GOA) Database and
enhancement of GO annotations through InterPro2GO
Nicky [email protected]
European Bioinformatics Institute
Contents
• Introduction to GOA
• Manual GOA annotation
• Electronic annotation:– InterPro2GO
• GOA data flow
• Uses of GOA
• Future plans
European Bioinformatics Institute
What is GO annotation?
An annotation is a statement that a gene product• has a particular molecular function • is involved in a particular biological process• is located within a certain cellular component
…as determined by a particular method
…as described in a particular reference.
GOTerm
ID
GOTerm
ID
Evidence Code
Evidence Code
ReferenceReference
European Bioinformatics Institute
Gene Ontology Annotation (GOA) Database
• GOA’s priority is to annotate the human, mouse and rat proteomes
• Largest open-source contributor of annotations to GO
• Provides 10 million annotations for more than 111,000 species
• Share and integrate GO annotation
European Bioinformatics Institute
How do we annotate GO terms
Manual Annotation
Electronic Annotation
All annotations must:
• be attributed to a source
• indicate what evidence was found to support the GO term-gene/protein association
European Bioinformatics Institute
Manual annotation
• High quality• Specific gene or gene product associations
made using:– Peer reviewed papers– Evidence codes
• BUT:– Time-consuming– Requires trained biologists
European Bioinformatics Institute
Manual GO annotation
Read papers Find GO term Annotate to protein
GOA-association file Oracle RDBMS
Pubmed ID,Evidence code
GO and EBI ftp sites
European Bioinformatics Institute
Protein2GO tool Online
European Bioinformatics Institute
Information captured by GOA
Source GOID Term Evid RefDB RefID With DB With ID Qualifier
European Bioinformatics Institute
How successful is manual-GOA?Source No. of annotations No. of distinct proteins
Proteome Inc. 22054 6568
UniProt 67910 13697
IntAct 22002 11013
MGI 124919 29837
SGD 21761 5076
FlyBase 52386 8775
RGD 8036 3369
HGNC 3699 798
GeneDB 5502 1384
TAIR/TIGR 3367 1895
ZFIN 1012 334
Roslin Institute 14 6
AgBase 889 173
Reactome 15 12
WormBase 893 443
TIGR 139 79
Gramene 139 2812
GDB 165 103
TOTAL MANUAL 336237 70728
July 2006
111740 taxa
European Bioinformatics Institute
• Large-scale assignment of GO terms to UniProtKB entries using existing information within database entries and manual mappings
• Get IEA evidence code
Electronic Annotation
Curated mapping e.g. EC:1.1.1.1 > GO:alcohol dehydrogenase activity ; GO:0004022
UniProt
Keyword HAMAPInterPro EC
GO
Curated or electronic rule based mappings
High qualityelectronic protein to GO associations
European Bioinformatics Institute
www.uniprot.org/
European Bioinformatics Institute
http://www.geneontology.org/GO.indices.shtml
Mappings of external concepts to GO
European Bioinformatics Institute
InterPro2GO mapping
• InterPro is a resource that integrates protein signatures databases, e.g. Pfam, Prints, Prosite, ProDom, SMART, TIGRFAMs etc.
• It provides a means of classifying proteins into families and identifying domains.
• Each InterPro entry groups proteins belonging to the same family and potentially having the same function
European Bioinformatics Institute
InterPro2Go mapping
• Done manually, but using tools• Look at InterPro and protein annotation• For all Swiss-Prot proteins matching entry
truly:– Get stats on DE lines, keywords, comments– Check how conserved common annotation is– Find appropriate GO term at most specific level
that applies to all proteins (not necessarily domains)
European Bioinformatics Institute
Tools used –”SQUID”
Statistics options:keyworddescriptionGene nameOrganismComments, etc.
European Bioinformatics Institute
SQUID statistics output
European Bioinformatics Institute
SQUID statistics output
European Bioinformatics Institute
InterPro2GO mapping in entry
European Bioinformatics Institute
InterProScan output with GO terms
European Bioinformatics Institute
InterPro2GO sanity checks
• Run weekly
• Reports:
• Obsolete GO terms
• Obsolete (deleted) IPRs
• Secondary IPRs
European Bioinformatics Institute
Quality of GO mapping
• BioCreAtIvE test set -635 GO annotations through InterPro2GO
Camon et al., 2005, BMC Bioinformatics
Manually checked 44 proteins, 107 predictions:
97 correct (90%):
-40 exact
-57 same lineage
10 new lineage (unknown)
0 incorrect
Exact term 151 24%
Same lineage < granularity 273 43%
Same lineage > granularity 24 4%
New lineage 187 29%
Minimal correct 424 67%
Potentially incorrect 211 33%
Precision 67-100%
European Bioinformatics Institute
InterPro2GO mapping statistics
Total no. IPRS mapped to GO 7126
% of IPRs mapped to at least 1 GO term 54%
No. IPRS mapped to molecular function 5741
No. IPRS mapped to biological process 5543
No. IPRS mapped to cellular component 3426
No. GO terms mapped 2811
No. UniProt proteins mapped through interpro2go 2006489 (61%)
% UniProt covered by InterPro 77.6%
European Bioinformatics Institute
• Provides large coverage
• High Quality
• However these annotations often use high-level GO terms and provide little detail.
How successful is IEA-GOA in general?
IEA Method No. of annotations No. of distinct proteins
InterPro2GO 6281916 2006489
HAMAP2GO 199904 85814
SP Keyword2GO 3613883 1287830
EC2GO 207540 202657
TOTAL 10303243 2167001
Jun 2006Manual ones: 336237 70728
European Bioinformatics Institute
Total GO statisticsTotal no. GO annotations 10639480
% GO associations manual 3.16%
% GO associations electronic 96.84
% GO associations interpro2GO 59%
Total no. proteins annotated to GO 2168717
% UniProt GO annotated in total 68.2%
% UniProt GO annotated manually 2.2%
% UniProt GO annotated electronically 66%
% UniProt GO annotated through interpro2go 61%
European Bioinformatics Institute
GOA data flow
Gene association files
European Bioinformatics Institute
Gene Association file format
http://www.geneontology.org/GO.annotation.shtml
European Bioinformatics Institute
Example GOA cow file
European Bioinformatics Institute
Output from the GOA database
GOACow
New
Redundant
Non-Redundant: based on IPI
Data also available in SRS, UniProt, QuickGO, MODs, Ensembl etc.
GA slim for UniProt + GO slims
European Bioinformatics Institute
GA Files for Non-redundant species
• Non-redundant complete protein set for each proteome is identified (>25% GO coverage)
• Includes UniProt, IPI and MOD-specific IDs, e.g. mouse (MGI), rat (RGD), zebrafish (ZFIN) etc.
• Xref files available with identifiers from: UniProt, IPI, RefSeq, Ensembl, UniGene etc.
ftp://ftp.ebi.ac.uk/pub/databases/GO/goa
ftp://ftp.ebi.ac.uk/pub/databases/integr8
European Bioinformatics Institute
Uses of GOA data
• Access protein functional information
• Look at relationships between proteins, e.g. IntAct
• Connect biological information to gene expression data
• Determine functional composition of a proteome –using GO slim
European Bioinformatics Institute
Find functional information on proteins
http://www.ebi.ac.uk/ego
Uses of GOA
European Bioinformatics Institute
Find functional information on interaction proteins (IntAct)
http:www.ebi.ac.uk/intact
Uses of GOA
European Bioinformatics Institute
Overview proteome with GO Slim
http://www.ebi.ac.uk/integr8
Uses of GOA
European Bioinformatics Institute
Microarray data analysis
Proteomics data analysis
Kislinger T et al, Mol Cell Proteomics, 2003
Larkin JE et al, Physiol Genomics, 2004
Cunliffe HE et al, Cancer Res, 2003
GO classification
GO classification
Analysis of high-throughput data according to GOUses of GOA
European Bioinformatics Institute
Future plans
• Continue deep level annotation of human, mouse and rat
• Manually annotate splice variants
• Outreach and inclusion of new datasets e.g. grape
• New electronic mappings, e.g. unipathway2go
• Ortholog prediction for electronic GO annotation
• Develop tools for annotation training
European Bioinformatics Institute
Evelyn Camon GOA Coordinator Daniel Barrell GOA Programmer Emily Dimmer GOA Curator Rachael Huntley GOA CuratorDavid Binns & John Maslen QuickGO, GOA tools
All EBI UniProtKB Curators, HAMAP(SIB), IntAct,GO Editorial Office @ EBI All GO Consortium & associate members
Rolf Apweiler Head of sequence database group
Acknowledgements