the gene ontology annotation (goa) database and enhancement of go annotations through interpro2go

37
European Bioinformatics Institute The Gene Ontology Annotation (GOA) Database and enhancement of GO annotations through InterPro2GO Nicky Mulder [email protected]

Upload: pearly

Post on 28-Jan-2016

58 views

Category:

Documents


0 download

DESCRIPTION

The Gene Ontology Annotation (GOA) Database and enhancement of GO annotations through InterPro2GO. Nicky Mulder [email protected]. Contents. Introduction to GOA Manual GOA annotation Electronic annotation: InterPro2GO GOA data flow Uses of GOA Future plans. What is GO annotation?. GO - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: The Gene Ontology Annotation (GOA) Database and enhancement of GO annotations through InterPro2GO

European Bioinformatics Institute

The Gene Ontology Annotation (GOA) Database and

enhancement of GO annotations through InterPro2GO

Nicky [email protected]

Page 2: The Gene Ontology Annotation (GOA) Database and enhancement of GO annotations through InterPro2GO

European Bioinformatics Institute

Contents

• Introduction to GOA

• Manual GOA annotation

• Electronic annotation:– InterPro2GO

• GOA data flow

• Uses of GOA

• Future plans

Page 3: The Gene Ontology Annotation (GOA) Database and enhancement of GO annotations through InterPro2GO

European Bioinformatics Institute

What is GO annotation?

An annotation is a statement that a gene product• has a particular molecular function • is involved in a particular biological process• is located within a certain cellular component

…as determined by a particular method

…as described in a particular reference.

GOTerm

ID

GOTerm

ID

Evidence Code

Evidence Code

ReferenceReference

Page 4: The Gene Ontology Annotation (GOA) Database and enhancement of GO annotations through InterPro2GO

European Bioinformatics Institute

Gene Ontology Annotation (GOA) Database

• GOA’s priority is to annotate the human, mouse and rat proteomes

• Largest open-source contributor of annotations to GO

• Provides 10 million annotations for more than 111,000 species

• Share and integrate GO annotation

Page 5: The Gene Ontology Annotation (GOA) Database and enhancement of GO annotations through InterPro2GO

European Bioinformatics Institute

How do we annotate GO terms

Manual Annotation

Electronic Annotation

All annotations must:

• be attributed to a source

• indicate what evidence was found to support the GO term-gene/protein association

Page 6: The Gene Ontology Annotation (GOA) Database and enhancement of GO annotations through InterPro2GO

European Bioinformatics Institute

Manual annotation

• High quality• Specific gene or gene product associations

made using:– Peer reviewed papers– Evidence codes

• BUT:– Time-consuming– Requires trained biologists

Page 7: The Gene Ontology Annotation (GOA) Database and enhancement of GO annotations through InterPro2GO

European Bioinformatics Institute

Manual GO annotation

Read papers Find GO term Annotate to protein

GOA-association file Oracle RDBMS

Pubmed ID,Evidence code

GO and EBI ftp sites

Page 8: The Gene Ontology Annotation (GOA) Database and enhancement of GO annotations through InterPro2GO

European Bioinformatics Institute

Protein2GO tool Online

Page 9: The Gene Ontology Annotation (GOA) Database and enhancement of GO annotations through InterPro2GO

European Bioinformatics Institute

Information captured by GOA

Source GOID Term Evid RefDB RefID With DB With ID Qualifier

Page 10: The Gene Ontology Annotation (GOA) Database and enhancement of GO annotations through InterPro2GO

European Bioinformatics Institute

How successful is manual-GOA?Source No. of annotations No. of distinct proteins

Proteome Inc. 22054 6568

UniProt 67910 13697

IntAct 22002 11013

MGI 124919 29837

SGD 21761 5076

FlyBase 52386 8775

RGD 8036 3369

HGNC 3699 798

GeneDB 5502 1384

TAIR/TIGR 3367 1895

ZFIN 1012 334

Roslin Institute 14 6

AgBase 889 173

Reactome 15 12

WormBase 893 443

TIGR 139 79

Gramene 139 2812

GDB 165 103

TOTAL MANUAL 336237 70728

July 2006

111740 taxa

Page 11: The Gene Ontology Annotation (GOA) Database and enhancement of GO annotations through InterPro2GO

European Bioinformatics Institute

• Large-scale assignment of GO terms to UniProtKB entries using existing information within database entries and manual mappings

• Get IEA evidence code

Electronic Annotation

Curated mapping e.g. EC:1.1.1.1 > GO:alcohol dehydrogenase activity ; GO:0004022

UniProt

Keyword HAMAPInterPro EC

GO

Curated or electronic rule based mappings

High qualityelectronic protein to GO associations

Page 12: The Gene Ontology Annotation (GOA) Database and enhancement of GO annotations through InterPro2GO

European Bioinformatics Institute

www.uniprot.org/

Page 13: The Gene Ontology Annotation (GOA) Database and enhancement of GO annotations through InterPro2GO

European Bioinformatics Institute

http://www.geneontology.org/GO.indices.shtml

Mappings of external concepts to GO

Page 14: The Gene Ontology Annotation (GOA) Database and enhancement of GO annotations through InterPro2GO

European Bioinformatics Institute

InterPro2GO mapping

• InterPro is a resource that integrates protein signatures databases, e.g. Pfam, Prints, Prosite, ProDom, SMART, TIGRFAMs etc.

• It provides a means of classifying proteins into families and identifying domains.

• Each InterPro entry groups proteins belonging to the same family and potentially having the same function

Page 15: The Gene Ontology Annotation (GOA) Database and enhancement of GO annotations through InterPro2GO

European Bioinformatics Institute

InterPro2Go mapping

• Done manually, but using tools• Look at InterPro and protein annotation• For all Swiss-Prot proteins matching entry

truly:– Get stats on DE lines, keywords, comments– Check how conserved common annotation is– Find appropriate GO term at most specific level

that applies to all proteins (not necessarily domains)

Page 16: The Gene Ontology Annotation (GOA) Database and enhancement of GO annotations through InterPro2GO

European Bioinformatics Institute

Tools used –”SQUID”

Statistics options:keyworddescriptionGene nameOrganismComments, etc.

Page 17: The Gene Ontology Annotation (GOA) Database and enhancement of GO annotations through InterPro2GO

European Bioinformatics Institute

SQUID statistics output

Page 18: The Gene Ontology Annotation (GOA) Database and enhancement of GO annotations through InterPro2GO

European Bioinformatics Institute

SQUID statistics output

Page 19: The Gene Ontology Annotation (GOA) Database and enhancement of GO annotations through InterPro2GO

European Bioinformatics Institute

InterPro2GO mapping in entry

Page 20: The Gene Ontology Annotation (GOA) Database and enhancement of GO annotations through InterPro2GO

European Bioinformatics Institute

InterProScan output with GO terms

Page 21: The Gene Ontology Annotation (GOA) Database and enhancement of GO annotations through InterPro2GO

European Bioinformatics Institute

InterPro2GO sanity checks

• Run weekly

• Reports:

• Obsolete GO terms

• Obsolete (deleted) IPRs

• Secondary IPRs

Page 22: The Gene Ontology Annotation (GOA) Database and enhancement of GO annotations through InterPro2GO

European Bioinformatics Institute

Quality of GO mapping

• BioCreAtIvE test set -635 GO annotations through InterPro2GO

Camon et al., 2005, BMC Bioinformatics

Manually checked 44 proteins, 107 predictions:

97 correct (90%):

-40 exact

-57 same lineage

10 new lineage (unknown)

0 incorrect

Exact term 151 24%

Same lineage < granularity 273 43%

Same lineage > granularity 24 4%

New lineage 187 29%

Minimal correct 424 67%

Potentially incorrect 211 33%

Precision   67-100%

Page 23: The Gene Ontology Annotation (GOA) Database and enhancement of GO annotations through InterPro2GO

European Bioinformatics Institute

InterPro2GO mapping statistics

Total no. IPRS mapped to GO 7126

% of IPRs mapped to at least 1 GO term 54%

No. IPRS mapped to molecular function 5741

No. IPRS mapped to biological process 5543

No. IPRS mapped to cellular component 3426

No. GO terms mapped 2811

No. UniProt proteins mapped through interpro2go 2006489 (61%)

% UniProt covered by InterPro 77.6%

Page 24: The Gene Ontology Annotation (GOA) Database and enhancement of GO annotations through InterPro2GO

European Bioinformatics Institute

• Provides large coverage

• High Quality

• However these annotations often use high-level GO terms and provide little detail.

How successful is IEA-GOA in general?

IEA Method No. of annotations No. of distinct proteins

InterPro2GO 6281916 2006489

HAMAP2GO 199904 85814

SP Keyword2GO 3613883 1287830

EC2GO 207540 202657

TOTAL 10303243 2167001

Jun 2006Manual ones: 336237 70728

Page 25: The Gene Ontology Annotation (GOA) Database and enhancement of GO annotations through InterPro2GO

European Bioinformatics Institute

Total GO statisticsTotal no. GO annotations 10639480

% GO associations manual 3.16%

% GO associations electronic 96.84

% GO associations interpro2GO 59%

Total no. proteins annotated to GO 2168717

% UniProt GO annotated in total 68.2%

% UniProt GO annotated manually 2.2%

% UniProt GO annotated electronically 66%

% UniProt GO annotated through interpro2go 61%

Page 26: The Gene Ontology Annotation (GOA) Database and enhancement of GO annotations through InterPro2GO

European Bioinformatics Institute

GOA data flow

Gene association files

Page 27: The Gene Ontology Annotation (GOA) Database and enhancement of GO annotations through InterPro2GO

European Bioinformatics Institute

Gene Association file format

http://www.geneontology.org/GO.annotation.shtml

Page 28: The Gene Ontology Annotation (GOA) Database and enhancement of GO annotations through InterPro2GO

European Bioinformatics Institute

Example GOA cow file

Page 29: The Gene Ontology Annotation (GOA) Database and enhancement of GO annotations through InterPro2GO

European Bioinformatics Institute

Output from the GOA database

GOACow

New

Redundant

Non-Redundant: based on IPI

Data also available in SRS, UniProt, QuickGO, MODs, Ensembl etc.

GA slim for UniProt + GO slims

Page 30: The Gene Ontology Annotation (GOA) Database and enhancement of GO annotations through InterPro2GO

European Bioinformatics Institute

GA Files for Non-redundant species

• Non-redundant complete protein set for each proteome is identified (>25% GO coverage)

• Includes UniProt, IPI and MOD-specific IDs, e.g. mouse (MGI), rat (RGD), zebrafish (ZFIN) etc.

• Xref files available with identifiers from: UniProt, IPI, RefSeq, Ensembl, UniGene etc.

ftp://ftp.ebi.ac.uk/pub/databases/GO/goa

ftp://ftp.ebi.ac.uk/pub/databases/integr8

Page 31: The Gene Ontology Annotation (GOA) Database and enhancement of GO annotations through InterPro2GO

European Bioinformatics Institute

Uses of GOA data

• Access protein functional information

• Look at relationships between proteins, e.g. IntAct

• Connect biological information to gene expression data

• Determine functional composition of a proteome –using GO slim

Page 32: The Gene Ontology Annotation (GOA) Database and enhancement of GO annotations through InterPro2GO

European Bioinformatics Institute

Find functional information on proteins

http://www.ebi.ac.uk/ego

Uses of GOA

Page 33: The Gene Ontology Annotation (GOA) Database and enhancement of GO annotations through InterPro2GO

European Bioinformatics Institute

Find functional information on interaction proteins (IntAct)

http:www.ebi.ac.uk/intact

Uses of GOA

Page 34: The Gene Ontology Annotation (GOA) Database and enhancement of GO annotations through InterPro2GO

European Bioinformatics Institute

Overview proteome with GO Slim

http://www.ebi.ac.uk/integr8

Uses of GOA

Page 35: The Gene Ontology Annotation (GOA) Database and enhancement of GO annotations through InterPro2GO

European Bioinformatics Institute

Microarray data analysis

Proteomics data analysis

Kislinger T et al, Mol Cell Proteomics, 2003

Larkin JE et al, Physiol Genomics, 2004

Cunliffe HE et al, Cancer Res, 2003

GO classification

GO classification

Analysis of high-throughput data according to GOUses of GOA

Page 36: The Gene Ontology Annotation (GOA) Database and enhancement of GO annotations through InterPro2GO

European Bioinformatics Institute

Future plans

• Continue deep level annotation of human, mouse and rat

• Manually annotate splice variants

• Outreach and inclusion of new datasets e.g. grape

• New electronic mappings, e.g. unipathway2go

• Ortholog prediction for electronic GO annotation

• Develop tools for annotation training

Page 37: The Gene Ontology Annotation (GOA) Database and enhancement of GO annotations through InterPro2GO

European Bioinformatics Institute

Evelyn Camon GOA Coordinator Daniel Barrell GOA Programmer Emily Dimmer GOA Curator Rachael Huntley GOA CuratorDavid Binns & John Maslen QuickGO, GOA tools

All EBI UniProtKB Curators, HAMAP(SIB), IntAct,GO Editorial Office @ EBI All GO Consortium & associate members

Rolf Apweiler Head of sequence database group

Acknowledgements