ontologies for informatics. infrastructure for systems biology. oxford october 19 2004

34
Ontologies for Informatics . Infrastructure for Systems Biology . Oxford October 19 2004

Post on 21-Dec-2015

217 views

Category:

Documents


0 download

TRANSCRIPT

Ontologies for Informatics

.

Infrastructure for Systems Biology

.Oxford

October 19 2004

To provide

structured controlled vocabularies

for the

representation of biological knowledge

in

biological databases.

Manifesto of Liberation

Bioinformatics

Be open source Use open standards Make data & code available

without constraint Involve your community

Gene Ontology - 1998

FlyBase Drosophila Cambridge, EBI, HarvardBerkeley & Bloomington.

SGD Saccharomyces Stanford.

MGI Mus Jackson Labs., Bar Harbor.

Gene Ontology - 2004 Fruitfly - FlyBase Budding yeast - Saccharomyces Genome Database

(SGD) Mouse - Mouse Genome Database (MGD & GXD) Rat - Rat Genome Database (RGD) Weed - The Arabidopsis Information Resource (TAIR) Worm - WormBase Dictyostelium discoidem - Dictybase InterPro/UniProt at EBI - InterPro Fission yeast - Pombase Human - UniProt, Ensembl, NCBI, Incyte, Celera,

Compugen Parasites - Plasmodium, Trypanosoma, Leishmania -

GeneDB - Sanger Microbes - Vibrio, Shewanella, B. anthracus, … - TIGR Grasses - rice & maize - Gramene database zebra fish - Zfin Coming: Xenopus, Chlamydomonas, Tetrahymena,

Gallus & more.

GOThree (Orthogonal)

Ontologies Biological Process

• Goal or objective within cell, tissue ..

Molecular Function

• Elemental activity or task

Cellular Component

• location or complex

•molecular function 7422 terms•biological process 8972 terms•cellular component 1472 terms

•all 17,866 terms

definitions 16,600 (93%)

Content of GO

What is the least complex data structure that is sufficient?

Key word list? Hierarchical tree? Directed acyclic graph? Other?

What data structure to use ?

Directed Acyclic Graph tree directed acyclic

graph

ISA (hypernomy/hyponomy)• as in: an elephant is a mammal

PARTOF (meronomy/holonomy)• as in: a trunk is part of an

elephant

REGULATES• carbohydrate metabolism

regulates: regulation of carbohydrate metabolism

Classes of parent-child relationship

Cellular component

%membrane

%vacuolar membrane

%nuclear membrane

%intracellular

%cell

<cytoplasm

<vacuole

<vacuolar membrane

<vacuolar lumen

<nucleus

<nuclear membrane

Cellular component

vacuolarmembrane

membrane intracellular

vacuole

vacuolarlumen

cytoplasmnucleus

nuclearmembrane

cell

ISA (%) PARTOf (<)

Structure of the Ontologies

term: chloroplastgo_id: GO:0009507definition: A chlorophyll-containing plastid with thylakoids organized into grana and frets, or stroma thylakoids, and embedded in a stroma.definition_reference: ISBN:0471245208

term: ketone catabolismgoid: GO:0042182definition: The breakdown into simpler components of ketones, a class of organic compounds that contain the carbonyl group, CO, and in which the carbonyl group is bonded only to carbon atoms. The general formula for a ketone is RCOR, where R and R are alkyl or aryl groups.definition_reference: GO:curators

GO terms are defined & have unique id’s

•literature curation:•Inferred from Mutant Phenotype•Inferred from Direct Assay•Inferred from Genetic Interaction•Inferred from Physical Interaction•Inferred from Expression Pattern•Traceable Author Statement•Non-traceable Author Statement.

•“homologies”:• Inferred from Sequence Similarity

•computed annotation:• Inferred from Electronic Annotation

Annotation of GO terms to gene products

GO Gene Association Tables

Herpes viruses

Vibrio cholerae, B. anthracis, Coxiella burnetii, Pseudomonas syringae,

Shewanella oneidensis …

Dictyostelium discoidem

Saccharomyces cerevisiae, Schizosaccharomyces pombe

Trypanosoma brucei, Leishmania major, Plasmodium falciparum

Caenorhabditis elegansDrosophila melanogaster, Glossina

morsitans

Danio rerio

Mus “domesticus”, Rattus norvegicus,Homo sapiens bioinformaticus

Arabidopsis thaliana, Oryza sativa

FB FBgn0015567 &agr;-Adaptin GO:0005886 FB:FBrf0093110|PMID:9118220 IDA CFB FBgn0015567 &agr;-Adaptin GO:0007269 FB:FBrf0108281|PMID:10218159 NAS PFB FBgn0015567 &agr;-Adaptin GO:0016192 FB:FBrf0124164 NAS PFB FBgn0015567 &agr;-Adaptin GO:0030122 FB:FBrf0115359 NAS CFB FBgn0015567 &agr;-Adaptin GO:0030122 FB:FBrf0124164 NAS CFB FBgn0015567 &agr;-Adaptin GO:0006901 FB:FBrf0108281|PMID:10218159 TAS PFB FBgn0015567 &agr;-Adaptin GO:0008021 FB:FBrf0108281|PMID:10218159 TAS CFB FBgn0015567 &agr;-Adaptin GO:0016181 FB:FBrf0141528|PMID:11697879 TAS PFB FBgn0015567 &agr;-Adaptin GO:0016183 FB:FBrf0108281|PMID:10218159 TAS PFB FBgn0015567 &agr;-Adaptin GO:0030135 FB:FBrf0108281|PMID:10218159 TAS CFB FBgn0010215 &agr;-Cat GO:0003779 FB:FBrf0132100 ISS FFB FBgn0010215 &agr;-Cat GO:0007016 FB:FBrf0129868|PMID:10908592 ISS PFB FBgn0010215 &agr;-Cat GO:0008092 FB:FBrf0132100 ISS FFB FBgn0010215 &agr;-Cat GO:0016342 FB:FBrf0129868|PMID:10908592 ISS CFB FBgn0010215 &agr;-Cat GO:0016343 FB:FBrf0129868|PMID:10908592 ISS FFB FBgn0010215 &agr;-Cat GO:0005912 FB:FBrf0151280|PMID:12147138 NAS C

SGD S0004660 AAC1 GO:0005743 SGD_REF:12031|PMID:2167309 TAS C SGD S0004660 AAC1 GO:0006854 SGD_REF:12031|PMID:2167309 IDA P SGD S0004660 AAC1 GO:0005471 SGD_REF:12031|PMID:2167309 IDA FSGD S0000289 AAC3 GO:0005743 SGD_REF:13606|PMID:1915842 TAS CSGD S0000289 AAC3 GO:0006854 SGD_REF:13606|PMID:1915842 IMP P

SGD S0000289 AAC3 GO:0005471 SGD_REF:13606|PMID:19158 42 IMP F ADP/ATP translocator YBR085W|ANC3 gene taxid:4932 20010213 SGD

go/gene_associations/

Curated GO Annotations

1.12.20011.12.2003Gene products 42421 253962GO terms 4262 7741

Expression studies: Human ontogenic tumor gene expressionHuman breast cancer gene expressionHuman endothelial cell gene expressionHuman fibrosarcoma cell cDNAsHuman osteoblast progenitor cell gene expressionHuman fibrosarcoma cell gene expressionMouse cDNAs - FANTOM/FANTOM2 ProjectsMouse lung gene expressionMouse dendritic cell gene expressionMouse hepatic and hippocampal gene expressionMouse liver tumor gene expressionDrosophila gene expression during agingDrosophila embryo gene expressionAffymetrix Probe Sets

Protein annotation: Vertebrate nuclear proteinsHuman GPCR proteinsMouse proteomePANTHER protein families

EST collections: Cattle ESTs, Pig ESTs, Dog ESTsParacoccidioides brasiliensis ESTsPlasmodium falciparum ESTsHoney bee ETSsSchizophyllum commune ESTsMeloidogyne incognita ESTsPlasmodium vivax ESTsAmblyomma variegatum ETSs

Genomic annotation: Drosophila melanogaster genomeCaenorhabditis briggsae genomeAnopheles gambiae genomeSchizosaccharomyces pombe genomePlasmodium yoelli genomePlasmodium falciparum genomeDictyostelium genomeRice genomePlant alternatively spliced genesHuman pseudogenes

http://www.geneontology.org/GO.biblio.html

SGD: Dwight et al. 2002

Database annotations

Meloidogyne incognita: McCarter et al. 2003

Annotation summaries

The combinatorial nightmare

Combinatoric explosion

Process Body partRegulationNegative or

Positive

2 * 1 * (# of processes - 1)

Induction

2 * 2 * (# of processes - 2)

2 * 2 * (# of processes - 2) * (# of body parts)

OBOL - Open Biological Ontologies Language

Chris Mungall

The OBOL System Approach: annotation-time term

composition vs tools for maintenance of large directed acyclic graphs

Requires new generalization hierarchies Term decomposition using grammars Generating computable logical

definitions Using logical definitions – term

creation and error checking

A A Formal Grammar for OBO terms Formal Grammar for

OBO terms All GO terms are NOUN-PHRASES A NOUN-PHRASE is (recursively) made from

• a NOUN (includes inflected verbs; eg binding)• an ADJECTIVE followed by a NOUN-PHRASE• a NOUN-PHRASE preceeded by a NOUN-

PHRASE acting as ADJECTIVE; eg clathrin coat• a NOUN-PHRASE then PREPOSITION then

NOUN-PHRASE; eg regulation of transcription• an (optional) NOUN-PHRASE then a

RELATIONAL ADJECTIVE then a NOUN-PHRASE; eg clathrin-coated vesicle

Precedence rules are also required to prune parse forest

A Formal Grammar for OBO terms

Gene Ontology Software

•Browsers - Amigo•Database - mySQL•Editor - DAG-EDIT

geneontology.sourceforge.net

•Third party software (e.g. Spotfire; TreeMap; GoFish; FatiGO)

OBO-Edit - a powerful editor for directed acyclic graphs.

•data adaptors•multiple edits on same graph•define your own relationship types•plug in architecture - e.g. add an external in-line dictionary

The importance of community feedback

Everyone can suggest new terms for GOand tell us what errors we have made.

geneontology.sourceforge.net