a sequence ontology suzanna lewis berkeley drosophila genome project

26
A Sequence Ontology Suzanna Lewis Berkeley Drosophila Genome Project http://www.geneontology.org/doc/gobo.html

Upload: eustacia-farmer

Post on 31-Dec-2015

221 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A Sequence Ontology Suzanna Lewis Berkeley Drosophila Genome Project

A Sequence Ontology

Suzanna Lewis Berkeley Drosophila Genome Project

http://www.geneontology.org/doc/gobo.html

Page 2: A Sequence Ontology Suzanna Lewis Berkeley Drosophila Genome Project

Semantic Interpretation: What is communication?

• An information transmission from a source to a receiver

by means of encoding-decoding processes (including

language).

• But what is meant, what is said, what is heard, and

what is understood are not always the same thing.

• This has a simple consequence: it is only possible to

communicate to the extent that we share rules of usage

and have reciprocal understanding of the meaning.

Page 3: A Sequence Ontology Suzanna Lewis Berkeley Drosophila Genome Project

Working towards a shared language for the description of

sequence.

Hey, know what I figured out? The

meaning of words isn’t a fixed thing!

Any word can mean anything! By

giving words new meanings, ordinary

English can become an exclusionary

code! Two researchers can be divided

by the same language! To that end,

we’re inventing new definitions for

common words, so we’ll be unable to

communicate. Don’t you think that is

totally excellent?

Page 4: A Sequence Ontology Suzanna Lewis Berkeley Drosophila Genome Project

How to best describe biology?

• natural language• highly expressive• ambiguous• hard to compute on

• why would I want to compute

on it?• database searching• data mining• knowledge transfer

Page 5: A Sequence Ontology Suzanna Lewis Berkeley Drosophila Genome Project

The aims of SO

1. Develop a shared set of terms and concepts to annotate biological sequences.

2. Apply these in our separate projects to provide consistent query capabilities between them.

3. Provide a software resource to assist in the application and distribution of SO.

4. Meet the GOBO criteria.

Page 6: A Sequence Ontology Suzanna Lewis Berkeley Drosophila Genome Project

SO: Phase I

• To provide a structured controlled vocabulary for the description of primary annotations of nucleic acid sequence

• Useful for the annotations shared by a DAS server.

• To provide a structured representation of these annotations within genomic databases.

• Making it possible to query all for example, all genes whose transcripts are edited, or trans-spliced, or are bound by a particular protein.

Page 7: A Sequence Ontology Suzanna Lewis Berkeley Drosophila Genome Project

Source Type Group

Sanger exon Sequence Em:AP000546.C22.2.mRNA

EBI exon dJ68O2.C22.1.mRNA

WUSTL CDS gene_is "001"; transcript_id "001.1";

Gadfly exon genegrp=CG18090; transgrp=CG18090-RA;

Wormbase exon Sequence "C27C7.7"

Simple GFF: What is a transcript?

Page 8: A Sequence Ontology Suzanna Lewis Berkeley Drosophila Genome Project

What is a pseudogene?

• Human• Sequence similar to known protein but contains

frameshift(s) and/or stop codons which disrupts the ORF.

• Neisseria• A gene that is inactive - but may be activated by

translocation (e.g. by gene conversion) to a new chromosome site.

• - note such a gene would be called a “cassette” in yeast.

Page 9: A Sequence Ontology Suzanna Lewis Berkeley Drosophila Genome Project

SO so far

• 1280 terms

• Top levels• Structural variation• Locatable features• Other sequence attributes

Page 10: A Sequence Ontology Suzanna Lewis Berkeley Drosophila Genome Project

Approach

• Determine the top level orthogonal categories• Domain, site, sequence type, location

• Specify the specializations• homeo domain, phosphorylation site, DNA/RNA/AA

• Define inter-relationships between orthogonal

categories • ison, defines

Page 11: A Sequence Ontology Suzanna Lewis Berkeley Drosophila Genome Project

primarytranscript

DNA

sequence

RNA

nucleic acidsequence

processedtranscript

transcript

defines

ison

ison

ison

ison

nucleic acid sequence region

sequence region

DNA region

gene region transcript region exon

RNA region

Page 12: A Sequence Ontology Suzanna Lewis Berkeley Drosophila Genome Project

Source Type Group

Sanger exon transcript “Em:AP000546.C22.2”

EBI exon transcript “dJ68O2.C22.1”

WUSTL CDS gene "001"; transcript "001.1";

Gadfly exon gene “CG18090”; transcript “CG18090-RA”;

Wormbase exon transcript "C27C7.7"

GFF After

Page 13: A Sequence Ontology Suzanna Lewis Berkeley Drosophila Genome Project

SO long(term)

1. Formalize the current phrase-based ontology to a

description logic

2. Provide DAML+OIL/OWL representations

3. Add declarative rules and constraints to ensure

consistency of annotations and aid annotation.

4. Extend the ontology so that it can be used as a full

sequence knowledge base.

Page 14: A Sequence Ontology Suzanna Lewis Berkeley Drosophila Genome Project

Description logics will make the ontology easier to maintain

• For example, it will enable cross-products within the ontology.

• Now: "tRNA alanyl", "tRNA coding gene alanyl", "tRNA primary transcript

alanyl".

• tRNA class has a ‘slot’ for "amino-acid” and a slot for anti-codon.

• 'restrictions' effectively say "any instance of class tRNA that has the

amino-acid slot value of alanine is of the class 'tRNA alanyl'".

• ‘checks’ for inconsistency between anticodon, amino-acid and class.

Page 15: A Sequence Ontology Suzanna Lewis Berkeley Drosophila Genome Project

Computable definitions

• Human-readable text definitions are always desirable. But,

lengthy text definitions will always be open to interpretation.

• …besides, much of the data will be provided by programmers, and

programmers never read the instructions.

• If programmers write their own code for assigning these, this opens

the possibility of inconsistencies of interpretation of the concept.

• Computable definitions/constraints are essential wherever

possible to provide a set of declarative rules for checking and

inference.

Page 16: A Sequence Ontology Suzanna Lewis Berkeley Drosophila Genome Project

A SO Knowledge Base?

• SO could eventually be used not just as a way of categorizing

sequence features, but as the data model for storing sequence

and sequence feature data.

• Accomplish this by adding a few slots to the top level feature

class - for instance for start and end coordinates.

• One could then have an entire sequence database in

DAML+OIL/OWL format.

Page 17: A Sequence Ontology Suzanna Lewis Berkeley Drosophila Genome Project

Declarative representation for spatial definitions

• Rules involving mathematical constructs cannot be usually be

expressed in a Description Logic.

• There needs to be a declarative representation of these rules

because enforcing the rules using a program written in an

imperative language, is difficult to sustain.

• Declarative languages specify *what* is to be done, rather than

*how* they should be done.

Page 18: A Sequence Ontology Suzanna Lewis Berkeley Drosophila Genome Project

Give me 500 bases upstream of all 5’ exons.

• Define 5’ exon as being the first exon on the five prime end of a transcribed region. It would be very tedious for a curator to have to specifically annotate exons as being ”5' exon" as opposed to the more general "exon". There is no need for them to do this, as this is computable from rules.

Page 19: A Sequence Ontology Suzanna Lewis Berkeley Drosophila Genome Project

Give me all the dicistronic genes

• Define a dicistronic gene in terms of the

cardinality of the transcript to open-reading-

frame relationship and their spatial

arrangement.

Page 20: A Sequence Ontology Suzanna Lewis Berkeley Drosophila Genome Project

Give me all 3’ exons that overlap 5’ untranslated regions.

• Define “exons with overlapping UTRs” as a spatial relationship coupled with being “partsof” different genes and being non-coding.

Page 21: A Sequence Ontology Suzanna Lewis Berkeley Drosophila Genome Project

Loose and Flexible

• Rules are meant purely to ensure consistency

• There will always be fuzzy areas where we want to

allow freedom, because normal biology is like that.

• Constraints are NOT meant to perform any predictive

function, they just provide a consistent definition.

Page 22: A Sequence Ontology Suzanna Lewis Berkeley Drosophila Genome Project

A single framework that integrates other biological ontologies

• One could have a 'knowledge base' centered around the

genome. This KB would be amenable to reasoning.

• This is a significant change from relational, OO, or XML

modeling, however, it is compatible with all these.

• SO could be a framework for integrating data with other

ontologies.

• product features would have slots for standard GO annotations,

• variation features would have slots into phenotypic ontologies.

Page 23: A Sequence Ontology Suzanna Lewis Berkeley Drosophila Genome Project

Build in a Bayesian Belief Network

• Probabilities may be assigned to annotations or used to

suggest new annotations.

• Define a model for binding sites and regulatory regions on

weight matrices, proximity to starts of genes and so forth.

• Curator can interactively explore and ask questions like "ok, i

have evidence for there being such and such a binding site

here, what if I alter the priors, how does that affect other nodes

in the network (statements in the knowledge base) pertaining

to pathways?”

Page 24: A Sequence Ontology Suzanna Lewis Berkeley Drosophila Genome Project

To paraphrase Brunelleschi on the importance of tools, circa

1425

QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.

I am accustomed to think about and construct in my mind some unheard of invention making it possible to create great and wonderful things.

Page 25: A Sequence Ontology Suzanna Lewis Berkeley Drosophila Genome Project

GOBO Criteria

1. The ontologies are "open" and can be used by all without any constraint other than that their origin must be acknowledged.

2. The ontologies are in, or can be instantiated in, the GO syntax, extensions of this syntax or in DAML+OIL. .

3. The ontologies are orthogonal to other ontologies already lodged with gobo.

4. The ontologies share an unique identifier space.

5. The ontologies include definitions of their terms.

Page 26: A Sequence Ontology Suzanna Lewis Berkeley Drosophila Genome Project

Giving it a go

• Sanger Institute

• Richard Durbin, Tim Hubbard

• EBI

• Michael Ashburner, Ewan Birney

• Mouse Genome Database

• Judith Blake, Carol Bult

• BDGP

• Chris Mungall, Brad Marshall, John Richter, ShengQiang Shu

• Wormbase

• Lincoln Stein