ontology annotation: mapping genomic regions to biological function

8
Ontology annotation: mapping genomic regions to biological function Paul D Thomas 1 , Huaiyu Mi 1 and Suzanna Lewis 2 With numerous whole genomes now in hand, and experimental data about genes and biological pathways on the increase, a systems approach to biological research is becoming essential. Ontologies provide a formal representation of knowledge that is amenable to computational as well as human analysis, an obvious underpinning of systems biology. Mapping function to gene products in the genome consists of two, somewhat intertwined enterprises: ontology building and ontology annotation. Ontology building is the formal representation of a domain of knowledge; ontology annotation is association of specific genomic regions (which we refer to simply as ‘genes’, including genes and their regulatory elements and products such as proteins and functional RNAs) to parts of the ontology. We consider two complementary representations of gene function: the Gene Ontology (GO) and pathway ontologies. GO represents function from the gene’s eye view, in relation to a large and growing context of biological knowledge at all levels. Pathway ontologies represent function from the point of view of biochemical reactions and interactions, which are ordered into networks and causal cascades. The more mature GO provides an example of ontology annotation: how conclusions from the scientific literature and from evolutionary relationships are converted into formal statements about gene function. Annotations are made using a variety of different types of evidence, which can be used to estimate the relative reliability of different annotations. Addresses 1 Evolutionary Systems Biology Group, Artificial Intelligence Center, SRI International, 333 Ravenswood Ave, Menlo Park, CA 94025, USA 2 Berkeley Bioinformatics and Ontology Project, Lawrence Berkeley National Laboratory, 1 Cyclotron Road Mailstop 64–121, Berkeley, CA 94720, USA Corresponding author: Thomas, Paul D ([email protected]) Current Opinion in Chemical Biology 2007, 11:4–11 This review comes from a themed issue on Proteomics and genomics Edited by Matthew Bogyo and Benjamin F Cravatt Available online 5th January 2007 1367-5931/$ – see front matter # 2006 Elsevier Ltd. All rights reserved. DOI 10.1016/j.cbpa.2006.11.039 Introduction Interpreting the results of biological experiments, particularly at a genomic scale, requires a systems approach. The genome of an organism can contain tens of thousands of genes and even more gene regulatory elements; these genes and their products interact in a complex network that defines biology at the molecular level. These molecular networks tend to be modular, with closely interacting molecules forming a composite unit that itself has a function. Furthermore, these modules have been combined in a hierarchical fashion over evol- utionary time to generate even higher level functions. Representing the function(s) of a single protein in this context is already a daunting task; representing function on a genome-wide scale is even more so. Perhaps the most valuable tool for managing this com- plexity is the computer. But for a computer to be useful in such a task, we must first provide a structured, ‘formal’ model of biological function. Ontologies have been used for years in computer science to provide a structure for knowledge and, with the advent of the Gene Ontology (GO) [1 ], are entering widespread use in the domain of biology. Here, we first give a brief general background of ontologies. We then describe the ontology structures currently being used in the biological knowledge domain, focusing on the GO, as well as the rise of ‘pathway ontologies’ that can represent mechanistic and temporal properties of molecular networks. Finally, we review current efforts in ‘ontology annotation’ — the methods by which specific genes are associated with gene function ontology terms — with an emphasis on helping users to distinguish reliable from less reliable annotations. What is an ontology? An ontology is a formal structuring of knowledge. In its purest form, it is meant to represent reality, which means some part of the world as we currently understand or interpret it. An ontology consists of ‘universals’ (also referred to variously as ‘entities’, ‘classes’, ‘concepts’, ‘types’ and ‘terms’) and the relationships between them [2,3]. A universal is simply a type, or category, of things in the real world. Universals are often divided into two main subtypes: ‘continuants’ (things that exist), and ‘occurrents’ (things that occur in time, or ‘events’). So, for example, a particular molecule of trypsin in this test tube is a continuant of the type called serine protease, and the reaction it is catalyzing at that moment is an occur- rent of the type called proteolysis. Universals have relationships to other universals in the ontology; for example, proteolysis is a subtype of ‘protein processing and modification’. In formal ontology representation, this would be coded as follows: proteolysis is_a protein processing and modification. Current Opinion in Chemical Biology 2007, 11:4–11 www.sciencedirect.com

Upload: paul-d-thomas

Post on 26-Jun-2016

214 views

Category:

Documents


0 download

TRANSCRIPT

Ontology annotation: mapping genomic regions tobiological functionPaul D Thomas1, Huaiyu Mi1 and Suzanna Lewis2

With numerous whole genomes now in hand, and experimental

data about genes and biological pathways on the increase, a

systems approach to biological research is becoming

essential. Ontologies provide a formal representation of

knowledge that is amenable to computational as well as human

analysis, an obvious underpinning of systems biology. Mapping

function to gene products in the genome consists of two,

somewhat intertwined enterprises: ontology building and

ontology annotation. Ontology building is the formal

representation of a domain of knowledge; ontology annotation

is association of specific genomic regions (which we refer to

simply as ‘genes’, including genes and their regulatory

elements and products such as proteins and functional RNAs)

to parts of the ontology. We consider two complementary

representations of gene function: the Gene Ontology (GO) and

pathway ontologies. GO represents function from the gene’s

eye view, in relation to a large and growing context of biological

knowledge at all levels. Pathway ontologies represent function

from the point of view of biochemical reactions and

interactions, which are ordered into networks and causal

cascades. The more mature GO provides an example of

ontology annotation: how conclusions from the scientific

literature and from evolutionary relationships are converted into

formal statements about gene function. Annotations are made

using a variety of different types of evidence, which can be used

to estimate the relative reliability of different annotations.

Addresses1 Evolutionary Systems Biology Group, Artificial Intelligence Center, SRI

International, 333 Ravenswood Ave, Menlo Park, CA 94025, USA2 Berkeley Bioinformatics and Ontology Project, Lawrence Berkeley

National Laboratory, 1 Cyclotron Road Mailstop 64–121, Berkeley, CA

94720, USA

Corresponding author: Thomas, Paul D ([email protected])

Current Opinion in Chemical Biology 2007, 11:4–11

This review comes from a themed issue on

Proteomics and genomics

Edited by Matthew Bogyo and Benjamin F Cravatt

Available online 5th January 2007

1367-5931/$ – see front matter

# 2006 Elsevier Ltd. All rights reserved.

DOI 10.1016/j.cbpa.2006.11.039

IntroductionInterpreting the results of biological experiments,

particularly at a genomic scale, requires a systems

approach. The genome of an organism can contain tens

Current Opinion in Chemical Biology 2007, 11:4–11

of thousands of genes and even more gene regulatory

elements; these genes and their products interact in a

complex network that defines biology at the molecular

level. These molecular networks tend to be modular, with

closely interacting molecules forming a composite unit

that itself has a function. Furthermore, these modules

have been combined in a hierarchical fashion over evol-

utionary time to generate even higher level functions.

Representing the function(s) of a single protein in this

context is already a daunting task; representing function

on a genome-wide scale is even more so.

Perhaps the most valuable tool for managing this com-

plexity is the computer. But for a computer to be useful in

such a task, we must first provide a structured, ‘formal’

model of biological function. Ontologies have been used

for years in computer science to provide a structure for

knowledge and, with the advent of the Gene Ontology

(GO) [1��], are entering widespread use in the domain of

biology. Here, we first give a brief general background of

ontologies. We then describe the ontology structures

currently being used in the biological knowledge domain,

focusing on the GO, as well as the rise of ‘pathway

ontologies’ that can represent mechanistic and temporal

properties of molecular networks. Finally, we review

current efforts in ‘ontology annotation’ — the methods

by which specific genes are associated with gene function

ontology terms — with an emphasis on helping users to

distinguish reliable from less reliable annotations.

What is an ontology?An ontology is a formal structuring of knowledge. In its

purest form, it is meant to represent reality, which means

some part of the world as we currently understand or

interpret it. An ontology consists of ‘universals’ (also

referred to variously as ‘entities’, ‘classes’, ‘concepts’,

‘types’ and ‘terms’) and the relationships between them

[2,3]. A universal is simply a type, or category, of things in

the real world. Universals are often divided into two

main subtypes: ‘continuants’ (things that exist), and

‘occurrents’ (things that occur in time, or ‘events’). So,

for example, a particular molecule of trypsin in this test

tube is a continuant of the type called serine protease, and

the reaction it is catalyzing at that moment is an occur-

rent of the type called proteolysis. Universals have

relationships to other universals in the ontology; for

example, proteolysis is a subtype of ‘protein processing

and modification’. In formal ontology representation,

this would be coded as follows: proteolysis is_a proteinprocessing and modification.

www.sciencedirect.com

Ontologies for gene function Thomas, Mi and Lewis 5

Ontologies for gene function: the GeneOntology and pathway ontologiesThe GO was designed as a formal representation of

biological knowledge, as it relates to genes and gene

products (primarily proteins, but also functional RNA

molecules) [1��,4]. It consists of three knowledge

domains: molecular function, biological process and cel-

lular component. These terms were meant to describe the

biological functions of an individual gene product: its

functions at the molecular level, which higher level

processes those functions help to accomplish, and where

in the cell those functions are typically carried out.

Cellular components and molecular functions are conti-

nuants. Cellular components are defined as ‘a component

of a cell. . . that is part of some larger object’, such as an

organelle or molecular ‘machine’ made of multiple gene

products (e.g. the proteasome or spliceosome) (http://

www.geneontology.org/GO.doc.shtml). A molecular

function is defined as the potential capacity to carry

out an activity, such as catalysis at the molecular level,

which the gene product possesses. Biological processes

are occurrents. A biological process is defined as ‘a series

of events accomplished by one or more ordered assem-

blies of molecular functions’. One way to think about the

difference between these two ontology domains is that

molecular function covers biology at the local, individual

molecular level, whereas biological process covers biology

at all higher levels, from metabolic pathways to organism-

level physiology and even behavior.

The computational representation of pathways using

ontologies actually has an even longer history than GO,

although the use of these ontologies has not been as

widespread among biologists. One of the first pathway

representations was encoded as the back-end database

schema for the EcoCyc database [5,6]. Pathway ontolo-

gies naturally include classes that cover molecules and

different states of those molecules (e.g. phosphorylated

and unphosphorylated forms of a protein), but the

primary ‘atomic’ class is the ‘generalized reaction’. The

most important relationships are those between molecule

classes and reaction classes. For example, a molecule can

be related to a classic chemical reaction as a reactant,

product or catalyst. Reactions are generalized to include

covalent reactions, such as transfer of a phosphate group,

and noncovalent interactions, such as protein binding

events. A key property of these ontologies is that they

enable molecular events to be ordered into a pathway,

either implicitly (products of one reaction are reactants in

another) or explicitly (by representing dependencies or

temporal ordering relations between reactions). For path-

way ontologies covering eukaryotic cells, which have a

high degree of structure for localizing and compartmen-

talizing reactions, the relationships between reactions and

cellular components must also be represented (typically

using GO cellular compartment terms). There are two

standard formats for representing reaction and pathway

www.sciencedirect.com

ontologies: Systems Biology Markup Language (SBML)

[7] and Biological Pathways Exchange (BioPAX) [8��].One of the main differences between these two standards

is that SBML is used primarily for quantitative modeling

of biological processes and includes relevant terms and

attributes such as rate constants and equilibrium con-

stants, whereas BioPAX emerged from the genomics

community and includes relevant terms and attributes

such as protein sequences and gene identifiers.

It is important to note that even though some GO

biological process terms represent pathways, this repres-

entation is limited to types of processes, and does not

describe the process itself. The relationship types in GO

are either ‘is_a’ (i.e. ‘is a subclass of’), or ‘part_of’, which

cannot represent the temporal or biochemical relation-

ships between the different molecular steps in the path-

way. One example is the Notch signaling pathway.

Figure 1 compares the representation of this pathway

in GO with a more detailed pathway representation. The

advantages of the GO representation include (i) its sim-

plicity, and (ii) its focus on representing the structure and

context of general biological knowledge [3]. The Notch

pathway is represented as a single concept, with three

explicit ‘parts’ that capture the general concept of Notch

pathway regulation, Notch receptor processing and Notch

target gene activation. An example of the greater context

it provides is the rich set of connections between regu-

lation of Notch signaling pathway and other ontology

terms. By contrast, the advantages of the pathway repres-

entation include (i) the capability to represent details

including molecular mechanisms, and (ii) representation

of temporal ordering of events. The visual pathway

representation shown in Figure 1b has a precise mapping

into an ontology (SBML), and represents 15 reactions

involving 23 distinct molecule classes.

Ontology annotation of genesGO and pathway ontologies provide a structure for repre-

senting the functions of proteins in light of our current

understanding of biology, but they do not provide infor-

mation about actual genes. In ontology jargon, ontologies

describe the ‘types’ of entities that exist and the

‘relations’ that exist between these entities, but do not

specify ‘instances’ [3]. From a practical standpoint, this

means that the ontological types provide ‘bins’ into which

individual genes can be classified, but they do not actually

place genes into these bins. The process of associating

(‘annotating’) a biological molecule with an ontological

type is called ontology annotation.

Sources of ontology annotations for genes and gene

products

The primary source for GO annotations is the Gene

Ontology database, available at http://www.geneontology.

org. The database contains GO annotations deposited by

each of the contributing GO Consortium members

Current Opinion in Chemical Biology 2007, 11:4–11

6 Proteomics and genomics

Figure 1

Current Opinion in Chemical Biology 2007, 11:4–11 www.sciencedirect.com

Ontologies for gene function Thomas, Mi and Lewis 7

Table 1

Pathway databases that are ontology based and linked to stable sequence identifiers.

Database name BioCyc KEGG PANTHER pathway Reactome

Pathway type Metabolic Mostly metabolic, a few signal

transduction

Metabolic, regulatory, signal

transduction

Metabolic, regulatory, signal

transduction

Standard format SBML, BioPAX SBML, BioPAX SBML SBML, BioPAX

Sequences linked to

pathway

Yes Yes Yes Yes

Literature evidence Linked to reactions Not available Linked to sequences and reactions Linked to reactions

Homology inference Enzyme commission

number matching,

Bayesian ‘hole filling’ [52]

Orthologous clusters PANTHER phylogenetic tree, hidden

Markov model

Orthologous clusters from

OrthoMCL [23]

URL www.biocyc.org/ www.genome.jp/kegg/ www.pantherdb.org/pathway/ www.reactome.org./

Reference [49��] [50] [47��] [48��]

[9,10��,11–19]. Currently, each member database is

responsible for a single organism, or set of organisms (often

taxonomically related) for which they contribute GO anno-

tations. GO annotations for organisms not covered directly

by a model organism database (see URL for full list) are

provided for UniProt sequences by the GOA group [10��].GO annotations are not simply attached to text strings

representing, for example, gene symbols such as ‘AKT’;

they are attached to stable database identifiers that can be

accurately tracked and updated as our knowledge of genes

and genomes increases.

Unfortunately, there is currently no centralized resource

for pathway ontologies, including pathway ontology anno-

tations. It has been estimated [20��] that more than 190

sources of pathway data are currently available on the

internet. However, only a small fraction of these sources

are both explicitly ontology based and linked to stable

sequence identifiers, and therefore approach the GO

standard for ontology annotation (Table 1).

Using evidence types to distinguish more from less

reliable annotations

The source of all ontology annotations is ultimately exper-

imental findings published in the scientific literature

through one or more steps of inference. However, not all

ontology annotations are inferred by equally reliable

methods. The reliability of an ontology annotation

depends on how those inferences were made (i.e. the

‘evidence’ for the annotation).

GO provides several evidence codes for describing the

type of evidence used (http://www.geneontology.org/

(Figure 1 Legend) The Notch signaling pathway as represented in (a) the Ge

representation of the pathway is relatively simple and provides high-level bio

signaling pathway’ is a child of ‘regulation of signal transduction’, which is its

transduction’. Figure (a) created using the AMIGO program browser availabl

are 15 reactions; for example, one reaction class is NIcs + Ub (catalyzed_by)

(reactant), Sel10 (catalyst), NIcs�Ub (product). NIcs is linked to UniProt P465

Abbreviations: IGI, inferred from genetic interaction; NIcs, Notch intracellula

ubiquitin. Figure (b) created from www.pantherdb.org/pathway/pathwayDiag

www.sciencedirect.com

GO.evidence.shtml), as well as a link to the piece of

evidence itself. Broadly speaking, there are three main

types of evidence: literature-based evidence, homology-

based evidence and other computational evidence. The

fraction of GO annotations of a given evidence type is

shown in Figure 2.

Literature-based evidence

Literature-based evidence comes from direct experimen-

tal results used to draw an inference about the function of

a particular gene or its gene product. This can be primary-

source evidence, such as a paper that describes an actual

experiment, or secondary-source evidence, such as a

statement made in a review paper. Literature-based

GO annotations made by GO curators are considered

to be the most reliable. Some of the ‘model organisms’

that are particularly well studied, such as yeast, fruit fly

and Caenorhabditis elegans, provide a large number of GO

annotations with direct, literature-based evidence.

Homology-based evidence

For most organisms, experimental data do not currently

exist for most gene products. As a result, the great

majority of GO annotations rely on evidence of homology

(Figure 2). For homology-based annotations, two separate

inferences are made. An example of homology-based

annotation is shown in Figure 3, where a human protein

is annotated by homology to a Drosophila protein. The

first inference is that a particular paper provides litera-

ture-based evidence for the function of the DrosophilaEther-a-go-go (EAG), resulting in the direct annotation:

FBgn0000535 molecular function: ion channel with PubMed

ID 10798390. The second inference is often simply called

ne Ontology (GO) and (b) the PANTHER pathway ontology. (a) In GO, the

logical concepts for context; for example, the term ‘regulation of Notch

elf a child of both ‘regulation of cellular process’ and ‘regulation of signal

e at www.godatabase.org/. (b) In the PANTHER pathway ontology, there

Sel10! NIcs�Ub, relating the molecular classes NIcs (reactant), Ubiquitin

31 with PubMed 15123653 [51] as evidence, and IGI as evidence code.

r soluble fragment; NIct, Notch intracellular tethered fragment; Ub,

ram.jsp?catAccession=P00045.

Current Opinion in Chemical Biology 2007, 11:4–11

8 Proteomics and genomics

Figure 2

Fraction of GO annotations with different types of evidence. Evidence has been divided into three broad groups: literature-based (both primary-source

GO evidence codes IDA, IEP, IGI, IMP, IPI, and secondary-source codes TAS, NAS, IC), homology-based (ISS code, plus IEA codes and RCA codes

that rely primarily on homology data, such as InterPro), and other (IEA codes and RCA codes that are not primarily homology-based). For most

organisms, homology-based annotations predominate. Abbreviations: IC, inferred by curator; IDA, inferred from direct assay; IEA, inferred from

electronic annotation; IEP, inferred from expression pattern; IGI, inferred from genetic interaction; IMP, inferred from mutant phenotype; IPI, inferred

from physical interaction; ISS, inferred from sequence similarity; NAS, nontraceable author statement; RCA, inferred from reviewed computational

annotation; TAS, traceable author statement.

Figure 3

Annotation of human KCNH1 and KCNH5 using homology to Drosophila EAG (KCNAE_DROME), experimentally determined to be a potassium

channel. For this inference to be valid, the most recent common ancestor (MRCA) of all these sequences (represented by the blue diamond)

must have had potassium channel activity, and this function was preserved among all the descendants shown, including not only the human

sequences, but also the Anopheles (mosquito), and the other vertebrate sequences.

Current Opinion in Chemical Biology 2007, 11:4–11 www.sciencedirect.com

Ontologies for gene function Thomas, Mi and Lewis 9

‘homology’ but is actually a statement about the function

of the most recent common ancestor (or an even more

ancient ancestor), and about the inheritance of function

from that ancestor. This is clear in Figure 3. To annotate

potassium channel activity of the human EAGs based on

homology with Drosophila EAG, one implicitly assumes

that (i) the most recent common ancestor possessed

potassium channel activity, and (ii) this function was

inherited by its extant descendants (here, DrosophilaEAG and the human EAGs, as well as descendants in

many other organisms, as shown in Figure 3).

The reliability of the homology-based annotation

depends on the reliability of the two links in the inference

chain: the literature-based inference for the function of

one gene, and the inference of descent from a common

ancestor that had that same function, which was then

preserved in its descendants. Either of these links can be

human curated or computationally inferred. However, the

putative homology relationship, even if reviewed by a

curator, is always the result of a computational algorithm,

such as BLAST (basic local alignment search tool), hid-

den Markov models such as Pfam [21], sequence cluster-

ing algorithms [22,23] or phylogenetic tree building

algorithms [24–27]. Curator-reviewed BLAST searching

has been shown to result in less reliable GO annotations

than phylogenetic tree building algorithms and curated

subfamily hidden Markov models [3]. Because infor-

mation about the algorithm is not currently linked to

the annotation, the reliability of even curator-reviewed

homology-based annotations is difficult to estimate from

the current set of GO evidence codes, although further

work in this area is in progress.

Other evidence

Several other computational methods have been devel-

oped for GO annotation. Because of space requirements,

here we briefly mention only those methods that have

already contributed to the annotations currently in the

GO database. Groups partly or entirely outside the Gene

Ontology Consortium (GOC) are largely responsible for

developing these computational methods; the remarkable

success of the GOC is due in part to its emphasis on

openness and collaboration. Not including homology-

based annotations [28–34,35��], these other compu-

tational methods represent a wide variety of approaches:

text mining [36�,37�,38], gene expression analysis [39],

protein–protein interaction data analysis [40], small

nucleolar RNA predictions [41–43], signal peptide pre-

dictions [44], membrane-spanning region predictions [45]

and a method based on a clustering of genes with similar

GO annotations [46]. In general, these approaches are all

labeled with the IEA (inferred from electronic annota-

tion) evidence code, indicating the lowest level of

reliability, but as with homology-based annotations the

reliability is dependent on the methodology used. Cura-

tors often review these electronic annotations, resulting in

www.sciencedirect.com

upgrading of text-mining results to literature-based anno-

tations, or upgrading of other results to RCA (reviewed

computational analysis).

Pathway ontology annotation evidence

Some pathway ontology resources also use a controlled set

of evidence codes for estimating the reliability of different

ontology annotations. The PANTHER pathway database

uses the GO evidence codes for direct evidence and links

to ancestral nodes in phylogenetic trees to trace homology

inferences [47��]. The Reactome database has two types of

evidence [48��], direct evidence and orthology evidence

(this is a particularly stringent homology relationship,

implying only speciation events since the last common

ancestor of two genes, i.e. the ‘same’ gene in two different

species) whereas the BioCyc database has its own evidence

ontology [49��]. By contrast, the KEGG database [50]

makes use of orthology clusters but does not provide

information about which genes have direct evidence from

literature, and which genes are inferred orthologs.

ConclusionsOntologies for gene function have a crucial role in trans-

lating genomic data into models of biological function.

Ontologies are constructed to represent a domain of

knowledge using concepts and relationships; the bio-

logical knowledge domain is an area of extremely active

research and, therefore, the corresponding ontologies will

continue to evolve rapidly. The Gene Ontology is being

constantly expanded and revised in a collaborative man-

ner. Pathway ontologies provide detailed biochemical

relationships between molecular types; these relation-

ships are complementary to the representation in the

Gene Ontology, and, indeed, can be explicitly connected

to Gene Ontology terms. However, pathway ontologies

have not reached the maturity of the Gene Ontology;

there is as yet no centralized repository for these ontol-

ogies, nor a coordinated process for reaching community

consensus on representing specific pathways.

Biological ontology annotation is the process by which

specific genomic regions — genes (in the broadest sense)

and their products — are linked to concepts in biological

ontologies. Each annotation is an inference that rests on

specific evidence. The type of evidence is crucial for

estimating the relative reliability of different annotations,

although few users of Gene Ontology data have taken full

advantage of evidence codes. Most gene function ontol-

ogy annotations rely on homology inferences, and the

reliability of a given homology-based annotation depends

crucially on the computational methodology used to infer

the inheritance of function from a common ancestor.

Gene Ontology annotations have been used widely in

several application areas, but by far the most common use

has been in forming biological hypotheses from large-

scale genomic data such as gene expression studies

Current Opinion in Chemical Biology 2007, 11:4–11

10 Proteomics and genomics

(http://www.ebi.ac.uk/GOA/users.html). In most cases,

these hypotheses derive from statistically significant over-

laps between groups of coregulated genes and GO terms.

The continued development of protein function ontolo-

gies, such as the more detailed representation in pathway

ontologies, will enable even more sophisticated analysis

of large-scale biological experiments. Experiments, in

turn, will lead to revisions in the ontologies, paving the

way toward systems biology.

AcknowledgementsWe thank Anish Kejariwal for generating the data in Figure 2, and forhelpful comments on the manuscript.

References and recommended readingPapers of particular interest, published within the annual period ofreview, have been highlighted as:

� of special interest

�� of outstanding interest

1.��

Gene ontology consortium: The Gene Ontology (GO) project in2006. Nucleic Acids Res 2006, 34:D322-D326.

An excellent update of GO project status, including applications of theGO.

2. Smith B: Onthology. In Blackwell Guide to the Philosophy ofComputing and Information. Edited by Floridi L. Oxford: Blackwell;2003:155-166.

3. Smith B, Williams J, Schulze-Kremer S: The ontology of the geneontology. AMIA Annu Symp Proc 2003:609-613.

4. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM,Davis AP, Dolinski K, Dwight SS, Eppig JT et al.: Gene ontology:tool for the unification of biology. The Gene OntologyConsortium. Nat Genet 2000, 25:25-29.

5. Karp PD, Riley M, Paley SM, Pelligrini-Toole A: EcoCyc: anencyclopedia of Escherichia coli genes and metabolism.Nucleic Acids Res 1996, 24:32-39.

6. Keseler IM, Collado-Vides J, Gama-Castro S, Ingraham J, Paley S,Paulsen IT, Peralta-Gil M, Karp PD: EcoCyc: a comprehensivedatabase resource for Escherichia coli. Nucleic Acids Res 2005,33:D334-D337.

7. Hucka M, Finney A, Sauro HM, Bolouri H, Doyle JC, Kitano H,Arkin AP, Bornstein BJ, Bray D, Cornish-Bowden A et al.: Thesystems biology markup language (SBML): a medium forrepresentation and exchange of biochemical network models.Bioinformatics 2003, 19:524-531.

8.��

Luciano JS: PAX of mind for pathway researchers. Drug DiscovToday 2005, 10:937-942.

BioPax is an emerging standard for pathway exchange format thatcombines different concepts of biologial pathways into one consistentontology. This article describes in detail the progress of BioPAX, as wellas the challenge ahead of it. See also http://www.biopax.org.

9. Blake JA, Eppig JT, Bult CJ, Kadin JA, Richardson JE: The MouseGenome Database (MGD): updates and enhancements.Nucleic Acids Res 2006, 34:D562-D567.

10.��

Camon E, Magrane M, Barrell D, Lee V, Dimmer E, Maslen J,Binns D, Harte N, Lopez R, Apweiler R: The Gene OntologyAnnotation (GOA) database: sharing knowledge in Uniprotwith Gene Ontology. Nucleic Acids Res 2004, 32:D262-D266.

GOA provides high-quality electronic and manual annotations to theUniProt Knowledgebase using the standardized vocabulary of the GeneOntology. This article describes the GO annotation process implementedfor the project.

11. Drysdale RA, Crosby MA: FlyBase: genes and gene models.Nucleic Acids Res 2005, 33:D390-D395.

12. Hirschman JE, Balakrishnan R, Christie KR, Costanzo MC,Dwight SS, Engel SR, Fisk DG, Hong EL, Livstone MS, Nash R

Current Opinion in Chemical Biology 2007, 11:4–11

et al.: Genome Snapshot: a new resource at theSaccharomyces Genome Database (SGD) presenting anoverview of the Saccharomyces cerevisiae genome.Nucleic Acids Res 2006, 34:D442-D445.

13. Rhee SY, Beavis W, Berardini TZ, Chen G, Dixon D, Doyle A,Garcia-Hernandez M, Huala E, Lander G, Montoya M et al.:The Arabidopsis Information Resource (TAIR): a modelorganism database providing a centralized, curated gatewayto Arabidopsis biology, research materials and community.Nucleic Acids Res 2003, 31:224-228.

14. de la Cruz N, Bromberg S, Pasko D, Shimoyama M, Twigger S,Chen J, Chen CF, Fan C, Foote C, Gopinath GR et al.: The RatGenome Database (RGD): developments towards a phenomedatabase. Nucleic Acids Res 2005, 33:D485-D491.

15. Schwarz EM, Antoshechkin I, Bastiani C, Bieri T, Blasiar D,Canaran P, Chan J, Chen N, Chen WJ, Davis P et al.: WormBase:better software, richer content. Nucleic Acids Res 2006,34:D475-D478.

16. Sprague J, Bayraktaroglu L, Clements D, Conlin T, Fashena D,Frazer K, Haendel M, Howe DG, Mani P, Ramachandran S et al.:The Zebrafish Information Network: the zebrafish modelorganism database. Nucleic Acids Res 2006, 34:D581-D585.

17. Jaiswal P, Ni J, Yap I, Ware D, Spooner W, Youens-Clark K, Ren L,Liang C, Zhao W, Ratnapu K et al.: Gramene: a bird’s eye view ofcereal genomes. Nucleic Acids Res 2006, 34:D717-D723.

18. Chisholm RL, Gaudet P, Just EM, Pilcher KE, Fey P, Merchant SN,Kibbe WA: DictyBase, the model organism database forDictyostelium discoideum. Nucleic Acids Res 2006,34:D423-D427.

19. Haft DH, Selengut JD, White O: The TIGRFAMs database ofprotein families. Nucleic Acids Res 2003, 31:371-373.

20.��

Bader GD, Cary MP, Sander C: Pathguide: a pathway resourcelist. Nucleic Acids Res 2006, 34:D504-D506.

A carefully compiled and categorized list of available pathway resources.

21. Finn RD, Mistry J, Schuster-Bockler B, Griffiths-Jones S, Hollich V,Lassmann T, Moxon S, Marshall M, Khanna A, Durbin R et al.:Pfam: clans, web tools and services. Nucleic Acids Res 2006,34:D247-D251.

22. Servant F, Bru C, Carrere S, Courcelle E, Gouzy J, Peyruc D,Kahn D: ProDom: automated clustering of homologousdomains. Brief Bioinform 2002, 3:246-251.

23. Li L, Stoeckert CJ Jr, Roos DS: OrthoMCL: identification ofortholog groups for eukaryotic genomes. Genome Res 2003,13:2178-2189.

24. Thomas PD, Kejariwal A, Campbell MJ, Mi H, Diemer K, Guo N,Ladunga I, Ulitsky-Lazareva B, Muruganujan A, Rabkin S et al.:PANTHER: a browsable database of gene productsorganized by biological function, using curated proteinfamily and subfamily classification. Nucleic Acids Res 2003,31:334-341.

25. Mi H, Lazareva-Ulitsky B, Loo R, Kejariwal A, Vandergriff J, RabkinS,Guo N, Muruganujan A, Doremieux O, Campbell MJ et al.: ThePANTHER database of protein families, subfamilies, functionsand pathways. Nucleic Acids Res 2005, 33:D284-D288.

26. Li H, Coghlan A, Ruan J, Coin LJ, Heriche JK, Osmotherly L, Li R,Liu T, Zhang Z, Bolund L et al.: TreeFam: a curated database ofphylogenetic trees of animal gene families. Nucleic Acids Res2006, 34:D572-D580.

27. Dehal PS, Boore JL: A phylogenomic gene cluster resource:the Phylogenetically Inferred Groups (PhIGs) database.BMC Bioinformatics 2006, 7:201.

28. Mi H, Vandergriff J, Campbell M, Narechania A, Majoros W,Lewis S, Thomas PD, Ashburner M: Assessment ofgenome-wide protein function classification for Drosophilamelanogaster. Genome Res 2003, 13:2118-2128.

29. Okazaki Y, Furuno M, Kasukawa T, Adachi J, Bono H,Kondo S, Nikaido I, Osato N, Saito R, Suzuki H et al.: Analysis ofthe mouse transcriptome based on functional annotation of60,770 full-length cDNAs. Nature 2002, 420:563-573.

www.sciencedirect.com

Ontologies for gene function Thomas, Mi and Lewis 11

30. Carninci P, Kasukawa T, Katayama S, Gough J, Frith MC,Maeda N, Oyama R, Ravasi T, Lenhard B, Wells C et al.:The transcriptional landscape of the Mamm Genome.Science 2005, 309:1559-1563.

31. Hayete B, Bienkowska JR: Gotrees: predicting go associationsfrom protein domain composition using decision trees.Pac Symp Biocomput 2005:127-138.

32. Ogawa T, Ueda Y, Yoshimura K, Shigeoka S: Comprehensiveanalysis of cytosolic Nudix hydrolases in Arabidopsis thaliana.J Biol Chem 2005, 280:25277-25283.

33. Auldridge ME, Block A, Vogel JT, Dabney-Smith C, Mila I,Bouzayen M, Magallanes-Lundback M, DellaPenna D,McCarty DR, Klee HJ: Characterization of three members of theArabidopsis carotenoid cleavage dioxygenase familydemonstrates the divergent roles of this multifunctionalenzyme family. Plant J 2006, 45:982-993.

34. Lutfiyya LL, Xu N, D’Ordine RL, Morrell JA, Miller PW, Duff SM:Phylogenetic and expression analysis of sucrose phosphatesynthase isozymes in plants. J Plant Physiol 2006, Epub ahead ofprint, doi:10.1016/j.jplph.2006.04.014.

35.��

Maeda N, Kasukawa T, Oyama R, Gough J, Frith M, Engstrom PG,Lenhard B, Aturaliya RN, Batalov S, Beisel KW et al.: Transcriptannotation in FANTOM3: mouse gene catalog based onphysical cDNAs. PLoS Genet 2006, 2:e62.

The third large-scale mouse transcript annotation FANTOM3; the processcontains several improvements over that of FANTOM1.

36.�

Muller HM, Kenny EE, Sternberg PW: Textpresso: anontology-based information retrieval and extraction systemfor biological literature. PLoS Biol 2004, 2:e309.

A description of the text retrieval system used by many model organismGO annotation projects.

37.�

Liu Y, Navathe SB, Civera J, Dasigi V, Ram A, Ciliax BJ,Dingledine R: Text mining biomedical literature for discoveringgene-to-gene relationships: a comparative study ofalgorithms. IEEE/ACM Trans Comput Biol Bioinform 2005,2:62-76.

A good description of several algorithms for text mining in the context offunctional annotation.

38. Hoffmann R, Valencia A: Implementing the iHOP concept fornavigation of biomedical literature. Bioinformatics 2005,21(Suppl 2):ii252-ii258.

39. Wade CH, Umbarger MA, McAlear MA: The budding yeast rRNAand ribosome biosynthesis (RRB) regulon contains over 200genes. Yeast 2006, 23:293-306.

40. Samanta MP, Liang S: Predicting protein functions fromredundancies in large-scale protein interaction networks.Proc Natl Acad Sci USA 2003, 100:12579-12583.

www.sciencedirect.com

41. Schattner P, Decatur WA, Davis CA, Ares M Jr, Fournier MJ,Lowe TM: Genome-wide searching for pseudouridylationguide snoRNAs: analysis of the Saccharomyces cerevisiaegenome. Nucleic Acids Res 2004, 32:4281-4296.

42. Lowe TM, Eddy SR: A computational screen for methylationguide snoRNAs in yeast. Science 1999, 283:1168-1171.

43. Kiss-Laszlo Z, Henry Y, Bachellerie JP, Caizergues-Ferrer M, Kiss T:Site-specific ribose methylation of preribosomal RNA: a novelfunction for small nucleolar RNAs. Cell 1996, 85:1077-1088.

44. Emanuelsson O, Nielsen H, Brunak S, von Heijne G: Predictingsubcellular localization of proteins based on their N-terminalamino acid sequence. J Mol Biol 2000, 300:1005-1016.

45. Krogh A, Larsson B, von Heijne G, Sonnhammer EL: Predictingtransmembrane protein topology with a hidden Markovmodel: application to complete genomes. J Mol Biol 2001,305:567-580.

46. King OD, Foulger RE, Dwight SS, White JV, Roth FP: Predictinggene function from patterns of annotation. Genome Res 2003,13:896-904.

47.��

Mi H, Guo N, Kejariwal A, Thomas PD: PANTHER version 6:protein sequence and function evolution data with expandedrepresentation of biological pathways. Nucleic Acids Res 2007,35:D247-D252.

A high-level description of a pathway ontology, summarized in a figure.

48.��

Joshi-Tope G, Gillespie M, Vastrik I, D’Eustachio P, Schmidt E, deBono B, Jassal B, Gopinath GR, Wu GR, Matthews L et al.:Reactome: a knowledgebase of biological pathways. NucleicAcids Res 2005, 33:D428-D432.

An excellent, succinct description of a pathway ontology (data model).

49.��

Karp PD, Ouzounis CA, Moore-Kochlacs C, Goldovsky L, Kaipa P,Ahren D, Tsoka S, Darzentas N, Kunin V, Lopez-Bigas N: Expansionof the BioCyc collection of pathway/genome databases to 160genomes. Nucleic Acids Res 2005, 33:6083-6089.

A description of the BioCyc collection of pathway/genome databaseswith an emphasis on community contribution and more accurate pathwayprediction.

50. Kanehisa M, Goto S, Hattori M, Aoki-Kinoshita KF, Itoh M,Kawashima S, Katayama T, Araki M, Hirakawa M: From genomicsto chemical genomics: new developments in KEGG.Nucleic Acids Res 2006, 34:D354-D357.

51. Das I, Craig C, Funahashi Y, Jung KM, Kim TW, Byers R, Weng AP,Kutok JL, Aster JC, Kitajewski J: Notch oncoproteins depend ong-secretase/presenilin activity for processing and function.J Biol Chem 2004, 279:30771-30780.

52. Green ML, Karp PD: A Bayesian method for identifying missingenzymes in predicted metabolic pathway databases.BMC Bioinformatics 2004, 5:76.

Current Opinion in Chemical Biology 2007, 11:4–11