ontology annotation: mapping genomic regions to biological function
TRANSCRIPT
Ontology annotation: mapping genomic regions tobiological functionPaul D Thomas1, Huaiyu Mi1 and Suzanna Lewis2
With numerous whole genomes now in hand, and experimental
data about genes and biological pathways on the increase, a
systems approach to biological research is becoming
essential. Ontologies provide a formal representation of
knowledge that is amenable to computational as well as human
analysis, an obvious underpinning of systems biology. Mapping
function to gene products in the genome consists of two,
somewhat intertwined enterprises: ontology building and
ontology annotation. Ontology building is the formal
representation of a domain of knowledge; ontology annotation
is association of specific genomic regions (which we refer to
simply as ‘genes’, including genes and their regulatory
elements and products such as proteins and functional RNAs)
to parts of the ontology. We consider two complementary
representations of gene function: the Gene Ontology (GO) and
pathway ontologies. GO represents function from the gene’s
eye view, in relation to a large and growing context of biological
knowledge at all levels. Pathway ontologies represent function
from the point of view of biochemical reactions and
interactions, which are ordered into networks and causal
cascades. The more mature GO provides an example of
ontology annotation: how conclusions from the scientific
literature and from evolutionary relationships are converted into
formal statements about gene function. Annotations are made
using a variety of different types of evidence, which can be used
to estimate the relative reliability of different annotations.
Addresses1 Evolutionary Systems Biology Group, Artificial Intelligence Center, SRI
International, 333 Ravenswood Ave, Menlo Park, CA 94025, USA2 Berkeley Bioinformatics and Ontology Project, Lawrence Berkeley
National Laboratory, 1 Cyclotron Road Mailstop 64–121, Berkeley, CA
94720, USA
Corresponding author: Thomas, Paul D ([email protected])
Current Opinion in Chemical Biology 2007, 11:4–11
This review comes from a themed issue on
Proteomics and genomics
Edited by Matthew Bogyo and Benjamin F Cravatt
Available online 5th January 2007
1367-5931/$ – see front matter
# 2006 Elsevier Ltd. All rights reserved.
DOI 10.1016/j.cbpa.2006.11.039
IntroductionInterpreting the results of biological experiments,
particularly at a genomic scale, requires a systems
approach. The genome of an organism can contain tens
Current Opinion in Chemical Biology 2007, 11:4–11
of thousands of genes and even more gene regulatory
elements; these genes and their products interact in a
complex network that defines biology at the molecular
level. These molecular networks tend to be modular, with
closely interacting molecules forming a composite unit
that itself has a function. Furthermore, these modules
have been combined in a hierarchical fashion over evol-
utionary time to generate even higher level functions.
Representing the function(s) of a single protein in this
context is already a daunting task; representing function
on a genome-wide scale is even more so.
Perhaps the most valuable tool for managing this com-
plexity is the computer. But for a computer to be useful in
such a task, we must first provide a structured, ‘formal’
model of biological function. Ontologies have been used
for years in computer science to provide a structure for
knowledge and, with the advent of the Gene Ontology
(GO) [1��], are entering widespread use in the domain of
biology. Here, we first give a brief general background of
ontologies. We then describe the ontology structures
currently being used in the biological knowledge domain,
focusing on the GO, as well as the rise of ‘pathway
ontologies’ that can represent mechanistic and temporal
properties of molecular networks. Finally, we review
current efforts in ‘ontology annotation’ — the methods
by which specific genes are associated with gene function
ontology terms — with an emphasis on helping users to
distinguish reliable from less reliable annotations.
What is an ontology?An ontology is a formal structuring of knowledge. In its
purest form, it is meant to represent reality, which means
some part of the world as we currently understand or
interpret it. An ontology consists of ‘universals’ (also
referred to variously as ‘entities’, ‘classes’, ‘concepts’,
‘types’ and ‘terms’) and the relationships between them
[2,3]. A universal is simply a type, or category, of things in
the real world. Universals are often divided into two
main subtypes: ‘continuants’ (things that exist), and
‘occurrents’ (things that occur in time, or ‘events’). So,
for example, a particular molecule of trypsin in this test
tube is a continuant of the type called serine protease, and
the reaction it is catalyzing at that moment is an occur-
rent of the type called proteolysis. Universals have
relationships to other universals in the ontology; for
example, proteolysis is a subtype of ‘protein processing
and modification’. In formal ontology representation,
this would be coded as follows: proteolysis is_a proteinprocessing and modification.
www.sciencedirect.com
Ontologies for gene function Thomas, Mi and Lewis 5
Ontologies for gene function: the GeneOntology and pathway ontologiesThe GO was designed as a formal representation of
biological knowledge, as it relates to genes and gene
products (primarily proteins, but also functional RNA
molecules) [1��,4]. It consists of three knowledge
domains: molecular function, biological process and cel-
lular component. These terms were meant to describe the
biological functions of an individual gene product: its
functions at the molecular level, which higher level
processes those functions help to accomplish, and where
in the cell those functions are typically carried out.
Cellular components and molecular functions are conti-
nuants. Cellular components are defined as ‘a component
of a cell. . . that is part of some larger object’, such as an
organelle or molecular ‘machine’ made of multiple gene
products (e.g. the proteasome or spliceosome) (http://
www.geneontology.org/GO.doc.shtml). A molecular
function is defined as the potential capacity to carry
out an activity, such as catalysis at the molecular level,
which the gene product possesses. Biological processes
are occurrents. A biological process is defined as ‘a series
of events accomplished by one or more ordered assem-
blies of molecular functions’. One way to think about the
difference between these two ontology domains is that
molecular function covers biology at the local, individual
molecular level, whereas biological process covers biology
at all higher levels, from metabolic pathways to organism-
level physiology and even behavior.
The computational representation of pathways using
ontologies actually has an even longer history than GO,
although the use of these ontologies has not been as
widespread among biologists. One of the first pathway
representations was encoded as the back-end database
schema for the EcoCyc database [5,6]. Pathway ontolo-
gies naturally include classes that cover molecules and
different states of those molecules (e.g. phosphorylated
and unphosphorylated forms of a protein), but the
primary ‘atomic’ class is the ‘generalized reaction’. The
most important relationships are those between molecule
classes and reaction classes. For example, a molecule can
be related to a classic chemical reaction as a reactant,
product or catalyst. Reactions are generalized to include
covalent reactions, such as transfer of a phosphate group,
and noncovalent interactions, such as protein binding
events. A key property of these ontologies is that they
enable molecular events to be ordered into a pathway,
either implicitly (products of one reaction are reactants in
another) or explicitly (by representing dependencies or
temporal ordering relations between reactions). For path-
way ontologies covering eukaryotic cells, which have a
high degree of structure for localizing and compartmen-
talizing reactions, the relationships between reactions and
cellular components must also be represented (typically
using GO cellular compartment terms). There are two
standard formats for representing reaction and pathway
www.sciencedirect.com
ontologies: Systems Biology Markup Language (SBML)
[7] and Biological Pathways Exchange (BioPAX) [8��].One of the main differences between these two standards
is that SBML is used primarily for quantitative modeling
of biological processes and includes relevant terms and
attributes such as rate constants and equilibrium con-
stants, whereas BioPAX emerged from the genomics
community and includes relevant terms and attributes
such as protein sequences and gene identifiers.
It is important to note that even though some GO
biological process terms represent pathways, this repres-
entation is limited to types of processes, and does not
describe the process itself. The relationship types in GO
are either ‘is_a’ (i.e. ‘is a subclass of’), or ‘part_of’, which
cannot represent the temporal or biochemical relation-
ships between the different molecular steps in the path-
way. One example is the Notch signaling pathway.
Figure 1 compares the representation of this pathway
in GO with a more detailed pathway representation. The
advantages of the GO representation include (i) its sim-
plicity, and (ii) its focus on representing the structure and
context of general biological knowledge [3]. The Notch
pathway is represented as a single concept, with three
explicit ‘parts’ that capture the general concept of Notch
pathway regulation, Notch receptor processing and Notch
target gene activation. An example of the greater context
it provides is the rich set of connections between regu-
lation of Notch signaling pathway and other ontology
terms. By contrast, the advantages of the pathway repres-
entation include (i) the capability to represent details
including molecular mechanisms, and (ii) representation
of temporal ordering of events. The visual pathway
representation shown in Figure 1b has a precise mapping
into an ontology (SBML), and represents 15 reactions
involving 23 distinct molecule classes.
Ontology annotation of genesGO and pathway ontologies provide a structure for repre-
senting the functions of proteins in light of our current
understanding of biology, but they do not provide infor-
mation about actual genes. In ontology jargon, ontologies
describe the ‘types’ of entities that exist and the
‘relations’ that exist between these entities, but do not
specify ‘instances’ [3]. From a practical standpoint, this
means that the ontological types provide ‘bins’ into which
individual genes can be classified, but they do not actually
place genes into these bins. The process of associating
(‘annotating’) a biological molecule with an ontological
type is called ontology annotation.
Sources of ontology annotations for genes and gene
products
The primary source for GO annotations is the Gene
Ontology database, available at http://www.geneontology.
org. The database contains GO annotations deposited by
each of the contributing GO Consortium members
Current Opinion in Chemical Biology 2007, 11:4–11
6 Proteomics and genomics
Figure 1
Current Opinion in Chemical Biology 2007, 11:4–11 www.sciencedirect.com
Ontologies for gene function Thomas, Mi and Lewis 7
Table 1
Pathway databases that are ontology based and linked to stable sequence identifiers.
Database name BioCyc KEGG PANTHER pathway Reactome
Pathway type Metabolic Mostly metabolic, a few signal
transduction
Metabolic, regulatory, signal
transduction
Metabolic, regulatory, signal
transduction
Standard format SBML, BioPAX SBML, BioPAX SBML SBML, BioPAX
Sequences linked to
pathway
Yes Yes Yes Yes
Literature evidence Linked to reactions Not available Linked to sequences and reactions Linked to reactions
Homology inference Enzyme commission
number matching,
Bayesian ‘hole filling’ [52]
Orthologous clusters PANTHER phylogenetic tree, hidden
Markov model
Orthologous clusters from
OrthoMCL [23]
URL www.biocyc.org/ www.genome.jp/kegg/ www.pantherdb.org/pathway/ www.reactome.org./
Reference [49��] [50] [47��] [48��]
[9,10��,11–19]. Currently, each member database is
responsible for a single organism, or set of organisms (often
taxonomically related) for which they contribute GO anno-
tations. GO annotations for organisms not covered directly
by a model organism database (see URL for full list) are
provided for UniProt sequences by the GOA group [10��].GO annotations are not simply attached to text strings
representing, for example, gene symbols such as ‘AKT’;
they are attached to stable database identifiers that can be
accurately tracked and updated as our knowledge of genes
and genomes increases.
Unfortunately, there is currently no centralized resource
for pathway ontologies, including pathway ontology anno-
tations. It has been estimated [20��] that more than 190
sources of pathway data are currently available on the
internet. However, only a small fraction of these sources
are both explicitly ontology based and linked to stable
sequence identifiers, and therefore approach the GO
standard for ontology annotation (Table 1).
Using evidence types to distinguish more from less
reliable annotations
The source of all ontology annotations is ultimately exper-
imental findings published in the scientific literature
through one or more steps of inference. However, not all
ontology annotations are inferred by equally reliable
methods. The reliability of an ontology annotation
depends on how those inferences were made (i.e. the
‘evidence’ for the annotation).
GO provides several evidence codes for describing the
type of evidence used (http://www.geneontology.org/
(Figure 1 Legend) The Notch signaling pathway as represented in (a) the Ge
representation of the pathway is relatively simple and provides high-level bio
signaling pathway’ is a child of ‘regulation of signal transduction’, which is its
transduction’. Figure (a) created using the AMIGO program browser availabl
are 15 reactions; for example, one reaction class is NIcs + Ub (catalyzed_by)
(reactant), Sel10 (catalyst), NIcs�Ub (product). NIcs is linked to UniProt P465
Abbreviations: IGI, inferred from genetic interaction; NIcs, Notch intracellula
ubiquitin. Figure (b) created from www.pantherdb.org/pathway/pathwayDiag
www.sciencedirect.com
GO.evidence.shtml), as well as a link to the piece of
evidence itself. Broadly speaking, there are three main
types of evidence: literature-based evidence, homology-
based evidence and other computational evidence. The
fraction of GO annotations of a given evidence type is
shown in Figure 2.
Literature-based evidence
Literature-based evidence comes from direct experimen-
tal results used to draw an inference about the function of
a particular gene or its gene product. This can be primary-
source evidence, such as a paper that describes an actual
experiment, or secondary-source evidence, such as a
statement made in a review paper. Literature-based
GO annotations made by GO curators are considered
to be the most reliable. Some of the ‘model organisms’
that are particularly well studied, such as yeast, fruit fly
and Caenorhabditis elegans, provide a large number of GO
annotations with direct, literature-based evidence.
Homology-based evidence
For most organisms, experimental data do not currently
exist for most gene products. As a result, the great
majority of GO annotations rely on evidence of homology
(Figure 2). For homology-based annotations, two separate
inferences are made. An example of homology-based
annotation is shown in Figure 3, where a human protein
is annotated by homology to a Drosophila protein. The
first inference is that a particular paper provides litera-
ture-based evidence for the function of the DrosophilaEther-a-go-go (EAG), resulting in the direct annotation:
FBgn0000535 molecular function: ion channel with PubMed
ID 10798390. The second inference is often simply called
ne Ontology (GO) and (b) the PANTHER pathway ontology. (a) In GO, the
logical concepts for context; for example, the term ‘regulation of Notch
elf a child of both ‘regulation of cellular process’ and ‘regulation of signal
e at www.godatabase.org/. (b) In the PANTHER pathway ontology, there
Sel10! NIcs�Ub, relating the molecular classes NIcs (reactant), Ubiquitin
31 with PubMed 15123653 [51] as evidence, and IGI as evidence code.
r soluble fragment; NIct, Notch intracellular tethered fragment; Ub,
ram.jsp?catAccession=P00045.
Current Opinion in Chemical Biology 2007, 11:4–11
8 Proteomics and genomics
Figure 2
Fraction of GO annotations with different types of evidence. Evidence has been divided into three broad groups: literature-based (both primary-source
GO evidence codes IDA, IEP, IGI, IMP, IPI, and secondary-source codes TAS, NAS, IC), homology-based (ISS code, plus IEA codes and RCA codes
that rely primarily on homology data, such as InterPro), and other (IEA codes and RCA codes that are not primarily homology-based). For most
organisms, homology-based annotations predominate. Abbreviations: IC, inferred by curator; IDA, inferred from direct assay; IEA, inferred from
electronic annotation; IEP, inferred from expression pattern; IGI, inferred from genetic interaction; IMP, inferred from mutant phenotype; IPI, inferred
from physical interaction; ISS, inferred from sequence similarity; NAS, nontraceable author statement; RCA, inferred from reviewed computational
annotation; TAS, traceable author statement.
Figure 3
Annotation of human KCNH1 and KCNH5 using homology to Drosophila EAG (KCNAE_DROME), experimentally determined to be a potassium
channel. For this inference to be valid, the most recent common ancestor (MRCA) of all these sequences (represented by the blue diamond)
must have had potassium channel activity, and this function was preserved among all the descendants shown, including not only the human
sequences, but also the Anopheles (mosquito), and the other vertebrate sequences.
Current Opinion in Chemical Biology 2007, 11:4–11 www.sciencedirect.com
Ontologies for gene function Thomas, Mi and Lewis 9
‘homology’ but is actually a statement about the function
of the most recent common ancestor (or an even more
ancient ancestor), and about the inheritance of function
from that ancestor. This is clear in Figure 3. To annotate
potassium channel activity of the human EAGs based on
homology with Drosophila EAG, one implicitly assumes
that (i) the most recent common ancestor possessed
potassium channel activity, and (ii) this function was
inherited by its extant descendants (here, DrosophilaEAG and the human EAGs, as well as descendants in
many other organisms, as shown in Figure 3).
The reliability of the homology-based annotation
depends on the reliability of the two links in the inference
chain: the literature-based inference for the function of
one gene, and the inference of descent from a common
ancestor that had that same function, which was then
preserved in its descendants. Either of these links can be
human curated or computationally inferred. However, the
putative homology relationship, even if reviewed by a
curator, is always the result of a computational algorithm,
such as BLAST (basic local alignment search tool), hid-
den Markov models such as Pfam [21], sequence cluster-
ing algorithms [22,23] or phylogenetic tree building
algorithms [24–27]. Curator-reviewed BLAST searching
has been shown to result in less reliable GO annotations
than phylogenetic tree building algorithms and curated
subfamily hidden Markov models [3]. Because infor-
mation about the algorithm is not currently linked to
the annotation, the reliability of even curator-reviewed
homology-based annotations is difficult to estimate from
the current set of GO evidence codes, although further
work in this area is in progress.
Other evidence
Several other computational methods have been devel-
oped for GO annotation. Because of space requirements,
here we briefly mention only those methods that have
already contributed to the annotations currently in the
GO database. Groups partly or entirely outside the Gene
Ontology Consortium (GOC) are largely responsible for
developing these computational methods; the remarkable
success of the GOC is due in part to its emphasis on
openness and collaboration. Not including homology-
based annotations [28–34,35��], these other compu-
tational methods represent a wide variety of approaches:
text mining [36�,37�,38], gene expression analysis [39],
protein–protein interaction data analysis [40], small
nucleolar RNA predictions [41–43], signal peptide pre-
dictions [44], membrane-spanning region predictions [45]
and a method based on a clustering of genes with similar
GO annotations [46]. In general, these approaches are all
labeled with the IEA (inferred from electronic annota-
tion) evidence code, indicating the lowest level of
reliability, but as with homology-based annotations the
reliability is dependent on the methodology used. Cura-
tors often review these electronic annotations, resulting in
www.sciencedirect.com
upgrading of text-mining results to literature-based anno-
tations, or upgrading of other results to RCA (reviewed
computational analysis).
Pathway ontology annotation evidence
Some pathway ontology resources also use a controlled set
of evidence codes for estimating the reliability of different
ontology annotations. The PANTHER pathway database
uses the GO evidence codes for direct evidence and links
to ancestral nodes in phylogenetic trees to trace homology
inferences [47��]. The Reactome database has two types of
evidence [48��], direct evidence and orthology evidence
(this is a particularly stringent homology relationship,
implying only speciation events since the last common
ancestor of two genes, i.e. the ‘same’ gene in two different
species) whereas the BioCyc database has its own evidence
ontology [49��]. By contrast, the KEGG database [50]
makes use of orthology clusters but does not provide
information about which genes have direct evidence from
literature, and which genes are inferred orthologs.
ConclusionsOntologies for gene function have a crucial role in trans-
lating genomic data into models of biological function.
Ontologies are constructed to represent a domain of
knowledge using concepts and relationships; the bio-
logical knowledge domain is an area of extremely active
research and, therefore, the corresponding ontologies will
continue to evolve rapidly. The Gene Ontology is being
constantly expanded and revised in a collaborative man-
ner. Pathway ontologies provide detailed biochemical
relationships between molecular types; these relation-
ships are complementary to the representation in the
Gene Ontology, and, indeed, can be explicitly connected
to Gene Ontology terms. However, pathway ontologies
have not reached the maturity of the Gene Ontology;
there is as yet no centralized repository for these ontol-
ogies, nor a coordinated process for reaching community
consensus on representing specific pathways.
Biological ontology annotation is the process by which
specific genomic regions — genes (in the broadest sense)
and their products — are linked to concepts in biological
ontologies. Each annotation is an inference that rests on
specific evidence. The type of evidence is crucial for
estimating the relative reliability of different annotations,
although few users of Gene Ontology data have taken full
advantage of evidence codes. Most gene function ontol-
ogy annotations rely on homology inferences, and the
reliability of a given homology-based annotation depends
crucially on the computational methodology used to infer
the inheritance of function from a common ancestor.
Gene Ontology annotations have been used widely in
several application areas, but by far the most common use
has been in forming biological hypotheses from large-
scale genomic data such as gene expression studies
Current Opinion in Chemical Biology 2007, 11:4–11
10 Proteomics and genomics
(http://www.ebi.ac.uk/GOA/users.html). In most cases,
these hypotheses derive from statistically significant over-
laps between groups of coregulated genes and GO terms.
The continued development of protein function ontolo-
gies, such as the more detailed representation in pathway
ontologies, will enable even more sophisticated analysis
of large-scale biological experiments. Experiments, in
turn, will lead to revisions in the ontologies, paving the
way toward systems biology.
AcknowledgementsWe thank Anish Kejariwal for generating the data in Figure 2, and forhelpful comments on the manuscript.
References and recommended readingPapers of particular interest, published within the annual period ofreview, have been highlighted as:
� of special interest
�� of outstanding interest
1.��
Gene ontology consortium: The Gene Ontology (GO) project in2006. Nucleic Acids Res 2006, 34:D322-D326.
An excellent update of GO project status, including applications of theGO.
2. Smith B: Onthology. In Blackwell Guide to the Philosophy ofComputing and Information. Edited by Floridi L. Oxford: Blackwell;2003:155-166.
3. Smith B, Williams J, Schulze-Kremer S: The ontology of the geneontology. AMIA Annu Symp Proc 2003:609-613.
4. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM,Davis AP, Dolinski K, Dwight SS, Eppig JT et al.: Gene ontology:tool for the unification of biology. The Gene OntologyConsortium. Nat Genet 2000, 25:25-29.
5. Karp PD, Riley M, Paley SM, Pelligrini-Toole A: EcoCyc: anencyclopedia of Escherichia coli genes and metabolism.Nucleic Acids Res 1996, 24:32-39.
6. Keseler IM, Collado-Vides J, Gama-Castro S, Ingraham J, Paley S,Paulsen IT, Peralta-Gil M, Karp PD: EcoCyc: a comprehensivedatabase resource for Escherichia coli. Nucleic Acids Res 2005,33:D334-D337.
7. Hucka M, Finney A, Sauro HM, Bolouri H, Doyle JC, Kitano H,Arkin AP, Bornstein BJ, Bray D, Cornish-Bowden A et al.: Thesystems biology markup language (SBML): a medium forrepresentation and exchange of biochemical network models.Bioinformatics 2003, 19:524-531.
8.��
Luciano JS: PAX of mind for pathway researchers. Drug DiscovToday 2005, 10:937-942.
BioPax is an emerging standard for pathway exchange format thatcombines different concepts of biologial pathways into one consistentontology. This article describes in detail the progress of BioPAX, as wellas the challenge ahead of it. See also http://www.biopax.org.
9. Blake JA, Eppig JT, Bult CJ, Kadin JA, Richardson JE: The MouseGenome Database (MGD): updates and enhancements.Nucleic Acids Res 2006, 34:D562-D567.
10.��
Camon E, Magrane M, Barrell D, Lee V, Dimmer E, Maslen J,Binns D, Harte N, Lopez R, Apweiler R: The Gene OntologyAnnotation (GOA) database: sharing knowledge in Uniprotwith Gene Ontology. Nucleic Acids Res 2004, 32:D262-D266.
GOA provides high-quality electronic and manual annotations to theUniProt Knowledgebase using the standardized vocabulary of the GeneOntology. This article describes the GO annotation process implementedfor the project.
11. Drysdale RA, Crosby MA: FlyBase: genes and gene models.Nucleic Acids Res 2005, 33:D390-D395.
12. Hirschman JE, Balakrishnan R, Christie KR, Costanzo MC,Dwight SS, Engel SR, Fisk DG, Hong EL, Livstone MS, Nash R
Current Opinion in Chemical Biology 2007, 11:4–11
et al.: Genome Snapshot: a new resource at theSaccharomyces Genome Database (SGD) presenting anoverview of the Saccharomyces cerevisiae genome.Nucleic Acids Res 2006, 34:D442-D445.
13. Rhee SY, Beavis W, Berardini TZ, Chen G, Dixon D, Doyle A,Garcia-Hernandez M, Huala E, Lander G, Montoya M et al.:The Arabidopsis Information Resource (TAIR): a modelorganism database providing a centralized, curated gatewayto Arabidopsis biology, research materials and community.Nucleic Acids Res 2003, 31:224-228.
14. de la Cruz N, Bromberg S, Pasko D, Shimoyama M, Twigger S,Chen J, Chen CF, Fan C, Foote C, Gopinath GR et al.: The RatGenome Database (RGD): developments towards a phenomedatabase. Nucleic Acids Res 2005, 33:D485-D491.
15. Schwarz EM, Antoshechkin I, Bastiani C, Bieri T, Blasiar D,Canaran P, Chan J, Chen N, Chen WJ, Davis P et al.: WormBase:better software, richer content. Nucleic Acids Res 2006,34:D475-D478.
16. Sprague J, Bayraktaroglu L, Clements D, Conlin T, Fashena D,Frazer K, Haendel M, Howe DG, Mani P, Ramachandran S et al.:The Zebrafish Information Network: the zebrafish modelorganism database. Nucleic Acids Res 2006, 34:D581-D585.
17. Jaiswal P, Ni J, Yap I, Ware D, Spooner W, Youens-Clark K, Ren L,Liang C, Zhao W, Ratnapu K et al.: Gramene: a bird’s eye view ofcereal genomes. Nucleic Acids Res 2006, 34:D717-D723.
18. Chisholm RL, Gaudet P, Just EM, Pilcher KE, Fey P, Merchant SN,Kibbe WA: DictyBase, the model organism database forDictyostelium discoideum. Nucleic Acids Res 2006,34:D423-D427.
19. Haft DH, Selengut JD, White O: The TIGRFAMs database ofprotein families. Nucleic Acids Res 2003, 31:371-373.
20.��
Bader GD, Cary MP, Sander C: Pathguide: a pathway resourcelist. Nucleic Acids Res 2006, 34:D504-D506.
A carefully compiled and categorized list of available pathway resources.
21. Finn RD, Mistry J, Schuster-Bockler B, Griffiths-Jones S, Hollich V,Lassmann T, Moxon S, Marshall M, Khanna A, Durbin R et al.:Pfam: clans, web tools and services. Nucleic Acids Res 2006,34:D247-D251.
22. Servant F, Bru C, Carrere S, Courcelle E, Gouzy J, Peyruc D,Kahn D: ProDom: automated clustering of homologousdomains. Brief Bioinform 2002, 3:246-251.
23. Li L, Stoeckert CJ Jr, Roos DS: OrthoMCL: identification ofortholog groups for eukaryotic genomes. Genome Res 2003,13:2178-2189.
24. Thomas PD, Kejariwal A, Campbell MJ, Mi H, Diemer K, Guo N,Ladunga I, Ulitsky-Lazareva B, Muruganujan A, Rabkin S et al.:PANTHER: a browsable database of gene productsorganized by biological function, using curated proteinfamily and subfamily classification. Nucleic Acids Res 2003,31:334-341.
25. Mi H, Lazareva-Ulitsky B, Loo R, Kejariwal A, Vandergriff J, RabkinS,Guo N, Muruganujan A, Doremieux O, Campbell MJ et al.: ThePANTHER database of protein families, subfamilies, functionsand pathways. Nucleic Acids Res 2005, 33:D284-D288.
26. Li H, Coghlan A, Ruan J, Coin LJ, Heriche JK, Osmotherly L, Li R,Liu T, Zhang Z, Bolund L et al.: TreeFam: a curated database ofphylogenetic trees of animal gene families. Nucleic Acids Res2006, 34:D572-D580.
27. Dehal PS, Boore JL: A phylogenomic gene cluster resource:the Phylogenetically Inferred Groups (PhIGs) database.BMC Bioinformatics 2006, 7:201.
28. Mi H, Vandergriff J, Campbell M, Narechania A, Majoros W,Lewis S, Thomas PD, Ashburner M: Assessment ofgenome-wide protein function classification for Drosophilamelanogaster. Genome Res 2003, 13:2118-2128.
29. Okazaki Y, Furuno M, Kasukawa T, Adachi J, Bono H,Kondo S, Nikaido I, Osato N, Saito R, Suzuki H et al.: Analysis ofthe mouse transcriptome based on functional annotation of60,770 full-length cDNAs. Nature 2002, 420:563-573.
www.sciencedirect.com
Ontologies for gene function Thomas, Mi and Lewis 11
30. Carninci P, Kasukawa T, Katayama S, Gough J, Frith MC,Maeda N, Oyama R, Ravasi T, Lenhard B, Wells C et al.:The transcriptional landscape of the Mamm Genome.Science 2005, 309:1559-1563.
31. Hayete B, Bienkowska JR: Gotrees: predicting go associationsfrom protein domain composition using decision trees.Pac Symp Biocomput 2005:127-138.
32. Ogawa T, Ueda Y, Yoshimura K, Shigeoka S: Comprehensiveanalysis of cytosolic Nudix hydrolases in Arabidopsis thaliana.J Biol Chem 2005, 280:25277-25283.
33. Auldridge ME, Block A, Vogel JT, Dabney-Smith C, Mila I,Bouzayen M, Magallanes-Lundback M, DellaPenna D,McCarty DR, Klee HJ: Characterization of three members of theArabidopsis carotenoid cleavage dioxygenase familydemonstrates the divergent roles of this multifunctionalenzyme family. Plant J 2006, 45:982-993.
34. Lutfiyya LL, Xu N, D’Ordine RL, Morrell JA, Miller PW, Duff SM:Phylogenetic and expression analysis of sucrose phosphatesynthase isozymes in plants. J Plant Physiol 2006, Epub ahead ofprint, doi:10.1016/j.jplph.2006.04.014.
35.��
Maeda N, Kasukawa T, Oyama R, Gough J, Frith M, Engstrom PG,Lenhard B, Aturaliya RN, Batalov S, Beisel KW et al.: Transcriptannotation in FANTOM3: mouse gene catalog based onphysical cDNAs. PLoS Genet 2006, 2:e62.
The third large-scale mouse transcript annotation FANTOM3; the processcontains several improvements over that of FANTOM1.
36.�
Muller HM, Kenny EE, Sternberg PW: Textpresso: anontology-based information retrieval and extraction systemfor biological literature. PLoS Biol 2004, 2:e309.
A description of the text retrieval system used by many model organismGO annotation projects.
37.�
Liu Y, Navathe SB, Civera J, Dasigi V, Ram A, Ciliax BJ,Dingledine R: Text mining biomedical literature for discoveringgene-to-gene relationships: a comparative study ofalgorithms. IEEE/ACM Trans Comput Biol Bioinform 2005,2:62-76.
A good description of several algorithms for text mining in the context offunctional annotation.
38. Hoffmann R, Valencia A: Implementing the iHOP concept fornavigation of biomedical literature. Bioinformatics 2005,21(Suppl 2):ii252-ii258.
39. Wade CH, Umbarger MA, McAlear MA: The budding yeast rRNAand ribosome biosynthesis (RRB) regulon contains over 200genes. Yeast 2006, 23:293-306.
40. Samanta MP, Liang S: Predicting protein functions fromredundancies in large-scale protein interaction networks.Proc Natl Acad Sci USA 2003, 100:12579-12583.
www.sciencedirect.com
41. Schattner P, Decatur WA, Davis CA, Ares M Jr, Fournier MJ,Lowe TM: Genome-wide searching for pseudouridylationguide snoRNAs: analysis of the Saccharomyces cerevisiaegenome. Nucleic Acids Res 2004, 32:4281-4296.
42. Lowe TM, Eddy SR: A computational screen for methylationguide snoRNAs in yeast. Science 1999, 283:1168-1171.
43. Kiss-Laszlo Z, Henry Y, Bachellerie JP, Caizergues-Ferrer M, Kiss T:Site-specific ribose methylation of preribosomal RNA: a novelfunction for small nucleolar RNAs. Cell 1996, 85:1077-1088.
44. Emanuelsson O, Nielsen H, Brunak S, von Heijne G: Predictingsubcellular localization of proteins based on their N-terminalamino acid sequence. J Mol Biol 2000, 300:1005-1016.
45. Krogh A, Larsson B, von Heijne G, Sonnhammer EL: Predictingtransmembrane protein topology with a hidden Markovmodel: application to complete genomes. J Mol Biol 2001,305:567-580.
46. King OD, Foulger RE, Dwight SS, White JV, Roth FP: Predictinggene function from patterns of annotation. Genome Res 2003,13:896-904.
47.��
Mi H, Guo N, Kejariwal A, Thomas PD: PANTHER version 6:protein sequence and function evolution data with expandedrepresentation of biological pathways. Nucleic Acids Res 2007,35:D247-D252.
A high-level description of a pathway ontology, summarized in a figure.
48.��
Joshi-Tope G, Gillespie M, Vastrik I, D’Eustachio P, Schmidt E, deBono B, Jassal B, Gopinath GR, Wu GR, Matthews L et al.:Reactome: a knowledgebase of biological pathways. NucleicAcids Res 2005, 33:D428-D432.
An excellent, succinct description of a pathway ontology (data model).
49.��
Karp PD, Ouzounis CA, Moore-Kochlacs C, Goldovsky L, Kaipa P,Ahren D, Tsoka S, Darzentas N, Kunin V, Lopez-Bigas N: Expansionof the BioCyc collection of pathway/genome databases to 160genomes. Nucleic Acids Res 2005, 33:6083-6089.
A description of the BioCyc collection of pathway/genome databaseswith an emphasis on community contribution and more accurate pathwayprediction.
50. Kanehisa M, Goto S, Hattori M, Aoki-Kinoshita KF, Itoh M,Kawashima S, Katayama T, Araki M, Hirakawa M: From genomicsto chemical genomics: new developments in KEGG.Nucleic Acids Res 2006, 34:D354-D357.
51. Das I, Craig C, Funahashi Y, Jung KM, Kim TW, Byers R, Weng AP,Kutok JL, Aster JC, Kitajewski J: Notch oncoproteins depend ong-secretase/presenilin activity for processing and function.J Biol Chem 2004, 279:30771-30780.
52. Green ML, Karp PD: A Bayesian method for identifying missingenzymes in predicted metabolic pathway databases.BMC Bioinformatics 2004, 5:76.
Current Opinion in Chemical Biology 2007, 11:4–11