principles of ontology construction intro · that the ontologies are simultaneously both well...

59
Principles of Ontology Construction Overview of tutorial Biological data must be readily accessible, comparable, and correlated to efficiently provide relevant answers to scientific inquiries and thus enable discoveries. Well- principled ontological frameworks can provide a means to accomplish this. The caveat is that the ontologies are simultaneously both well formed and biologically intuitive. There are an increasing number of groups developing ontologies in assorted biological domains. However, these efforts will only be beneficial and aid biological data integration if certain criteria are met. These prerequisites are that the ontologies are non- overlapping, that they are accepted and used by the community, and that they are well- principled. The methods and approach required for the creation of usable ontologies is the focus of this tutorial. Organization This tutorial handout contains relevant reading material (described below) and the presentation itself. The presentation is organized into four sections: 1. The sociology of ontology building (Michael Ashburner) 2. The fundamental principles of ontology construction (Barry Smith) 3. Case studies of errors and corrections based on these principles (David Hill and Rama Balakrishnan) 4. A debate on the counter-tensions between pragmatics and purity. Reading Material Two pieces that provide a historical perspective: 1. Ashburner M, Lewis SE. 2002 On ontologies for biologists: the Gene Ontology - uncoupling the web. Novartis Found Symp 247: 66-80. 2. Lewis SE. 2005. Gene Ontology: looking backwards and forwards. Genome Biology 6: 103. A philosopher’s critique of some representative biomedical ontologies: 3. Smith B. 2005 Ontologies in Biomedicine:The Good, the Bad, and the Ugly. Personal communication. A small assortment of active ontology projects for illustration: 4. Gkoutos GV, Green ECJ, Mallon A-M, Hancock JM and Davidson D. 2004. Using ontologies to describe mouse phenotypes. Genome Biology, 6:R8. 5. Bard J, Rhee SY, Ashburner M. 2005. An ontology for cell type. Genome Biology, 6:R21.

Upload: others

Post on 30-May-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Principles of Ontology Construction Intro · that the ontologies are simultaneously both well formed and biologically intuitive. There are an increasing number of groups developing

Principles of Ontology Construction

Overview of tutorial

Biological data must be readily accessible, comparable, and correlated to efficiently provide relevant answers to scientific inquiries and thus enable discoveries. Well-principled ontological frameworks can provide a means to accomplish this. The caveat is that the ontologies are simultaneously both well formed and biologically intuitive. There are an increasing number of groups developing ontologies in assorted biological domains. However, these efforts will only be beneficial and aid biological data integration if certain criteria are met. These prerequisites are that the ontologies are non-overlapping, that they are accepted and used by the community, and that they are well-principled. The methods and approach required for the creation of usable ontologies is the focus of this tutorial.

Organization

This tutorial handout contains relevant reading material (described below) and the presentation itself. The presentation is organized into four sections:

1. The sociology of ontology building (Michael Ashburner)

2. The fundamental principles of ontology construction (Barry Smith)

3. Case studies of errors and corrections based on these principles (David Hill and Rama Balakrishnan)

4. A debate on the counter-tensions between pragmatics and purity.

Reading Material

Two pieces that provide a historical perspective:

1. Ashburner M, Lewis SE. 2002 On ontologies for biologists: the Gene Ontology - uncoupling the web. Novartis Found Symp 247: 66-80.

2. Lewis SE. 2005. Gene Ontology: looking backwards and forwards. Genome Biology 6: 103.

A philosopher’s critique of some representative biomedical ontologies:

3. Smith B. 2005 Ontologies in Biomedicine:The Good, the Bad, and the Ugly. Personal communication.

A small assortment of active ontology projects for illustration:

4. Gkoutos GV, Green ECJ, Mallon A-M, Hancock JM and Davidson D. 2004. Using ontologies to describe mouse phenotypes. Genome Biology, 6:R8.

5. Bard J, Rhee SY, Ashburner M. 2005. An ontology for cell type. Genome Biology, 6:R21.

Page 2: Principles of Ontology Construction Intro · that the ontologies are simultaneously both well formed and biologically intuitive. There are an increasing number of groups developing

6. Eilbeck K, Lewis SE, Mungall CJ, Yandell M, Stein L, Durbin R and Ashburner M. 2005. The Sequence Ontology: a tool for the unification of genome annotations Genome Biology, 6:R44.

7. Rosse C and Mejino JeLV. 2003. A reference ontology for biomedical informatics: the Foundational Model of Anatomy. Journal of Biomedical Informatics 36:478–500.

The group used for the case study evaluation:

8. The Gene Ontology Consortium. 2004. The Gene Ontology (GO) database and informatics resource. Nucleic Acids Res 32: D258-D261.

Methodology for defining ontological relationships:

9. Smith B, Ceusters W, Klagges B, Köhler J, Kumar A, Lomax J, Mungall C, Neuhaus F, Rector AL and Rosse C. 2005. Relations in biomedical ontologies Genome Biology, 6:R46

Page 3: Principles of Ontology Construction Intro · that the ontologies are simultaneously both well formed and biologically intuitive. There are an increasing number of groups developing

1

Novartis Symposium – November 2001. On ontologies for biologists: The Gene Ontology – untangling the web. Michael Ashburner, Department of Genetics, University of Cambridge and EMBL – European Bioinformatics Institute, Hinxton, Cambridge, UK. and Suzanna Lewis, Berkeley Drosophila Genome Project, Lawrence Berkeley National Laboratory, University of California, Berkeley, CA, USA. Department of Genetics University of Cambridge Downing Street Cambridge CB2 3EH The European Bioinformatics Institute The Wellcome Trust Genome Campus Hinxton Cambridge CB10 1SD Berkeley Drosophila Genome Project Lawrence Berkeley National Laboratory Berkeley, CA 94720, USA. [email protected]; [email protected]

Page 4: Principles of Ontology Construction Intro · that the ontologies are simultaneously both well formed and biologically intuitive. There are an increasing number of groups developing

2

Abstract. The mantra of the “post-genomic” era is “gene function”. Yet surprisingly little attention has been given to how functional and other information concerning genes is to be captured, made accessible to biologists or structured in a computable form. The aim of the Gene Ontology Consortium is to provide a framework for both the description and the organisation of such information. The GO Consortium is presently concerned with three structured controlled vocabularies which can be used to describe three discrete biological domains, building structured vocabularies which can be used to describe the molecular function, biological roles and cellular locations of gene products. Keywords: Gene function; ontologies; controlled vocabularies; databases

Page 5: Principles of Ontology Construction Intro · that the ontologies are simultaneously both well formed and biologically intuitive. There are an increasing number of groups developing

3

Introduction and status. The GO Consortium’s work is motivated by the need of both biologists and bioinformaticists for a method for rigorously describing the biological attributes of gene products (GO Consortium 2000, 2001). A comprehensive lexicon (with mutually understood meanings) describing those attributes of molecular biology that are common to more than one life form is essential to enable communication: in both computer and natural languages. In this era, when new sequenced genomes are rapidly being completed, all needing to be discussed, described, and compared, the development of a common language is crucial. The most familiar of these attributes is that of “function”. Indeed, as early as 1993 Monica Riley (Riley 1993) attempted a hierarchical functional classification of all the then known proteins of Escherichia coli. Since then, there have been other attempts to provide vocabularies and ontologies1 for the description of gene function, either explicitly or implicitly (e.g. Dure 1991, Commission of Plant Gene Nomenclature 1994, Fleischmann et al 1995, Overbeek et al 1997, Takai-Igarashi, Nadaoka, Kaminuma 1998, Baker et al 1999, Mewes et al 1999, Overbeek et al 2000, Stevens et al 2000; see Riley 1988, Rison et al 2000, Sklyar 2001 for reviews, Karp et al. 2002). Riley has recently updated her classification for the proteins of E. coli (Serres et al 2001). One problem with many (though not all: e.g. Schulze-Kremer 1997, 1998, Karp et al 20002a, 2002b) efforts prior to that of the GO Consortium is that they lacked semantic clarity due, to a large degree, to the absence of definitions for the terms used. Moreover, these previous classifications were usually not explicit concerning the relationships between different (e.g. “parent” and “child”) terms or concepts. A further problem with these efforts was that, by and large, they were developed as one-off exercises, with little consideration given to revision and implementation beyond the domain for which they were first conceived. They generally also lacked the 1 Philosophically speaking an ontology is “the study of that which exists" and is defined in opposition to "epistemology", which means "the study of that which is known or knowable". Within the field of artificial intelligence the term ontology has taken on another meaning: “A specification of a conceptualization that is designed for reuse across multiple applications and implementations” (Karp 2000) and it is in this sense that we are using it.

Page 6: Principles of Ontology Construction Intro · that the ontologies are simultaneously both well formed and biologically intuitive. There are an increasing number of groups developing

4

apparatus required for both persistence and consistent use by others, i.e. versioning, archiving and unique identifiers attached to their concepts. The GO vocabularies distinguish three orthogonal domains (vocabularies); the concepts within one vocabulary do not overlap those within another. These domains are molecular_function, biological_process and cellular_component, defined as follows: molecular_function: An action characteristic of a gene product. biological_process: A phenomenon marked by changes that lead to a particular result, mediated by one or more gene products. cellular_component: The part, or parts, of a cell of which a gene product is a component; for this purpose includes the extracellular environment of cells. The initial objective of the GO Consortium is to provide a rich structured vocabulary of terms (concepts) for use by those annotating gene products within an informatics context, be it a database of the genetics and genomics of a model organism, a database of protein sequences or a database of information about gene products, such as might be obtained from a DNA microarray experiment. In GO the annotation of gene products with GO terms follows two guidelines: (i) that all annotations include the evidence upon which that assertion is based and, (ii) that the evidence provided for each annotation includes attribution to an available external source, such as a literature reference. Databases using GO for annotation are widely distributed. Therefore an additional task of the Consortium is to provide a centralized holding site for their annotations. GO provides a simple format for contributing databases to submit their annotations to a central annotation database maintained by GO. The annotation data submitted includes the association of gene products with GO terms as well as ancillary information, such as evidence and attribution. These annotations can then form the basis for queries – either by an individual or a computer program. At present gene product associations are available for several different organisms, including two yeasts (S. pombe and S. cerevisiae), two

Page 7: Principles of Ontology Construction Intro · that the ontologies are simultaneously both well formed and biologically intuitive. There are an increasing number of groups developing

5

invertebrates (Caenorhabditis elegans and Drosophila melanogaster), two mammals (mouse and rat) and a plant, Arabidopsis thaliana. In addition, the first bacterium (Vibrio cholerae) has now been annotated with GO and efforts are now underway to annotate all 60 or so publicly available bacterial genomes. Over 80% of the proteins in the SWISS-PROT protein database have been annotated with GO terms (the majority by automatic annotation, see below), these include the SWISS-PROT to GO annotations of over 16,000 human proteins (available at www.geneontology.org/gene-associations/gene_association.goa). Some 7,000 human proteins were also annotated with GO by Proteome Inc. and are available from LocusLink (Pruitt, Maglott 2001). A number of other organismal databases are in the process of using GO for annotation, including those for Plasmodium falciparum (and other parasitic protozoa) (M. Berriman, personal communication), Dictyostelium discoideum (R. Chisholm, personal communication) and the grasses (rice, maize, wheat, etc) (L. Vincent, personal communication). The availability of these sets of data has lead to the construction of GO browsers which enable users to query them all simultaneously for genes whose products serve a particular function, play a role in a particular biological process or are located in a particular sub-cellular part (AmiGO 2001). These associations are also available as tab-delimited tables (www.geneontology.org/gene-associations/) or with protein sequences. GO thus achieves de facto a degree of database integration (see Leser 1998), one holy grail of applied bioinformatics. Availability. The products of the GO Consortium’s work can be obtained from their w3 home page: www.geneontology.org. All of the efforts of the GO Consortium are placed in the public domain and can be used by academia or industry alike without any restraint, other than they cannot be modified and then passed off as the products of the Consortium. This is true for all major classes of the GO Consortium’s products: the controlled vocabularies, the gene-association tables, and software for browsing and editing the GO vocabularies and gene association tables (AmiGO 2001, DAG Edit 2001). Thus the GO Consortium’s work is very much in the spirit of the Open Source tradition in software development (DiBona, Ockman, Stone 1999; OpenSource 2001). The GO ontologies and

Page 8: Principles of Ontology Construction Intro · that the ontologies are simultaneously both well formed and biologically intuitive. There are an increasing number of groups developing

6

their associated files are available as text files, in XML or as tables for a MySQL database. The structure of the GO ontologies. All biologists are familiar with hierarchical graphs – the system of classification introduced by Linnaeus has been a bedrock for biological research for some 250 years. In a Linnean taxonomy the nodes of the graphs are the names of taxa, be they phyla or species; the edges between these nodes represent the relationship “is a member of” between parent and child nodes. Thus the node “species:Drosophila melanogaster” “is a member of” its parent node “genus:Drosophila”. Useful as hierarchies are they suffer from a serious limitation, each node has one and only one parental node – no species is a member of two (or more) genera, no genus a member of two (or more) families. Yet in the broader world of biology an object may well have two or more parents. Consider, as a simple example, a protein that both binds DNA and hydrolyses ATP. It is as equally correct to describe this as a “DNA binding protein” as it is to describe it as a “catalyst” (or enzyme); therefore it should be a child of both within a tree structure. Not all DNA binding proteins are enzymes, not all enzymes are DNA binding proteins, yet some are and we need to be able to represent these facts conceptually. For this reason GO uses a structure known as a directed acyclic graph (DAG), a graph in which nodes can have many parents but in which cycles – that is a path which starts and ends at the same node – are not allowed. All nodes must have at least one parent node, with the exception of the root of each graph. Alice replies to Humpty Dumpty’s inquiry as to the meaning of her name “Must a name mean something?” “Of course it must” replies Humpty Dumpty (Heath 1974:188). This is as true in the real world as in that through the looking glass. The nodes in the GO controlled vocabularies are concepts, concepts that describe the molecular function, biological role or cellular location of gene products. The terms used by GO are simply a shorthand way of referring to these concepts, concepts that are restricted by their natural language definitions. (At present only 20% of the 10,000 or so GO terms are defined but a major effort to correct this situation will be launched early in 2002). Each and every GO term has a unique identifier consisting of the prefix GO: and an integer, for example, GO:0036562. But what happens if a GO term changes? A change may be as trivial as correcting a spelling error or as drastic as being a new lexical string. If the

Page 9: Principles of Ontology Construction Intro · that the ontologies are simultaneously both well formed and biologically intuitive. There are an increasing number of groups developing

7

change does not change the meaning of the term then there is no change to the GO identifier. If the meaning is changed, however, then the old term, its identifier and definition are retired (they are marked as “obsolete”, they never disappear from the database) and the new term gets a new identifier and a new definition. Indeed this is true even if the lexical string is identical between old and new terms; thus if we use the same words to describe a different concept then the old term is retired and the new is created with its own definition and identifier. This is the only case where, within any one of the three GO ontologies, two or more concepts may be lexically identical; all except one of them must be flagged as being obsolete. Because the nodes represent semantic concepts (as described by their definitions) it is not strictly necessary that the terms are unique, but this restriction is imposed in order to facilitate searching. This mechanism helps with maintaining and synchronizing other databases that must track changes within GO, which is always rapidly changing by design. Keeping everything and everyone consistent is a difficult problem that we had to solve in order permit this dynamic adaptability of GO. The edges between the nodes represent the relationships between them. GO uses two very different classes of semantic relationship between nodes: isa and partof. Both the isa and partof relationships within GO should be fully transitive. That is to say an instance of a concept is also an instance of all of the parents of that concept (to the root); a part concept that is partof a whole concept is a partof all of the parents of that concept (to the root). Both relationships are reflexive (see below). The isa relationship is one of subsumption, a relationship that permits refinement in concepts and definitions and thus enables annotators to draw coarser or finer distinctions, depending on the present degree of knowledge. This class of relationship is known as hyponymy (and its reflexive relation hypernymy) to the authors of the lexical database WordNet (Fellbaum 1998). Thus the term DNA binding is a hyponym of the term nucleic acid binding; conversely nucleic acid binding is a hypernym of DNA binding. The latter term is more specific than the former, and hence its child. It has been argued that the isa relationship, both generally (see below) and as used by GO (P. Karp, personal communication; S. Schultze-Kremer, personal communication) is complex and that further information describing the nature of the relationship should be captured. Indeed this is true, because the

Page 10: Principles of Ontology Construction Intro · that the ontologies are simultaneously both well formed and biologically intuitive. There are an increasing number of groups developing

8

precise connotation of the isa relationship is dependent upon each unique pairing of terms and the meanings of these terms. Thus the isa relationship is not a relationship between terms, but rather is a relationship between particular concepts. Therefore the isa relationship is not a single type of relationship; its precise meaning is dependent on the parent and child terms it connects. The relationship simply describes the parent as the more general concept and the child as the more precise concept and says nothing about how the child specifically refines the concept. The partof relationship (meronomy and its reflexive relationship holonymy) (Cruse 1986, cited in Miller 1998) is also semantically complex as used by GO (see: Wierzbicka 1984 (cited in Miller 1998), Miller 1998, Priss 1998, Rogers and Rector 2000). It may mean that a child node concept “is a component of” its parent concept. (The reflexive relationship (holonymy) would be “has a component”). The mitochondrion “is a component of” the cell; the small ribosomal subunit “is a component of” the ribosome. This is the most common meaning of the partof relationship in the GO cellular_component ontology. In the biological_process ontology, however, the semantic meaning of partof can be quite different, it can mean “is a subprocess of”; thus the concept amino acid activation “is a subprocess of” of the concept protein biosynthesis. It is in the future for the GO Consortium to clarify these semantic relationships while, at the same time not making the vocabularies too cumbersome and difficult to maintain and use. Meronymy and hyponymy cause terms to “become intertwined in complex ways” (Miller 1998:38). This is because one term can be a hyponym with respect to one parent, but a meronym with respect to another. Thus the concept cytostolic small ribosomal subunit is both a meronym of the concept cytostolic ribosome and a hyponym of the concept small ribosomal subunit, since there also exists the concept mitochondrial small ribosomal subunit. The third semantic relationship represented in GO is the familiar relationship of synonymy. Each concept defined in GO (i.e. each node) has one primary term (used for identification) and may have zero or many synonyms. In the sense of the WordNet noun lexicon a term and its synonyms at each node represents a synset (Miller 1998); in GO, however, the relationship between

Page 11: Principles of Ontology Construction Intro · that the ontologies are simultaneously both well formed and biologically intuitive. There are an increasing number of groups developing

9

synonyms is strong, and not as context dependent as in WordNet’s synsets. This means that in GO all members of synset are completely interchangeable in whatever context the terms are found. That is to say, for example, that "lymphocyte receptor of death" and "death receptor 3" are equivalent labels for the same concept and are conceptually identical. One consequence of this strict usage is that synonyms are not inherited from parent to child concepts in GO. The final semantic relationship in GO is a cross-reference to some other database resource, representing the relationship “is equivalent to”. Thus the cross-reference between the GO concept alcohol dehydrogenase and the Enzyme Commission’s number EC:1.1.1.1 is an equivalence (but not necessarily an identity, these cross-references within GO are for a practical rather than theoretical purpose). As with synonyms, database cross-references are not inherited from parent to child concept in GO. As we have expressed, we are not fully satisfied that the two major classes of relationship within GO, isa and partof, are yet defined as clearly as we would like. There is, moreover, some need for a wider agreement in this field on the classes of relationship that are required to express complex relationships between biological concepts. Others are using relationships that, at first sight appear to be similar to these: for example within the aMAZE database (van Helden et al 2001) the relationships ContainedCompartment and SubType appear to be similar to GO’s partof and isa, respectively. Yet ContainedCompartment and partof have, on closer inspection, different meanings (GO’s partof seems to be a much broader concept than aMAZE’s ContainedCompartment). The three domains now considered by the GO Consortium, molecular_function, biological_process and cellular_component are orthogonal. They can be applied independently of each other to describe separable characteristics. A curator can describe where some protein is found without knowing what process it is involved in. Likewise, it may be known that a protein is involved in a particular process without knowing its function. There are no edges between the domains, although we realize that there are relationships between them. This constraint was made because of problems in defining the semantic meanings of edges between nodes in different ontologies (see Rogers and

Page 12: Principles of Ontology Construction Intro · that the ontologies are simultaneously both well formed and biologically intuitive. There are an increasing number of groups developing

10

Rector (2000) for a discussion of the problems of transitivity met within an ontology that includes different domains of knowledge). This structure is, however, to a degree, artificial. Thus all (or, certainly most) gene products annotated with the GO function term transcription factor will be involved in the process transcription, DNA-dependent and the majority will have the cellular location nucleus. This really becomes important not so much within GO itself, but at the level of the use of GO for annotation. For example, if a curator were annotating genes in FlyBase, the genetic and genomic database for Drosophila, then it would be an obvious convenience for a gene product annotated with the function term transcription factor to inherit both the process transcription, DNA-dependent and the location nucleus. There are plans to build a tool to do this, but one that allows a curator to say to the system “in this case do not inherit” where to do so would be misleading or wrong. Annotation using GO. There are two general methods for using GO to annotate gene products within a database. These may be characterised as the ‘curatorial’ and ‘automatic’ methods. By ‘curatorial’ we mean that a domain expert annotates gene products with GO terms as the result of either reading the relevant literature or by an evaluation of a computational result. Automated methods rely solely on computational sequence comparisons such as the result of a BLAST (Alstschul et al 1990) or InterProScan (Zdobnov Apweiler 2001) analysis of a gene product’s known or predicted protein sequence. Whatever method is used, the basis for the annotation is then summarised, using a small controlled list of phrases (www.geneontology.org/GO.evidence); perhaps “inferred from direct assay” if annotating on the evidence of experimental data in a publication or “inferred from sequence comparison with database:object” (where database:object could be, for example, SWISS-PROT:P12345, where P12345 is a sequence accession in the SWISS-PROT database of protein sequences), if the inference is made from a BLAST or InterProScan compute which has been evaluated by a curator. The incorrect inference of a protein’s or predicted protein’s function from sequence comparison is well known to be a major problem and one that has often contaminated both databases and the literature (Kyrpides and Ouzounis

Page 13: Principles of Ontology Construction Intro · that the ontologies are simultaneously both well formed and biologically intuitive. There are an increasing number of groups developing

11

1998, for one example among many). The syntax of GO annotation in databases allows curators to annotate a protein as NOT having a particular function despite impressive BLAST data. For example, in the genome of Drosophila melanogaster there are at least 480 proteins or predicted proteins that any casual curation of BLASTP output would assign the function peptidase (or one of its child concepts) yet, on closer inspection, at least 14 of these lack residues required for the catalytic function of peptidases (D. Coates, personal communication). In FlyBase these are curated with the “function” NOT peptidase. What is needed is a comprehensive set of computational rules to allow curators, who cannot be experts in every protein family, to automatically detect the signatures of these cases, cases where the transitive inference would be incorrect (Kretschmann, Fleischmann, Apweiler 2001). It is also conceivable that triggers to correct dependent annotations could be constructed because GO annotations track the identifiers of the sequence on which annotation is based. Curatorial annotation will be at a quality proportional both to the extent of the available evidence for annotation and the human resources available for annotation. Potentially, its quality is high but at the expense of human effort. For this reason several ‘automatic’ methods for the annotation of gene products are being developed. These are especially valuable for a first-pass annotation of a large number of gene products, those, for example, from a complete genome sequencing project. One of the first to be used was M. Yandell’s program LoveAtFirstSight developed for the annotation of the gene products predicted from the complete genome of Drosophila melanogaster (Adams et al 2000). Here, the sequences were matched (by BLAST) to a set of sequences from other organisms that had already been curated using GO. Three other methods, DIAN (Pouliot et al 2001), PANTHER (Kerlavage et al 2002) and GO Editor (Xie et al 2002), also rely on a comprehensive database of sequences or sequence clusters that have been annotated with GO terms by curation, albeit with a large element of automation in the early stages of the process. PANTHER is a method in which proteins are clustered into “phylogenetic” families and sub-families, which are then annotated with GO terms by expert curators. New proteins can then be matched to a cluster (in fact to a Hidden Markov Model describing the conserved sequence patterns of that cluster) and transitively annotated with appropriate GO terms. In a recent experiment PANTHER performed well in comparison

Page 14: Principles of Ontology Construction Intro · that the ontologies are simultaneously both well formed and biologically intuitive. There are an increasing number of groups developing

12

with the curated set of GO annotations of Drosophila genes in FlyBase (Mi et al in preparation). DIAN matches proteins to a curated set using two algorithms, one is vocabulary based and is only suitable for sequences that already have some attached annotation; the other is domain based, using Pfam Hidden Markov Models of protein domains. Even simpler methods have also been used. For example, much of the first-pass GO annotation of mouse proteins was done by parsing the KEYWORDs attached to SWISS-PROT records of mouse proteins, using a file that semantically mapped these KEYWORDs to GO concepts (see www.geneontology.org/external2go/spkw2go) (Hill et al 2001). Automatic annotations have the advantages of speed, essential if large protein data sets are to be analysed within a short time. Their disadvantage is that the accuracy of annotation may not be high and the risk of errors by incorrect transitive inference is great. For this reason, all annotations made by such methods are tagged in GO gene-association files as being “inferred by electronic annotation”. Ideally, all such annotations are reviewed by curators and subsequently replaced by annotations of higher confidence. The problems of complexity and redundancy. There are in the biological_process ontology many words or strings of words that have no business being there. The major examples of offending concepts are chemical names and anatomical parts. There are two reasons why this is problematic, one practical and the other of more theoretical importance. The practical problem is one of maintainability. The number of chemical compounds that are metabolised by living organisms is vast. Each one deserves its own unique set of GO terms: carbohydrate metabolism (and its children carbohydrate biosynthesis, carbohydrate catabolism), carbohydrate transport and so on. In the ideal world there would exist a public domain ontology for natural (and xenobiotic) compounds: carbohydrate simple carbohydrate pentose hexose glucose

Page 15: Principles of Ontology Construction Intro · that the ontologies are simultaneously both well formed and biologically intuitive. There are an increasing number of groups developing

13

galactose polysaccharide and so on. Then we could make the cross-product between this little DAG (a DAG because a carbohydrate could also be an acid or an alcohol, for example) and this small biological_process DAG: metabolism biosynthesis catabolism to produce automatically: carbohydrate metabolism carbohydrate biosynthesis carbohydrate catabolism simple carbohydrate metabolism simple carbohydrate biosynthesis simple carbohydrate catabolism pentose metabolism pentose biosynthesis pentose catabolism hexose metabolism hexose biosynthesis hexose catabolism glucose metabolism glucose biosynthesis glucose catabolism galactose metabolism galactose biosynthesis galactose catabolism polysaccharide metabolism polysaccharide biosynthesis polysaccharide catabolism Such cross-product DAGs may often have compound terms that are not appropriate. For example, the GO concepts 1,1,1-trichloro-2,2-bis-(4'-chlorophenyl)ethane metabolism and 1,1,1-trichloro-2,2-bis-(4'-chlorophenyl)ethane catabolism are appropriate, yet 1,1,1-trichloro-2,2-bis-(4'-chlorophenyl)ethane biosynthesis is not; organisms break down DDT but do not synthesise it. For this reason any cross-product tree would need pruning by a domain expert subsequent to its computation (or rules for selecting sub-graphs that are not be cross-multiplied).

Page 16: Principles of Ontology Construction Intro · that the ontologies are simultaneously both well formed and biologically intuitive. There are an increasing number of groups developing

14

Unfortunately, as no suitable ontology of compounds yet exists in the public domain, there is no alternative to the present method of maintaining this part of the biological_process ontology by hand. A very similar situation exists for anatomical terms, in effect used as anatomical qualifiers to terms in the biological_process ontology. An example is eye morphogenesis, a term that can be broken up into an anatomical component, eye, and a process component, morphogenesis. This example illustrates a further problem, we clearly need to be able to distinguish the morphogenesis of a fly eye from that of a murine eye, or a Xenopus eye, or an acanthocephalan eye (were they to have eyes). Such is not the way to maintain an ontology. Far better would be to have species- (or clade-) specific anatomical ontologies and then to generate the required terms for biological_process as cross-products. This is indeed the way in which GO will proceed (D. Hill, in preparation) and anatomical ontologies for Drosophila and Arabidopsis are already available (www.genontology.org/anatomy/), with those for mouse and C. elegans in preparation (Bard and Winter 2001, for a discussion). The other advantage of this approach is that these anatomical ontologies can then be used in other contexts, for example for the description of expression patterns or mutant phenotypes (Hamsey 1997). gobo: global open biological ontologies. Although the three controlled vocabularies built by the GO Consortium are far from complete they are already showing their value (e.g. Venter et al 2001, Jenssen et al 2001, Laegreid et al 2002, Pouliot et al 2001). Yet, as discussed in the preceding paragraphs the present method of building and maintaining some of these vocabularies cannot be sustained. Both for their own use, as well as the belief that it will be useful for the community at large the GO Consortium is sponsoring gobo (global open biological ontologies) as an umbrella for structured controlled vocabularies for the biological domain. A small ontology of such ontologies might look like this: gobo gene gene_attribute gene_structure gene_variation gene_product gene_product_attribute

Page 17: Principles of Ontology Construction Intro · that the ontologies are simultaneously both well formed and biologically intuitive. There are an increasing number of groups developing

15

molecular_function biological_process cellular_component protein_family chemical_substance biochemical_substance class biochemical_substance_attribute pathway pathway_attribute developmental_timeline anatomy gross_anatomy tissue cell_type phenotype mutant_phenotype pathology disease experimental_condition taxonomy Some of these already exist (e.g. Taxman for taxonomy (Wheeler et al 2000)) or are under active development (e.g. the MGED ontologies for microarray data description (MGED 2001), a trait ontology for grasses (GRAMENE 2002)) others are not. There is everything to be gained if these ontologies could (at least) all be instantiated in the same syntax (e.g. that used now by the GO Consortium or in DAML+OIL (Fensel et al 2001)); for then they could share software, both tools and browsers, and be more readily exchanged. There is also everything to be gained if these are all open source and agree on a shared namespace for unique identifiers. GO is very much a work in progress. Moreover, it is a community rather than individual effort. As such, it tries to be responsive to feedback from its users so that it can improve its utility to both biologists and bioinformaticists, a distinction, we observe, that is growing harder to make every day. References. Adams M et al 2000 The genome sequence of Drosophila melanogaster. Science 287:2185-2195

Page 18: Principles of Ontology Construction Intro · that the ontologies are simultaneously both well formed and biologically intuitive. There are an increasing number of groups developing

16

Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ 1990 Basic local alignment search tool J Mol Biol 215:403-410 AmiGO 2001 url: www.godatabase.org/cgi-bin/go.cgi Baker PG, Goble CA, Bechhofer S, Paton NW, Stevens R, Brass A 1999 An ontology for bioinformatics applications. Bioinformatics 15:510-520 Bard J, Winter R 2001 Ontologies of developmental anatomy: Their current and future roles. Briefings Bioinformatics 2:289-299 Commission of Plant Gene Nomenclature 1994 Nomenclature of sequenced plant genes. Plant Molec Biol Reporter 12:S1-S109 Cruse DA 1986 Lexical semantics. New York, Cambridge University Press DAG Edit 2001 url: sourceforge.net/projects/geneontology/ DiBona C, Ockman S, Stone M (Editors) 1999 OpenSources. O’Reilly, Sebastopol CA Dure L. III 1991 On naming plant genes. Plant Molec Biol Reporter 9:220-228 Fellbaum C (editor) 1998 WordNet. An Electronic Lexical Database. MIT Press, Cambridge MA Fensel D, van Harmelen F, Horrocks I, McGuinness D, and P. F. Patel-Schneider PF 2001 OIL: An ontology infrastructure for the semantic web. IEEE Intelligent Systems 16:38-45; url: www.daml.org Fleischmann RD, Adams MD et al 1995 Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. Science 269:496-512 GO Consortium 2000 Gene Ontology: Tool for the unification of biology. Nature Genetics 25:25-29

Page 19: Principles of Ontology Construction Intro · that the ontologies are simultaneously both well formed and biologically intuitive. There are an increasing number of groups developing

17

GO Consortium 2001 Creating the gene ontology resource: design and implementation. Genome Res 11:1425-1433 GRAMENE 2002 url: www.gramene.org/plant_ontology Hamsey M 1997 A review of phenotypes of Saccharomyces cerevisiae. Yeast 1:1099-1133. Heath P 1974 The Philosopher’s Alice. Alice’s Adventures in Wonderland & Through a Looking Glass, by Lewis Carroll. Introduction and notes by Peter Heath. Academy Editions, London Hill DP, Davis AP, Richardson JE, Corradi JP, Ringwald M, Eppig JT, Blake JA 2001 Strategies for biological annotation of mammalian systems: implementing gene ontologies in mouse genome informatics. Genomics 74:121-128 Jenssen TK, Laegreid A, Komorowski J, Hovig 2001 A literature network of human genes for high-throughput analysis of gene expresssion. Nature Genetics 28:21-28 Karp P 2000 An ontology for biological function based on molecular interactions. Bioinformatics 16:269-285 Karp P, Paley S 1994 Representations of metabolic knowledge. Proc 2nd Internat Conf Intelligent Systems Bioinformatics, pp 203-211 Karp P, Riley M, Saier M, Paulsen IJ, Collado-Vides J, Paly SM, Pellegrini-Toole A, Bonavides C, Gama-Castro S 2002a The EcoCyc database. Nucleic Acids Res 30:56-58 Karp P, Riley M, Parley SM, Pellegrini-Toole A 2002b The MetaCyc database. Nucleic Acids Res 30:59-61 Kerlavage A, Bonazzi V, di Tommaso M, Lawrence C, Li P, Mayberry F, Mural R, Nodell M, Yandell M, Zhang J, Thomas PD 2002 The Celera Discovery system. Nucleic Acids Res 30:129-136

Page 20: Principles of Ontology Construction Intro · that the ontologies are simultaneously both well formed and biologically intuitive. There are an increasing number of groups developing

18

Kretschmann E, Fleischmann W, Apweiler R 2001 Automatic rule generation for protein annotation with the C4.5 data mining algorithm applied on SWISS-PROT. Bioinformatics 17:920-926 Kyrpides NC, Ouzounis CA 1998 Whole-genome sequence annotation ‘going wrong with confidence’. Molec Microbiol 32:886-887 Laegreid A, Hvidsten TR, Midelfart H, Komorowski J, Sandvik AK 2002 Supervised learning used to predict biological functions of 196 human genes. [In press] Leser U 1998 Semantic mapping for database integration – making use of ontologies. url: cis.cs.tu-berlin.de/~leser/pub_n_pres/ws_ontology_final98.ps.gz MGED 2001 url: www.mged.org Mewes HW, Heumann K, Kaps A, Mayer K, Pfeiffer F, Stocker S, Frishman D 1999 MIPS: a database for genomes and protein sequences. Nucleic Acids Res 27:44-48 Miller GA 1998 Nouns in WordNet. Chapter 1 in Fellbaum 1998 OpenSource 2001 www.opensource.org/ Overbeek R, Larsen N, Punsch GD, D’Souza M, Selkov E Jr, Kyrpides N, Fonstein M, Maltsev N, Selkov E 2000 WIT: Integrated system for high-level throughput genome sequence analysis and metabolic reconstruction. Nucleic Acids Res 28:123-125 Overbeek R, Larsen N, Smith W, Maltsev N, Selkov E 1997 Representation of function: the next step. Gene 191:GC1-GC9 Pouliot Y, Gao J, Su QJ, Liu GG, Ling YB 2001 DIAN: A novel algorithm for genome ontological classification. Genome Res 11:1766-1779 Priss UE 1998 The formalization of WordNet by methods of relational concept analysis. Chapter 7 in Fellbaum 1998

Page 21: Principles of Ontology Construction Intro · that the ontologies are simultaneously both well formed and biologically intuitive. There are an increasing number of groups developing

19

Pruitt KD, Maglott DR 2001 RefSeq and LocusLink: NCBI gene-centered resources. Nucleic Acids Res 29:137-140 Riley M 1993 Functions of the gene products of Escherichia coli. Microbiol Revs 57:862-952 Riley M 1988 Systems for categorizing functions of gene products. Curr Opin Struct Biol 8:388-392 Rison SCG, Hodgman TC, Thornton JM 2000 Comparison of functional annotation schemes for genomes. Funct Integr Genomics 1:56-69 Rogers J, Rector A 2000 GALEN’s model of parts and wholes: Experience and comparisons. Proc Amer Medical Informatics Assn Symp 2000:714-718 (editor JM Overhage). Hanley & Belfus Inc, Philadelphia PA Schulze-Kremer S 1997 Integrating and exploiting large-scale, heterogeneous and autonomous databases with an ontology for molecular biology. pp. 43-46 in: Hofestaedt R, Lim H (editors) Molecular bioinformatics – The human genome project. Shaker Verlag, Aachen. Schulze-Kremer S 1998 Ontologies for molecular biology. Proc Pacific Symp Biocomput 3:695-706 Serres MH, Gopal S, Nahum LA, Liang P, Gaasterland T, Riley M 2001 A functional update of the Escherichia coli K-12 genome. GenomeBiology 2001:2/9/research/0035.1 Sklyar N 2001 Survey of existing Bio-ontologies. url: http://dol.uni-leipzig.de/pub/2001-30/en Stevens R, Baker P, Bechhofer S, Ng G, Jacoby A, Paton NW, Goble CA, Brass A 2000 Transparent Access to Multiple Bioinformatics Information Sources. Bioinformatics 16:184-186 Takai-Igarashi T, Nadaoka Y, Kaminuma T 2000 A database for cell signaling networks. J Comp Biol 5:747 Venter JC et al 2001 The sequence of the human genome. Science 291:1304-1351

Page 22: Principles of Ontology Construction Intro · that the ontologies are simultaneously both well formed and biologically intuitive. There are an increasing number of groups developing

20

Wheeler DL, Chappey C, Lash AE, Leipe DD, Madden TL, Schuler GD, Tatusova TA, Rapp BA 2000 Database resources of the National Center for Biotechnology Information. Nucleic Acids Res 28:10-14 Wierzbicka A 1984 Apples are not a “kind of fruit”. Amer Ethnologist 11:313-328 Xie H, Wasserman A., Levine L, Novik A, Grebinsky V, Shoshan A, Mintz L 2002 Automatic large scale protein annotation through Gene Ontology. [In press] Zdobnov EM, Apweiler R 2001 InterProScan - an integration platform for the signature-recognition methods in InterPro. Bioinformatics 17:847-848 Acknowledgements. The Gene Ontology Consortium is supported by a grant to the GO Consortium from the National Institutes of Health (HG02273), a grant to FlyBase from the Medical Research Council, London (G9827766) and by donations from AstraZeneca Inc and Incyte Genomics. The work described in this review is that of the Gene Ontology Consortium and not the authors – they are just the raconteurs; they thank all of their colleagues for their great support. They also thank Robert Stevens, a user-friendly artificial intelligencer, for his comments and for providing references that would otherwise have evaded them; MA thanks Donald Michie for introducing him to WordNet, albeit over a rather grotty chinese meal in York.

Page 23: Principles of Ontology Construction Intro · that the ontologies are simultaneously both well formed and biologically intuitive. There are an increasing number of groups developing

The Gene Ontology, looking backwards and forwards

5/10/05 1

The Gene Ontology: looking backwards and forwards Suzanna E. Lewis The Gene Ontology Consortium (GO) was initiated six years ago when a group of scientists, including myself, decided that the most direct way connect our data was to share the same language for describing it. Looking back over what has happened since then, all of us feel that the most significant achievement of the GO is in uniting the many independent biological database efforts into a cooperative force. Long ago, in the pre-genome era, biological databases were coming to terms with a formidable amount of work. After Crick and Watson elucidated the structure of DNA, the field of molecular biology exploded and an ever-increasing amount of information needed to be carefully managed and organized. This was particularly so after the invention of methods to sequence DNA in the late 1970s1, 2 and consequently, the initiation of the genome sequencing programs in the late 1980s, all of which led to an even faster acceleration of work in this field. Keeping pace with molecular developments were biological data management efforts. These first began emerging in the 1960’s when Margaret Dayhoff3 published the Atlas of Protein Sequence and Structure4, which later went on-line as the Protein Identification Resource (PIR5). More than 30 years ago, in the 1970s, the first structure database, Protein Data Bank (PDB6), was founded7 and Jackson Laboratory developed the first mammalian genetics database8. A few years later the first depositories for nucleotide sequences were established—with the EMBL Data Library9 beginning in 198110 at Heidelberg, Germany and GenBank11 in 198212 at Los Alamos, New Mexico—followed shortly by the formal establishment of the PIR in 198413 for proteins. By the late 1980s and 1990s biological databases were popping up everywhere: 1986—SwissProt14; 1989—C. elegans ACeDB15; 1991—Arabidopsis AAtDB16; 199217—The Institute for Genomic Research18 (TIGR); 1993—FlyBase19; and in 199420—Saccharomyces Genome Database21 (SGD). These groups all took advantage of concurrent technological advances and pioneered the use of the Internet, the web, and relational database management systems (RDBMS) and SQL when these technologies first became available during the 1980s and 1990s22. Thus, many biological databases bloomed, flourished and, until the late 1990’s, all of them primarily operated autonomously. Having many independent genome databases made a large number of researchers very happy but there were shortcomings. The most important research limitation was that the full potential of these isolated data sets would not be realized until they were as integrated as possible. However, there is a practical constraint, which is that biological databases are inherently distributed because the specialized biological expertise that is required for data capture is spread around the globe at the sites where the data originates. Whatever the solution to biological integration was, it would have to acknowledge that the primary sources of data are these distributed investigators. The community initially was very small and these pioneer database developers largely knew one another. They made many attempts to work together towards an integrated solution either by facilitating the transfer of knowledge between databases or by merging them. The annual ACeDB workshops are one example of these efforts. In the early 1990s these two-week sessions brought together participants from many organisms, such as pine trees, tomatoes, bovines, flies, weeds, worms, and others. Unfortunately, ACeDB was dependent upon what became outmoded technology and did not adapt to the web or RDBMS quickly enough to survive as a general solution. There were also a number of meetings organized in vain attempts to design the ultimate biological database schema, such as the Meeting on the Interconnection of Molecular Biology Databases held at Clare College, Cambridge in 1995. Creating a federated system failed for reasons to numerous to list, but the biggest impedance was getting the many people involved to agree on virtually everything. It would have created a technological behemoth that would be unable to respond to new requirements when they inevitably occurred. Even small-scale collaborations between two databases failed (SGD and Berkeley Fly Database—my personal experience). While we decided to share technology, the RDBMS and programming language, this commonality was moot because we did not also share a common focus. SGD had a finished

Page 24: Principles of Ontology Construction Intro · that the ontologies are simultaneously both well formed and biologically intuitive. There are an increasing number of groups developing

The Gene Ontology, looking backwards and forwards

5/10/05 2

genome while Berkeley was managing EST and physical mapping data. The central point is that the solution to biological database integration does not lay in particular technologies. At the same time, an approximate solution to this problem was being demanded by the research communities whom the model organism databases (MODs) served. These communities increasingly included not just organism-specific researchers, but also pharmaceutical companies, human geneticists, and biologists interested in many organisms, not just one. Another contributing factor was the recent maturation of DNA microarray technology23, 24. The implication of this development was that functional analysis would be done on a large scale and the community risked losing the capability of fully leveraging the power of these new data if they were poorly integrated. For those orchestrating a genome database this was not merely an intellectual exercise, we had to find a solution or risk losing funding. In a word, we were highly motivated. The most fundamental questions for the biologists the MODs served revolve around the genes. What genes are there, what are their mRNA and peptide sequences, where are they on the genome, when are they expressed and how is their activity controlled, in what tissue, organ, and part of the cell are they expressed, what function do they carry out and what role does this play in the organism’s biology? Both pragmatically and biologically then, it made sense for the solution to likewise revolve around the genes. One essential aspect of this, that everyone agreed was necessary, was systematically recording the molecular functions and biological roles of every gene. One of the first functional classification systems was created in 1993 by Monica Riley for E. coli25. Building primarily upon this system, Michael Ashburner began assembling what became proto GO, originally to serve the requirements of FlyBase. Similarly, TIGR created its functional classification system around this time. These early efforts were systematic, in that they were using a well-defined set of concepts for the descriptions, but they were limited because they were not shared between organisms. SGD, FlyBase, TIGR, Mouse Genome Informatics26 (MGI), and others, all independently realized that we could essentially solve a significant portion of the data integration issue if a cross-species functional classification system were created. In our ideal world, sequence (nucleic acid, protein), organism, and other specialty biological databases would all agree on how this should be done. In 1998, it simply was imperative for those responsible for community model organism databases to act, as the number of completely sequenced genomes and large-scale functional analysis experiments was growing. Our correspondence that spring contained many messages such as these: “I'm interested in being involved in defining a vocabulary that is used between the model organism databases. These databases must work together to produce a controlled vocabulary.” (Personal communication); and “It would be desirable if the whole genome community was using one role/process scheme. It seems to me that your list and the TIGR list are similar enough that generation of a common list is conceivable.” (Personal communication). In July of that year, Michael Ashburner presented a proposal at the Montreal ISMB bio-ontologies workshop to use a simple hierarchical controlled vocabulary that was dismissed by other participants as naïve. However, later, in the hotel bar, representatives of FlyBase (Suzanna Lewis), SGD (Steve Chervitz), and MGI (Judith Blake) embraced this proposal and agreed to jointly apply the same vocabulary to describe the molecular functions and biological roles for every gene in their respective databases and thereby founded the Gene Ontology Consortium. It is now six years later and the GO has grown enormously. There are many measures demonstrating its success: Publications—at present there are close to 300 articles in PubMed referencing the GO; Support from large institutional databanks—SwissProt now uses GO for annotating of the peptide sequences they maintain; Participation—the number of organism groups has grown every quarter from the initial three to roughly two dozen; Acceptance—every conference has talks and posters either referencing or utilizing the GO, and within the genome community it is the accepted standard for functional annotation. While it is impossible in hindsight to pinpoint exactly why it has succeeded, there are certain definite factors involved that are listed below. We already had ‘market-share’.

Page 25: Principles of Ontology Construction Intro · that the ontologies are simultaneously both well formed and biologically intuitive. There are an increasing number of groups developing

The Gene Ontology, looking backwards and forwards

5/10/05 3

Our careers were such that we could take risks. We were and are practical and experienced engineers. We have always worked at the leading edge of technology. It was in our own self-interest. We had domain knowledge. We are open. A significant advantage that we (those managing biological databases) had, though it is not often considered, is our stewardship of key data sets. The commencement of GO also coincided with the completion of many key genomes that, once sequencing is finished, these database groups annotate, manage and maintain. These facts put us in the right position to succeed because of the influence these data have. The decisions we make in our management of these data have a great deal of downstream effect. Every researcher, both bench and informaticist, who utilize the genomic data of mouse, Drosophila, yeast, and other organisms are influenced by our choices in how the data are described and organized. In contrast to broad-spectrum archival repositories, these data are annotated by specialists in the biology of a given organism who have a detailed understanding of its idiosyncratic biological phenomena. This expertise anchors the captured knowledge in experimental data. As other organism specialists joined, the Arabidopsis Information Resource27 (TAIR) joining soon after the start, as well as microbial and pathogen databases28 the impact of GO increased. Given the large established constituency of biologists that FlyBase, SGD, MGI, and TAIR are accepted by, it is unsurprising that our decision to jointly develop the GO was influential. In addition to holding majority share of these critical research resources the careers of the people involved are built on successful collaborative efforts. The professionals who are responsible for the biological databases fall roughly into two classes. They are either tenured principal investigators who wish to contribute to their community or PhD level researchers (both biologists and computer scientists) who have especially chosen a non-academic career track. As individuals, they do not have much to gain by, for example, publishing papers as individuals. Papers are published, of course, about the content of the database or techniques for managing these data, but an individual’s personal publication record is not a primary criterion upon which their career is evaluated. Rather, careers are measured by the success of the project and the strength of an individual’s contribution to the project’s goals. This attitude allowed us to remove our egos and concern for individual recognition from the search for a solution to the data interconnection problem. Apart from the preceding organizational and social factors, each GO consortium scientist had a successful background in producing large information resources. Everyone possessed institutional knowledge of the requirements for biology and proven experience in engineering management and development. They knew how to decompose a large and complex project into smaller readily measurable milestones, an extremely difficult thing to do. Understanding the theoretical requirements of a problem is necessary, but insufficient. The experience and practical skill to effectively direct the development and implement a solution was also essential. Complementing our existing skills was our willingness to use new technologies. A key characteristic of the scientists who initiated the GO is that they are “early adopters”. There is a definite behavior pattern in this group of exploring technological innovations. We had always sought new strategies to solve our problems, for example—the Internet, the web, RDBMSs, new languages (such as Perl and Java), to ontologies—all of which we began to work with before the methodologies were mature and well-established. In short, we have a tradition in experimentation. It is not very surprising that scientists are willing to experiment, but this mind set extends to computer science as well and enables us to exploit advances in that field to address the needs of biology. Anything that will help us get the job done we will take advantage of. The GO consortium is inherently collaborative and collaborations are hard, very hard, because of geography, misunderstandings, and the length of time it takes to get anything resolved and completed. Within the consortium, it is made even more difficult because we must discuss and

Page 26: Principles of Ontology Construction Intro · that the ontologies are simultaneously both well formed and biologically intuitive. There are an increasing number of groups developing

The Gene Ontology, looking backwards and forwards

5/10/05 4

agree upon mental concepts and definitions in addition to concrete issues such as data syntax and exchange. Still, we actively sought collaboration, because it was in our own self-interest. Our users, whose support we depended upon, were demanding the ability to ask the same query of different genomic databases and receive comparable answers. Every biological database would gain through cooperation. One of the most significant contributing factors is our deep knowledge of the domain of biology. No problem can be solved successfully if you do not understand its nuances. The consortium succeeded by utilizing knowledge from many disparate fields: selectively exploiting what has been learned in the field of artificial intelligence (AI) and the study of ontologies; constrained by practical engineering considerations and incremental development; all whilst bearing in mind the niceties of the biology being represented. Domain knowledge is essential to GO’s success, without which we could not maintain biological fidelity. Last, and perhaps most important, is that we have always been open. All of the vocabularies, the annotations, and software tools are available for others to use. Our success is best illustrated by how much they are used29. This openness is essential in the scientific environment we work in. To provide a technology without a willingness to reveal all source code and data is tantamount to throwing away the lab notebook. Providing outside researchers with the ability to completely understand the methods that are used is mandatory for scientific progress. The GO is not perfect, but its success is primarily due to revealing everything. The feedback we receive from others is what is enabling the consortium to improve with age. Our plan for the future is to build on this base. We are actively seeking ways and building tools to help new biological databases utilize the GO and thus extend our data coverage to include more organisms. We will remain pragmatic in our choice of technologies and remain flexible enough to exploit new advances. We will incrementally advance the sophistication of the underlying software architecture, one example of this is shown by our collaboration with Reactome30, a project generating formal representations of biological pathways. We will seek out domain experts as the biological coverage of the GO extends into new areas so that biological fidelity is kept high. Likewise, we will work with experts to extend the scope of the ontologies to cover other critical areas of biological description, such as anatomies, cell types, and phenotypes, as illustrated by the Open Biological Ontologies31 project. Finally, we will continue to work cooperatively and remain open as this has shown to be the most scientifically productive approach. The GO succeeded because it was not a technical solution per se. Technology is more than just an implementation detail of course, but will never be a silver bullet either. It comes down to the fact that we want to continue integrating our knowledge forever and technologies are short-lived. Therefore, the solution must be able to adopt new technologies as they arise while the primary focus remains on cooperative development of semantic standards. It’s about the content, not the container. Perhaps ironically, the impact of shifting the focus away from a technical solution to our biological data integration problem is that we have begun sharing technology. Once the mechanism for a dialog was in place we have discovered many other areas where our interests coincided. There are now organized meetings for professional biological curators to meet and discuss standard methodologies32. The Generic Model Organism Database33 (GMOD) effort makes these common tools available to the community and serves as a forum for a wide spectrum of interests. It is this unforeseen outcome, consolidating the disparate databases into a cooperative community engaged in productive dialogs, which is the single largest impact and achievement of the Gene Ontology consortium. 1 Sanger F., Coulson A.R., A rapid method for determining sequences in DNA by primed synthesis with DNA polymerase. J Mol Biol. 1975 May 25;94(3):441-8. 2 Maxam A.M., Gilbert W., A new method for sequencing DNA. Proc Natl Acad Sci U S A. 1977 Feb;74(2):560-4.

Page 27: Principles of Ontology Construction Intro · that the ontologies are simultaneously both well formed and biologically intuitive. There are an increasing number of groups developing

The Gene Ontology, looking backwards and forwards

5/10/05 5

3 http://www.dayhoff.cc/index.html 4 Dayhoff MO, Eck RV, Chang MA, Sochard MR. Atlas of Protein Sequence and Structure. National Biomedical Research Foundation, 1965 5 http://pir.georgetown.edu/home.shtml 6 http://www.rcsb.org/pdb/ 7 http://www.rcsb.org/pdb/holdings.html 8 www.jax.org/about/milestones.html 9 http://www.ebi.ac.uk/embl/index.html 10 http://www.embl.org/aboutus/generalinfo/history.html 11 http://www.ncbi.nlm.nih.gov/Genbank/index.html 12 www.ncbi.nlm.gov/Education/BLASTinfo/milestones.html 13 http://pir.georgetown.edu/pirwww/aboutpir/history.html 14 www.ebi.ac.uk/swissprot 15 http://www.acedb.org/ 16http://weedsworld.arabidopsis.org.uk/Vol3ii/Cherry-Flanders-Petel.WW.html 17 www.tigr.org/about/history.shtml 18 http://www.tigr.org/ 19 www.flybase.org 20 www.yeastgenome.org/aboutsgd.shtml 21 www.yeastgenome.org/ 22 1970: Relational database model specified (Ted Codd: www.nap.edu/readingroom/books/far/ch6.html) 1983: The internet is defined as networks using TCP/IP (www.historyoftheinternet.com/chap4.html), 1985: SQL defined; 1990: First ever web page at CERN (www.w3.org/History.html) 23 Fodor SP, Rava RP, Huang XC, Pease AC, Holmes CP, Adams CL., Multiplexed biochemical assays with biological chips. Nature. 1993 Aug 5;364(6437):555-6. 24 Schena M., Shalon D., Davis R.W., Brown P.O., Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science. 1995 Oct 20;270(5235):467-70. 25 Riley M. Functions of the gene products of Escherichia coli. Microbiol Rev. 1993 Dec;57(4):862-952. 26 www.informatics.jax.org/ 27 www.arabidopsis.org/ 28 www.genedb.org/ 29 http://www.geneontology.org/GO.biblio.html 30 www.reactome.org 31 http://obo.sf.net 32 http://tesuque.stanford.edu/biocurator.org/ 33 http://gmod.sourceforge.net/

Page 28: Principles of Ontology Construction Intro · that the ontologies are simultaneously both well formed and biologically intuitive. There are an increasing number of groups developing

Ontologies in Biomedicine: The Good, the Bad, and the Ugly http://ontology.buffalo.edu/bio/OntologiesGBU.html

1 of 5 5/10/05 10:01 AM

Ontologies in Biomedicine: The Good, the Bad, and the Ugly Compiled for internal use Very First Draft. Comments, Corrections and Extensions Welcome (to: [email protected]) 1. Caveats:1. Not everything on this list is described by its authors as an ontology.2. The list has been prepared for illustrative purposes, as a preliminary guide to the sorts of pitfalls thatwe face in the building of ontologies. Its goal is to draw attention primarily to what is wrong withontologies. Thus it should be used in conjunction with lists of ontologies in biomedicine such as thoseprepared at:

http://www.cs.man.ac.uk/~stevensr/ontology.htmlhttp://anil.cchmc.org/Bio-Ontologies.htmlhttp://lsdis.cs.uga.edu/~cthomas/bio_ontologies.html

and of course with the OBO (Open Biomedical Ontologies Consortium) ontology library:http://obo.sourceforge.net 2. First Draft List of Criteria to be satisfied by Good Ontologies (NB think of these asrules of thumb, or goals to keep constantly in mind – the world is too messy to support them allsimultaneously) a. Each ontology should have as its backbone a taxonomy based on the is_a relation (for ‘is a subtypeof’). This should be as far as possible a true hierarchy (single inheritance).b. The taxonomy should have one root, with a suite of high-level children of the root of a sort which yield a top-down view of the structure of the whole ontology. (One does not have this e.g. inSNOMED, or in the Cell Ontology.) c. The expressions corresponding to the constituent nodes of the taxonomy and to its relations (is_a, part_of, etc.) should be explicitly defined in both human-readable and computable formats. The lattershould be formalized versions of the former. Such definitions should then provide the rationale forestablishing the class subsumption inheritance hierarchy.d. There should be clear rules governing how definitions are formulated.e. The ontology should distinguish between the types (classes, universals) represented by thistaxonomy, and the tokens (individuals, particulars, instances) instantiated by these types on the side ofreality.f. The relationships should be used consistently to ensure valid inferences both within and betweenontologies. One should be able to reliably query on instance data, computationally.g. Classification systems that have existed for centuries have been human interpretable, but never

Page 29: Principles of Ontology Construction Intro · that the ontologies are simultaneously both well formed and biologically intuitive. There are an increasing number of groups developing

Ontologies in Biomedicine: The Good, the Bad, and the Ugly http://ontology.buffalo.edu/bio/OntologiesGBU.html

2 of 5 5/10/05 10:01 AM

computable. So, being able to compute on an ontology is important. h. An ontology should accommodate change in knowledge. It should have clear procedures foradding new terms, and clear procedures for correcting erroneous entries. All prior versions should beeasily accessible.i. There should be clear rules governing how to select terms and how to resolve problems in case ofdifficult terms. j. The different types of problem cases in the treatment of terms, relations and definitions, should becarefully documented, and best practices for the resolution of these problems tested and promulgated.k. The scope of an ontology should be clearly specified, both in terms of the domain of instances overwhich it applies and in terms of the types of relations in that domain (and thus to pertinent type ofscientific inquiry). The family of terms in a given ontology should then have a natural unity, whichshould also be reflected in the name of the ontology. This criterion not satisfied e.g. by the variousso-called 'tissue' ontologies discussed in the last below. Indeed the family of tissue terms is oneimportant example of a problem area – reflecting the fact that the term 'tissue' is ambiguous as betweenKIND of tissue and PORTION of tissue. (A similar ambiguity applies e.g. to ‘substance’.)

3. The Rankings 3.1 Very Good

The Foundational Model of Anatomy (FMA): http://sig.biostr.washington.edu/projects/fm/Very clear statement of scope (structural human anatomy, at all levels of granularity, from the wholeorganism to the biological macromolecule; very powerful treatment of definitions (from which the entireFMA hierarchy is generated); very quick turn-around time for correction of errors; very few unfortunateartifacts in the ontology deriving from its specific computer representation (Protégé)

3.2 The Good

GALENMotivation: to find ways of storing detailed clinical information in a computer system so that both (1)clinicians are able to store and review information at a level of detail relevant to them and (2) computerscan manipulate what is stored, for retrieval, abstraction, display, comparison.Very powerful (Description Logic-based) formal structure, thus tight organization and careful treatmentof terms; unfortunately remains only partially developed after some years of lying fallow. Now in somerespects outdated.

3.3 The Intermediate (= still need many modifications) Gene OntologyOpen source; very useful; poor treatment of the relations between the entities covered by its threeseparate ontologies Reactome http://www.reactome.org/

Page 30: Principles of Ontology Construction Intro · that the ontologies are simultaneously both well formed and biologically intuitive. There are an increasing number of groups developing

Ontologies in Biomedicine: The Good, the Bad, and the Ugly http://ontology.buffalo.edu/bio/OntologiesGBU.html

3 of 5 5/10/05 10:01 AM

A rich knowledgebase of biological process, but with incoherent treatment of top-level categories. ThusReferentEntity (embracing e.g. small molecules) is treated as a sibling of PhysicalEntity (embracingcomplexes, molecules, ions and particles). Similarly CatalystActivity is treated as a sibling of Event.

SNOMED http://www.snomed.org/

Swissprot http://us.expasy.org/sprot/Protein knowledgebaseSequence Ontology http://song.sourceforge.net/

Cell Ontology http://www.xspan.org/obo/

Zebrafish Anatomy and Development Ontology http://obo.sourceforge.net/cgi-bin/detail.cgi?zfishanat

NANDA International Taxonomy http://www.nanda.org/html/taxonomy.htmlA conceptual system that guides the classification of nursing diagnoses in a taxonomy.

ICNP International Classification for Nursing Practice http://www.icn.ch/icnp.htmA combinatorial terminology for nursing practice that facilitates crossmapping of local terms and existingvocabularies and classifications National Cancer Institute Thesaurushttp://www.mindswap.org/2003/CancerOntology/Top-level structure recognizes the existence of three (disjoint) classes of cells: cells, normal cells,abnormal cells. Recognizes three (disjoint) classes of plants: vascular plants, non-vascular plants, otherplants. Inherits many of the problematic features from other terminologies in the UMLS.

UMLShttp://www.nlm.nih.gov/research/umls/ ICD-10http://www.icd10.ch/index.asp?lang=EN

3.4 The Bad UMLS Semantic Network http://semanticnetwork.nlm.nih.gov/Recognizes only one subtype of plant – algae (which are not plants)Treats the digestive system as a conceptual part of the organism

Clinical Terms Version 3 (The Read Codes): http://www.nhsia.nhs.uk/terms/pages/publications/v3refman/chap2.pdf

Page 31: Principles of Ontology Construction Intro · that the ontologies are simultaneously both well formed and biologically intuitive. There are an increasing number of groups developing

Ontologies in Biomedicine: The Good, the Bad, and the Ugly http://ontology.buffalo.edu/bio/OntologiesGBU.html

4 of 5 5/10/05 10:01 AM

(Early?) versions classify chemicals into: chemicals whose name begins with ‘A’, chemicals whosename begins with ‘B’, chemicals whose name begins with ‘C’, ...Incorporated into SNOMED-CT LOINC Logical Observation Identifiers Names and Codes: http://www.regenstrief.org/loincGoal: to facilitate the exchange and pooling of results, such as blood hemoglobin, serum potassium, orvital signs, for clinical care, outcomes management, and researchtissue ontologiesProblem: reveals its origins in the punchcard era; typical string:12189-7 | CREATINE KINASE.MB/CREATINE KINASE.TOTAL | CFR | PT | SET/PLAS | QN |CALCULATION

Health Level 7 Reference Information Model (HL7 RIM):http://www.hl7.org/Library/data-model/RIM/modelpage_mem.htmHL7 is a standard for exchange of information between clinical information systems (has proved verycrumbly as a standard; every hospital has its own version of HL7); the RIM is designed to overcomethis problem by defining the world of healthcare data (a consensus view of the entire healthcareuniverse); one problem with the RIM is that very many entities in the healthcare universe (e.g. disorders,genes, ribosomes) are identified as documents; because of the counterintuitive nature of thisidentification, RIM documentation is itself highly counterintuitive, and the RIM community itself issubject to constant fights Medical Entities Dictionary (MED): http://med.dmi.columbia.edu/Semantic network style. MedDRA v. 3: http://www.meddramsso.com/NewWeb2003/medra_overview/index.htmHas hierarchies, but you can’t tell by browsing through the hierarchies whether different terms representthe same thing or not.MedDRA v. 3 does not assign unique codes to its terms, but rather works with unique terms collectedfrom various sources which are left unchanged for reasons of ‘compatibility’. Some sourceterminologies, such as WHO-ART (World Health Organization Adverse Reaction Terminology) had allterms in uppercase, some not. So a unique term might be “COLD”, but also “cold”, and “Cold”, and“cOLd”, .... Each unique term in MedDRA v3 must be assigned a single meaning, but MedDRA doesthis in a haphazard way. Thus the 4-character string “COLD” might be assigned the meaning common cold or cold temperature or (as is in fact the case <check>) chronic obstructive lung disease. Suppose,now, that a medical doctor in a pharmaceutical company has the task of coding into MedDRAhandwritten reports received from practising physicians engaged in clinical studies. She must then,according to the coding rules set up by her department, either code a sentence such as “patient coughingand sneezing, ... diagnosis: COLD” as referring to chronic obstructive lung disease (which is obviouslywrong), or make a phone call to the physician to ensure that he in fact meant “cold” and not “COLD”. MEDCIN: http://www.medicomp.com/index_html.htmMixes up everything that can be mixed up

Page 32: Principles of Ontology Construction Intro · that the ontologies are simultaneously both well formed and biologically intuitive. There are an increasing number of groups developing

Ontologies in Biomedicine: The Good, the Bad, and the Ugly http://ontology.buffalo.edu/bio/OntologiesGBU.html

5 of 5 5/10/05 10:01 AM

International Classification of Primary Care (ICPC): http://www.ulb.ac.be/esp/wicc/icpc2.htmltries to explain general medicine (family medicine) by means of about 800 classes. MeSHMGEDeVocPATOMouse Pathology

3.5 The Ugly ICD-10-PCS: http://www.cms.hhs.gov/paymentsystems/icd9/icd10.aspbased on good principles but worked out using an ugly representation

UMLS Semantic Network

Special Mention: Ugly Tissue Ontologies1. TissueDB (http://tissuedb.ontology.ims.u-tokyo.ac.jp:8082/tissuedb/)has nothing to do with tissues. What they call tissue is basically all the structures one can identify histologically.2. Brenda Tissue Ontology http://www.brenda.uni-koeln.de/ontology/tissue/tree/update/update_files/BrendaTissuehas nothing to do with tissue. Or rather, here, basically everything a tissue. Thus it contains statements like: arm is-a limb3. Aukland Anatomy Ontology Tissue Class Viewhttp://n2.bioeng5.bioeng.auckland.ac.nz/ontology/anatomy/ontology_class_view?class_uri=http%3A//physiome.bioeng.auckland.ac.nz/anatomy/all%23TissueClassifies tissue into: Connective tissue, Epithelial tissue, Glandular tissue, Muscle tissue, Nervous tissue; but proceedingfurther down the hierarchy we find not tissues but organs and organ parts such as SimpleTubularGland,SimpleAcinarGland, etc. Moreover EndocrineGland is asserted to have two ‘instances’ (we presume they mean subclasses):EndocrineGland (!), and FollicularEndocrineGland. Among the ‘instances’ of ConnectiveTissue are listed: Left Humerus,Right Tibia, and so on. So nonsense, here, too.

Page 33: Principles of Ontology Construction Intro · that the ontologies are simultaneously both well formed and biologically intuitive. There are an increasing number of groups developing

www.elsevier.com/locate/yjbin

Journal of Biomedical Informatics 36 (2003) 478–500

A reference ontology for biomedical informatics:the Foundational Model of Anatomy

Cornelius Rosse* and Jos�e L. V. Mejino Jr.

Departments of Biological Structure, and Medical Education & Biomedical Informatics, Structural Informatics Group,

University of Washington, Seattle, WA 98195, USA

Received 7 November 2003

Abstract

The Foundational Model of Anatomy (FMA), initially developed as an enhancement of the anatomical content of UMLS, is a

domain ontology of the concepts and relationships that pertain to the structural organization of the human body. It encompasses

the material objects from the molecular to the macroscopic levels that constitute the body and associates with them non-material

entities (spaces, surfaces, lines, and points) required for describing structural relationships. The disciplined modeling approach

employed for the development of the FMA relies on a set of declared principles, high level schemes, Aristotelian definitions and a

frame-based authoring environment. We propose the FMA as a reference ontology in biomedical informatics for correlating dif-

ferent views of anatomy, aligning existing and emerging ontologies in bioinformatics ontologies and providing a structure-based

template for representing biological functions.

� 2003 Elsevier Inc. All rights reserved.

Keywords: Ontology; Knowledge representation; Bioinformatics; Biomedical informatics; Anatomy; Mereotopology; Embryology; Developmental

biology; UMLS

1. Introduction

Ontology design is becoming increasingly recognized

as central to medical informatics [1] and even more so to

bioinformatics. New ontologies continue to appear in

diverse areas of the biomedical sciences with a particular

emphasis on biological macromolecules and the pro-

cesses in which these molecules participate. The impor-

tance of relating such new information resources tomedical terminologies (or vocabularies) is illustrated by

the recent incorporation of the Gene Ontology [2] in the

Unified Medical Language System (UMLS) [3]. UMLS,

designed, maintained and distributed by the National

Library of Medicine, provides a unified knowledge

representation system for correlating a large number of

biomedical terminologies. Like most UMLS terminolo-

gies, the Gene Ontology and other application ontolo-gies in biomedical informatics are compiled in diverse

* Corresponding author. Fax: 1-206-543-1524.

E-mail address: [email protected] (C. Rosse).

1532-0464/$ - see front matter � 2003 Elsevier Inc. All rights reserved.

doi:10.1016/j.jbi.2003.11.007

contexts with distinct user groups in mind; consequentlytheir correlation and mapping to one another pose a

considerable challenge. The challenge is enhanced by the

need for aligning these ontologies with evolving, com-

putable information resources in the classical, basic,

biomedical sciences (e.g., anatomy, physiology, and

pathology), as well as with those in clinical medicine.

Such correlations will be critical for the development of

knowledge-based applications that will need to rely oninference in order to support clinical research and de-

cision making based on the knowledge of molecular

biology.

A raison d’etre of UMLS is to facilitate the estab-

lishment of correspondences in the meaning of terms

among its constituent vocabularies. This correlation is

largely achieved through assigning the same concept

occurring in different terminologies to high level se-mantic types encompassed within the UMLS Semantic

Network [4]. It is more problematic, however, to rec-

oncile divergences in the semantic structure of these

sources and other ontologies at levels higher than leaf

concepts and discrete terms. For example, while there is

Page 34: Principles of Ontology Construction Intro · that the ontologies are simultaneously both well formed and biologically intuitive. There are an increasing number of groups developing

C. Rosse, J.L.V. Mejino Jr. / Journal of Biomedical Informatics 36 (2003) 478–500 479

considerable correspondence in the meaning of ana-tomical terms in UMLS sources that include substantial

amounts of anatomy, there is very little similarity in the

schemes these sources use for arranging their anatomical

terms into a coherent representation of anatomical

knowledge. While such correspondences may support

the correlation of the meaning of terms, the underlying

semantic structure of these abstractions must also be

aligned if problem solving calls for inference across theboundaries of related ontologies.

It is particularly important to assure coherence of

knowledge domains that generalize to a number of other

fields where they will be reused. Such is the case with the

classical, basic, biomedical sciences and also with more

modern disciplines, such as neuroscience and develop-

mental biology. All these fields are embraced by bio- and

biomedical informatics, which deal not only with humanbiology but also with observations and experimental

data derived from non-human species. In order to sup-

port the generation of knowledge-based applications

that will be increasingly needed in basic science and

clinical research, as well as in the delivery of health care,

computable knowledge sources must be established not

only in the modern but also in the classical disciplines of

basic science. Such a widening focus in bioinformatics isinevitable in the post-genomic era, and the process has

in fact already begun. Distinct from the large clinical

terminologies (e.g., SNOMED RT [5], GALEN [6],

Medical Entities Dictionary [7]), a number of ontologies

are emerging that represent knowledge in discrete fields

of the basic biomedical sciences. One of these ontologies

is the Digital Anatomist Foundational Model of Anat-

omy (Foundational Model or FMA, for short) [8,9]. TheFMA symbolically represents the structural organiza-

tion of the human body from the macromolecular to

macroscopic levels.

The initial development of the FMA was supported

by UMLS with the intent of enhancing the anatomical

content of UMLS source vocabularies and ultimately

facilitating the correlation of anatomical concepts rep-

resented in these vocabularies. We present a status re-port on the FMA, major components of which are

included in UMLS as the Digital Anatomist vocabulary

(known in previous editions of UMLS as UWDA). With

this report we wish to promote the evaluation of the

FMA with respect to realizing its intended role in

UMLS and, in a broader sense, bring the FMA to the

attention of the biomedical, and particularly the bioin-

formatics communities.The purpose of this paper is to describe the FMA and

propose it as a reference ontology for biomedical in-

formatics. Our rationale for this proposal is based on

the fact that the FMA�s concept domain embraces all

material objects, substances and spaces that result from

the coordinated expression of structural genes. In their

aggregate these anatomical entities constitute the fully

formed body and assume the role of ‘‘actors’’ in allphysiological and disease processes. Therefore, we con-

tend that a coherent domain ontology of anatomical

entities is the best candidate for serving as a foundation

and reference for the correlation of other ontologies in

biomedical informatics. Our second objective is to il-

lustrate the process of disciplined modeling we pursued

in establishing the FMA. We believe that this approach

could also serve well the authors of emerging knowledgesources in bioinformatics, in that it synergizes with and

enhances broader guidelines and desiderata that have

been proposed for the construction of terminologies and

knowledge bases [10,11].

1.1. Organization of this paper

We first define the FMA and then illustrate the dis-ciplined modeling approach by focusing on the estab-

lishment of the Anatomy Taxonomy (AT) and the other

two components of the FMA, which relate to structural

and developmental attributes of the entities to which

concepts in the AT refer. The next sections are devoted

to accessing, scaling, and evaluating the FMA, before

we discuss the FMA�s relevance to UMLS and comment

on its potential as a reference ontology for biomedicalinformatics, which leads to our conclusions. Different

typographies used in the text have the following asso-

ciations: Names of concepts represented in the FMA are

in Courier New font, which distinguishes, for example,

Organ, a class in the AT, from the term �organ� used in a

general context; relationships between concepts are in

italics enclosed by hyphens, e.g., -part of-; italics are also

used for emphasis and for Latin terms; abbreviations ofthe components of the FMA are in bold capitals,

e.g., AT.

2. The Foundational Model of Anatomy

The Foundational Model of Anatomy is an evolving

ontology for biomedical informatics; it is concernedwith the representation of entities and relationships

necessary for the symbolic modeling of the structure of

the human body in a computable form that is also un-

derstandable by humans [8,9]. Specifically, the FMA is

an abstraction that explicitly represents a coherent body

of declarative knowledge about human anatomy as a

domain ontology (defined below). The ontology is im-

plemented in a frame-based system and is stored in arelational database. The FMA is intended as a reusable

and generalizable resource of deep anatomical knowl-

edge, which can be filtered to meet the needs of any

knowledge-based application that requires structural

information. It is distinct from application ontologies in

that it is not intended as an end-user application and

does not target the needs of any particular user group.

Page 35: Principles of Ontology Construction Intro · that the ontologies are simultaneously both well formed and biologically intuitive. There are an increasing number of groups developing

480 C. Rosse, J.L.V. Mejino Jr. / Journal of Biomedical Informatics 36 (2003) 478–500

We regard this model as foundational for two reasons:(1) anatomy is fundamental to all biomedical domains;

and (2) the anatomical concepts and relationships en-

compassed by the FMA generalize to all these domains.

By �anatomical concept� we mean a unit of thought that

refers to an anatomical entity (defined in section 3.2.1).

The Foundational Model currently contains 70,000

distinct anatomical concepts—representing structures

ranging in size from some macromolecular complexesand cell components to major body parts. These con-

cepts are associated with more than 110,000 terms, and

are related to one another by more than 1.5 million in-

stantiations of over 170 kinds of relationships. We de-

veloped and instantiated this large and complex model

through an approach we call disciplined modeling.

3. Disciplined modeling

We first describe the elements of disciplined modeling

that have guided the establishment of the three major

components of the FMA and then deal with each of

these components: the Anatomy Taxonomy, Anatomi-

cal Structural Abstraction, and the Anatomical Trans-

formation Abstraction.

3.1. Elements of disciplined modeling of anatomy

We borrow the term �disciplined modeling� from Perl

et al. [12,13], who proposed a methodology for re-

structuring existing vocabularies in order to introduce

clarity into their representation scheme. We on the other

hand have employed a disciplined approach for the de

novo creation of a new knowledge base. The elements of

our approach consist of a set of declared foundational

principles, a high level scheme for representing the ref-

erents of concepts and relationships in the anatomy

domain, Aristotelian definitions and a knowledge

modeling environment that assures implementation of

the principles and the inheritance of definitional and

non-definitional attributes.

3.1.1. Foundational principles

Principles are assertions that provide the basis for

reasoning and action. The nature of the principles we

declare is dictated by the definition of the domain we

intend to model. This domain is anatomy. We have

previously distinguished and defined two concepts for

which the term �anatomy� is a homonym: anatomy(science) and anatomy (structure) [8]. As its definition in

a preceding section specifies, the Foundational Model of

Anatomy is an abstraction of anatomy (structure),

which is the ordered aggregate of material objects and

physical spaces filled with substances that together

constitute a biological organism. The instantiated sym-

bolic model itself is a concrete manifestation of anatomy

(science), which is a biological science concerned withthe discovery, analysis and representation of anatomy

(structure). We declared the following principles for

guiding the formulation and instantiation of the FMA

abstraction [8,9]:

1. Unified context principle. The abstraction should

conform to a strictly structural context. Although ana-

tomical discourse in education and various biomedical

fields embraces diverse contexts (e.g., functional, surgical,radiological, and biomechanical), it is the analysis and

description of an organism�s structure that distinguishesthe science of anatomy from other biological sciences.We

have found that only in a structural context is it possible

to establish a single inheritance hierarchy that subsumes

all anatomical concepts. As stated earlier, it is our con-

tention that such a structure-based representation can

serve as a reference ontology for correlating other (e.g.,functional, clinical) contexts and views of anatomy.

2. Abstraction level principle. The abstraction should

model canonical anatomy and provide a framework for

anatomical variants, but should exclude instantiated

anatomy.

We have previously distinguished canonical and in-

stantiated anatomy [8]. Canonical anatomy is a field of

anatomy (science) that comprises the synthesis ofgeneralizations based on anatomical observations that

describe idealized anatomy (structure). These general-

izations have been implicitly sanctioned by their usage in

anatomical discourse. Instantiated anatomy is the field of

anatomy (science) which comprises anatomical data

pertaining to instances (i.e., individuals) of organisms

and their parts. Although we exclude instantiated anat-

omy from the FMA, our intent is for the FMA to serve asa foundation for the representation of the anatomy of

individuals and to provide an organizational framework

for anatomical data, including images. Thus, the FMA

should represent classes, which are multiply located

anatomical entities (i.e., universals) that exist in the

instances (or particulars) that they subsume.

3. Species specificity principle. The initial iteration of

the abstraction should model the anatomy of Homo

sapiens, but at the same time it should serve as a

framework for the anatomy of other mammalian and

eventually, other vertebrate species. Although clinical

medicine is concerned with the human, animal models of

human disease, as well as veterinary medicine in its own

right, call for a symbolic representation of anatomy. The

highly conserved groups of structural genes that dictate

the vertebrate body phenotype provide a rationale foreventually modeling species-specific anatomy as spe-

cializations of a generalizable vertebrate body plan [14].

Therefore, the high level abstract classes of the FMA

should accommodate the generalized ‘‘Bauplan’’ of

vertebrates.

4. Definition principle. Defining attributes of a class in

the model should be specified in terms of the physical

Page 36: Principles of Ontology Construction Intro · that the ontologies are simultaneously both well formed and biologically intuitive. There are an increasing number of groups developing

1 In previous publications this was called the Ao (Anatomy

ontology); we renamed it as AT in order to distinguish it from the

entire FMA, which is more appropriately regarded as an ontology.

C. Rosse, J.L.V. Mejino Jr. / Journal of Biomedical Informatics 36 (2003) 478–500 481

and other structural (i.e., anatomical) attributes of theanatomical entities that the class subsumes (see Section

3.1.3).

5. Dominant concept principle. An ontology�s domi-

nant class is the class in reference to which other classes

in the ontology are defined. Anatomical structure

(defined in Section 3.2) shall be the dominant class in the

FMA (see Section 3.2.2.2).

6. Organizational unit principle. The abstraction shallhave two units in terms of which subclasses of Ana-

tomical structure are defined: Cell and Organ.

Other subclasses of Anatomical structure shall

constitute cells or organs, or be constituted by cells or

organs.

7. Content constraint principle. The largest anatomical

structure represented shall be the whole organism (in the

current iteration, the human body) and the smallestBiological macromolecule. Should the need arise,

molecules not synthesized through the expression of the

organism�s own genes shall be represented in separate

ontologies. Within these constraints, the abstraction

shall model both concepts and relationships at the most

refined level of granularity.

8. Relationship constraint principle. The abstraction

shall model three types of relationships that occur be-tween anatomical entities: (1) class subsumption rela-

tionships; (2) static physical relationships; and (3)

relationships that describe the transformation of ana-

tomical entities during the ontogeny of an organism.

Dynamic physical relationships between anatomical

entities (e.g., those relating to physiological function and

the pathogenesis of abnormalities and disease) shall be

modeled in separate ontologies.9. Coherence principle. The abstraction shall have one

root, Anatomical entity, which subsumes all enti-

ties relating to the structural organization of the body;

concepts referring to these entities shall be arranged in a

single and comprehensive inheritance class subsumption

hierarchy.

10. Representation principle. The abstraction shall be

modeled as an ontology of anatomical concepts andshould accommodate all naming conventions associated

with these concepts.

Because of the diverse and implied meanings associ-

ated with the term �ontology,� (some of which are re-

viewed by Burgun and Bodenreider [11]), we prefer to

refer to the abstraction of the FMA as a symbolic

model, rather than an ontology. We define a symbolic

model as a conceptualization of a domain of discourserepresented with non-graphical symbols in a computable

form that supports inference. We designate such a

symbolic model as a foundational model, when it declares

the principles for including concepts and relationships

that are implicitly assumed when knowledge of the do-

main is applied in diverse contexts, and explicitly defines

the concepts and relationships necessary for consistently

modeling the structure of the coherent knowledge do-main. In order to justify its designation as foundational,

such a model should serve as a reference in terms of

which other views (contexts) of the domain can be cor-

related. Moreover, the concepts represented in a foun-

dational model should be indispensable for the symbolic

modeling of, and discourse in, a number if other do-

mains. The Foundational Model of Anatomy is a foun-

dational model of the physical organization of thehuman body—i.e., anatomy (structure)—and its coherent

knowledge domain is anatomy (science). Other domains

for which anatomy is indispensable include physiology,

pathology, clinical medicine, and molecular and devel-

opmental biology.

These principles provide the rationale for proposing a

high level scheme for the FMA.

3.1.2. High level scheme

A high level scheme encapsulates the concept domain

and scope of a symbolic model and defines its main

components; in effect it serves as a hypothesis that is

tested by the instantiation of the model and may be

modified during this process. We have previously pro-

posed such a high level scheme for the Foundational

Model of Anatomy [9]:

FMA ¼ ðAT;ASA;ATA;MkÞ; ð1Þwhere AT is the Anatomy Taxonomy, which specifies thetaxonomic relationships of anatomical entities and as-

signs them to classes (defined in next section) according

to defining attributes which they share with one another

and by which they can be distinguished from one an-

other;1 the ASA, or Anatomical Structural Abstraction

describes the partitive (meronymic) and spatial rela-

tionships of the concepts represented in the taxonomy;

the ATA, or Anatomical Transformation Abstraction

describes the time-dependent morphological transfor-

mations of the concepts represented in the taxonomy

during the human life cycle, which includes prenatal

development, post-natal growth and aging; and Mk re-

fers to Metaknowledge, which comprises the principles

and sets of rules, according to which the relationships

are represented in the model�s other three component

abstractions.This abstraction captures the information that is nec-

essary for describing the anatomy of not only the whole

body, but also that of any structure (physical object) or

space that constitutes the body. Indeed, in practical terms,

the foundational model of the whole body must be gen-

erated stepwise through aggregating the symbolic models

of discrete classes of physical anatomical entities. The

foundational model for the anatomy of the entire body

Page 37: Principles of Ontology Construction Intro · that the ontologies are simultaneously both well formed and biologically intuitive. There are an increasing number of groups developing

482 C. Rosse, J.L.V. Mejino Jr. / Journal of Biomedical Informatics 36 (2003) 478–500

(FMABODY) may, therefore, be conceived of as the ag-gregate of the foundational models of physical anatomi-

cal entities (fFMAPHYSICAL ANATOMICAL ENTITYg) that

constitute the body. Thus,

FMABODY ¼ fFMAPHYSICAL ANATOMICAL ENTITYg: ð2ÞThe FMA�s high level scheme identifies the anatomy

taxonomy as one of the component abstractions of the

symbolic model or ontology, a distinction that is rarely

made clear in discussions of ontologies. The AT forms

the backbone of the FMA, and Aristotelian definitions,

a third element of principled modeling, play a key role inits establishment.

3.1.3. Aristotelian definitions

In dictionaries the unit of information is a term, and

the purpose of the definitions is to define all meanings

associated with a given term. For example the term

�organ� may refer, among other things, to a musical in-

strument, or a part of the human body. In an ontologyor foundational model, as we define it above, the unit of

information is a concept and the purpose of definitions

is to align all concepts in the ontology�s domain in a

coherent inheritance type hierarchy or taxonomy. This

objective imposes a set of requirements that are not

satisfied by the majority of dictionary definitions. We

have found that, unlike a number of controlled medical

terminologies, we could not adopt dictionary definitionsfor establishing the Anatomy Taxonomy. Therefore,

guided by the foundational principles we declared, and

relying on precedent set by Aristotle [15], we formulated

ten desiderata that definitions must satisfy in order to

support the creation of an inheritance type hierarchy,

such as the AT [16].

In brief, these desiderata specify that definitions

should be consistent with the declared context andprinciples of an ontology. Rather than stating the

meaning of terms, definitions should state the essence of

anatomical entities in terms of their characteristics,

consistent with the ontology�s context. Paraphrasing

Aristotle, the essence of an entity is constituted by two

sets of defining attributes; one set, the genus, necessary

to assign an entity to a class and the other set, the dif-

ferentiae, necessary to distinguish the entity from otherentities also assigned to the class. A collection of entities

that share the same set of essential characteristics con-

stitutes a class of the ontology. The defining attribute/s

shared by all entities within the selected domain should

specify the root of the ontology. To assure transitive

inheritance of essential characteristics, classes that may

not have been explicitly identified in existing sources of

domain knowledge should be defined.Provided these desiderata are satisfied, the hierar-

chical sequence of classes in the taxonomy will be dic-

tated by the properties shared by collections of entities.

The soundness of this hierarchy will then depend on the

explicit specification of the properties (attributes) thatdefine the essence of entities, providing the basis on

which they may be grouped together or distinguished

from one another. Unlike dictionary definitions, which

bear no relationship to their neighbors in the alphabet-

ized list of terms, the definition of a concept in a tax-

onomy is enriched by the definition of all of its parents

within the hierarchy. Thus, a definition of a concept

within an ontology is incomplete without that of all ofits parents.

Therefore, in creating the Anatomy Taxonomy, two

challenges need to be met: a conceptual one, which is to

identify the structural attributes in terms of which ene-

tities that constitute the human body may be grouped

together and distinguished from one another, and a

practical one, which is to identify an authoring program

that not only supports but also enforces the implemen-tation of foundational principles and definitional de-

siderata that are to guide the creation of the FMA. We

first describe the knowledge modeling environment we

selected, which is the fourth element of disciplined

modeling.

3.1.4. Knowledge modeling environment

We have analyzed the challenges posed by the seem-ingly simple task of formally representing declarative

anatomical knowledge and found them to be surpris-

ingly complex [17]. We selected the Prot�eg�e-2000 on-

tology editing and knowledge acquisition environment

[18] for encoding the FMA, because its frame-based

architecture, which is compatible with the Open

Knowledge Base Connectivity (OKBC) protocol [19],

provides for an expressive, scalable and tractable rep-resentation of anatomical entities and the complex re-

lationships that exist between them. We briefly describe

and illustrate with examples how (1) frames are used in

Prot�eg�e-2000 to represent anatomical concepts; (2)

frames allow for distinguishing between classes and in-

stances; (3) Prot�eg�e-2000 provides for selective inheri-

tance of attributes; and (4) Prot�eg�e enhances the

specificity and expressivity of attributes through as-signing to them their own attributes.

3.1.4.1. Frames, slots, slot values, and facets. Anatomical

concepts are represented as frames in Prot�eg�e-2000. Aframe is a data structure that contains all the informa-

tion in the ontology about a given concept. This infor-

mation includes the properties of the entity to which

that concept refers and also the relationships of thatentity to other entities. In the context of the FMA, a

frame is a named anatomical entity, such as vertebra.

With each frame is associated a defined set of attributes;

each of these attributes has a value. Thus each frame

consists of a concept and a set of attribute/value pair-

ings. Fig. 1 shows the frame Vertebra; the concept

highlighted in the left hand pane (the AT) and some of

Page 38: Principles of Ontology Construction Intro · that the ontologies are simultaneously both well formed and biologically intuitive. There are an increasing number of groups developing

Fig. 2. A variety of terms associated with the concept Uterine tube.

Fig. 1. The frame of the concept Vertebra.

Fig. 3. Attributed adjacency and continuity relationships of the Esophagus.

C. Rosse, J.L.V. Mejino Jr. / Journal of Biomedical Informatics 36 (2003) 478–500 483

Page 39: Principles of Ontology Construction Intro · that the ontologies are simultaneously both well formed and biologically intuitive. There are an increasing number of groups developing

484 C. Rosse, J.L.V. Mejino Jr. / Journal of Biomedical Informatics 36 (2003) 478–500

the attribute/value pairings in the right hand pane of theProt�eg�e graphical user interface (GUI).

Attributes (properties) and relationships of the entity

associated with the concept are expressed as slots of the

frame. Slots correspond to such non-structural attri-

butes as preferred name, synonyms, and numerical

identifiers (UWDA-ID), as well as such structural at-

tributes or relationships as -has part-, -part of-, -has di-

mension-, -bounded by-, etc. Slots remain empty unlessfilled with one or more values. In Fig. 1 the synonyms

slot is empty because Vertebra has no synonyms,

whereas the same slot in the frame of Uterine tube in

Fig. 2 is filled with two values.

Prot�eg�e-2000 allows different binary relationships for

slots. Some slots, like -has dimension- and -has inherent

3D shape-, have a binary relationship with atomic values

like Boolean ‘‘true’’ or ‘‘false’’; for slots that describebinary relationships between frames, the values are de-

rived from established classes of the AT or the FMA�sother associated taxonomies. For example, the Dimen-

sional Ontology provides the values for the slot -has

shape- (e.g., cylinder, polyhedron, which are subclasses

of 3-D volume), whereas the values for the part and

adjacency slots in the frame are derived from the AT.

In Prot�eg�e-2000, facets impose constraints on thevalues that a slot can have. For example, the facets of

the -part of- slot in the frame of Organ specify that

there can be multiple values for the slot and that the

values can be derived only from AT classes Organ

System, Organ system subdivision, Body

part and Body part subdivision. Thus the value

Vertebral column in the -part of- slot of Vertebra

is allowed, because Vertebral column is a subdivi-sion of the skeletal system. Another example is the re-

striction for the -nerve supply- slot; values for this slot

may only be derived from AT classes Cranial nerve,

Spinal nerve, and Peripheral nerve.

3.1.4.2. Classes and instances. In Prot�eg�e-2000 a frame

may represent a class or an instance. As far as most users

of the Foundational Model will be concerned, however,(and as explained below) all the nodes of Anatomy

Taxonomy hierarchy may be regarded as classes.

A class in the AT is a collection of anatomical entities

or collections of collections. For example, the class

Vertebra represents such a collection of collections. It

subsumes different collections of vertebrae like cervical,

thoracic, and lumbar vertebrae (Fig. 1). Moreover, the

members of each of these collections, which in Prot�eg�eare represented as subclasses of Vertebra, are likewise

further grouped into more specialized collections. This is

true even of the leaves of the Vertebra tree, which

have no subclasses in the AT. The Fifth lumbar

vertebra, for example, is a class to which the fifth

lumbar vertebrae of individuals like a John or a Jane

Doe belong. Therefore, unlike the higher classes, Fifth

lumbar vertebra, as currently implemented, doesnot subsume collections of collections; rather it sub-

sumes concrete anatomical entities, which, however, are

not represented in the AT. Should a need arise, this

representation allows us to elaborate the AT by intro-

ducing subclasses of Fifth lumbar vertebra speci-

fied by gender or race, for example, without having to

redefine this class and its ancestors.

Since concrete, real-world objects, such as the verte-brae of a John or a Jane Doe, represent anatomical data,

in concurrence with the �abstraction level principle,� theyare excluded from the FMA; they belong in the field of

instantiated anatomy. By contrast, concepts in the class

hierarchy of the Anatomy Taxonomy refer to collections

and collections of collections; they belong in the field of

canonical anatomy.

Although the above explanation suggests that allconcepts of anatomical structures in the AT are classes,

in fact, we had to assign the role of instance as well to

the frames of these concepts. In the frame-based system

of Prot�eg�e, this was the technical solution for enabling

the selective inheritance of attributes, discussed in the

next section. This solution required the establishment of

a metaclass hierarchy and assigning the frames of AT

classes as instances of the corresponding metaclasses(see below). Thus, except for its root, all concepts in the

AT are subclasses of a superclass and also an instance of

a metaclass. These dual assignments integrate the AT

and the metaclass hierarchy. Class-to-class relationships

in the integrated AT and metaclass hierarchies are en-

coded in Prot�eg�e as -direct superclass- and -direct sub-

class- links, whereas the inverse relationship between a

class and its instances in the metaclass hierarchy is -di-rect type- and -direct instance-. We distinguish the inte-

grated Anatomy Taxonomy and metaclass hierarchy

from other hierarchies (e.g., part-of, branch-of) by

calling it the -is a- hierarchy. This technical contrivance

is of interest to the authors of the FMA and to other

knowledge modelers; it can, however, remain opaque to

other users of the ontology.

3.1.4.3. Selective inheritance of attributes. The purpose

of the Anatomy Taxonomy is to assure the propagation

or inheritance of attributes. It is necessary, however, to

distinguish between the attributes that should and

should not be propagated. As intimated above, the de-

sired selective inheritance is achieved operationally, in a

seemingly contradictory way, by assigning a dual role to

each frame: in Prot�eg�e each AT frame is modeled bothas a class and as an instance. Its role as a class allows it

to propagate its set of attributes to its subclasses, but in

its role as an instance it is prevented from doing so.

The insertion of new slots at appropriate levels of the

ontology provides for introducing definitional and other

attributes that should be inherited by descendants of a

class. Such a class has been designated as a property

Page 40: Principles of Ontology Construction Intro · that the ontologies are simultaneously both well formed and biologically intuitive. There are an increasing number of groups developing

C. Rosse, J.L.V. Mejino Jr. / Journal of Biomedical Informatics 36 (2003) 478–500 485

introduction class [20], whereas in Prot�eg�e-2000 newattributes (slots) are introduced in metaclasses. Meta-

classes function as templates, and serve to define new

classes. Newly created classes in the AT are assigned as

instances of corresponding metaclasses. Thus an AT

class is a subclass of its ancestor classes in the AT and its

frame is an instance of its metaclass. For example, the

AT class Vertebra is a subclass of Irregular bone

and an instance of Vertebra metaclass.This arrangement allows for discriminating between

slots that should and should not be propagated. The

definitional attributes are propagated to descendants of

the class as template slots; they specify which slots each

member of the class shall have and what the restrictions

(facets) on the values of these slots shall be. Instances of

the class, on the other hand, inherit such template slots

as own slots and assign specific values to them (own slotvalues). Own slots are not propagated. For example,

Vertebra metaclass has a template slot -part of-,

which its instance Vertebra inherits as its own slot,

and assigns the slot value Vertebral column.Cer-

vical vertebra is a subclass of Vertebra and in-

herits the template slot -part of- but not the slot value

Vertebral column. Instead it converts the template

slot into its own slot, and assigns its own slot valueCervical vertebral column. Template slots dic-

tate what attributes or slots a class must impose on its

descendants. The example illustrates the principle of

modeling at the most refined level of granularity.

Although Cervical vertebra is part of Vertebral

column, the most specific relationship holds for Cer-

vical vertebral column, which is also a subdivi-

sion of the skeletal system and is in turn a part of theVertebral column. It is the role of intelligent query

interfaces, described in Section 4, to concatenate such

relationships and allow the result Cervical verte-

bra -part of- Vertebral column.

3.1.4.4. Attributed relationships. The FMA is particu-

larly rich in relationships, which, in addition to defining

attributes, describe the part-whole, location, and otherspatial associations of anatomical entities. However, for

the precise and comprehensive description of the struc-

ture of the body, it is not sufficient to state, for example,

that the esophagus is continuous with the pharynx and

stomach, or that it is adjacent to the vertebral column. It

is necessary to specify that the esophagus is continuous

with the pharynx superiorly and with the stomach in-

feriorly; and its adjacency relationship with the vertebralcolumn is posterior, whereas with the fibrous pericar-

dium, it is anterior, on both the right and the left. Thus

the continuity and adjacency attributes need to be as-

sociated with additional attributes in order to express

additional elements of knowledge involved in the rela-

tionships. Such attributed relationships are the rule ra-

ther than the exception in anatomy. Their representation

in any knowledge-modeling environment is a challenge.The solution we developed in the frame-based environ-

ment of Prot�eg�e-2000 may seem complex, but it captures

the necessary knowledge [17].

The solution is to attach to a slot (e.g., -continuous

with-, -adjacency-) a value that includes not only the

simple adjacency relationship between referenced struc-

tures but also the additional attributes of that relation-

ship (e.g., superiorly, inferiorly, or anterior, posterior,left and right). Attribution of the slot value is called

reification. This can be achieved by assigning the slot

value as an instance frame of a class which specifies or

describes the additional attributes for the relationship.

For example, in the case of the slot -adjacency-, the

slot value is an instance of a class Anatomical ad-

jacency coordinate. This class carries the template

slots that describe the adjacent structure (-related part-)and its relative position or coordinate (-coordinate- and

-laterality-) that qualify its adjacency to the reference

anatomical structure. As shown in the frame of

Esophagus (Fig. 3), one value of its -adjacency- slot is

an instance that shows the related part Fibrous

pericardium as being anterior and to the right and

left (coordinate and laterality, respectively) of the

esophagus, which is the reference anatomical structure.This rather complex reification process allows us to

not only comprehensively represent structural relations

but also to qualify relations with additional attributes in

order to describe the structure of the body with accuracy

at the highest level of granularity. The process also il-

lustrates that the challenges of modeling anatomical

knowledge push the envelope of available methods [17]

and require the collaboration of anatomists andknowledge engineers.

3.2. Anatomy taxonomy

Anatomical discourse in educational, research and

clinical contexts proceeds at the level of discrete ana-

tomical structures and spaces, which correspond to leaf

concepts of a taxonomy. Although attempts to stan-dardize anatomical terminology are more than a century

old, time-honored sources of the domain contain only

implied and contradictory schemes for classifying ana-

tomical entities, which are not supported by explicit

definitions. The officially sanctioned term list, Termin-

ologia Anatomica [21] (and its predecessor Nomina An-

atomica), compiled by an international group of

anatomists, has a number of shortcomings for sup-porting the establishment of an inheritance hierarchy

[22]. Chief among these shortcomings is the lack of ab-

stract classes that could subsume more and more specific

collections of anatomical entities on the basis of their

shared essential properties. As a consequence, controlled

medical terminologies and emerging ontologies in bio-

informatics have no choice but to establish their own

Page 41: Principles of Ontology Construction Intro · that the ontologies are simultaneously both well formed and biologically intuitive. There are an increasing number of groups developing

486 C. Rosse, J.L.V. Mejino Jr. / Journal of Biomedical Informatics 36 (2003) 478–500

scheme for aligning anatomical concepts in a comput-able representation. Since these sources target the needs

of diverse user groups, they represent anatomy in het-

erogeneous contexts; therefore their anatomy content is

hard to generalize to domains beyond their own.

In this section we present the rationale for the class

structure of the AT in the context of foundational

principles, starting with the selection of its root. Next we

illustrate the inheritance of definitional and other attri-butes through the class subsumption hierarchy and

comment on the derivation of terms.

3.2.1. Root of the AT

Since our intent is to represent knowledge about

anatomical structure, the Anatomy Taxonomy must

accommodate not only the physical entities (sub-

stances, objects, spaces, surfaces, lines, and points) thatconstitute the body, but also the descriptors of these

entities that we want to model. Terms, coordinates,

relationships, developmental stages and other non-

physical concepts that form an indispensable part of

anatomical discourse must also be included in the AT.

A more restricted concept than �entity� will not sub-

sume these concepts. Therefore, we declared Ana-

tomical entity as the root of the AT and, in orderto satisfy requirements for its Aristotelian definition,

we considered the essential properties of this concept.

Anatomical entities can be conceptualized only in re-

lation to biological organisms, and they are unique

among biological concepts in that they pertain to the

structural organization of these organisms. Therefore,

the genus of �anatomical entity� is the primitive �bio-logical entity,� because it manifests the essence of allbiological entities (namely that they pertain only to

biological organisms), and the differentia is the re-

striction to structure. The definition may therefore be

written as:

Anatomical entity

is a biological entity,

which constitutes the structural organization

of a biological organism, or

is an attribute of that organization.

Fig. 4. Schematic representation of the principal classes of the Anat-

omy Taxonomy.

We use this first definition of the FMA to illustrate

the process of formulating such definitions. The con-

ceptualization and insertion of such a new class in the

AT is paralleled by establishing the template slots in its

metaclass that will be inherited by all of its descen-

dants. Every concept to be entered in the FMA will

have a preferred name and a specific, randomly as-signed numerical identifier. Therefore slots for these

attributes are inserted in the Anatomical entity

metaclass. This template will also have other slots.

For example, all anatomical entities, including ana-

tomical terms, have parts. Therefore the -has part- slot,

and its inverse, -part of-, are introduced at the root of

the AT.

3.2.2. The inheritance class subsumption hierarchy

3.2.2.1. High level classes. The rationale for selecting the

root of the AT makes reference to two major types of

anatomical entities in terms of whether or not they are

physical in nature. Therefore we designated the imme-

diate descendants of Anatomical entity as the

classes Physical anatomical entity and Non-

physical anatomical entity (Fig. 4). The genusfor both is Anatomical entity, and in structural

terms the differentia that distinguishes these two classes

is the structural attribute of spatial dimension: All

physical entities have spatial dimension, because they

are volumes, surfaces, lines or points, whereas non-

physical entities have no spatial dimension. Therefore

the attribute and its corresponding slot �spatial dimen-

sion� are introduced at this level; the value of the slot inthe frame of Physical anatomical entity will be

�true.� Not only the slot, but also its value will be in-

herited by all descendants of this class.

Physical anatomical entities may be further specified

on the basis of whether or not they have mass, which

serves as the differentia of the classes Material

physical anatomical entity and Non-mate-

rial physical anatomical entity. Subclasses ofthe latter are Anatomical space, Anatomical

surface, Anatomical line, and Anatomical

point, none of which have mass [23]. These classes are

distinguished from one another by the number of spatial

dimensions they have.

Even without presenting the definitions of these

classes and listing their defining differential attributes,

the logic and rationale for establishing these high levelabstract classes should become apparent. Although an-

atomical texts and medical terminologies with an ana-

tomical content deal only superficially, if at all, with

anatomical surfaces, lines, and points, it is nevertheless

Page 42: Principles of Ontology Construction Intro · that the ontologies are simultaneously both well formed and biologically intuitive. There are an increasing number of groups developing

C. Rosse, J.L.V. Mejino Jr. / Journal of Biomedical Informatics 36 (2003) 478–500 487

necessary to represent these entities explicitly and com-prehensively in the FMA in order to describe boundary

and adjacency relationships of material physical ana-

tomical entities and spaces.

The class of Material physical anatomical

entity may be subdivided into two major types on the

basis of the differentia of inherent 3D shape. We desig-

nate the collection that lacks this attribute as Body

substance; its descendants include Secretion,Excretion, Blood, etc.; all of which have mass and

accommodate to the shape of their container. The

members of the collection that have their own inher-

ent 3D shape constitute the class Anatomical

structure.

3.2.2.2. Dominant concept. The dominant class principle

declares Anatomical structure as the dominantclass in the FMA; therefore its definition is of particular

importance.

Anatomical structure

is a material physical anatomical entity

which has inherent 3D shape;

is generated by coordinated expression

of the organism�s own structural genes;

consists of parts thatare anatomical structures;

spatially related to one another in patterns

determined by coordinated gene expression.

The definition illustrates that inherent 3D shape is a

necessary, but not a sufficient, differentia for defining the

class Anatomical structure. We have to exclude

from this class, for example, manufactured objects used

as prostheses and biological organisms such as parasitesand bacteria that are introduced into an individual, as

well as space-occupying lesions such as neoplasms and

granulomas. The differentiae in the class definition that

exclude such foreign and abnormal structures are spec-

ified by constraining the class to biological objects

generated by the coordinated expression of groups of

the organism�s own structural genes and thereby dis-

tinguishing these structures from those that result fromperturbed or abnormal biological processes. Moreover,

by introducing the differentia of the genetically deter-

mined arrangement of the parts of an anatomical

structure, the definition also excludes from the class such

cell aggregates as a rouleau or a sediment of blood cells.

The dominant role of Anatomical structure is

reflected by the fact that non-material physical ana-

tomical entities (e.g., spaces, surfaces) and body sub-stances (e.g., blood, cytosol) are conceptualized in the

FMA, and also in anatomical discourse in general, in

terms of their relationship to anatomical structures. For

example, Thoracic cavity (an Anatomical

space) can only be conceptualized in terms of the

Anatomical structure (the Thorax) of which it is

a part; Surface of heart cannot exist without

Heart, the Anatomical structure, which the sur-face bounds; Cytoplasm, a Cell substance, can be

conceptualized only in reference to Cell, an Ana-

tomical structure.

The definition of Anatomical structure imple-

ments the �content constraint principle� of the FMA, in

that it implies that the largest anatomical structure is the

organism itself, and the smallest are biological macro-

molecules assembled from smaller non-biological mole-cules through the mediation of the organism�s genes. Inthis sense, the definition also distinguishes, in a broader

context, animate and inanimate objects.

3.2.2.3. Units of structural organization. The organiza-

tional unit principle designates Cell and Organ as

organizational units of the FMA; these are two of the

subclasses of Anatomical structure. All but two ofthe other subclasses of Anatomical structure are

conceptually derived from cell or organ, in that they are

either parts of cells and organs or are constituted by cells

and organs. We discuss these derivative classes in the

next section. The exceptions are Acellular ana-

tomical structure (e.g., elastic and collagen fiber

and otolith) and Biological macromolecule. Such

molecules exist in association with cell parts and alsoindependent of cells in body substances. It may be ar-

gued that Biological macromolecule qualifies as

an organizational unit within the FMA. Although we

include a substantial number of macromolecules in the

FMA, our intent is to link to other ontologies when the

need arises for representing the molecular composition

and associations of cell parts and body substances.

Cell. With respect to Cell, the organizational unitprinciple is consistent with the cell theory of Schleiden

[24] and Schwann [25]. However, notwithstanding some

unique exceptions, a cell is a microscopic structure; in

practical terms, it is meaningful to consider it as a unit

of organization only at the microscopic level. No orga-

nizational unit existed at the macroscopic level until we

proposed �organ� to fill this role [8]. It is hard to find

satisfactory definitions of cell and organ in dictionaries.Our definitions of these two concepts conform to the

definition principle. We first define Cell and discuss its

subclasses.

Cell

is a anatomical structure

which consists of cytoplasm surrounded by a

plasma membrane

with or without the cell nucleus.This class subsumes all cell types of the human body

and can accommodate those of other metazoan organ-

isms. One may find up to 10 different implied classifi-

cations of cells in the literature. However, these

classifications are unsupported by explicit definitions.

The most consistent scheme was proposed by Lovtrup

[26], and is based on such structural properties as the

Page 43: Principles of Ontology Construction Intro · that the ontologies are simultaneously both well formed and biologically intuitive. There are an increasing number of groups developing

488 C. Rosse, J.L.V. Mejino Jr. / Journal of Biomedical Informatics 36 (2003) 478–500

connectivity of cells to one another and the type of ap-pendages they possess. We have adopted these proper-

ties as the differentia for the largest collections of cells

[27], and found it necessary to further subdivide these

classes based on embryonic derivation (Fig. 5). We

recognize that this classification introduces transforma-

tional rather than structural attributes as differentiae.

However, until the necessary gene expression data be-

come available, the representation of cell lineages can-not be accomplished on the basis of structural attributes

alone. Cell classification is a topic that merits further

discussion in a separate publication.

Organ. Dictionary and textbook definitions of organ

are satisfied by such anatomical structures as the hand

or knee, as well as by the liver or the thymus. There are

also a large number of macroscopic anatomical struc-

tures, which are known by their specific name, but havenot been designated as any particular higher level type.

For example, by what criteria is the skin generally re-

garded as an organ, but the underlying layer of super-

ficial fascia is never referred to as such, or as any other

type of entity? What are nerves and blood vessels? It has,

in fact been suggested that it is not possible to define

organ, because the meaning of the term varies so widely.

The definition we have proposed for Organ resolvesthese problems.

Organ

is an anatomical structure,

which consists of the maximal set of organ parts

so connected to one another that together

they constitute a self-contained unit of

macroscopic anatomy

morphologically distinct from other such units.The definition is contingent on the definition of

Organ part.

Organ part

is an anatomical structure,

which consists of two or more types of tissues,

spatially related to one another

in patterns determined by coordinated gene

expression;together with other contiguous organ parts

it constitutes an organ.

Tissue is another concept with a variety of meanings

in general discourse. Its dictionary and textbook defi-

nitions are violated by regarding such concepts as blood

and gingiva as tissues. Before discussing Organ, we also

define tissue.

Tissueis an anatomical structure,

which consists of similarly specialized cells

and intercellular matrix,

aggregated according to genetically determined

spatial relationships.

The differentia of genetically determined spatial re-

lationships among the constituent cells excludes from

this class blood, lymph, semen, and cerebrospinal fluid,all of which meet the definition of Body substance.

Likewise, gingiva and many other entities convention-

ally referred to as tissue consist of more than one tissue

in terms of the FMA definition. The definition implies,

furthermore, that in the fully formed organism tissues

do not exist independent of organs. In the embryo,

however, tissues are definable before bona fide organs

are formed.The definition of Organ part links the microscopic

and macroscopic units of structural organization to one

another and eliminates any circular element from the

definition of Organ. In terms of the definition, the liver

qualifies as an organ, because it is constituted by a

maximal set of anatomical structures that are composed

of tissues, and these structures are connected to one

another to form a discrete morphological entity. Al-though the right lung is composed of the same set of

connected organ parts as the left lung, the two sets are

not continuous with one another; hence the two lungs

are separate organs. The entire skin qualifies as an organ

in terms of the definition, and so does the superficial

fascia that underlies it. On the other hand, the brain and

spinal cord cannot be regarded as two separate organs,

since both are made of the same types of organ parts,which are continuous with one another and together

constitute a morphological whole. In fact a real

boundary between the two cannot be determined.

Therefore, the definition mandates that brain and spinal

cord be regarded as organ parts and that together they

be classified as one organ. We have named and defined it

as the Neuraxis [28].

It follows from the definition of Organ that differ-entiae for distinguishing organ subclasses must be based

on the kinds of continuous organ parts of which organs

are constituted. Even without presenting definitions,

Fig. 6 illustrates the employment of elementary struc-

tural attributes, on the basis of which types or organs

are grouped together and distinguished from one an-

other. These essential properties (e.g., organ cavity, wall,

parenchyma, cortex, medulla, lobe, etc.) are introducedin the corresponding metaclasses and are inherited by

the subclasses of the respective organ types. Only at this

level of the AT do we reach specific organ types, such as

lung, esophagus, heart, etc., which are the concepts

commonly encountered in anatomical and clinical dis-

course. Such are also the concepts that are subsumed by

derivative subclasses of Anatomical structure.

3.2.2.4. Derivative classes. We regard Organ part and

Cell part, referred to in the previous section, as de-

rivative subclasses of Anatomical structure be-

cause they are conceived of in relation to Organ and

Cell, the organizational units of the FMA. Although

each of the remaining derivative subclasses are explicitly

defined, we will not present these definitions here; rather

Page 44: Principles of Ontology Construction Intro · that the ontologies are simultaneously both well formed and biologically intuitive. There are an increasing number of groups developing

C. Rosse, J.L.V. Mejino Jr. / Journal of Biomedical Informatics 36 (2003) 478–500 489

we comment on them and illustrate the kinds of struc-tures each subsumes.

Body part and organ system. Perhaps most important

are the classes Body part2 and Organ system. Both

are constituted by organs. In a body part, such as the

Trunk or Upper limb, organs of different classes are

related to one another through genetically predeter-

mined patterns. The same holds true for Body part

subdivisions (e.g., Thorax, Hand). Organ systems(and their subdivisions) are constituted of organs pre-

dominantly of the same type, which are interconnected

by zones of continuity. For example, Musculoskel-

etal system is comprised of the classes Muscle

(organ), Bone (organ), Joint, and Ligament

(organ), which together form an interconnected ana-

tomical structure. Subdivisions of this system, the

Skeletal system and Articular system, for ex-ample, consist of sets of bones and joints, respectively;

the joints interconnecting the bones and visa versa. So

called systems of the body are, as a rule, conceived of in

functional rather than structural terms; therefore many

of them do not qualify as anatomical structures (e.g.,

immune system, endocrine system) and are excluded

from the Organ system class. However, because these

concepts are so widely used in anatomical and clinicaldiscourse, we represent them in the FMA as the class

Functional system, which is a child of Non-ana-

tomical anatomical entity.

Anatomical cluster, set, and junction. There are a

number of other anatomical concepts in current use that

are a composite of organs, organ parts, tissues or cells

that are hard to classify, yet we wanted to accommodate

them in the FMA. For this purpose we created anddefined the classes for Anatomical cluster, Ana-

tomical set, and Anatomical junction.

For example, the root of the lung and the renal

pedicle meet the definition of Anatomical struc-

ture, but do not fit any of its subclasses we described so

far. Both consist of a heterogeneous set of organ parts

grouped together in a predetermined manner, but do not

constitute the whole or a subdivision of either a bodypart or an organ system. We classify such structures as

Anatomical cluster. Such clusters can be com-

posed of cells (e.g., splenic cord, consisting of erythro-

cytes, reticular cells, lymphocytes, monocytes, and

plasma cells), organ parts (e.g., tendinous or rotator

cuff, consisting of the fused tendons of several muscles),

as well as of organs (e.g., lacrimal apparatus consists of

the lacrimal gland, lacrimal sac, and nasolacrimal duct,each of which qualify as an organ).

Also problematic are such widely used concepts as

viscera, or cranial nerves, which represent a collection of

2 �Body part� and �Body region� are regarded as synonyms by most

sources, including Terminologia Anatomica; the FMA adopts this

convention.

anatomical structures that are members of one class. Weassign such collections to the class Anatomical set.

The FMA does not allow plural concepts and therefore

the singular concept Set of cranial nerves is en-

tered as a subclass of Anatomical set. At the cellular

level such a set is Myone, for example, which is a set of

skeletal muscle cells (muscle fibers) innervated by a

single alpha motor neuron. Anatomical sets have

members, rather than parts (e.g., Oculomotor nerve

is a member of Set of cranial nerves).

Members of an anatomical set, as defined in the

FMA, are distinct from elements of a mathematical set

in at least two respects: (1) indirect connections exist

between the members, since all anatomical structures of

an organism are interconnected directly or indirectly

(except for those that are surrounded by body sub-

stances; e.g., blood cells afloat in plasma); (2) as a rule,the members are ordered in accord with genetically de-

termined patterns (e.g., the set of cranial nerves associ-

ated with the brain and the set of ribs associated with

the vertebral column are ordered and their members are

not interchangeable; whereas as far as we know, no such

ordered pattern exists for the disposition of members of

a myone within a muscle fasciculus); and (3) the mem-

bers do not define an anatomical set (which is a class),whereas a mathematical set is defined by its members.

Finally, we introduced the class Anatomical

junction to subsume such anatomical structures as a

suture, the commissure of the mitral valve, gastro-

esophageal junction, anastomosis, and nerve plexus, as

well as synapse or desmosome. These heterogeneous

structures are arranged in appropriate subclasses of

Anatomical junction. We define this class as ananatomical structure in which two or more anatomical

structures establish physical continuity with one another

or intermingle their component parts.

Anticipating future enhancements of the FMA, we

have also introduced three additional classes. Vesti-

gial anatomical structure (e.g., epoophoron,

gubernaculum testis) and Gestational structure,

which includes subclasses for gestational membranes aswell as embryonic and fetal structures. The third class,

Variant anatomical structure, is as yet sparsely

populated. Once we focus on anatomical variants,

members of this class will be reassigned as variant sub-

classes of the canonical anatomical structures.

3.2.3. Derivation of terms

Our intent with the FMA is to make anatomical in-formation available in computable form that generalizes

to all application domains of anatomy. Therefore, rather

than attempting to standardize terminology, we are

committed to include in the FMA all terms that cur-

rently designate anatomical entities in order to facilitate

navigation of the FMA by any user. We relied on time-

honored English language scholarly textbooks of

Page 45: Principles of Ontology Construction Intro · that the ontologies are simultaneously both well formed and biologically intuitive. There are an increasing number of groups developing

490 C. Rosse, J.L.V. Mejino Jr. / Journal of Biomedical Informatics 36 (2003) 478–500

anatomy [29–31] as our primary sources for anatomicalterms, enhanced by copious reference to original journal

articles from the anatomy and clinical literature. We

have developed a tool for semi-automatically integrating

existing anatomical term lists into the FMA [32]. Such

integration has been accomplished for approximately

10,000 terms of Terminologia Anatomica [21], the offi-

cially sanctioned anatomical term list, and 6500 neuro-

Fig. 7. Documentation associated with Tuba uterine, a non-E

Fig. 5. Major classes of Cell.

anatomical terms of NeuroNames [33], a structuredvocabulary of the brain.

In the FMA each concept has a randomly assigned

unique numerical identifier (UWDAID; University of

Washington Digital Anatomist Identifier) and is asso-

ciated with one or more terms. One of these terms is

designated as the preferred name of the concept; other

terms are synonyms or non-English equivalents (Fig. 2).

nglish equivalent of the preferred name Uterine tube.

Fig. 6. Subclasses of Organ.

Page 46: Principles of Ontology Construction Intro · that the ontologies are simultaneously both well formed and biologically intuitive. There are an increasing number of groups developing

Fig. 8. Part of the taxonomy of structural relationships.

C. Rosse, J.L.V. Mejino Jr. / Journal of Biomedical Informatics 36 (2003) 478–500 491

Each term is created as an instance of the class Con-

cept name. Instances of Concept name have associ-

ated with them various meta-data that describe the

attributes of the term, illustrated in Fig. 7.

A consistent naming convention is used throughout.

Unlike in many other terminologies (including Termin-

ologia Anatomica), all terms are in the singular form,

and conjunctions and homonyms are not allowed. An-

atomical entities commonly referred to as groups orcollections (e.g., intercostal arteries, spinal nerves) are

represented as anatomical sets and designated, for ex-

ample as Set of intercostal arteries and Set

of spinal nerves, since such concepts conform to

the definition of the class Anatomical set. Because

each term must be unique, commonly used homonyms

such as �muscle� and �bone� are rendered specific by ex-

tensions to discriminate between their different mean-ings; e.g., Muscle (tissue), a class that subsumes

Smooth muscle and Striated muscle and Muscle

(organ), which subsumes such organs as Biceps

brachii and Gluteus maximus.

Although the compendium of available anatomical

terms is large, for the comprehensive and logical mod-

eling of anatomical structure we had to include in the

FMA concepts that have not been named previously.These concepts include not only the high level classes of

the AT, but also macroscopic parts of the body that

have not previously been named [34]. For example, to

satisfy the FMA�s requirement that all parts of a whole

be explicitly named, we assigned the term Upper

uterine segment to a previously unnamed part of

Body of the uterus to complement the other part,

which is generally known as the Lower uterine

segment.

Formulas govern the ordering of descriptors in the

complex name of an anatomical entity. For example, the

order of adjectives in the term �Left fifth inter-

costal space� is based on the rationale that the noun

in the term is �space�; its primary descriptor is �inter-costal,� further specified by a sequence of numbers, a

specificity enhanced by the laterality descriptor. In theterm this order is reversed. Based on a similar rationale,

the term �right upper lobe� is not the preferred name of

the concept, although the FMA includes it as a synonym

of �Upper lobe of right lung,� because of its

common usage in radiology reports.

3.3. Anatomical Structural Abstraction

Defined in Section 3.1.2, the ASA is an aggregate of

the structural relationships that exist between the enti-

ties represented in the AT. A full account of the ASA

will be the subject of a separate report. Our purpose here

is to summarily illustrate the richness and specificity of

structural relationships in the FMA. Fig. 8 shows a part

of the taxonomy of these relationships as subclasses of

Non-anatomical anatomical entity. Fig. 3

illustrates the implementation of some of these rela-

tionships in the frame of the esophagus. Reference is

made in earlier sections to the fact that the majority of

these relationships are attributed, which further

enhances the expressivity and specificity of the FMA

for describing the structure, not only the constituents, of

the human body. Particular attention is paid to attrib-uted partonomic relationships in one of our recent

publications [35].

We have conceived of the ASA as sets of interacting

networks [36], which are schematically represented in

Fig. 9. The high level scheme for the ASA derives from

the FMA�s overall conceptual scheme. The example we

describe below illustrates the nature and interactions

between just two of the ASA�s interacting networks.These networks make reference to some of the rules that

constrain the concepts that can be linked to one another

by these relationships to certain classes of the Dimen-

sional taxonomy (DT). The Do is a small ontology in the

FMA, which represents dimensional entities of zero to

three dimensions and shape classes of 3D entities. It also

distinguishes between real and virtual surfaces and lines.

The example for illustrating ASA networks concernsthe heart. The surface of the heart forms the boundary

of the heart in the boundary network (Bn), rather than

being a part of the heart, because nodes of the parton-

omy network (Pn) must be of the same dimension in the

DT, whereas a boundary must have one lower dimen-

sion than the entity it bounds. Because they share the

same dimension, the diaphragmatic surface of the heart

is a part of the surface of the heart (Pn) and forms partof the boundary not only of the heart, but also of the

right ventricle (Bn), which is a part of the heart. The Bn

Page 47: Principles of Ontology Construction Intro · that the ontologies are simultaneously both well formed and biologically intuitive. There are an increasing number of groups developing

Fig. 9. A scheme for Anatomical Structural Abstraction (ASA).

492 C. Rosse, J.L.V. Mejino Jr. / Journal of Biomedical Informatics 36 (2003) 478–500

of the heart comes about by representing not only the

surfaces that bound the heart�s subvolumes, but also the

lines that bound these surfaces (which are the cardiac

margins), and the points, which in turn bound the

margins. The Pn of the heart comes about by repre-

senting transitively the subvolumes of the heart in one

network, the subsurfaces of the surface of the heart (e.g.,

Surface of heart -has part- Diaphragmatic

surface of heart, Sternocostal surface of

heart, Base of heart) and the subdivisions of each

subsurface (e.g., Diaphragmatic surface of the

heart -has part- Diaphragmatic surface of

right ventricle, Diaphragmatic surface of

left ventricle) in another network, and those of

the margins (lines) of the heart in yet another network.

Similar interactions of the Bn and Pn with the othernetworks, shown in Fig. 9, comprehensively describe the

structure and spatial relationships of any anatomical

structure or space. A number of authors refer to such a

scheme as a mereotopological model or representation,

though none have defined it or implemented it to the

same level as the FMA. The conception of such a

mereotopological model or ASA as a set of interacting

networks is a particular feature of the FMA.The ASA has been instantiated quite extensively in

the FMA for boundary and partonomic relationships,

as well for -branch of - and -tributary of- relationships,

including their inverses. Other relationships are more

sparsely implemented.

More comprehensive implementation will be achieved

through semi-automated authoring tools that are under

current development, which can reuse the knowledge

already embedded in the FMA. Also, we anticipate that

investigators who have a need for comprehensive rep-

resentation of the anatomy of particular parts of the

body (e.g., the eye or the knee joint) will collaborate

with us in populating the knowledge base for the areasof their interest.

3.4. Anatomical Transformation Abstraction

Defined in Section 3.1.2, we envisage the initial im-

plementation of the ATA as a symbolic model of the

entities and relationships that link the fertilized egg or

zygote to the fully differentiated anatomical structuresand spaces that are currently represented in the AT. As

we initially did for the ASA, we propose a high level

scheme for the prenatal component of the ATA as a

hypothesis, which, as in the case of the ASA, will be

tested and modified as the ATA becomes implemented

and instantiated. Currently, we are not proposing such

schemes for the morphological transformations associ-

ated with the processes of growth and aging. Our pres-ent purpose with giving a preliminary account of the

ATA scheme is to illustrate the challenges the symbolic

modeling of developmental biology and prenatal devel-

opment present, and to emphasize that knowledge of

embryonic development is as important a component of

Page 48: Principles of Ontology Construction Intro · that the ontologies are simultaneously both well formed and biologically intuitive. There are an increasing number of groups developing

Fig. 10. A scheme for Anatomical Transformation Abstraction (ATA). Shading is used to facilitate the visualization of relationships between

cognates of a higher level component of the ATA.

C. Rosse, J.L.V. Mejino Jr. / Journal of Biomedical Informatics 36 (2003) 478–500 493

anatomical and medical reasoning as spatial knowledgeof the human body. The FMA will not attain its full

potential until it is able to support inference based on

both structural and developmental relationships.

The significance of the ATA scheme as we propose it

is that, together with the ASA, it formalizes and con-

strains all the kinds of information that need to be as-

sociated with an anatomical entity in order to

comprehensively conceptualize and symbolically repre-sent its development starting from the fertilized egg. We

propose a scheme for the ATA as an extension of the

FMA�s overall conceptual scheme and illustrate its

components in Fig. 10.

We envisage the Developmental Taxonomy (DevT)

as the sum of several developmental subtaxonomies

linked together through the AT. This virtual umbrella

taxonomy will consist of taxonomies of developmentalstructures (DStrO), developmental spaces (DSpO), and

developmental processes (DPO).

Developmental lineage (DL) and phenotypic trans-

formation (PTr) relate to the essence of embryonic de-

velopment. Both are complex concepts. Both can be

modeled through the inverse relationships -gives rise

to- and -derived from-, or their synonyms between a

�precursor� and one or more �successors.� Phenotypictransformation (PTr) is a developmental relationship,

which is established between developmental states of

one individual, or a class of individuals, on the basis of a

change in phenotype (gene expression) between precur-

sor and successor. For example, (using the symbol > to

mean -gives rise to-) Mesodermal primordium of

humerus > Cartilaginous primordium of hu-

merus > Ossifying humerus with primary os-

sification center > Ossifying humerus with

secondary ossification center > Fully

formed humerus. Each developmental stage of the

same structure is distinguished from the preceding one

by a set of newly acquired phenotypes, which, as a rule

results from differential gene expression. PTr pertains toall classes of Developmental structure and De-

velopmental space, even if the phenotypic change is

limited to the addition or deletion of one of their com-

ponents, the structural rearrangement of their parts, or a

change in their shape. Therefore, the formalism for

phenotypic transformation should specify the immediate

precursor (Pc1), its immediate successor (S1) and the

change in phenotype (DPt):

PTr ¼ ðPc1; S1;DPtÞ: ð3Þ

Developmental lineage (DL) specifies a line of descent

or ancestry in which an ancestor replicates itself and

gives rise to two or more descendants, each of which is

phenotypically distinct from its immediate ancestor. The

formalism for lineage parallels that for PTr by specify-

ing the immediate ancestor (A1), the immediate de-

scendant (D1) and the change in phenotype (DPT):

DL ¼ ðA1;D1;DPtÞ: ð4ÞNote that each DPt has to be expressed as an ASA

attribute of Pc1, S1, A1, and D1. This is only one of the

Page 49: Principles of Ontology Construction Intro · that the ontologies are simultaneously both well formed and biologically intuitive. There are an increasing number of groups developing

494 C. Rosse, J.L.V. Mejino Jr. / Journal of Biomedical Informatics 36 (2003) 478–500

ways in which the ASA and ATA will be closely inter-related, an observation that leads to the conclusion that

an ontology of embryonic development should be de-

veloped as a logical extension and integral component of

the FMA.

Timing of PTr and DL in the context of a develop-

mental clock must be represented through the develop-

mental time parameters of post-ovulatory time (POT)

and/or developmental stage (DSt).A transforming agent (TAg)—which is a gene prod-

uct—is always required for effecting the expression of a

new phenotype. This agent may play a facilitatory or

inhibitory role in the expression of the new phenotype

by its target (Tg). The expression of this new phenotype

(DPT) depends on the activity of one or more specific

genes (G), which may increase (i.e., is facilitated; Gf) or

decrease (i.e., is repressed; Gr).TAg has not only a target but also a source (Sc). It is, in

fact, itself a new phenotype resulting from facilitated or

suppressed gene activation within its source. In both

target and source, themacromolecule that corresponds to

the new phenotype is produced through a change in the

activity of a gene or genes, even when this change results

from the repression of another gene or genes. Finally, the

TAg must be propagated (Prop) from the source to thetarget, which may occur within cells, through cell junc-

tions or through the intercellular environment.

Fig. 11. The distributed, Internet-based architecture of the Anatomy Inform

row) are made available to outside processes by means of specialized servers

query user interfaces developed for different users. Other remote agents and i

Internet protocols.

Thus change in phenotype along a cell lineage or inthe phenotypic transformation of multicellular, devel-

oping structures is the outcome of a number of inter-

acting networks, which are controlled by the facilitation

or repression of selected groups of genes. Therefore, we

propose the first iteration of regulatory networks (Rn)

that control the expression of new phenotypes as:

Rn ¼ ðTAg; Sc;Tg;Gf ;Gr;Prop;DPtÞ: ð5ÞThe purpose of the Rn scheme is to establish a

framework for the information that emerges from ex-

periments and integrate this new information with ex-

isting knowledge. The components of this formalism

decompose the complex developmental events into ele-

ments that can be entered in the framework of the FMA,

even with currently available methods.

We concede that while the establishment of the FMAfor static, fully formed anatomy is a Herculean task, this

task pales in comparison with the challenges posed by

the enhancement of the FMA with the dynamic pro-

cesses that constitute embryonic development and cell

differentiation. These challenges provide the motivation

for collaboration, a coordinated, distributed effort, and

for the development of knowledge-based authoring tools

that facilitate the population of a large knowledge base,such as the FMA, and others that are currently emerg-

ing in bioinformatics.

ation System (AIS). Various structural information resources (bottom

(center row). Various client applications (top row) are graphical and

nterfaces at diverse locations access servers of the AIS via well-defined

Page 50: Principles of Ontology Construction Intro · that the ontologies are simultaneously both well formed and biologically intuitive. There are an increasing number of groups developing

C. Rosse, J.L.V. Mejino Jr. / Journal of Biomedical Informatics 36 (2003) 478–500 495

4. Accessing the FMA

The FMA is one of the components of the Anatomy

Information System (AIS), shown in Fig. 11, which is a

three-tiered software architecture constituted by a set of

structural information resources (the chief one of which

is the FMA), sets of authoring and end-user programs,

and structural information servers, which communicate

with the information resources via the web through themediation of the servers [37].

Currently, the FMA is accessed through six different

user interfaces in the AIS, which are shown at the top of

Fig. 11: (1) the Prot�eg�e-2000 graphical user interface,

which supports authoring and also allows browsing

through the Prot�eg�e class structure; (2) the Founda-

tional Model Explorer (FME), a web-based GUI that

provides intuitive browsing capabilities without thecomplexity of the full Prot�eg�e system [38]; (3) the GO-

QAFMA Graphical User Interface to the OQAFMA

Query Agent for the Foundational Model of Anatomy,

which provides a web interface for users to issue low-

level database queries to the OQAFMA server [39]; (4)

the intelligent EMILY GUI, which constrains the con-

struction of queries to concepts and relationships to

those in the FMA and relies on inference to retrieveresults not explicitly represented in the knowledge base

[40]; (5) GAPP, a natural language interface that allows

simple queries about the concepts and relationships

represented in the FMA [41]; and (6) the GUI of the

Dynamic Scene Generator that provides access to im-

ages and 3D models linked to the FMA in order to

support knowledge-based generation of interactive

scenes [42].In addition, the part of the FMA�s content incorpo-

rated in the UMLS as the Digital Anatomist vocabulary

is accessible through the UMLS knowledge server. The

Digital Anatomist vocabulary contains the Anatomy

Taxonomy, except for the concepts and relationships

pertaining to the brain and spinal cord, and relation-

ships of partonomy and branch and tributary relation-

ships.The evolution of the diverse interfaces for accessing

the FMA indicates that the FMA has reached a stage at

which there is sufficient content to support experiments

for interrogating the knowledge base, which is a key

requirement for developing knowledge-based applica-

tions such as the Dynamic Scene Generator [42], and

also for evaluating the FMA. The recent release of the

FMA on the Internet [43] should facilitate both theseactivities.

5. Evaluation and current usage

Evaluation of a large knowledge base, such as the

FMA, poses considerable problems and must take place

on several levels. At the most fundamental level, themodel has to be evaluated for its internal consistency

and comprehensiveness. There are no precedents we are

aware of for evaluating the overall semantic structure of

a computable knowledge source, which is perhaps one of

the most critical features of the FMA. At the highest

level, a knowledge base that claims to be reusable and

‘‘foundational’’ must be evaluated for its generalizability

and usefulness to other projects in knowledge repre-sentation and application development. Given the fact

that the FMA is still evolving and has not yet been re-

leased, its evaluations to date have been largely at the

first level.

Internal consistency checks were performed by

UMLS staff on segments of the FMA instantiated for

different body parts as these segments were delivered for

inclusion in the UMLS. Independent projects also as-sessed the internal consistency of different versions of

the FMA as a prerequisite for meeting their own re-

search objectives [44,45, Gu H. personal communica-

tion]. Feedback from these investigators revealed an

aggregate of a few hundred errors, many of which re-

lated to spelling and only a few to cycles in the class

subsumption and partonomy hierarchies. Given the size

and complexity of the FMA, we found these results verygratifying.

It is problematic to evaluate the FMA for compre-

hensiveness of its content, since there is no available

gold standard for comparison. There is no other source

that includes over 100,000 anatomical terms, less than

10% of which correspond to the complete list of officially

sanctioned anatomical terms [21]. Nevertheless, a cor-

relation of the incidence of anatomical concepts in alarge compendium of clinical reports with the FMA

would be informative.

Comprehensiveness seems a relatively trivial problem

compared to evaluating the FMA�s overall semantic

structure and the extensive modeling of relationships.

However, the difficulties entailed in such an apparently

simple task are illustrated by the mapping of large

symbolic models to one another, taking into accounttheir structure as well as their terms [45]. The FMA and

GALEN�s common reference model (CRM) [46] were

selected for developing automated methods for such

model matching. Although, after some necessary lexical

adjustments, over 3000 matching terms can be demon-

strated, there are surprisingly few homologies between

the FMA and GALEN-CRM when -is a- and parton-

omy relationships are also taken into account. Thereasons for the differences have not yet been explored,

but at least some of them may be the different contexts

of modeling. GALEN represents anatomy in the context

of surgical procedures, whereas the FMA has a strictly

structural orientation.

The ultimate evaluation of the Foundational Model

of Anatomy needs to take place through testing the

Page 51: Principles of Ontology Construction Intro · that the ontologies are simultaneously both well formed and biologically intuitive. There are an increasing number of groups developing

496 C. Rosse, J.L.V. Mejino Jr. / Journal of Biomedical Informatics 36 (2003) 478–500

hypothesis that motivates the establishment of themodel: the FMA will provide the anatomical informa-

tion called for by any knowledge-based application that

requires computable anatomical knowledge. We include

among such applications those developed for education,

biomedical research, and clinical medicine. The prereq-

uisites for such evaluations are currently being gener-

ated. The development of query interfaces to the FMA,

described in the preceding section, is a requirementfor making the FMA accessible for application

development.

We have made evolving versions of the FMA avail-

able to selected investigators, but its use has been largely

limited to associating the terms of the FMA with images

and image volumes [47–50], and for integrating these

terms in other terminologies [51]. Definitions of the

FMA have been used as a basis for characterizing defi-nitions of anatomical concepts in WordNet [52] and in

other biomedical ontologies [11], as well as for the au-

tomatic semantic interpretation of anatomical spatial

relationships [53], enriching the UMLS semantic net-

work [54] and designing its metaschema [55]. As far as

we are aware, only one application relies on knowledge

embedded in the FMA for interacting with 3D scenes

[42]. We hope that the development of knowledge-basedapplications calling for anatomical knowledge will be

stimulated by access to the comprehensive FMA, pro-

viding opportunities for its higher level evaluation.

6. Scaling of FMA

The objective of the FMA to represent declarativeknowledge about the structure of the body calls for

scaling the model to the concept domains of those fields

of anatomical science that are not yet included in the

FMA. These fields include neuroanatomy, develop-

mental biology and embryology, and also comparative

anatomy. Moreover, we contend that since manifesta-

tions of health and disease may be conceptualized as

attributes of anatomical structures, a logical and com-prehensive representation of anatomy should serve as a

foundation or template for the computable representa-

tion of physiological function, as well as pathology and

the clinical manifestations of diseases. Unless the se-

mantic structure of the FMA lends itself for such scal-

ing, the model cannot be regarded as foundational.

Moreover, if the FMA is to fulfill its potential as a

reference ontology, then it should be feasible to readilyalign other existing and evolving biomedical ontologies

with it.

The first phase of the FMA�s development was fo-

cused on macroscopic anatomy. Then the scope was

extended to include histology and the representation of

cells, subcellular entities, and biological macromole-

cules. There is no other hard copy or computable source

that encompasses a comparable spectrum of anatomicalentities at a level above that of elementary textbooks of

an introductory nature.

The next scaling up entailed the development of the

neuroanatomical component of the FMA [28]. The

FMA is unique among neuroscience resources in that it

comprehensively represents anatomical concepts of both

the central and peripheral nervous systems; moreover it

does so in the same information space as other systemsof the body. The instantiation of neuroanatomical

relationships is in progress.

In Section 3.4 we propose to extend the FMA to

knowledge elements that integrate the traditional field of

classical embryology with contemporary developmental

biology. The FMA�s semantic structure accommodates

the implemented and projected scale ups quite naturally.

We regard this outcome as a validation of the FMA�sconceptual framework and disciplined approach to

knowledge modeling.

Recently we began to experiment with using the

FMA as a template for the representation of the anat-

omy of non-human species, particularly those that serve

as experimental models of human disease [14]. The

classes of the AT readily accommodate the anatomy of

mammals and even other vertebrates. The challenge is toformally represent interspecies similarities and differ-

ences at the various levels of structural organization.

Solution of this problem will likely generalize to the

representation of intraspecies anatomical variation, i.e.,

differences between individuals. This possibility has im-

portant applications not only in clinical medicine but

also in anthropology. Plans have been made already

for using the FMA to annotate anthropological osteol-ogy databases [Drs. Razdan and Clark, personal

communication].

We are committed to constrain the FMA�s content tobiological structure or anatomy. However, we have be-

gun to develop a representation of physiological func-

tion using the FMA as a template or reference ontology

[56]. Such a Foundational Model of Physiology (FMP)

will be distinct from the FMA but it will be intimatelylinked to it.

7. Discussion

The Digital Anatomist Foundational Model of

Anatomy expresses a theory of anatomy that provides a

view of the domain consonant with the requirements offormal knowledge representation and also accommo-

dates traditional views of the domain. Coherent theories

of anatomy have not been declared as such, although

theoretical treatises on mereotopology (e.g., [57]), or on

some aspect of it (e.g., [58]), cite, or are even based on,

anatomical examples. These proposals, however, as a

rule, do not proceed from the examples to implementing

Page 52: Principles of Ontology Construction Intro · that the ontologies are simultaneously both well formed and biologically intuitive. There are an increasing number of groups developing

C. Rosse, J.L.V. Mejino Jr. / Journal of Biomedical Informatics 36 (2003) 478–500 497

the theory for the entire corpus of the domain, which, ofcourse, is not their purpose. The FMA�s theory of

anatomy is articulated by its high level scheme, the se-

mantic structure of the AT, and the schemes of the

model�s ASA and ATA components. Initially proposed

as hypotheses, these components of the FMA have now

been largely validated by instantiating the symbolic

model with tens of thousands of concepts and more than

a million relationships.In this article we focus primarily on the AT and defer

detailed descriptions of the ASA and ATA to separate

communications. We first summarize the salient features

of the AT, before commenting on the relevance of the

FMA to UMLS in general and to bioinformatics in

particular.

7.1. Salient features of the AT

Our intent with the Anatomy Taxonomy is to in-

corporate in it all concepts that relate to the structure of

the body, including those first identified in the contem-

porary literature and those that are newly discovered.

The AT introduces a number of classes that are unlikely

to be found in the literature or in anatomical discourse.

The rationale and justification for creating these classesis to assure that general as well as more and more spe-

cific attributes that are shared by increasingly specialized

anatomical structures are propagated from the root of

the taxonomy to its leaves. The semantic structure of the

AT also assures that all anatomical entities, ranging in

size and complexity from macromolecules to major

body parts and the whole organism, are encompassed by

one attributed graph. This graph also accommodatesclasses of substances and non-material entities that are

associated with and defined in terms of anatomical

structures, which constitute the dominant class of the

AT. In addition to these non-material physical ana-

tomical entities of zero to three dimensions, the root of

the AT also subsumes non-physical anatomical entities

that have no spatial dimension at all.

To safeguard against ambiguity, explicit Aristoteliandefinitions specify the classes of the AT in terms of

predominantly structural attributes, which are formally

represented in the frames of the AT�s concepts. At the

current state of the FMA, however, these definitions are

less consistently implemented the further one moves

away from the taxonomy�s root.The semantic structure of the AT, together with the

Prot�eg�e-2000 authoring environment, allows the repre-sentation of multiple inheritance. However, Aristotelian

definitions that specify the essence of the entities to

which the concepts refer obviate the need for multiple

inheritance, since non-definitional attributes of the

concepts can be readily accommodated as slots of their

frames. This representation affords searching the

knowledge base along the path of any explicitly repre-

sented, transitive relationship, or along a virtual pathconcatenated from heterogeneous relationships [39].

The structure of the AT is a dynamic abstraction that

is modified as a result of new insights we gain into the

structure of anatomical knowledge. New terms are also

added to the FMA as they come to our attention.

7.2. Relevance to UMLS

As noted in the introduction, in the initial phase of

the FMA�s development, we conceived of the classes of

the AT as extensions and specifications of UMLS Se-

mantic Types (ST). However, the disciplined approach

to modeling we describe in this communication, coupled

with the insights we gained into the structure of ana-

tomical knowledge through the instantiation of the

model, resulted in the redefinition of many of theseclasses. The specificity of these definitions has led to a

divergence between the definitions of UMLS ST and

FMA classes, several of which are designated by the

same or similar terms. For example, there are sub-

stantial differences in the definitions of the semantic type

�Anatomical Structure� and the FMA class of the same

name. Therefore, in submitting to UMLS evolving ver-

sions of the Digital Anatomist component of the FMA,we assigned Anatomical structure to the UMLS

ST �Body Part, Organ or Organ component� rather than�Anatomical Structure.� More problematic is the as-

signment of Anatomical space (which subsumes

such entities as Peritoneal cavity, Vertebral canal, and

Ischio-anal fossa) to ST �Body Space or Junction,� adescendant of �Conceptual Entity.� The latter is defined

as a broad grouping of abstract entities, whereas theFMA class is a descendant of Physical anatomical

entity, since the entities to which the class refers have

physical dimension.

Similar considerations led other investigators to sug-

gest adding several new semantic types to better describe

the anatomy portion of the Enriched Semantic Network

they developed for UMLS, allowing multiple parents in

the -is a- subsumption hierarchy [54]. An abstractionmetaschema for this enriched network is given in [55].

Some of these enrichments make use of the FMA�sdefinitions, which suggests perhaps that bidirectional in-

teractions between the UMLS SN and its source vocab-

ularies could benefit not only the vocabularies but also the

SN. Thus, in addition to the potential of the FMA for

reconciling inconsistencies in anatomical concepts rep-

resented in UMLS vocabularies [59] and in traditional,hard-copy sources [34], class definitions of the FMAmay

prove useful in a review of UMLS semantic types. Such a

review is likely to become desirable as a consequence

of the expanding scope of the UMLS Metathesaurus,

which reflects the growing relevance of bioinformatics to

clinical medicine by the inclusion of emerging ontologies

in this field of biomedical informatics.

Page 53: Principles of Ontology Construction Intro · that the ontologies are simultaneously both well formed and biologically intuitive. There are an increasing number of groups developing

498 C. Rosse, J.L.V. Mejino Jr. / Journal of Biomedical Informatics 36 (2003) 478–500

7.3. Relevance to bio- and biomedical informatics

The relevance of the FMA to domains of bioinfor-

matics beyond that of traditional anatomy is illustrated

by recent, emerging projects that reuse information from

the FMA. Though initially conceived for classical,

macroscopic anatomy, the FMA has been successfully

scaled to microscopic and neuroanatomy as well as to

biological macromolecules. The scheme for modelingembryology and developmental biology, described in

this communication, is an integral part of the FMA�sconceptual framework. The FMA has also provided a

motivation for research related to the modeling of

physiological functions [56], comparative anatomy [14],

and anthropological osteology, and to querying and

matching large ontologies and databases [39–41,45].

We contend that the Foundational Model of Anat-omy is the most promising, currently available candidate

for serving as a reference ontology in biomedical infor-

matics. The reasons for this contention are inherent in

the semantic structure and other distinguishing features

of the FMA. By way of summary, we highlight the

following features.

1. The FMA is a domain ontology that represents

deep knowledge of the structure of the human body byplacing an emphasis on the highest level of granularity

of its concepts and the large number and specificity of

the structural relationships that exist between the ref-

erents of these concepts. Modeling at the highest level of

detail assures consistency in the representation across

different levels of structural organization. A conse-

quence of this approach is that, as far as we are aware,

the FMA has developed into the most complex bio-medical domain ontology. This conclusion is reached by

applying the metric proposed by Gu et al. [13], in terms

of which the FMA scores over 10 in comparison with a

score of 2–3 for vocabularies included in and similar to

those in UMLS. This level of complexity presents its

own challenges, which include developing methods to

filter the FMA�s contents when information is required

at coarser levels of granularity. The semantic structureof the FMA will facilitate the development of knowl-

edge-based tools for such a purpose.

2. The concept domain of the FMA integrates in one

continuous conceptual and implementation framework

subdomains of anatomy that are conventionally handled

by independent and largely incompatible sources. The

objective is to comprehensively represent in the FMA

anatomical entities down to the level of cell parts andprovide a framework for linking to the FMA ontologies

and other data repositories for biological macromole-

cules. Comprehensive instantiation of the FMA�s ASA

and ATA components can be accomplished through

funding that targets the needs of research groups for

computable, in-depth anatomical information related to

selected parts of the body.

3. By modeling canonical anatomical knowledge and,in particular, by introducing high level, abstract classes

of anatomical entities, the FMA also provides a

framework for inter- and intraspecies anatomical vari-

ation and for the organization of anatomical data that

pertain to instances of the human and other species.

These data include the clinical record and biological

experiments performed on non-human species.

4. The FMA is unusual among traditional and com-putable knowledge sources in that it strictly adheres in

its modeling to one context. Because the majority of the

other sources target particular user groups, of necessity,

they intermingle different contexts or views of their

primary domain of interest. By design, the FMA is in-

tended to meet the needs of diverse user groups and

applications that require anatomical information;

therefore it is designed as a reusable reference ontologyrather than an application ontology. Only the structural

context generalizes to and complements all other views

of biology and medicine. The structural context proved

to be critical for the disciplined modeling of the FMA;

we found it to be the only view that allowed the com-

prehensive and consistent representation of biological

structure across all levels of its organization.

Such context-specific modeling results in a number ofbenefits: (1) it obviates duplication and redundancy in

ontology development, since the FMA�s contents can be

reused; (2) it provides for consistency among indepen-

dent ontologies that rely on the FMA�s contents; and (3)

it serves as a template for the development of other

ontologies in which the concepts of the FMA assume the

role of actors.

8. Conclusions

We attempted to illustrate that the FMA not only

encompasses in the Anatomy Taxonomy the diverse

entities that make up the human body, but is also ca-

pable of modeling through the interacting networks of

its ASA and ATA components a great deal of knowledgeabout these entities. Anatomical knowledge represented

in the FMA parallels in its complexity and depth the

knowledge printed in textbooks and journal articles

pertaining to the structure of the body. However, unlike

the information in these hard copy sources, the FMA�scontents are processable by computers and therefore

provide for machine-based inference, which is a pre-

requisite for the development of knowledge-based ap-plications. Most of the current and emerging ontologies

in bioinformatics are primarily concerned with repre-

senting the entities of their domain and point to publi-

cations for the knowledge associated with the referents

of the concepts they model. We hope that our report will

encourage a trend in the development of bioinformatics

ontologies toward incrementally linking the published

Page 54: Principles of Ontology Construction Intro · that the ontologies are simultaneously both well formed and biologically intuitive. There are an increasing number of groups developing

C. Rosse, J.L.V. Mejino Jr. / Journal of Biomedical Informatics 36 (2003) 478–500 499

information in a computable form to the concepts theseontologies compile in order to make also this informa-

tion machine-processable. Serving as a reference ontol-

ogy for bioinformatics, the FMA may facilitate such a

process.

Acknowledgments

The number of publications in References by mem-

bers of the Structural Informatics Group attests to the

numerous individuals who actively contributed to the

development of the Foundational Model of Anatomy

over a nearly 10 year period. We record our recognition

of the collaboration we continue to enjoy with Dr. Mark

Musen and other members of Stanford Medical Infor-

matics. In addition to his many other contributions, weare grateful to Dr. James F. Brinkley for reviewing the

manuscript. Our special thanks to Dr. Barry Smith for

making a number of valuable suggestions for improving

the clarity of the manuscript. The most substantial

support for the work we report was received from the

National Library of Medicine through contract LM

03528 and Grant LM 06822.

References

[1] Musen MA. Medical informatics: searching for underlying com-

ponents. Methods Inf Med 2002;41:12–9.

[2] GeneOntology (GO). Available from: http://www.geneontol-

ogy.org/.

[3] US Department of Health and Human Services, National

Institutes of Health, National Library of Medicine. Unified

Medical Language System (UMLS), 2002.

[4] McCray AT. Representing biomedical knowledge in the UMLS

Semantic Network. In: Broering NC, editor. High performance

medical libraries: advances in information management for the

virtual era. Westport, CT: Mekler; 1993. p. 45–55.

[5] Spackman KE, Campbell KE, Cote RA. SNOMED RT: a

reference terminology for health care. Proc AMIA Symp

1997:640–4.

[6] GALEN. Available from: http://www.opengalen.org/.

[7] Cimino JJ, Hricsak G, Johnson SB, Clayton PD. Designing an

introspective multipurpose controlled medical vocabulary. Proc

13th Annu Symp Comput Appl Med Care 1989:513–7.

[8] Rosse C, Mejino JL, Modayur BR, Jakobovits R, Hinshaw KP,

Brinkley JF. Motivation and organizational principles for ana-

tomical knowledge representation: the Digital Anatomist Sym-

bolic Knowledge Base. J Am Med Inform Assoc 1998;5:17–40.

[9] Rosse C, Shapiro LG, Brinkley JF. The Digital Anatomist

Foundational Model: principles for defining and structuring its

concept domain. Proc AMIA Symp 1998:820–4.

[10] Cimino JJ. Desiderata for controlled medical vocabularies in the

twenty-first century. Methods Inf Med 1998;37(4–5):394–403.

[11] Burgun A, Bodenreider O. Ontologies in the biomedical domain. J

Am Med Inform Assoc 2003 [in press].

[12] Perl Y, Geller J, Gu H. Identify a forest hierarchy in an OODB

specialization hierarchy satisfying disciplined modeling. Proc First

IFCIS Internat Conf on Cooperative Inform Syst CoopIS�961996:182–95.

[13] Gu H, Perl Y, Geller J, Halper M, Singh M. A methodology for

partitioning a vocabulary hierarchy into trees. Artif Intell Med

1999;15(1):77–98.

[14] Travillian RS, Rosse C, Shapiro LG. An approach to the

anatomical correlation of species through the Foundational

Model of Anatomy. Proc AMIA Symp 2003:669–73.

[15] Aristotle. The categories. Cambridge, MA: Harvard University

Press; 1973.

[16] Michael J, Mejino JLV, Rosse C. The role of definitions in

biomedical concept representation. Proc AMIA Symp 2001:463–7.

[17] Noy NF, Mejino JLV, Musen MA, Rosse C. Pushing the

envelope: challenges in frame-based representation of human

anatomy. Data & Knowledge Eng [in press].

[18] Noy NF, Fergerson RW, Musen MA. The knowledge model of

Prot�eg�e 2000: combining interoperability and flexibility. In: Proc

12 Internat Conf on Knowledge Eng Knowledge Manage

(EKAW-2000). Juan-les-Pins France: Springer; 2000.

[19] Chaudhri VK, Farquhar A, Fikes R, Karp PD, Rice JP. OKBC: a

programmatic foundation for knowledge base interoperability. In:

Fifteenth National Conf on Artificial (AAAI-98). Madison,

Wisconsin: AAI Press/The MIT Press; 1998.

[20] Gu H, Halper M, Geller J, Perl Y. Benefits of an object-oriented

database representation for controlled medical terminologies. J

Am Med Inform Assoc 1999;6:283–303.

[21] Federative Committee on Anatomical Terminology (FCAT).

Terminologia Anatomica. Stuttgart: Thieme, 1998.

[22] Rosse C. Terminologia Anatomica; considered from the perspec-

tive of next-generation knowledge sources. Clin Anat

2001;14(2):120–33.

[23] Mejino JLV, Rosse C. Conceptualizations of anatomical spatial

entities in the Digital Anatomist Foundational Model. Proc

AMIA Symp 1999:112–6.

[24] Schleiden MJ. Beitr€age zur Phytogenese. M€uller�s Archive. 1838.

Translation in Sydenham Soc, vol. 12, London, 1847.

[25] Schwann T. Mikroskopische Untersuchungen uber die €uberein-stimmung in der Structur und dem Wachstum der Tiere und

Pflanzen. Berlin, 1839. Translation in Sydenham Soc, vol. 12,

London, 1847.

[26] Lovtrup S. Epigenetics; a treatise on theoretical biology. London:

Wiley; 1974.

[27] Agoncillo AV, Mejino Jr JLV, Rickard KL, Detwiler LT, Rosse

C. Proposed classification of cells in the Foundational Model of

Anatomy. Proc AMIA Symp 2003:775.

[28] Martin RF, Mejino JLV, Bowden DM, Brinkley JF, Rosse C.

Foundational model of neuroanatomy: its implications for the

Human Brain Project. Proc AMIA Symp 2001:438–42.

[29] Hollinshead WH.. 3rd ed.. Anatomy for surgeons, vols. 1–3.

Philadelphia: Harper and Row; 1982.

[30] Rosse C, Gaddum-Rosse P. In: Hollinshead�s textbook of

anatomy. 5th ed. Philadelphia: Lippincott-Raven; 1997. p. 902.

[31] Williams PL, Bannister LH, Berry MM, Collins P, Dyson M,

Dussec JE, Ferguson MWJ. In: Gray�s anatomy. 38th ed. New

York: Churchill Livingstone; 1995. p. 2092.

[32] Rickard KL, Mejino Jr JLV, Martin RF, Agoncillo AV, Rosse C.

Problems and solutions with integrating legacy terminologies into

evolving knowledge bases [submitted].

[33] Martin RF, Bowden D. Primate brain maps. Oxford: Elsevier;

2000.

[34] Agoncillo A, Mejino JLV, Rosse C. Influence of the Digital

Anatomist Foundational model on traditional representations of

anatomical concepts. Proc AMIA Symp 1999:2–6.

[35] Mejino Jr JLV, Agoncillo AV, Rickard KL, Rosse C. Represent-

ing complexity in part-whole relationships within the Founda-

tional Model of Anatomy. Proc AMIA Symp 2003:450–4.

[36] Neal PJ, Shapiro LG, Rosse C. The Digital Anatomist spatial

abstraction: a scheme for the spatial description of anatomical

entities. Proc AMIA Symp 1998:423–7.

Page 55: Principles of Ontology Construction Intro · that the ontologies are simultaneously both well formed and biologically intuitive. There are an increasing number of groups developing

500 C. Rosse, J.L.V. Mejino Jr. / Journal of Biomedical Informatics 36 (2003) 478–500

[37] Brinkley JF, Wong BA, Hinshaw KP, Rosse C. Design of an anat-

omy information system. IEEECompGraphicsAppl 1999;3:38–48.

[38] Detwiler LT, Mejino Jr JLV, Rosse C, Brinkley JF. Efficient web-

based navigation of the Foundational Model of Anatomy. Proc

AMIA Symp 2003:829.

[39] Mork P, Brinkley JF, Rosse C. OQAFMA querying agent for the

Foundational Model of Anatomy: providing flexible and efficient

access to a large semantic network. JBI 2003;36:501–17.

[40] Shapiro LG, Chung E, Detwiler LT,Mejino Jr JLV, Agoncillo AV,

Brinkley JF, Rosse C. A generalizable intelligent query interface for

the Digital Anatomist Foundational Model [submitted].

[41] Distelhorst G, Srivastava V, Rosse C, Brinkley JF. A prototype

natural language interface to a large complex knowledge base, the

Foundational Model of Anatomy. Proc AMIA Symp 2003:200–4.

[42] Wong BA, Rosse C, Brinkley JF. Semi-automatic scene genera-

tion using the Digital Anatomist Foundational Model. Proc

AMIA Symp 1999:637–41.

[43] http://fma.biostr.washington.edu.

[44] Beck R. Logic-based remodeling of the Digital Anatomist

Foundational Model. Proc AMIA Symp 2003:748–52.

[45] Zhang S, Bodenreider O. Aligning representation of anatomy using

lexical and structural methods. Proc AMIA Symp 2003 [in press].

[46] Rector AL, Gangenni E, Galeazzi A, Rossi-Mori A. The GALEN

core model schema for anatomy: towards a reusable application-

independent model of medical concepts. In: Twelfth International

Congress of European Federation for Medical Informatics.

Lisbon, Portugal; 1994. p. 229–233.

[47] Lober W, Brinkley JF. A portable image annotation tool for web-

based anatomy atlases. Proc AMIA Symp 1999:1108.

[48] Rindflesch TC, Bean CA, Sneiderman CA. Argument identifica-

tion for arterial branching predications asserted in cardiac

catheterization reports. Proc AMIA Symp 2000:704–8.

[49] Sneiderman CA, Rindflesch TC, Bean CA. Identification of

anatomical terminology in medical text. Proc AMIA Symp

1998:428–32.

[50] Teng CC, Austin-Seymour MM, Barker J, Kalet IJ, Shapiro LG,

Whipple M. Head and neck lymph node region delineation with 3-

D CT image registration. Proc AMIA Symp 2002:767–71.

[51] Tringali M, Hole WT, Srinivasan S. Integration of a standard

gastrointestinal endoscopy terminology in the UMLS Metathe-

saurus. Proc AMIA Symp 2002:801–5.

[52] Bodenreider O, Burgun A. Characterizing the definitions of

anatomical concepts in WorldNet and specialized sources. Proc

First Global WorldNet Conf 2002:223–30.

[53] Bean CA, Rindflesch TC, Sneiderman CA. Automatic semantic

interpretation of anatomic spatial relationships in clinical text.

Proc AMIA Symp 1998:897–901.

[54] Zhang L, Perl Y, Geller J, Halper M, Cimino JJ. Enriching the

structure of the UMLS Semantic Network. Proc AMIA Ann

Symp 2002; 939–943.

[55] Zhang L, Perl Y, Halper M, Geller J. Designing Metaschemas

for the UMLS Enriched Semantic Network. JBI 2003;36:

433–49.

[56] Cook DL, Mejino Jr JLV, Rosse C. Evolution of a foundational

model of physiology: symbolic representation for functional

bioinformatics [submitted].

[57] Smith B. Mereotopology: a theory of parts and boundaries. Data

& Knowledge Eng 1996;20:287–303.

[58] Schulz S, Hahn U. Mereotopological reasoning about parts

(w)holes in bio-ontologies. In: Proceedings of FOIS�01. New

York: ACM Press; 2001. p. 198–209.

[59] Mejino JL, Rosse C. The potential of the Digital Anatomist

Foundational Model for assuring consistency in UMLS sources.

Proc AMIA Symp 1998:825–9.

Page 56: Principles of Ontology Construction Intro · that the ontologies are simultaneously both well formed and biologically intuitive. There are an increasing number of groups developing

The Gene Ontology (GO) database and informaticsresourceGene Ontology Consortium*

GO-EBI, EMBL-EBI, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK

Received August 21, 2003; Revised and Accepted September 12, 2003

ABSTRACT

The Gene Ontology (GO) project (http://www.geneontology.org/) provides structured, controlledvocabularies and classi®cations that cover severaldomains of molecular and cellular biology and arefreely available for community use in the annotationof genes, gene products and sequences. Manymodel organism databases and genome annotationgroups use the GO and contribute their annotationsets to the GO resource. The GO database inte-grates the vocabularies and contributed annotationsand provides full access to this information in sev-eral formats. Members of the GO Consortium con-tinually work collectively, involving outside expertsas needed, to expand and update the GO vocabular-ies. The GO Web resource also provides access toextensive documentation about the GO project andlinks to applications that use GO data for functionalanalyses.

INTRODUCTION

The era of genome-scale biology has seen the accumulation ofvast amounts of biological data, accompanied by the wide-spread proliferation of biology-oriented databases. To makethe best use of biological databases and the knowledge theycontain, different kinds of information from different sourcesmust be integrated in ways that make sense to biologists.

A major component of the integration effort is thedevelopment and use of annotation standards such asontologies (1±4). Ontologies provide conceptualizations ofdomains of knowledge and facilitate both communicationbetween researchers and the use of domain knowledge bycomputers for multiple purposes.

The Gene Ontology (GO) project is a collaborative effort toaddress two aspects of information integration: providingconsistent descriptors for gene products, in different data-bases; and standardizing classi®cations for sequences andsequence features. The project began in 1998 as a collabor-ation between three model organism databases: FlyBase(Drosophila), the Saccharomyces Genome Database (SGD)and the Mouse Genome Informatics (MGI) project. Sincethen, the GO Consortium has grown to include manydatabases, including several of the world's major repositoriesfor plant, animal and microbial genomes (a current list ofmember organizations is included as SupplementaryMaterial).

THE GO PROJECT

The GO project has three major goals: (i) to develop a set ofcontrolled, structured vocabulariesÐknown as ontologiesÐtodescribe key domains of molecular biology, including geneproduct attributes and biological sequences; (ii) to apply GOterms in the annotation of sequences, genes or gene productsin biological databases; and (iii) to provide a centralizedpublic resource allowing universal access to the ontologies,annotation data sets and software tools developed for use withGO data.

Ontologies

The GO project provides ontologies to describe attributes ofgene products in three non-overlapping domains of molecularbiology. Within each ontology, terms have free text de®nitionsand stable unique identi®ers. The vocabularies are structuredin a classi®cation that supports `is-a' and `part-of' relation-ships. The scope and structure of the GO vocabularies aredescribed in more detail in references (5±7). In the currentresearch environment, where new genome sequences arebeing rapidly generated, and where comparative genome

*Correspondence should be addressed to GO-EBI, EMBL-EBI, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK.Tel. +44 1223 494667; Fax: +44 1223 494468; Email: [email protected]*Current members of the GO Consortium are: M. A. Harris, J. Clark, A. Ireland, J. Lomax (GO-EBI, Hinxton, UK); M. Ashburner, R. Foulger (FlyBase,Department of Genetics, University of Cambridge, Cambridge, UK); K. Eilbeck, S. Lewis, B. Marshall, C. Mungall, J. Richter, G. M. Rubin (BDGP,UC-Berkeley, Berkeley, CA, USA), J. A. Blake, C. Bult, M. Dolan, H. Drabkin, J. T. Eppig, D. P. Hill, L. Ni, M. Ringwald (MGI, Jackson Laboratory,Bar Harbor, ME, USA); R. Balakrishnan, J. M. Cherry, K. R. Christie, M. C. Costanzo, S. S. Dwight, S. Engel, D. G. Fisk, J. E. Hirschman, E. L. Hong,R. S. Nash, A. Sethuraman, C. L. Theesfeld (SGD, Department of Genetics, Stanford University, Stanford, CA, USA); D. Botstein, K. Dolinski, B. Feierbach(Genomics Institute, Princeton University, Princeton, NJ, USA); T. Berardini, S. Mundodi, S. Y. Rhee (TAIR, Carnegie Institution, Department of PlantBiology, Stanford, CA, USA); R. Apweiler, D. Barrell, E. Camon, E. Dimmer, V. Lee (GOA database, UniProt, EBI, Hinxton, UK); R. Chisholm, P. Gaudet,W. Kibbe (DictyBase, Northwestern University, Chicago, IL, USA); R. Kishore, E. M. Schwarz, P. Sternberg (WormBase, California Institute of Technology,Pasadena, CA, USA); M. Gwinn, L. Hannick, J. Wortman (Institute for Genome Research, Rockville, MD, USA); M. Berriman, V. Wood (Wellcome TrustSanger Institute, Hinxton, UK); N. de la Cruz, P. Tonellato (RGD, Medical College of Wisconsin, Milwaukee, WI, USA); P. Jaiswal (Gramene, Departmentof Plant Breeding, Cornell University, Ithaca, NY, USA); T. Seigfried (Maize DB, Iowa State University, Ames, IA, USA); R. White (Incyte Genomics,Palo Alto, CA, USA).

D258±D261 Nucleic Acids Research, 2004, Vol. 32, Database issueDOI: 10.1093/nar/gkh036

Nucleic Acids Research, Vol. 32, Database issue ã Oxford University Press 2004; all rights reserved

Page 57: Principles of Ontology Construction Intro · that the ontologies are simultaneously both well formed and biologically intuitive. There are an increasing number of groups developing

analysis requires the integration of data from multiple sources,it is especially germane to provide rigorous ontologies that canbe shared by the community.

Molecular Function (MF) describes activities, such ascatalytic or binding activities, at the molecular level. GOmolecular function terms represent activities rather than theentities (molecules or complexes) that perform the actions,and do not specify where, when or in what context the actiontakes place. Examples of individual molecular function termsare the broad concept `kinase activity' and the more speci®c`6-phosphofructokinase activity', which represents a subtypeof kinase activity.

Biological Process (BP) describes biological goals accom-plished by one or more ordered assemblies of molecularfunctions. High-level processes such as `cell death' can haveboth subtypes, such as `apoptosis', and subprocesses, such as`apoptotic chromosome condensation'.

Cellular Component (CC) describes locations, at the levelsof subcellular structures and macromolecular complexes.Examples of cellular components include `nuclear innermembrane', with the synonym `inner envelope', and the`ubiquitin ligase complex', with several subtypes of thesecomplexes represented.

The recent development of the Sequence Ontology (SO)permits the classi®cation and standard representation ofsequence features. De®ned sequence features include termssuch as `exon', whose meaning is widely accepted, and themore problematic term `pseudogene', for which severaldifferent usages have yet to be resolved. Although the SO isa relatively new vocabulary, and is still undergoing re®ne-ment, it is already being used for genome annotation projectsin Drosophila and Caenorhabditis elegans.

Annotations

Collaborating databases provide data sets comprising linksbetween database objects and GO terms, with supportingdocumentation. Every annotation must be attributed to asource, which may be a literature reference, another databaseor a computational analysis; furthermore, the annotation mustindicate the type of evidence the cited source provides tosupport the association between the gene product and the GOterm. A standard set of evidence codes quali®es annotationswith respect to different types of experimental determinations.For example, a direct assay to determine the function of theexact gene product being annotated is more reliable than asequence architecture comparison.

High-quality GO annotations, normally based on curatorialreview of published literature and supported by experimentalevidence, are now available for gene products in many modelorganisms. In addition, large sets of annotations made usingautomated methods cover both model organisms and lessexperimentally tractable organisms, including human. Anumber of different automatic methods have been applied(e.g. 8±12), all of which are represented by the evidence codeIEA (`inferred from electronic annotation'). Table 1 providesa snapshot of current annotations in the GO database; a moredetailed table is maintained on the web at http://www.ge-neontology.org/doc/GO.current.annotations.shtml. Additionalinformation on GO annotations can be found in references (5±8) and (13).

The SO is being used by the collaborating databasesfor genomic feature annotation. Like GO annotations, SOannotations are curated using both manual work by expertsand purely computational methodologies.

GO slims

For many purposes, in particular reporting the results of GOannotation of a genome or cDNA collection, it is very useful tohave a high-level view of each of the three ontologies. Thesesubsets of the GO have become known as `GO slims', the ®rstof which was constructed for the annotation of the Drosophilagenome (13). An example of a GO slim analysis is shown inFigure 1.

The shared use of GO slims makes comparisons ofsummary GO term distributions very easy. Different applica-tions, however, may require different GO slim sets tailored tothe speci®c needs of an analysis. To address this, the GOConsortium makes both generic and speci®c GO slim ®lesavailable. The generic GO slim ®le is kept up to date withrespect to the full ontologies, and speci®c GO slim ®les thathave been used in particular publications or analyses arearchived.

THE GO DATABASE

The GO database consists of a MySQL database that capturesGO content and a Perl object model and ApplicationProgrammer Interface (API) to simplify database access andhelp programmers write tools that use the GO data. The GOrelational database is released monthly in several versions:termdb includes the ontologies, de®nitions and cross-refer-ences to other databases; assocdb includes all data in termdbplus associations to gene products; and seqdb adds proteinsequences for annotated gene products (where available). Afourth version, seqdblite, is equivalent to seqdb without theIEA-based associations; this version is used by the AmiGObrowser (see below).

The GO database schema models generic graphs, includingthe GO structure (a directed acyclic graph, or DAG)relationally. At the core of the schema are two relationaltables for capturing all terms (also called nodes) and term±term relationships (arcs). The two relationship types, `is-a' and`part-of,' are represented as a `relationship type' attribute inthe relationship table.

Table 1. Status of the GO vocabularies

Totals July 1, 2000 July 1, 2003

All valid termsa 4493 13412Terms with de®nitions 250 11105Terms with synonyms 301 2813Terms with db cross-references 1042 12317Associationsb 30654 7781954Gene products 13016 1549236Sequences 0 21916Pathsc 30941 314886

aExcludes obsolete terms.bIndividual associations between any gene product and any GO term.cParent±child relationships traced from any GO term to the root (molecularfunction, biological process or cellular component).

Nucleic Acids Research, 2004, Vol. 32, Database issue D259

Page 58: Principles of Ontology Construction Intro · that the ontologies are simultaneously both well formed and biologically intuitive. There are an increasing number of groups developing

GO RESOURCES

Access to ontologies and annotations in all formats

The output of the GO projectÐvocabularies, annotations,database and accompanying toolsÐare in the public domainand are readily accessible via the GO web pages at http://www.geneontology.org/. The GO Consortium gives permis-sion for any of its products to be used without license, inaccordance with its redistribution and citation policy.Highlights of that policy are:

(i) that the Gene Ontology Consortium is clearly acknow-ledged as the source of the product;

(ii) that any GO Consortium ®le(s) displayed publiclyinclude the revision number(s) and/or date(s) of the relevantGO ®le(s);

(iii) that neither the content of a GO ®le(s) nor the logicalrelationships embedded within the GO ®le(s) be altered in anyway.

The full GO Redistribution and Citation Policy documentis available online at http://www.geneontology.org/doc/GO.cite.html. A list of useful URLs and addresses is includedin the Supplementary Material.

The MySQL database described above can be downloadedlocally, and Perl APIs are provided. The GO Consortium'sontologies and annotations are also available as ¯at ®les (themost frequently updated format at the time of writing) and asRDF XML; the latter is available with or without annotationdata included. The MySQL and XML formats are releasedmonthly. The ¯at ®les are updated continally, and monthly

snapshots are archived. Current and archival releases of allthree formats can be downloaded from the GO web site.

Documentation

The GO web resource includes an extensive set of docu-mentation pages (see http://www.geneontology.org/doc/GO.contents.doc.html). Topics include an overview of theGO project and the ontologies, guides to editorial style, ®leformats and annotation practices, and frequently askedquestions (FAQ).

Software/tools

A variety of browsers that provide visualization and querycapabilities for the GO are available. For example, the AmiGObrowser (developed by the GO software group at Berkeley; seehttp://www.godatabase.org/cgi-bin/go.cgi) provides a webinterface for searching and displaying the ontologies, termde®nitions and associated annotated gene products for theentire spectrum of contributing organism databases repre-sented in the GO database. AmiGO easily allows users tobrowse a tree-like view of the GO structure and to search forterms using a variety of different keys such as a name,synonym, de®nition, numerical identi®er or cross-referencedentry in an external database. The summary view presents thelist of gene products associated with each term. The resultsmay be constrained by the evidence code used in theassociation or by the organization that submitted the associ-ation. Representative amino acid sequences are available formost genes, and these can be selected and downloaded as

Figure 1. Application of a GO slim set in genome annotation. The number of gene products annotated to each term in each of four model organism genomesis shown for a GO slim set taken from the cellular component ontology (data as of August 1, 2003).

D260 Nucleic Acids Research, 2004, Vol. 32, Database issue

Page 59: Principles of Ontology Construction Intro · that the ontologies are simultaneously both well formed and biologically intuitive. There are an increasing number of groups developing

FASTA ®les. Using GOst, the GO BLAST server, users maysubmit a query sequence and retrieve the sequences and GOannotations of all similar gene products in the GO database.

The GO software group has also developed DAG-Edit, atool that provides a graphical interface to browse, query andedit GO or any other vocabulary that has a DAG data structure.GO curators use DAG-Edit to manage the GO vocabularies.The tool has also been used by other groups to build ontologiesfor a wide range of biological subjects, such as anatomies anddevelopmental timelines for several model organisms, humandiseases and plant growth environment. DAG-Edit is an opensource Java application that is installed locally. A user guide isavailable within the application and on the web (http://www.geneontology.org/doc/dagedit_userguide/dagedit.html).

DAG-Edit is updated regularly to add features and improveperformance; the current version can be downloaded fromhttp://sourceforge.net/project/show®les.php?group_id=36855.

The GO Software web page (http://www.geneontology.org/doc/GO.tools.html) provides a catalogue of GO-related toolsdeveloped by members of the GO Consortium or by GO users.In addition to AmiGO, there are several more applications forbrowsing and searching the GO vocabularies and annotations.Other available software includes applications for correlatingdata from the GO project and other sources (including, but notlimited to, microarray data), as well as tools that are notspeci®c to, but can be used in conjunction with, GO data.

Other resources

Literature collection. The GO project maintains a biblio-graphy of peer-reviewed publications (124 as of August 2003)relevant to the development and use of the GO vocabulariesand annotation sets at http://www.geneontology.org/doc/GO.biblio.html. Many of the publications document thecuration and display of GO annotations within a wide varietyof databases, whereas others make use of GO terms and geneproduct annotations in the interpretation of large-scaleexperimental results. Still other papers describe novel usesof GO terms (e.g. in text mining), software that uses GO dataand integration of the GO with other ontological resources.

Community input. The GO effort is greatly enriched by inputfrom its user community. Several routes are available for usersto comment on various aspects of the GO. Comments andsuggestions for changes and updates to the ontologies canbe submitted via a GO project page at the SourceForgesite (http://sourceforge.net/projects/geneontology), whereuponeach suggestion is evaluated by GO Consortium members.Different `trackers' available from the SourceForge site allowGO users to report problems or request features for the AmiGObrowser, and to submit suggestions for additions and changes tothe ontologies; items can be assigned to individuals or groupswithin the GO Consortium who have relevant expertise. Thissystem allows the submitter to track the status of a suggestion,both online and by email, allows other users to see whatchanges are currently under consideration, and archives allentries and associated communications.

Mailing lists. GO also has several mailing lists, coveringgeneral questions and comments, the GO database andsoftware, and summaries of changes to the ontologies. Thelists are described at http://www.geneontology.org/GO_

contacts.html. Any questions about contributing to the GOproject should be directed to the main GO mailing list [email protected].

SUMMARY

The GO project provides an ongoing example of communitydevelopment of bioinformatics standards. Combining theexpertise of biologists from multiple sub-disciplines, thecomputational expertise of arti®cial intelligence researchers,and input from multiple users of the system, the GOConsortium continues to develop and expand these classi®ca-tion systems for molecular biology.

SUPPLEMENTARY MATERIAL

Supplementary Material is available at NAR Online.

ACKNOWLEDGEMENTS

The Gene Ontology Consortium is supported by NIH/NHGRIgrant HG02273, and by grants from the European Union RTDProgramme `Quality of Life and Management of LivingResources' (QLRI-CT-2001-00981 and QLRI-CT-2001-00015).

REFERENCES

1. Gruber,T.R. (1993) A translational approach to portable ontologies.Knowl. Acq., 5, 199±220.

2. Jones,D.M. and Paton,R.C. (1999) Toward principles for therepresentation of hierarchical knowledge in formal ontologies. DataKnowl. Eng., 31, 102±105.

3. Schulze-Kremer,S. (1998) Ontologies for molecular biology. Pac. Symp.Biocomput., 3, 695±706.

4. Stevens,R., Goble,C.A., and Bechhofer.S. (2000) Ontology-basedknowledge representation for bioinformatics. Brief. Bioinform., 1, 398±414.

5. Blake,J.A. and Harris,M. (2003) The Gene Ontology Project: Structuredvocabularies for molecular biology and their application to genome andexpression analysis. In Baxevanis,A.D., Davison,D.B., Page,R.,Stormo,G. and Stein,L. (eds), Current Protocols in Bioinformatics. Wileyand Sons, Inc., New York.

6. The Gene Ontology Consortium (2001) Creating the gene ontologyresource: design and implementation. Genome Res., 11, 1425±1433.

7. The Gene Ontology Consortium (2000) Gene Ontology: tool for theuni®cation of biology. Nature Genet., 25, 25±29.

8. Camon,E., Magrane,M., Barrell,D., Binns,D., Fleischmann,W.,Kersey,P., Mulder,N., Oinn,T., Maslen,J., Cox,A. et al. (2003) The GeneOntology Annotation (GOA) Project: Implementation of GO in SWISS-PROT, TrEMBL, and InterPro. Genome Res., 13, 662±672.

9. Mi,H., Vandergriff,J., Campbell,M., Narechania,A., Majoros,W.,Lewis,S., Thomas,P.D. and Ashburner,M. (2003) Assessment of genome-wide protein function classi®cation for Drosophila melanogaster.Genome Res., 13, 2118±2128.

10. Pouliot,Y., Gao,J., Su,Q.J., Liu,G.G. and Ling,X.B. (2001) DIAN: anovel algorithm for genome ontological classi®cation. Genome Res., 11,1766±1779.

11. Okazaki,Y., Furuno,M., Kasukawa,T., Adachi,J., Bono,H., Kondo,S.,Nikaido,I., Osato,N., Saito,R. and Suzuki,H. et al. (2002) Analysis of themouse transcriptome based on functional annotation of 60,770 full-lengthcDNAs. Nature, 420, 563±573.

12. Xie,H., Wasserman,A., Levine,Z., Novik,A., Grebinskiy,V., Shoshan,A.and Mintz,L. (2002) Large scale protein annotation through GeneOntology. Genome Res., 12, 785±794.

13. Adams,M.D., Celniker,S.E., Holt,R.A., Evans,C.A., Gocayne,J.D.,Amanatides,P.G., Scherer,S.E., Li,P.W., Hoskins,R.A., Galle,R.F. et al.(2000) The genome sequence of Drosophila melanogaster. Science, 287,2185±2195.

Nucleic Acids Research, 2004, Vol. 32, Database issue D261