feature conference report: standards and ontologies for...

6
Comparative and Functional Genomics Comp Funct Genom 2003; 4: 116–120. Published online in Wiley InterScience (www.interscience.wiley.com). DOI: 10.1002/cfg.249 Feature Conference Report: Standards and ontologies for functional genomics: towards unified ontologies for biology and biomedicine Wellcome Trust Genome Campus, Hinxton, Cambridge, UK, 17–20 November 2002 Midori A. Harris and Helen Parkinson* European Bioinformatics Institute, EMBL Outstation, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK *Correspondence to: Helen Parkinson, European Bioinformatics Institute, EMBL Outstation, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK. E-mail: [email protected] Received: 28 November 2002 Accepted: 2 December 2002 Introduction In recent years, whole genome analysis has become routine and systems biology and modelling of whole cells are becoming more common. Advances in experimental technology now permit expression analysis for tens of thousands of genes at a time, generating vast amounts of biological data, and the application of high-throughput technologies to proteomics makes the burden even heavier. Databases and tools have become available to help biologists manage and interpret these large volumes of data, but databases alone are not sufficient to allow researchers to integrate large amounts, and different kinds, of information. The problem is that any given biological phenomenon can be described in many different ways; an exam- ple is the use of free text annotation in databases such as GenBank, which has resulted in inconsis- tent data representation [2]. One promising solution is to provide consistent annotation within databases by means of standard formats and ontologies. Stan- dards development depends on the availability of suitable ontologies; both the MIAME standard and emerging proteomics standards recommend the use of ontologies wherever possible. Where such ontologies do not exist, communities are starting to build their own to support standards [5]. Although the exact meaning of the word ‘ontol- ogy’ is contentious, T.Gruber’s definition of an ontology as a ‘specification of a conceptualiza- tion’ (http://www-ksl.stanford.edu/kst/what-is- an-ontology.html) is widely used. Ontologies are, at a minimum, sets of terms used in a spe- cific domain, definitions for those terms and defined relationships between the terms; they can range from simple controlled vocabularies to struc- turally complex representations employing descrip- tion logics [4]. In biology, the use of ontologies allows database annotation to be standardized, and makes sophisticated queries possible for humans and computers. The SOFG conference brought together about 120 biological domain experts, com- puter scientists and those with interests in related fields, such as natural language processing (NLP), to address the issues of developing, implementing and ultimately unifying ontologies. Presentations covered specific ontologies being developed for a Copyright 2003 John Wiley & Sons, Ltd.

Upload: nguyennhu

Post on 17-Mar-2018

220 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Feature Conference Report: Standards and ontologies for ...downloads.hindawi.com/journals/ijg/2003/465981.pdf · Feature Conference Report: Standards and ontologies for functional

Comparative and Functional GenomicsComp Funct Genom 2003; 4: 116–120.Published online in Wiley InterScience (www.interscience.wiley.com). DOI: 10.1002/cfg.249

Feature

Conference Report: Standards and ontologiesfor functional genomics: towards unifiedontologies for biology and biomedicineWellcome Trust Genome Campus, Hinxton, Cambridge, UK, 17–20November 2002

Midori A. Harris and Helen Parkinson*European Bioinformatics Institute, EMBL Outstation, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK

*Correspondence to:Helen Parkinson, EuropeanBioinformatics Institute, EMBLOutstation, Wellcome TrustGenome Campus, Hinxton,Cambridge CB10 1SD, UK.E-mail: [email protected]

Received: 28 November 2002Accepted: 2 December 2002

Introduction

In recent years, whole genome analysis has becomeroutine and systems biology and modelling ofwhole cells are becoming more common. Advancesin experimental technology now permit expressionanalysis for tens of thousands of genes at a time,generating vast amounts of biological data, andthe application of high-throughput technologies toproteomics makes the burden even heavier.

Databases and tools have become available tohelp biologists manage and interpret these largevolumes of data, but databases alone are notsufficient to allow researchers to integrate largeamounts, and different kinds, of information. Theproblem is that any given biological phenomenoncan be described in many different ways; an exam-ple is the use of free text annotation in databasessuch as GenBank, which has resulted in inconsis-tent data representation [2]. One promising solutionis to provide consistent annotation within databasesby means of standard formats and ontologies. Stan-dards development depends on the availability ofsuitable ontologies; both the MIAME standard

and emerging proteomics standards recommend theuse of ontologies wherever possible. Where suchontologies do not exist, communities are startingto build their own to support standards [5].

Although the exact meaning of the word ‘ontol-ogy’ is contentious, T.Gruber’s definition of anontology as a ‘specification of a conceptualiza-tion’ (http://www-ksl.stanford.edu/kst/what-is-an-ontology.html) is widely used. Ontologies are,at a minimum, sets of terms used in a spe-cific domain, definitions for those terms anddefined relationships between the terms; they canrange from simple controlled vocabularies to struc-turally complex representations employing descrip-tion logics [4]. In biology, the use of ontologiesallows database annotation to be standardized, andmakes sophisticated queries possible for humansand computers. The SOFG conference broughttogether about 120 biological domain experts, com-puter scientists and those with interests in relatedfields, such as natural language processing (NLP),to address the issues of developing, implementingand ultimately unifying ontologies. Presentationscovered specific ontologies being developed for a

Copyright 2003 John Wiley & Sons, Ltd.

Page 2: Feature Conference Report: Standards and ontologies for ...downloads.hindawi.com/journals/ijg/2003/465981.pdf · Feature Conference Report: Standards and ontologies for functional

Conference Report 117

wide range of biological domains as well as toolsfor constructing ontologies and ways that ontolo-gies are being put to use in various databases.It should be noted that ontologies are frequentlydeveloped for use in a database context, and donot exist in a vacuum.

In the context of this review, we evaluate ontolo-gies on their content and potential use to biologists,rather than on the basis of construction methods.One insight that quickly emerged is that a researchcommunity must be actively involved in develop-ing and deploying any ontology or standard forits field, to ensure that the ontology is relevantand usable. The emergence of the Gene Ontol-ogy (GO) vocabularies [6,7] as a de facto standardfor gene product annotation illustrates the value ofinvolving biologists in all stages of the develop-ment and application of an ontology. Numerousmodel organism databases (MGI, FlyBase, SGD,TAIR, WormBase and others), as well as SWISS-PROT and LocusLink, provide GO annotations fortheir database objects. These groups adopted GOvoluntarily and now make active contributions toits development. The benefits of widespread com-munity use of the GO extend beyond the modelorganism databases that generate it. Use of GOfacilitates sequence annotation and speeds up theanalysis of large-scale functional genomics exper-iments, and an increasing number of applicationsmake use of GO terms and gene product annota-tions, e.g. several applications combine GO annota-tions with microarray data to address queries suchas, ‘Which GO terms are over-represented in agiven cluster?’

We can go further in the use of ontologiesthan simply annotating gene products to the GO.Consider the following example: ‘a species X ,of developmental stage Y , has been treated withcompound Z ’ and the resulting data matrix storedin a database. If the data alone are stored withjust a free text description, query capability islimited. In contrast, if ontologies are used todescribe the species, compound and developmentalstage, structured queries, such as ‘what experimentsuse compound Y ?’ are possible, as compoundY is described only once within the database,and the source ontology provides an unambiguousdefinition for Y . This is precisely the type of querythat the standard for microarray data, minimuminformation about a microarray (MIAME), wasdeveloped to answer [1].

Keynote presentations

Lincoln Stein (Cold Spring Harbor Laboratory,USA) speculated on a seemingly universal humanurge to collect and classify objects, a tendencythat finds expression in classification systems forbiology, and which mirrors the urge of childrento collect, classify and describe Pokemon charac-ters. He provided some insights into the sociologyof naming things, a right that all scientists believethat they have. Using pathology as an example, heillustrated how classification systems require revi-sion because they are often technology-based, andas technology changes, so the resolving possibili-ties of the system change.

Winston Hide’s (SANBI, South Africa) key-note presentation used a population genetics meta-phor to describe the spread of bioinformatics tech-nology through the developing world, beginningin South Africa and extending to other parts ofAfrica and South America. Hide emphasized thepoint that open source software was, and is, essen-tial to the development of infrastructure for openhealth care in much of the world, before going onto describe his group’s efforts to provide an opencontrolled vocabulary, known as eVOC, for con-sistently describing clone, EST and cDNA librariesused in high-throughput experiments.

Ken Buetow (National Cancer Institute, USA)provided an overview of the National Cancer Insti-tute’s efforts to unify data related to cancer com-ing from sources as diverse as epidemiology, clin-ical trials, cancer animal models and microar-ray data into a unified system, with ontologiesmapped across these domains into a common onto-logical representation environment called caCORE(http://ncicb.nci.nih.gov/NCICB/core). The goalof these efforts is to support all those involved incancer research and facilitate in silico research. TheNCI has not limited itself to a single technology,providing access to its data via Java, SOAP, XMLand other formats and protocols, thereby placingitself at the cutting edge of data provision for can-cer research.

Peter Karp (SRI International, USA) describedvarious aspects of the BioCyc family of databases(http://www.biocyc.org/). Each database providesa knowledge base about a particular organism andcovers several biological datatypes, such as genes,enzymes and metabolic pathways. Karp also sum-marized the features of the Pathway Tools software

Copyright 2003 John Wiley & Sons, Ltd. Comp Funct Genom 2003; 4: 116–120.

Page 3: Feature Conference Report: Standards and ontologies for ...downloads.hindawi.com/journals/ijg/2003/465981.pdf · Feature Conference Report: Standards and ontologies for functional

118 Conference Report

used to build and maintain the BioCyc databasesand the underlying ontologies used.

Michael Ashburner [European BioinformaticsInstitute (EBI), UK] discussed the Global OpenBiology Ontologies (GOBO) effort, which aims toapply the community-based approach used by GOto other areas of biology (http://www.geneonto-logy.org/doc/gobo.html). Key requirements forinclusion in GOBO are that the ontology befreely available, and available in a standard for-mat. One subject area in which the GO commu-nity is directly involved is that of nucleic acidsequence types, as described in Suzanna Lewis’s(University of California at Berkeley, USA) talkon the Sequence Ontology (SO) project. The SOcategorizes sequence features along three orthog-onal axes: an ‘is-a’ hierarchy of classes and sub-classes (e.g. transcript is a subclass of RNA, whichin turn is a subclass of nucleic acid); a classi-fication based on location (‘is-on’; e.g. an exonis located on a transcript); and a classificationbased on ‘defines’ (e.g. a transcript region definesa primary transcript). Representatives from modelorganism databases, and the EnsEMBL and DAS(distributed annotation system) projects contributeto the SO effort.

Alexa McCray (National Library of Medicine,USA) described the National Library of Medicine’sUnified Medical Language System (UMLS), whichintegrates the contents of about 60 different sourcevocabularies into common concepts. The integra-tion algorithms are available from UMLS website(http://www.nlm.nih.gov/research/umls/), and itis clear that the UMLS offers a unique view onontologies for biology gained from the mappingprocess. In a related talk, Jane Lomax (EBI, UK)talked about issues encountered when integratingthe GO vocabularies into UMLS, e.g. GO termshave synonyms, which fall into two distinct classes,true synonyms and related terms. This distinctionmeans that merging with UMLS, a system that hasmany precisely defined relationships, will mean are-evaluation of which synonyms are defined in theGO. Thus, inclusion of an ontology in UMLS candrive the development of that ontology.

The session on Ontologies for Model Organ-isms covered a wide taxonomic range, begin-ning with two talks on the laboratory mouse. Inthe first presentation, Martin Ringwald (Jack-son Laboratory, USA) described ontologies inuse in the databases provided by the Mouse

Genome Informatics collaboration, emphasizingthe combination of anatomy terms with qual-ifiers to generate phenotype descriptors. Dun-can Davidson (MRC Human Genetics Unit,UK) then demonstrated the Mouse Atlas, inwhich links are made between text entries inan anatomy ontology and three-dimensional mod-els (http://genex.hgu.mrc.ac.uk/). This approachfacilitates the modelling of gene expression inthree-dimensional space and over time duringdevelopment. These models provide a unique andbeautiful view on development of mouse tissues.

Leszek Vincent (University of Missouri, Colu-mbia, USA) presented the work of the PlantOntology Consortium (POC), echoing the themeof community-based ontology development. ThePOC aims to provide ontologies with grounding inphylogenetics, using structures (anatomy plus mor-phology) in Zea mays as an initial test case. SueRhee (Carnegie Institution of Washington, USA)described several ontologies developed by the Ara-bidopsis Information Resource (TAIR), coveringanatomy, developmental stages and phenotypes,and used in literature-based database curation(http://www.arabidopsis.org/info/ontologies/).The model organism session also encompassed thenematode C. elegans [3], the fruit fly Drosophila(Huaiyu Mi), and fungi (Gregory Butler).

The Implementation and Use of Ontologiessession focused on how ontologies can be usedwithin microarray databases [5] and in indus-trial settings. Bo Servenius (AstraZeneca, Swe-den) described a joint project with Spotfire (ven-dors of microarray analysis software), in whichAstraZeneca contributed annotation and definitionsto the GO project and then incorporated these intothe tools they use locally for microarray analysis.In another industry perspective, Robin McEntire[GlaxoSmithKline (GSK), USA] outlined GSK’sapproach to incorporating the efforts in ontologyand standards development into their drug discov-ery pipelines. Robin emphasized the need for GSKto use public efforts and their willingness to con-tribute to them by unveiling a tissue type ontology,constructed in-house, which they intend to makepublic. In the same session, Christian Blaschke(Universidad Autonoma Madrid, Spain) gave anNLP perspective on the ontology world in a talkwhich described a method for mining terms fromthe literature, and applying these to classify poorlydescribed gene products. This use of the data in

Copyright 2003 John Wiley & Sons, Ltd. Comp Funct Genom 2003; 4: 116–120.

Page 4: Feature Conference Report: Standards and ontologies for ...downloads.hindawi.com/journals/ijg/2003/465981.pdf · Feature Conference Report: Standards and ontologies for functional

Conference Report 119

the literature to generate structured knowledge wasconfirmed by use of the GO, which largely verifiedthe results of the text mining.

Diane Oliver (Stanford University, USA) pre-sented a knowledge base constructed for repre-senting SNP data in the context of clinical out-comes after drug treatment. She provided exam-ples of known SNPs that can affect patient sur-vival in combination with drug treatment, andshowed how three ontologies have been importedinto a common environment created using theProtege tool. This knowledge base is now beingused to generate forms through which clinicianscan submit clinical data, and represents a use ofontologies and knowledge representation that has adirect health care benefit. Protege was describedin detail by Mark Musen (Stanford Univer-sity, USA) in the session Tools for BuildingOntologies. Protege — in a familiar theme — isa tool that is under development by the commu-nity that use it, and Musen described a number ofplug-ins that are available from the Protege web-site (http://protege.stanford.edu/). Sean Bech-hofer (University of Manchester, UK) comple-mented the description of Protege with an expla-nation of how the standard format DAML +OIL evolved into OWL, a format also used byProtege, and explained how OilEd, an ontologyeditor, works.

In the Ontologies for Chemistry, Toxicologyand Other Domains session, both Andrew Jones(University of Glasgow, UK) and Steve Oliver(University of Manchester, UK) gave presenta-tions on the need for proteomics standards. SteveOliver introduced PEDRo, an initiative that pro-vides a database model for proteomics data basedon the maxD system of the University of Manch-ester, while Andrew Jones presented a modifi-cation of the microarray community data modelfor proteomics experiments. Interestingly, theseapproaches are complementary, and the meetingprovided a forum for those interested in developingstandards specifically for proteomics.

Chris Catton (Oxford University, UK) pre-sented the BioImage project (http://www.bio-image.org/), highlighting the need for imagearchiving and illustrating the challenges involved indescribing images. While image and video formatsare relatively controlled, the subjects of the imagescan be interpreted in many ways. He provided an

interesting example of how interpretation and meta-data affect image interpretation: at first glance, animage of Jackie Kennedy wearing a leopard-skincoat looks like a simple news story, but it starteda fashion which had a dramatic impact on the wildpopulation of the animal in question.

This session was started with an introductionto the field of chemical ontologies. Tony Davies(IUPAC, Germany) provided a review of thestructure of IUPAC and outlined the formal way inwhich they work to provide unique chemical iden-tifiers. He was followed by Kirill Degtyarenko(EBI, UK) on the needs that biochemists have forspecific types of chemical ontologies. The key areasthat he identified were related to structure, enzy-matic reactions and bioinorganic proteins. He alsonoted flaws in the EC system, which limit its usein describing certain enzyme activities.

Mike Waters (National Institute for Envi-ronmental Health Sciences, USA) outlined someof the ways that metadata is gathered in high-throughput toxicogenomics experiments. To aidin standardizing this data, standardized pathologytables are used, and will be included in a devel-oping knowledge base called CEBS. He also out-lined the problems in working with the rat, whichhas not historically been a genetic model organ-ism, and described a mapping project in whichemerging rat genomic and EST data are reanno-tated based on information in the literature. Thisresource provides detailed annotation with a toxi-cogenomic slant, which is currently not availablein public sequence databases.

Conclusions

The importance of involving the research commu-nity in developing standards and ontologies washighlighted by many of the speakers; togetherwith the development and use of open standardsand open source tools, community involvementemerged as a strong theme throughout the con-ference. The issue of competing or overlappingontologies was also raised in a workshop chairedby Winston Hide, in which those interested in geneexpression and description of anatomy agreed towork together to attempt to represent this domain ina unified way. Readers who are interested in ontol-ogy developments may wish to look at the GOBO(http://www.geneontology.org/doc/gobo.html)

Copyright 2003 John Wiley & Sons, Ltd. Comp Funct Genom 2003; 4: 116–120.

Page 5: Feature Conference Report: Standards and ontologies for ...downloads.hindawi.com/journals/ijg/2003/465981.pdf · Feature Conference Report: Standards and ontologies for functional

120 Conference Report

and MGED ontology (http://mged.sourceforge.net/ontologies/) sites, which list existing ontolo-gies and tools for ontology development. There arealso discussion lists that run from these sites.

Three of the SOFG speakers have contributedreviews to this issue of CFG. These were selectedto represent the diversity of ontology developmentand to illustrate the use of different ontology-building tools. Raymond Lee and Paul Sternbergdescribe the construction of an ontology to describethe anatomy and cell lineages of C. elegans [3].Chris Stoeckert and Helen Parkinson describe theMGED ontology, which has been developed fordescribing microarray experiments [5], and RobertStevens et al. provide an introduction to descriptionlogics for beginners, and apply the DL principlesto the Gene Ontology [4].

The abstracts and the presentations from themeeting are available from http://www.ebi.ac.uk/SOFG.

References

1. Brazma A, Hingamp P, Quackenbush J, et al. 2001. Minimuminformation about a microarray experiment (MIAME) — to-wards standards for microarray data. Nature Genet 29:365–371.

2. Karp P. 2001. Many Genbank entries for complete microbialgenomes violate the Genbank standard. Comp Funct Genom 1:25–27.

3. Lee RYN, Sternberg PW. 2003. Building a cell and anatomyontology of C. elegans . Comp Funct Genom 4: (this issue).

4. Stevens R, Wroe C, Bechhofer S, et al. 2003. Buildingontologies in DAML + OIL. Comp Funct Genom 4: (this issue).

5. Stoeckert CJ, Parkinson H. 2003. The MGED ontology: aframework for describing functional genomics experiments.Comp Funct Genom 4: (this issue).

6. The Gene Ontology Consortium. 2000. Gene ontology: tool forthe unification of biology. Nature Genet 25: 25–29.

7. The Gene Ontology Consortium. 2001. Creating the geneontology resource: design and implementation. Genome Res 11:1425–1433.

Copyright 2003 John Wiley & Sons, Ltd. Comp Funct Genom 2003; 4: 116–120.

Page 6: Feature Conference Report: Standards and ontologies for ...downloads.hindawi.com/journals/ijg/2003/465981.pdf · Feature Conference Report: Standards and ontologies for functional

Submit your manuscripts athttp://www.hindawi.com

Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

Hindawi Publishing Corporation http://www.hindawi.com

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

The Scientific World JournalHindawi Publishing Corporation http://www.hindawi.com Volume 2014

Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttp://www.hindawi.com

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

International Journal of

Microbiology