mapping the gene ontology into the unified medical language system

8
Comparative and Functional Genomics Comp Funct Genom 2004; 5: 354–361. Published online in Wiley InterScience (www.interscience.wiley.com). DOI: 10.1002/cfg.407 Research Paper Mapping the Gene Ontology into the Unified Medical Language System Jane Lomax 1 * and Alexa T. McCray 2 1 European Bioinformatics Institute, Hinxton, Cambridge CB10 1SD, UK 2 National Library of Medicine, Bethesda, MD, USA *Correspondence to: Jane Lomax, European Bioinformatics Institute, Hinxton, Cambridge CB10 1SD, UK. E-mail: [email protected] Received: 5 August 2003 Revised: 9 March 2004 Accepted: 16 March 2004 Abstract We have recently mapped the Gene Ontology (GO), developed by the Gene Ontology Consortium, into the National Library of Medicine’s Unified Medical Language System (UMLS). GO has been developed for the purpose of annotating gene products in genome databases, and the UMLS has been developed as a framework for integrating large numbers of disparate terminologies, primarily for the purpose of providing better access to biomedical information sources. The mapping of GO to UMLS highlighted issues in both terminology systems. After some initial explorations and discussions between the UMLS and GO teams, the GO was integrated with the UMLS. Overall, a total of 23% of the GO terms either matched directly (3%) or linked (20%) to existing UMLS concepts. All GO terms now have a corresponding, official UMLS concept, and the entire vocabulary is available through the web-based UMLS Knowledge Source Server. The mapping of the Gene Ontology, with its focus on structures, processes and functions at the molecular level, to the existing broad coverage UMLS should contribute to linking the language and practices of clinical medicine to the language and practices of genomics. Copyright 2004 John Wiley & Sons, Ltd. Keywords: ontologies; semantic networks; controlled vocabularies; Gene Ontology; Unified Medical Language System Introduction A significant number of databases are now col- lecting and cataloguing genomic data. An annual review of molecular biology databases, for exam- ple, lists several hundred databases relevant to the genomic domain (Baxevanis, 2003). The databases that will most likely have the greatest impact are those that are able to link transparently to other closely related resources. One way to ensure this transparency is to interrelate the terminologies that are used by these various resources, be they databases of DNA sequences, literature databases or medical record systems (Harris and Parkinson, 2003). To this end we have recently mapped the Gene Ontology into the Unified Medical Language System. Ongoing and rapid advances in our under- standing of genomic phenomena are becoming increasingly important to clinical research and clin- ical medicine (Ansell et al., 2003; Cooper and Psaty, 2003; Collins, 1999). Mapping the large Gene Ontology, with its focus on structures, pro- cesses and functions at the molecular level, into the existing broad coverage UMLS should con- tribute to linking the language and practices of clinical medicine to the language and practices of genomics. The Gene Ontology The Gene Ontology (GO) is a shared resource developed by the Gene Ontology Consortium, a group of researchers working on various model organism gene and protein databases. The primary use of GO is to produce functional annotations Copyright 2004 John Wiley & Sons, Ltd.

Upload: jane-lomax

Post on 06-Jun-2016

215 views

Category:

Documents


0 download

TRANSCRIPT

Comparative and Functional GenomicsComp Funct Genom 2004; 5: 354–361.Published online in Wiley InterScience (www.interscience.wiley.com). DOI: 10.1002/cfg.407

Research Paper

Mapping the Gene Ontology into the UnifiedMedical Language System

Jane Lomax1* and Alexa T. McCray2

1European Bioinformatics Institute, Hinxton, Cambridge CB10 1SD, UK2National Library of Medicine, Bethesda, MD, USA

*Correspondence to:Jane Lomax, EuropeanBioinformatics Institute, Hinxton,Cambridge CB10 1SD, UK.E-mail: [email protected]

Received: 5 August 2003Revised: 9 March 2004Accepted: 16 March 2004

AbstractWe have recently mapped the Gene Ontology (GO), developed by the Gene OntologyConsortium, into the National Library of Medicine’s Unified Medical LanguageSystem (UMLS). GO has been developed for the purpose of annotating gene productsin genome databases, and the UMLS has been developed as a framework forintegrating large numbers of disparate terminologies, primarily for the purpose ofproviding better access to biomedical information sources. The mapping of GO toUMLS highlighted issues in both terminology systems. After some initial explorationsand discussions between the UMLS and GO teams, the GO was integrated with theUMLS. Overall, a total of 23% of the GO terms either matched directly (3%) orlinked (20%) to existing UMLS concepts. All GO terms now have a corresponding,official UMLS concept, and the entire vocabulary is available through the web-basedUMLS Knowledge Source Server. The mapping of the Gene Ontology, with its focuson structures, processes and functions at the molecular level, to the existing broadcoverage UMLS should contribute to linking the language and practices of clinicalmedicine to the language and practices of genomics. Copyright 2004 John Wiley& Sons, Ltd.

Keywords: ontologies; semantic networks; controlled vocabularies; Gene Ontology;Unified Medical Language System

Introduction

A significant number of databases are now col-lecting and cataloguing genomic data. An annualreview of molecular biology databases, for exam-ple, lists several hundred databases relevant to thegenomic domain (Baxevanis, 2003). The databasesthat will most likely have the greatest impactare those that are able to link transparently toother closely related resources. One way to ensurethis transparency is to interrelate the terminologiesthat are used by these various resources, be theydatabases of DNA sequences, literature databasesor medical record systems (Harris and Parkinson,2003). To this end we have recently mapped theGene Ontology into the Unified Medical LanguageSystem. Ongoing and rapid advances in our under-standing of genomic phenomena are becoming

increasingly important to clinical research and clin-ical medicine (Ansell et al., 2003; Cooper andPsaty, 2003; Collins, 1999). Mapping the largeGene Ontology, with its focus on structures, pro-cesses and functions at the molecular level, intothe existing broad coverage UMLS should con-tribute to linking the language and practices ofclinical medicine to the language and practices ofgenomics.

The Gene Ontology

The Gene Ontology (GO) is a shared resourcedeveloped by the Gene Ontology Consortium, agroup of researchers working on various modelorganism gene and protein databases. The primaryuse of GO is to produce functional annotations

Copyright 2004 John Wiley & Sons, Ltd.

Mapping GO into UMLS 355

of genes and gene products in these databases,but it has found many other applications in areassuch as microarray analysis, natural language pro-cessing and prediction of gene function. GO isa set of hierarchical, controlled vocabularies thatdescribe certain biological phenomena, structuredas directed acyclic graphs (DAGs), meaning thatterms can have multiple parentage. There are threeorthogonal vocabularies; biological process, molec-ular function and cellular component. Molecularfunctions are activities performed at the molecularlevel, while biological processes represent orderedassemblies of molecular functions. Cellular compo-nents are cellular locations, which include macro-molecular complexes. GO is non-species-specific,thereby allowing cross-species comparison of enti-ties in different databases.

GO is a dynamic vocabulary, which is con-stantly evolving and being updated in responseto requests from the research community. Theserequests are managed via a publicly accessiblewebsite tracker (SourceForge GO Curator RequestsTracker, https://sourceforge.net/tracker/?func=browse&group id=36855&atid=440764). Mem-bers of the consortium also organize and developspecific areas of the ontologies according to theirspecialist knowledge. There are several full-timeeditors who manage updates and who ensure theintegrity of the overall resource. GO evolves byconsensus, with any major changes in content orphilosophy being discussed at regular meetings ofthe GO Consortium and on various e-mail lists. GOis available in different formats: flat files, which arethe primary format, XML, and also as a MySQLdatabase. The gene ontology and its annotations(gene associations) are freely available, with nolicensing requirements.

Underlying GO are various philosophies andguidelines that are more fully documented else-where (Gene Ontology Consortium, 2000, 2001),but there are some assumptions that are impor-tant to discuss here. The first is the ‘true pathrule’. This rule states that the path from eachterm, back up through the hierarchy to its high-est level parent, must be biologically accurate. Forexample, ‘chitin biosynthesis’ cannot be a descen-dant of ‘cell wall biosynthesis’ because chitin isalso synthesized in the production of other struc-tures, e.g. a cuticle (Hill et al., 2002). Second,the scope of vocabularies is restricted, such thatthey do not include instances of individual gene

products. Molecular function terms all representactivities or actions, rather than entities. Cellularcomponent terms only represent locations, whichinclude multigene–product complexes such as ‘ori-gin recognition complex’. Third, there are two dif-ferent relationships a GO term can have to its par-ent, ‘is a’ and ‘part of’. In the biological processontology, ‘part of’ refers to a sub-process, whilein the cellular component ontology it refers to asub-component.

The Unified Medical Language System

The UMLS is a set of knowledge sources devel-oped at the U.S. National Library of Medicine(Lindberg et al., 1993; Unified Medical Lan-guage System; http://umlsinfo.nlm.nih.gov/). TheUMLS aims to provide integrated access to a largenumber of biomedical information resources byunifying the vocabularies that are used to accessthose resources. The UMLS is currently used in abroad range of biomedical applications, includingsystems focused on patient data, digital libraries,Web and bibliographic retrieval, and medical deci-sion support. Research groups use the UMLS toinvestigate a variety of natural language processing,knowledge representation and information retrievalquestions. The UMLS consists of three knowledgesources; the Metathesaurus, the UMLS semanticnetwork, and the Specialist lexicon with its asso-ciated lexical tools. The Metathesaurus containsinformation about biomedical concepts and consistsof a variety of terminology systems, including the-sauri, classification schemes, coding systems, andlists of controlled terms that have been indepen-dently developed by a broad range of differentgroups and organizations. These include medicalspecialty associations, hospital system developers,national and international standards bodies, infor-matics researchers, government agencies, and otherhealth organizations. The Metathesaurus currentlyinterrelates more than 60 families of vocabulariesand consists of over 900 000 concepts.

Each Metathesaurus concept is assigned one ormore semantic types from the UMLS semantic net-work. The network consists of 135 semantic typesand 54 relationships, including ‘is a’ and ‘part of’,but also many other relationships important for thebiomedical domain. These include five categories

Copyright 2004 John Wiley & Sons, Ltd. Comp Funct Genom 2004; 5: 354–361.

356 J. Lomax and A. T. McCray

of relationships, physical, functional, spatial, tem-poral, and conceptual. The semantic network pro-vides a coherent framework for the sometimes quitedisparate vocabularies that comprise the Metathe-saurus (McCray, 2003). The ‘true path rule’ thathas been described for the Gene Ontology appliesto the UMLS semantic network, but it does notapply to the Metathesaurus itself. The terminolo-gies are interrelated at the individual concept level,with many concepts consisting of synonyms drawnfrom a variety of terminologies, but there is noattempt to fully merge the hierarchical structures ofthe constituent terminologies (for discussion of theissues involved in mapping, merging, and aligningexisting terminologies and ontologies, see Tuttleet al., 1995; Zeng and Cimino, 1996; Reed andLenat, 2002; Pinto et al., 1999, Oliver et al., 1999).The Specialist lexicon and related lexical tools areresources that are used to manage lexical variationin a range of natural language processing applica-tions (McCray et al., 1994).

The UMLS has been updated regularly sinceits first release in 1990. Until recently these wereannual releases, but since 2002 they have beenquarterly releases in response to requests fromthe active UMLS user community. Because someof the constituent vocabularies in the UMLS aregoverned by varying levels of copyright restriction,users are asked to sign a license agreement beforeaccessing the UMLS data. There is no chargeinvolved. With each release the knowledge sourcesare enhanced with both additional content andadditional tools. Additions and enhancements tothe UMLS are accomplished through a combinationof automated and semi-automated processes, and avariety of tools have been developed to assist inthe extensive human review process that followsthe incorporation of any new UMLS content.

The UMLS knowledge sources are available as aset of ASCII relational tables and are fully accessi-ble through the UMLS Knowledge Source Server,either through its Web interface or through itsapplication programming interface, which gives theresults as XML objects (Bangalore et al., 2003).The Knowledge Source Server retrieves informa-tion about particular concepts, including attributessuch as the concept’s definition, its semantic types,and the concepts that are related to it. It is also pos-sible to limit a query to the perspective of a partic-ular constituent vocabulary. For example, the user

may be interested in seeing the ancestors or descen-dants for a term in just one particular vocabulary,or may like to know which synonyms originated inthat specific vocabulary.

Methods

Mapping the Gene Ontology into the UMLSinvolved several initial explorations before the finalintegration stage. First, we used automated proce-dures to determine whether there was any overlapbetween the coverage of GO and the coverageof the UMLS (McCray et al., 2002). Just overone-quarter of the GO terms were found in theUMLS. There was little overlap in biological pro-cesses (2%) and cellular components (3%), with thelargest overlap in molecular functions (21%). How-ever, later investigations showed that the appar-ent large overlap in the latter category resultedfrom certain assumptions about the nature of themolecular function terms in GO that needed to bereassessed.

Our second step involved a one month visit bythe first author, who is a GO curator, to the NationalLibrary of Medicine to work with the UMLS team.The purpose of the visit was to ensure that eachgroup would understand the goals, assumptions,and framework of the other group. We attemptedto fully map GO into the UMLS during this timeperiod, but we discovered a number of problemsthat needed to be addressed first. Thus, during thisphase and in the subsequent months before the finalmapping, many issues were addressed, debated andresolved, and these are discussed here.

The final mapping was done using the Decem-ber 2002 version of GO. The GO at that timeconsisted of 6692 biological process terms, 5152molecular functions and 1075 cellular components,for a total of 12 919 terms. We followed the samesteps normally used to map a new vocabulary tothe UMLS. First, the UMLS team studied the entireGO vocabulary and its documentation to assess itspurpose, structure and explicit or implicit assump-tions. In this case, the UMLS team was alreadywell informed by extensive discussions with theGO team. Second, GO was examined for potentialalgorithmic assignments of UMLS semantic types.For example, the UMLS semantic type ‘Cell Com-ponent’ was a candidate for algorithmic assignmentto all cellular component terms. Subsequent human

Copyright 2004 John Wiley & Sons, Ltd. Comp Funct Genom 2004; 5: 354–361.

Mapping GO into UMLS 357

review in some cases altered these algorithmicassignments, but they were, nonetheless, a usefulstarting point for the reviewers. Next, the Specialistlexical programmes, together with other heuristics,were used to automatically map the terms in GOto existing concepts in the UMLS.

Finally, and importantly, each provisionally map-ped concept was individually reviewed and some-times modified by a UMLS editor. In some cases,an algorithmic mapping had proved incorrect, aparticular mapping had been missed, or an incorrectsemantic type had been assigned. UMLS editorsreview all aspects of a concept record, includingnot only semantic type assignments, but also def-initions, related concepts (broader, narrower, andother related concepts), and a variety of other con-cept attributes.

Results

The 2003AB version of the UMLS, available inJuly 2003, contains GO in its entirety. All GOterms now have a corresponding, official UMLSconcept, and have therefore been assigned UMLSunique identifiers and semantic types from theUMLS semantic network. GO definitions and otherGO term attributes have been incorporated andare available together with other UMLS conceptattributes. GO terms are readily searchable throughthe UMLS Knowledge Source Server.

During the final mapping, only a small percent-age of GO terms exactly matched existing UMLSconcepts. This is perhaps not surprising, since GOis more specialized in the genomic domain than areany of the other vocabularies currently representedin the UMLS. Table 1 shows the results of mappingGO to the UMLS.

Overall, a total of 23% of the GO terms eithermatched directly (3%) or linked (20%) to exist-ing UMLS concepts. Some examples of exactmatches are ‘DNA replication’ (a GO biologi-cal process), ‘extracellular matrix’ (a GO cellularcomponent), and ‘protein binding’ (a GO molec-ular function). Additionally, a number of rela-tionships link GO terms to one or more exist-ing UMLS concepts. Thus, a GO term might benarrower or broader than a UMLS concept, orsome other relationship might obtain. For exam-ple, the GO term ‘lipid metabolism’ is narrowerin meaning than the UMLS concept ‘metabolism’,

Table 1. The number of GO terms, by GO category, outof a total of 12 919 terms that either matched or linked toan existing UMLS concept

GO Category

Molecularfunction

Biologicalprocess

Cellularcomponent Total

Match toUMLS

7 283 141 431 (3%)

Linked toUMLS

2239 240 99 2578 (20%)

Total 2246 523 240 3009 (23%)

and the GO term ‘feeding behaviour’ is broaderin meaning than the UMLS concept ‘animal feed-ing behaviour’. GO terms may be related in otherways to UMLS concepts. There are many instances,for example, where the molecular activity of aparticular enzyme, receptor, etc. is linked to theenzyme or receptor itself. Some examples are: ‘T-cell receptor’ (a UMLS concept) exhibits ‘T-cellreceptor activity’ (a GO molecular function); and‘peroxisome assembly factor- 2′ (UMLS) exhibits‘peroxisome-assembly ATPase activity’ (GO).

Some 45 of the 135 semantic types availablein the UMLS semantic network were assigned tothe GO terms. The vast majority were assignedto semantic types from the left hand side of the‘Biologic Function ‘ sub-tree, as shown in Figure 1.

In addition, many GO terms were assigned to thesemantic type ‘Cell Component’ in the ‘Anatom-ical Structure’ sub-tree. Some terms were giventhe UMLS semantic types ‘Individual Behaviour’(e.g. the GO term ‘grooming behaviour’) and‘Social Behaviour’ (e.g. the GO term ‘post-matingbehaviour’). Some other semantic types that wereassigned infrequently include ‘Cell’, ‘Gene orGenome’, and ‘Body Space or Junction’. A total of12 946 semantic types were assigned to the 12 919GO terms. This means that in a few cases multiplesemantic types were assigned to certain terms. Anexample is the GO term ‘feeding behaviour’, men-tioned earlier, which was assigned to both ‘Organ-ism Function’ and ‘Individual Behaviour’.

Table 2 shows the most frequently assignedUMLS semantic types by GO category, and indi-cates that 99% (12 830/12 946) of all semantictypes assigned fell into a total of nine UMLSsemantic types.

Copyright 2004 John Wiley & Sons, Ltd. Comp Funct Genom 2004; 5: 354–361.

358 J. Lomax and A. T. McCray

Figure 1. The ‘Biologic Function’ sub-tree in the UMLS Semantic Network

Table 2. The most frequent UMLS semantic types assignedto GO terms, by GO category

GO CategoryUMLSsemantictype

Molecularfunction

BiologicalProcess

Cellularcomponent Total

Biologicalfunction

12 12

Physiologicalfunction

1 6 7

Organismfunction

416 416

Mental process 18 18Organ or tissuefunction

352 352

Cell function 2311 2311Molecularfunction

4772 3211 7983

Genetic function 375 302 677Cell component 5 1049 1054Total 12 830

Discussion

The mapping of GO to the UMLS highlightedissues in both terminology systems. The existingUMLS semantic types and their definitions causedsome problems, and some naming issues in GOneeded to be resolved. It has often proved to bethe case that when new vocabularies are addedto the UMLS, the developers of that vocabulary

have seen opportunities to improve their terminol-ogy as a result of the mapping process; likewise,the UMLS developers have seen areas for improve-ment and enhancement.

In GO, a molecular function may be distin-guished from a biological process by virtue of thefact that it is a direct activity, while a biologicalprocess is an ordered assembly of more than oneactivity. For example, the terms ‘DNA binding’ and‘DNA ligase’ are molecular functions, while ‘DNArepair’ is a biological process. No similar distinc-tion, however, is made within the UMLS semanticnetwork. This means that a large proportion of bothmolecular function and biological process termswere assigned the same semantic type, ‘MolecularFunction’ (or its child, ‘Genetic Function’), therebylosing much of the resolution present in GO.

For some GO concepts, a precise semantic typewas not available in the semantic network. Forexample, the UMLS semantic types ‘Cell Func-tion’, and ‘Molecular Function’ are defined respec-tively as: ‘A physiologic function inherent to cellsor cell components’; and ‘A physiologic functionoccurring at the molecular level’. The GO term‘cell cycle’, for example, was assigned to ‘CellFunction’, but it was not entirely obvious whichof the two semantic types to assign because thisprocess occurs at both the molecular and wholecell level.

Copyright 2004 John Wiley & Sons, Ltd. Comp Funct Genom 2004; 5: 354–361.

Mapping GO into UMLS 359

There are a relatively small number of seman-tic types at the level of molecular phenomena inthe UMLS semantic network (see, however, Yuet al., 1999, for some discussion of this point). Forexample, a comparison of the ‘Biologically ActiveSubstance’ sub-tree of the UMLS semantic net-work with the ‘Natural Phenomenon or Process’tree indicates that the former includes the seman-tic type ‘Immunological Factor’, for which there is,however, no corresponding ‘Immunological Func-tion’ or ‘Immunological Process’. This lack of res-olution was also apparent in the semantic typesassigned to GO cellular component terms, the greatmajority of which were given the semantic type‘Cell Component’. Additional semantic types thatcould be added as children of ‘Cell Component’in the UMLS semantic network might include, forexample, ‘DNA component’ and ‘Membrane Com-ponent’. Some other GO categories that are notcurrently available as separate semantic types alsobecame apparent, e.g. developmental processes.Thus, mapping GO to the UMLS raised a numberof issues for the future development of the UMLSsemantic network. In particular, the granularity ofthe semantic types available in the semantic net-work does not always allow for some of the finerdistinctions that are made in GO. These possibleareas for improvement are currently under discus-sion by the UMLS team.

Similarly, throughout the mapping process, sev-eral issues and areas for improvement within GObecame apparent. For example, the GO molecu-lar function tree has the major sub-tree ‘enzyme’,which describes catalytic activities (note that thissub-tree has subsequently been renamed ‘catalyticactivity’). The nomenclature of this tree often usesthe names of the enzymes themselves so, for exam-ple, the GO term describing the catalytic activ-ity of carbamate kinase is named simply ‘car-bamate kinase’. In biology, ‘carbamate kinase’would commonly be used to refer to both the pro-tein entity — the physical enzyme itself — and theactivity of that enzyme. As a consequence, theGO term names were ambiguous, because it wasnot clear that they only represented the enzymeactivity. This was not a problem within GO itselfbecause this information is implicit; all enzymeterms are children of ‘molecular function’. How-ever, following the preliminary algorithmic map-ping of GO to the UMLS (McCray et al., 2002),we found that GO enzyme activity terms had been

mapped into concepts representing protein enti-ties because they shared identical text strings. Infact, we found that almost all GO molecular func-tion term names were ambiguous, in that theyshared a name with a protein entity, e.g. ‘receptor’,‘enzyme’ and ‘signal transducer’. To avoid mul-tiple concepts with the same name and differentmeanings in the Metathesaurus, the word ‘activ-ity’ was appended to all GO molecular functionterms for the purposes of the mapping, with a fewexceptions, including the term ‘binding’ and mostof its children. Because of this change in GO nam-ing, however, the exact number of matches of GOterms with existing UMLS concepts was signifi-cantly reduced from our earlier algorithmic results.The naming change does, however, better align thelinguistic forms of the molecular function termswith their intended meanings. The naming changewas agreed to at the January 2003 meeting of theentire GO Consortium and subsequently incorpo-rated into GO itself.

In the molecular function tree, adding the word‘activity’ to all terms made it clear that most rep-resented genuine functions, e.g. ‘receptor activity’and ‘enzyme activity’. There were, however, stilla few anomalies, e.g. ‘structural molecule’ termsactually result from a difficulty in describing thefunction of a gene product whose only ‘activity’ isto add to the architectural integrity of a structure,such as the mannoproteins that make up bacte-rial cell walls. These terms could be described as‘passive activities’ and their existence in the GOmolecular function tree, although at the momentnecessary, is often debated. The exercise of adding‘activity’ to molecular function term names in GOalso helped highlight terms in GO that clearlydidn’t represent molecular activities, and many ofthese terms have subsequently been made obso-lete.

We also encountered a problem with ambigu-ity in the GO cellular component tree. The loca-tions in cellular components can be as granularas multi-subunit complexes, so in GO, the word‘complex’ was added to the term name of allsuch structures to avoid cases of the same textstring with different meanings. However, this addi-tion still allows for ambiguity because a particu-lar ‘complex’ can be used to describe a proteinentity as well as a location, e.g. the ‘origin recog-nition complex’. Again, to avoid creating conceptswith the same name and different meanings, or

Copyright 2004 John Wiley & Sons, Ltd. Comp Funct Genom 2004; 5: 354–361.

360 J. Lomax and A. T. McCray

strings with different meanings within the sameconcept, the word ‘location’ was appended to allGO cellular component ‘complex’ terms within theUMLS. The original GO term is still preservedas a ‘suppressible’ form in the UMLS because,for some purposes, users may be interested inhaving access to the original GO strings. Thisappendage is only for the purposes of inclusionin the UMLS, so the original cellular componentstrings, without ‘location’, are still used in GOitself.

The preceding cases are examples of a commonproblem in vocabularies: that of the same text stringwith different meanings. In most cases these areambiguities that have evolved in the language overtime and that are only resolvable in their contextof use. GO faces this problem in those cases wherethe same string may apply to different species,but with a different meaning in each. This arisesbecause the terms in GO represent all of biology,covering species from viruses to human. An exam-ple of this is ‘cell wall’. In plants, the ‘cell wall’means the rigid membrane enclosing the protoplastof a cell, usually composed largely of cellulose,while in fungi, the same phrase is used to describethe structure surrounding the plasma membrane,usually composed of glycoproteins and peptidogly-cans. There are undoubtedly similarities betweenthese two structures, but they are by no means thesame. To address this problem, GO uses the quali-fier ‘sensu’, which is a taxonomic term meaning ‘inthe sense of’. In the case above, ‘cell wall’ appearsin the GO cellular component ontology and hasseveral children and grandchildren, including ‘cellwall (sensu Fungi)’ and ‘cell wall (sensu Magno-liophyta)’. One child term has the meaning of acell wall in the sense of fungi, and the other in thesense of flowering plants. The guide to whethernew ‘sensu’ terms need to be created is whetheror not they themselves will have different children.In this case, ‘cell wall (sensu Fungi)’ has the ‘partof’ child ‘bud scar’ while the plant term has a childterm, ‘cellulose microfibril’. This phenomenon alsoadds to the low exact match rate of GO terms toUMLS concepts, because concepts in the UMLSdo not have species qualifiers.

Another issue that arose involved ‘synonyms’in GO. Although called ‘GO synonyms’ for his-torical reasons, these text strings associated withGO terms frequently do not have an identicalmeaning to the main term, and are used mainly

for the purposes of searching GO. Treating theseas exact synonyms for the purposes of matchingGO into the UMLS would have led to incorrectmappings, since each concept in the UMLS con-sists only of true synonyms. As a result, GO hasbeen modified such that synonyms have now beenlabelled as to whether they are exact (true) syn-onyms, or whether they are broader, narrower, orotherwise related to the original GO term. Theseterm–synonym relationships are currently stored asa separate file, but will soon be incorporated intoGO itself.

Conclusions

The work reported here highlights some of theissues that arise in mapping one terminology sys-tem into another. All terminology systems aredeveloped with a particular purpose in mind,and this can have a significant impact on theirdesign and implementation. The Gene Ontologywas developed for the purpose of annotating geneproducts in genome databases, whereas the Uni-fied Medical Language System was developed asa framework for integrating large numbers of dis-parate terminologies, primarily for the purpose ofproviding better access to biomedical informationsources. The investigation revealed a variety ofissues, some of them systematic, e.g. the natureof the GO molecular function terms, and some ofthem more idiosyncratic. The exact match rate ofGO terms to existing UMLS concepts was low,which is most likely a reflection of the small num-ber of existing UMLS vocabularies in the genomicdomain. The UMLS semantic network, too, has arelatively small number of semantic types for rep-resenting this domain.

The mapping of GO into the UMLS should helpimprove interoperability among clinical, scientificliterature and bioinformatics resources. Both GOand UMLS are dynamic systems that are usedin a wide range of applications for a variety ofresearch purposes. We will continue to collaborateon maintaining current versions of GO withinUMLS, and we will investigate further methodsfor linking individual GO terms to existing UMLSconcepts in order to effect the greatest integrationpossible. In addition, we will further investigate theUMLS semantic types and relationships for theirusefulness in characterizing the genomic domain.

Copyright 2004 John Wiley & Sons, Ltd. Comp Funct Genom 2004; 5: 354–361.

Mapping GO into UMLS 361

Acknowledgments

The authors wish to thank Bisharah Libbus, StephanieLipow, Stuart Nelson, Tammy Powell and Laura Roth ofthe UMLS team, and the GO Consortium, in particularJudith Blake, Michael Ashburner and Midori Harris, fortheir contributions to this effort.

References

Ansell SM, Ackerman MJ, Black JL, Roberts LR, Tefferi A.2003. Primer on medical genomics. Part VI: genomics andmolecular genetics in clinical practice. Mayo Clin Proc 78(3):307–317.

Bangalore A, Thorn KE, Tilley C, Peters L. 2003. The UMLSKnowledge Source Server: An object model for deliveringUMLS data. Proc AMIA Symp (in press).

Baxevanis AD. 2003. The Molecular Biology Database Collection:2003 update. Nucleic Acids Res 31(1): 1–12.

Collins FS. 1999. Genetics: an explosion of knowledge istransforming clinical practice. Geriatrics 54(1): 41–47.

Cooper RS, Psaty BM. 2003. Genomics and medicine: distraction,incremental progress, or the dawn of a new age? Ann InternMed 138(7): 576–580.

Gene Ontology Consortium. 2000. Gene ontology: tool for theunification of biology. Nature Genet 25(1): 25–29.

Harris MA, Parkinson H. 2003. Standards and ontologies forfunctional genomics: towards unified ontologies for biology andbiomedicine. Comp Funct Genom 4: 116–120.

Hill DP, Blake JA, Richardson JE, Ringwald M. 2002. Extensionand integration of the gene ontology (GO): combining GOvocabularies with external vocabularies. Genome Res 12(12):1982–1991.

Lindberg DA, Humphreys BL, McCray AT. 1993. The UnifiedMedical Language System. Methods Inf Med 32(4): 281–291.

McCray AT, Browne AC, Bodenreider O. 2002. The lexicalproperties of the gene ontology. Proc American MedicalInformatics Association (AMIA) Symp, 504–508 (ISBN: 1-56053-600-4).

McCray AT, Srinivasan S, Browne AC. 1994. Lexical methodsfor managing variation in biomedical terminologies. Proc AnnuSymp Comput Appl Med Care Hanley & Belfus: WashingtonDC; 235–239.

McCray AT. 2003. An upper level ontology for the biomedicaldomain. Comp Funct Genom 4: 80–84.

Oliver DE, Shahar Y, Shortliffe EH, Musen MA. 1999. Represen-tation of change in controlled medical terminologies. Artif IntellMed 15(1): 53–76.

Pinto HS, Gomez-Perez A, Martins JP. 1999. Some issues onontology integration. Proc IJCAI-99. Workshop on ontologiesand problem-solving methods: http://citeseer.nj.nec.com/pinto99 some.html.

Reed SL, Lenat D. 2002. Mapping ontologies into Cyc.Proc AAAI 2002, AAAI 2002 Conference Workshop onOntologies for the semantic web, Edmonton, Canada.http;//citeseer.nj.nec.com/509238.html.

SourceForge GO Curator Requests Tracker. https://sourceforge.net/tracker/?func=browse&group id=36855&atid=440764.

The Gene Ontology Consortium. 2001. Creating the gene ontologyresource: design and implementation. Genome Res 11(8):1425–1433.

Tuttle MS, Suarez-Munist ON, Olson NE, et al. 1995. Mergingterminologies. Medinfo 8(1): 162–166.

Unified Medical Language System: http://umlsinfo.nlm.nih.gov/.Yu H, Friedman C, Rhzetsky A, Kra P. 1999. Representing

genomic knowledge in the UMLS semantic network.Proceedings of the AMIA Symposium, 181–185 (ISBN: 1-56053-371-4).

Zeng Q, Cimino JJ. 1996. Mapping medical vocabularies to theUnified Medical Language System. Proceedings of the AMIAAnnual Fall Symposium, 105–109.

Copyright 2004 John Wiley & Sons, Ltd. Comp Funct Genom 2004; 5: 354–361.