building a biowordnet using wordnet data formats

Upload: ar9vega

Post on 03-Apr-2018

214 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/28/2019 Building a BioWordNet Using WordNet Data Formats

    1/15

    Building BioWordNet using WordNet Data Formats

    Software Engineering, Testing, and Quality Assurance for Natural Language Processing, pages3139,Columbus, Ohio, USA, June 2008.c 2008 Association for Computational LinguisticsBuilding a BIOWORDNET by Using WORDNETs Data Formatsand WORDNETs Software Infrastructure A Failure Story

    Michael Poprat Elena BeisswangerJena University Language & Information Engineering (JULIE) LabFriedrich-Schiller-Universitat JenaD-07743 Jena, Germany{poprat,beisswanger,hahn}@coling-uni-jena.deUdo Hahn

    AbstractIn this paper, we describe our efforts to buildon WORDNET resources, using WORDNETlexical data, the data format that it comes withand WORDNETs software infrastructure inorder to generate a biomedical extension ofWORDNET, the BIOWORDNET. We beganour efforts on the assumption that the softwareresources were stable and reliable. Inthe course of our work, it turned out that thisbelief was far too optimistic. We discuss thestumbling blocks that we encountered, pointout an error in the WORDNET software withimplications for research based on it, and concludethat building on the legacy of WORDNET

    data structures and its associated softwaremight preclude sustainable extensionsthat go beyond the domain of general English.1 IntroductionWORDNET (Fellbaum, 1998) is one of the most authoritativelexical resources for the general Englishlanguage. Due to its coverage currently more than150,000 lexical items and its lexicological richnessin terms of definitions (glosses) and semanticrelations, synonymy via synsets in particular, it hasbecome a de facto standard for all sorts of research

    that rely on lexical content for the English language.Besides this perspective on rich lexicologicaldata, over the years a software infrastructure hasemerged around WORDNET that was equally approvedby the NLP community. This included,e.g., a lexicographic file generator, various editorsand visualization tools but also meta tools relyingon properly formated WORDNET data such asa library of similarity measures (Pedersen et al.,2004). In numerous articles the usefulness of thisdata and software ensemble has been demonstrated

    (e.g., for word sense disambiguation (Patwardhanet al., 2003), the analysis of noun phrase conjuncts(Hogan, 2007), or the resolution of coreferences(Harabagiu et al., 2001)).In our research on information extraction and text

  • 7/28/2019 Building a BioWordNet Using WordNet Data Formats

    2/15

    mining within the field of biomedical NLP, we similarlyrecognized an urgent need for a lexical resourcecomparable to WORDNET, both in scope andsize. However, the direct usability of the originalWORDNET for biomedical NLP is severely hamperedby a (not so surprising) lack of coverage of thelife sciences domain in the general-language EnglishWORDNET as was clearly demonstrated by Burgun

    and Bodenreider (2001).Rather than building a BIOWORDNET by hand,as was done for the general-language EnglishWORDNET, our idea to set up a WORDNET-stylelexical resource for the life sciences was different.We wanted to link the original WORDNET withvarious biomedical terminological resources vastlyavailable in the life sciences domain. As an obviouscandidate for this merger, we chose one of the majorhigh-coverage umbrella systems for biomedicalontologies, the OPEN BIOMEDICAL ONTOLOGIES

    (OBO).1 These (currently) over 60 OBO ontologiesprovide domain-specific knowledge in terms of hierarchiesof classes that often come with synonymsand textual definitions for lots of biomedical subdomains(such as genes, proteins, cells, sequences,1http://www.bioontology.org/repositories.html#obo31etc.).2 Given these resources and their software infrastructure,our plan was to create a biomedicallyfocused lexicological resource, the BIOWORDNET,whose coverage would exceed that of any of its componentresources in a so far unprecedented manner.Only then, given such a huge combined resourceadvanced NLP tasks such as anaphora resolutionseem likely to be tackled in a feasible way(Hahn et al., 1999; Castano et al., 2002; Poprat andHahn, 2007). In particular, we wanted to make directuse of available software infrastructure such asthe library of similarity metrics without the need forre-programming and hence foster the reuse of existing

    software as is.We began our efforts on the assumption that theWORDNET software resources were stable and reliable.In the course of our work, it turned out that thisbelief was far too optimistic. We discuss the stumblingblocks that we encountered, point out an errorin the WORDNET software with implications forresearch based on it, and conclude that building onthe legacy of WORDNET data structures and its associatedsoftware might preclude sustainable extensionsthat go beyond the domain of general English.

    Hence, our report contains one of the rare failure stories(not only) in our field.2 Software Around WORDNET DataWhile the stock of lexical data assembled in theWORDNET lexicon was continuously growing over

  • 7/28/2019 Building a BioWordNet Using WordNet Data Formats

    3/15

    time,3 its data format and storage structures, the socalledlexicographic file, by and large, remained unaltered(see Section 2.1). In Section 2.2, we will dealwith two important software components with whichthe lexicographic file can be created and browsed.Over the years, together with the continuous extensionof the WORDNET lexicon, a lot of softwaretools have been developed in various programming

    languages allowing browsing and accessing WORDNETas well as calculating semantic similarities onit. We will discuss the most relevant of these toolsin Section 2.3.2Bodenreider and Burgun (2002) point out that the structureof definitions in WORDNET differ to some degree from moredomain-specialized sources such as medical dictionaries.3The latest version 3.0 was released in December 20062.1 Lexicon Organization of WORDNET andStorage in Lexicographic Files

    At the top level, WORDNET is organized according

    to four parts of speech, viz. noun, verb, adjectiveand adverb. The most recent version 3.0 coversmore than 117,000 nouns, 11,500 verbs, 21,400adjectives and 4,400 adverbs, interlinked by lexicalrelations, mostly derivations. The basic semanticunit for all parts of speech are sets of synonymouswords, so-called synsets. These are connected bydifferent semantic relations, imposing a thesauruslikestructure on WORDNET. In this paper, we discussthe organization of noun synsets in WORDNETonly, because this is the relevant part of WORDNETfor our work. There are two important semanticrelation types linking noun synsets. The hypernym/ hyponym relation on which the whole WORDNETnoun sense hierarchy is built links more specificto more general synsets, while the meronym /holonym relation describes partonomic relations betweensynsets, such as part of the whole, member ofthe whole or substance of the whole.From its very beginning, WORDNET was builtand curated manually. Lexicon developing experts

    introduced new lexical entries into WORDNET,grouped them into synsets and defined appropriatesemantic and lexical relations. Since WORDNETwas intended to be an electronic lexicon, a datarepresentation format had to be defined as well.When the WORDNET project started more than twodecades ago, markup languages such as SGML orXML were unknown. Because of this reason, arather idiosyncratic, fully text-based data structurefor these lexicographic files was defined in a way tobe readable and editable by humansand survived

    until to-day. This can really be considered as anoutdated legacy given the fact that the WORDNETcommunity has been so active in the last years interms of data collection, but has refrained fromadapting its data formats in a comparable way to

  • 7/28/2019 Building a BioWordNet Using WordNet Data Formats

    4/15

    to-days specification standards. Very basically,4each line in the lexicographic file holds one synsetthat is enclosed by curly brackets. Take as anexample the synset for monkey:4A detailed description can be found in the WORDNETmanual wninput(5WN), available from http://wordnet.princeton.edu/man/wninput.5WN.32

    { monkey, primate,@ (any of variouslong-tailed primates (excluding theprosimians)) }Within the brackets at the first position synonymsare listed, separated by commas. In the example,there is only one synonym, namely monkey.The synonyms are followed by semantic relations toother synsets, if available. In the example, there isonly one hypernym relation (denoted by @) pointingto the synset primate. The final position isreserved for the gloss of the synset encapsulated in

    round brackets. It is important to notice that thereare no identifiers for synsets in the lexicographic file.Rather, the string expressions themselves serve asidentifiers. Given the fundamental idea of synsets all words within a synset mean exactly the same ina certain context it is sufficient to relate one wordin the synset in order to refer to the whole synset.Still, there must be a way to deal with homonyms,i.e., lexical items which share the same string, buthave different meanings. WORDNETs approach todistinguish different senses of a word is to add numbersfrom 0 to 15, called lexical identifiers. Hence,in WORDNET, a word cannot be more than 16-foldambiguous. This must be kept in mind when onewants to build a WORDNET for highly ambiguoussublanguages such as the biomedical one.2.2 Software Provided with WORDNETTo guarantee fast access to the entries and their relations,an optimized index file must be created. Thisis achieved through the easy-to-use GRIND softwarewhich comes with WORDNET. It simply consumes

    the lexicographic file(s) as input and creates twoplain-text index files,5 namely data and index.Furthermore, there is a command line tool, WN, anda graphical browser, WNB, for data visualization thatrequire the specific index created by GRIND (as allthe other tools that query the WORDNET data do aswell). These tools are the most important (and only)means of software support for WORDNET creationby checking the syntax as well as allowing the (manual)inspection of the newly created index.5Its syntax is described in http://wordnet.

    princeton.edu/man/wndb.5WN.2.3 Third-Party WORDNET ToolsDue to the tremendous value of WORDNET for theNLP and IR community and its usefulness as aresource for coping with problems requiring massive

  • 7/28/2019 Building a BioWordNet Using WordNet Data Formats

    5/15

    amounts of lexico-semantic knowledge, thesoftware-developing community was and continuesto be quite active. Hence, in support of WORDNETseveral APIs and software tools were released thatallow accessing, browsing and visualizing WORDNETdata and measuring semantic similarity on thebase of the WORDNETs lexical data structures.6The majority of these APIs are maintained well

    and kept up to date, such as JAWS7 and JWNL,8and enable connecting to the most recent versionof WORDNET. For the calculation of varioussimilarity measures, the PERL library WORDNET::SIMILARITY initiated and maintained by TedPedersen9 can be considered as a de facto standardand has been used in various experimental settingsand applications. This availability of welldocumentedand well-maintained software is definitelya strong argument to rely on WORDNET asa powerful lexico-semantic knowledge resource.

    3 The BIOWORDNET InitiativeIn this section, we describe our approach to extendWORDNET towards the biomedical domain by incorporatingterminological resources from the OBOcollection. The most obvious problems we facedwere to define a common data format and to mapnon-compliant data formats to the chosen one.3.1 OBO OntologiesOBO is a collection of publicly accessible biomedicalontologies.10 They cover terms frommany biomedical subdomains and offer structured,domain-specific knowledge in terms of classes(which often come with synonyms and textual definitions)and class hierarchies. Besides the hierarchydefiningrelation is-a, some OBO ontologies provide6For a comprehensive overview of available WORDNETtools we refer toWORDNETs related project website (http://wordnet.princeton.edu/links).7http://engr.smu.edu/tspell/8http://jwordnet.sourceforge.net/9http://wn-similarity.sourceforge.net/

    10http://www.bioontology.org/33WordNetIndex{ histoblast, simple_col...{ laborinth_supporting ...{ structural_cell, cell_by...{ mesangial_phagocyte, ...{ ito_cell, perisinusoida_ ...{ . . . }...

    OBO ontologyin OWL-formatextracted data BioWordNetlexicographicfile

  • 7/28/2019 Building a BioWordNet Using WordNet Data Formats

    6/15

    Step 1:data extractionfrom OBOStep 2:conversion to WordNetlexicographic fileformatStep 3:

    building WordNet indexusing grindWordNet BrowserInformation Retrieval WordNet API

    Anaphora Resolution Similarity MeasuringDocument ClusteringStep 4:BioWordNetindex can beused by varioussoftware

    componentsand APIs ...Step 5:... and further be processedin NLP compontentsBioWordNetindex fileIR and NLPapplicationsFigure 1: From OBO ontologies to BIOWORDNET towards a domain-specific WORDNET forbiomedicineadditional semantic relation types such as sequenceofor develops-from to express even more complexand finer-grained domain-specific knowledge. Theontologies vary significantly in size (up to 60,000classes with more than 150,000 synonyms), thenumber of synonyms per term and the nature ofterms.The OBO ontologies are available in various formatsincluding the OBO flat file format, XML andOWL. We chose to work with the OWL version for

    our purpose,11 since for the OWL language also appropriatetools are available facilitating the extractionof particular information from the ontologies,such as taxonomic links, labels, synonyms and textualdefinitions of classes.3.2 From OBO to BIOWORDNETOur plan was to construct a BIOWORDNET by converting,in the first step, the OBO ontologies into aWORDNET hierarchy of synsets, while keeping tothe WORDNET lexicographic file format, and buildinga WORDNET index. As a preparatory step, we

    defined a mapping from the ontology to WORDNETitems as shown in Table 1.The three-stage conversion approach is depictedin Figure 1. First, domain specific terms and tax-11http://www.w3.org/TR/owl-semantics/

  • 7/28/2019 Building a BioWordNet Using WordNet Data Formats

    7/15

    OBO ontology BIOWORDNETontology class synsetclass definition synset glossclass name word in synsetsynonym of class name word in synsetCi is-a Cj Si hyponym of SjCj has-subclass Ci Sj hypernym of SiTable 1: Mapping between items from OBO and from

    BIOWORDNET (Ci and Cj denote ontology classes, Siand Sj the corresponding BIOWORDNET synsets)onomic links between terms were extracted separatelyfrom each of the OBO ontologies. Thenthe extracted data was converted according to thesyntax specifications of WORDNETs lexicographicfile. Finally for each of the converted ontologies theWORDNET-specific index was built using GRIND.Following this approach we ran into several problems,both regarding the WORDNET data structureand the WORDNET-related software that we used

    for the construction of the BIOWORDNET. Convertingthe OBO ontologies turned out to be cumbersome,especially the conversion of the CHEBIontology12 (long class names holding many specialcharacters) and the NCI thesaurus13 (large number12http://www.ebi.ac.uk/chebi/13http://nciterms.nci.nih.gov/34of classes and some classes that also have a largenumber of subclasses). These and additional problemswill be addressed in more detail in Section 4.4 Problems with WORDNETs DataFormat and Software InfrastructureWe here discuss two types of problems we foundfor the data format underlying the WORDNET lexiconand the software that helps building a WORDNETfile and creating an index for this file. First,WORDNETs data structure puts several restrictionson what can be expressed in a WORDNET lexicon.For example, it constrains lexical information to afixed number of homonyms and a fixed set of relations.

    Second, the data structure imposes a numberof restrictions on the string format level. If theserestrictions are violated the WORDNET processingsoftware throws error messages which differ considerablyin terms of informativeness for error tracingand detection or even do not surface at all at the lexiconbuilders administration level.4.1 Limitations of ExpressivenessThe syntax on which the current WORDNET lexicographicfile is based imposes severe limitationson what can be expressed in WORDNET. Although

    these limitations might be irrelevant for representinggeneral-language terms, they do affect the constructionof a WORDNET-like resource for biomedicine.To give some examples, the WORDNET format allowsa 16-fold lexical ambiguity only (lexical IDs

  • 7/28/2019 Building a BioWordNet Using WordNet Data Formats

    8/15

    that are assigned to ambiguous words are restrictedto the numbers 0-15, see Section 2). This forced usto neglect some of the OBO ontology class namesand synonyms that were highly ambiguous.14Furthermore, the OBO ontologies excel in a richerset of semantic relations than WORDNET can offer.Thus, a general problem with the conversionof the OBO ontologies into WORDNET format was

    that except from the taxonomic is-a relation (whichcorresponds to the WORDNET hyponym relation)and the part-of relation (which corresponds to theWORDNET meronym relation) all remaining OBOspecificrelations (such as develops-from, sequenceof,variant-of and position-of ) could not be rep-14This is a well-known limitation that is already mentionedin the WORDNET documentation.resented in the BIOWORDNET. The structure ofWORDNET neither contains such relations nor isit flexible enough to include them so that we face

    a systematic loss of information in BIOWORDNETcompared to the original OBO ontologies. Althoughthese restrictions are well-known, their removalwould require extending the current WORDNETdata structure fundamentally. This, in turn,would probably necessitate a full re-programming ofall of WORDNET-related software.4.2 Limitations of Data Format and SoftwareWhen we tried to convert data extracted from theOBO ontologies into WORDNETs lexicographicfile format (preserving its syntactic idiosyncrasiesfor the sake of quick and straightforward reusabilityof software add-ons), we encountered several intricaciesthat took a lot of time prior to building a validlexicographic file.First, we had to replace 31 different characterswith unique strings such as ( with -LRB- and + with -PLU- before GRIND was ableto process the lexicographic file. The reason isthat many of such special characters occurringin domain specific terms, especially in designators

    of chemical compounds such as methyl ester2,10-dichloro-12H-dibenzo(d,g)(1,3)dioxocin-6-carboxylic acid (also known as treloxinate withthe CAS registry number 30910-27-1), are reservedsymbols in the WORDNET data formatting syntax.If these characters are not properly replaced GRINDthrows an exact and useful error message (see Table2, first row).Second, we had to find out that we have to replaceall empty glosses by at least one whitespace character.Otherwise, GRIND informs the user in terms of

    a rather cryptic error message that mentions the positionof the error though not its reason (see Table 2,second row).Third, numbers at the end of a lexical item need tobe escaped. In WORDNET, the string representation

  • 7/28/2019 Building a BioWordNet Using WordNet Data Formats

    9/15

    of an item is used as its unique identifier. To distinguishhomonyms (words with the same spellingbut different meaning, such as cell as the functionalunit of all organisms, on the one hand, andas small compartment, on the other hand) accordingto theWORDNET format different numbers from0 to 15 (so-called lexical IDs) have to be appended35

    Problem Description Sample Error Message Usefulness of ErrorMessageProblem Solutionillegal use of key characters noun.cell, line 7: Illegalcharacter %high replace illegal charactersempty gloss sanity error - actual pos2145 != assigned pos2143!moderate add gloss consisting of at leastone whitespace character

    homonyms (different wordswith identical strings)noun.rex, line 5: Synonymelectrochemicalreaction is notunique in filehigh distinguish word senses byadding lexical identifiers (usethe numbers 1-15)lexical ID larger than 15 noun.rex, line 4: ID mustbe less than 16: cd25high quote trailing numbers ofwords, only assign lexicalidentifiers between 1-15, omitadditional word sensesword with more than 425charactersSegmentation fault (coredumped)low omit words that exceed themaximallength of 425 characters

    synset with more than 998direct hyponymous synsetsSegmentation fault (coredumped)low omit some hyponymous synsetsor introduce intermediatesynsets with a limited numberof hyponymous synsetsno query result though thesynset is in the index, accesssoftware crashes

    none not knownTable 2: Overview of the different kinds of problems that we encountered when creating aBIOWORDNET keeping tothe WORDNET data structure and the corresponding software. Each problem description isfollowed by a sample error

  • 7/28/2019 Building a BioWordNet Using WordNet Data Formats

    10/15

    message that GRIND had thrown, a statement about how useful the error message was to detectthe source of the errorand a possible solution for the problems, if available. The last row documents a specialexperience with data viewersfor data from the NCI thesaurus.to the end of each homonym. If in a lexicographicfile two identical strings occur that have not been assigneddifferent lexical identifiers (it does not matter

    whether this happens within or across synsets)GRIND emits an error message that mentions both,the position and the lexical entry which caused thiserror (cf. Table 2, third row).Numbers that appear at the end of a lexical item asan integral part of it (such as 2 in IL2, a specialtype of cytokine (protein)) have to be escaped in orderto avoid their misinterpretation as lexical identifiers.This, again, is a well-documented shortcomingof WORDNETs data specification rules.In case such numbers are not escaped prior to presenting

    the lexicographic file to GRIND the wordclosing numbers are always interpreted as lexicalidentifiers. Closing numbers that exceed the number15 cause GRIND to throw an informative errormessage (see Table 2, fourth row).4.3 Undocumented Restrictions andInsufficient Error MessagesIn addition to the more or less documented restrictionsof the WORDNET data format mentionedabove we found additional restrictions that lack documentationup until now, to the best of our knowledge.First, it seems that the length of a word is restrictedto 425 characters. If a word in the lexicographicfile exceeds this length, GRIND is not able tocreate an index and throws an empty error message,namely the memory error segmentation fault (cf.Table 2, fifth row). As a consequence of this restriction,some very long CHEBI class names could nothave been included in the BIOWORDNET.Second, it seems that synsets are only allowed togroup up to 988 direct hyponymous synsets. Again,

    GRIND is not able to create an index, if this restrictionis not obeyed and throws the null memory er-36ror message segmentation fault (cf. Table 2, sixthrow). An NCI thesaurus class that had more than998 direct subclasses thus could not have been includedin the BIOWORDNET.Due to insufficient documentation and utterlygeneral error messages the only way to locate theproblem causing the segmentation fault errors wasto examine the lexicographic files manually. We had

    to reduce the number of synset entries in the lexicographicfile, step by step, in a kind of trial and errorapproach until we could resolve the problem. Thisis, no doubt, a highly inefficient and time consumingprocedure. More informative error messages of

  • 7/28/2019 Building a BioWordNet Using WordNet Data Formats

    11/15

    GRIND would have helped us a lot.4.4 Deceptive Results from WORDNETSoftware and Third-Party Components

    After getting rid of all previously mentioned errors,valid index files were compiled. It was possible toaccess these index files using the WORDNET queryingtools WN and WNB, indicating the index fileswere valid. However, when we tried to query

    the index file that was generated by GRIND for theNCI thesaurus we got strange results. While WNdid not return any query results, the browser WNBcrashed without any error message (cf. Table 2, seventhrow). The same holds for the Java APIs JAWSand JWNL.Since a manual examination of the index file revealedthat the entries that we were searching for, infact, were included in the file, some other, up to thisstep unknown error must have prevented the softwaretools from finding the targeted entries. Hence,

    we want to point out that although we have examinedthis error for the NCI thesaurus only, the riskis high that this no show error is likely to biasany other application as well which makes use ofthe the same software that we grounded our experimentson. Since the NCI thesaurus is a verylarge resource, even worse, further manual errorsearch is nearly impossible. At this point, westopped our attempt building a WORDNET resourcefor biomedicine based on the WORDNET formattingand software framework.5 Related WorkIn the literature dealing with WORDNET and itsstructures from a resource perspective (rather thandealing with its applications), two directions canbe distinguished. On the one hand, besides theoriginal English WORDNET and the various variantWORDNETs for other languages (Vossen, 1998),extensions to particular domains have already beenproposed (for the medical domain by Buitelaar andSacaleanu (2002) and Fellbaum et al. (2006); for the

    architectural domain Bentivogli et al. (2004); andfor the technical report domain by Vossen (2001)).However, none of these authors neither mentions implementationdetails of the WORDNETs or performancepitfalls we have encountered, nor is supplementarysoftware pointed out that might be usefulfor our work.On the other hand, there are suggestions concerningnovel representation formats of next-generationWORDNETs. For instance in the BALKANETproject (Tufis et al., 2004), an XML schema plus

    a DTD was proposed (Smrz, 2004) and an editorcalled CISDIC with basic maintenance functionalitiesand consistency check was released (Horak andSmrz, 2004). The availabi lity of APIs or software tomeasure similarity though remains an open issue.

  • 7/28/2019 Building a BioWordNet Using WordNet Data Formats

    12/15

    So, our approach to reuse the structure and thesoftware for building a BIOWORDNET was motivatedby the fact that we could not find any alternativescoming with a software ensemble as describedin Section 2. Against all expectations, wedid not manage to reuse the WORDNET data structure.However, there are no publications that reporton such difficulties and pitfalls we were confronted

    with.6 Discussion and ConclusionWe learnt from our conversion attempt that the currentWORDNET representation format of WORDNETsuffers from several limitations and idiosyncrasiesthat cannot be by-passed by a simple, yetad hoc work-around. Many of the limitations andpitfalls we found limiting (in the sense what can beexpressed in WORDNET) are due to the fact that itsdata format is out-of-date and not really suitable forthe biomedical sublanguage. In addition, though we

    do not take into doubt that the WORDNET software37works fine for the official WORDNET release, ourexperiences taught us that it fails or gives limitedsupport in case of building and debugging a newWORDNET resource. Even worse, we have evidencefrom one large terminological resource (NCI) thatWORDNETs software infrastructure (GRIND) rendersdeceptive results.

    Although WORDNET might no longerbe the oneand only lexical resource for NLP each year a continuouslystrong stream of publications on the use ofWORDNET il lustrates its importance for the community.On this account we find it remarkable thatalthough improvements in content and structure ofWORDNET have been proposed (e.g., Boyd-Graberet al. (2006) propose to add (weighted) connectionsbetween synsets, Oltramari et al. (2002) suggestto restructure WORDNETs taxonomical structure,and Mihalcea and Moldovan (2001) recommendto merge synsets that are too fine-grained)

    to the best of our knowledge, no explicit proposalshave been made to improve the representation formatof WORDNET in combination with the adaptionof the WORDNET-related software.

    According to our experiences the existing WORDNETsoftware is hardly (re)usable due to insufficienterror messages that the software throws and limiteddocumentation. From our point of view it would behighly preferable if the software would be improvedand made more user-supportive (more meaningfulerror messages would already improve the usefulness

    of the software). In terms of the actual representationformat of WORDNET we found that usingthe current format is not only cumbersome anderror-prone, but also limits what can be expressed ina WORDNET resource.

  • 7/28/2019 Building a BioWordNet Using WordNet Data Formats

    13/15

    From our perspective this indicates the need fora major redesign of WORDNETs data structurefoundations to keep up with the standards of todaysmeta data specification languages (e.g., basedon RFD (Graves and Gutierrez, 2006), XML orOWL (Lungen et al., 2007)). We encourage the reimplementationof WORDNET resources based onsuch a state-of-the-art markup language (for OWL in

    particular a representation of WORDNET is alreadyavailable, cf. van Assem et al. (2006)). Of course, ifa new representation format is used for aWORDNETresource also the software accessing the resource hasto be adapted to the new format. This may requiresubstantial implementation efforts that we think areworth to be spent, if the new format overcomes themajor problems that are due to the original WORDNETformat.

    AcknowledgmentsThis work was funded by the German Ministry

    of Education and Research within the STEMNETproject (01DS001A-C) and by the EC within theBOOTSTREP project (FP6-028099).ReferencesLuisa Bentivogli, Andrea Bocco, and Emanuele Pianta.2004. ARCHIWORDNET: Integrating WORDNETwith domain-specific knowledge. In Petr Sojka, KarelPala, Christiane Fellbaum, and Piek Vossen, editors,GWC 2004 Proceedings of the 2nd InternationalConference of the Global WordNet Association, pages3946. Brno, Czech Republic, January 20-23, 2004.Olivier Bodenreider and Anita Burgun. 2002. Characterizingthe definitions of anatomical concepts in WORDNETand specialized sources. In Proceedings of the 1stInternational Conference of the GlobalWordNet Association,pages 223230.Mysore, India, January 21-25,2002.Jordan Boyd-Graber, Christiane Fellbaum, Daniel Osherson,and Robert Schapire. 2006. Adding dense,weighted connections to WORDNET. In Petr Sojka,Key-Sun Choi, Christiane Fellbaum, and Piek Vossen,

    editors, GWC 2006 Proceedings of the 3rd InternationalWORDNET Conference, pages 2935. SouthJeju Island, Korea, January 22-26, 2006.Paul Buitelaar and Bogdan Sacaleanu. 2002. Extendingsynsets with medical terms WORDNET and specializedsources. In Proceedings of the 1st InternationalConference of the Global WordNet Association.Mysore, India, January 21-25, 2002.

    Anita Burgun and Olivier Bodenreider. 2001. Comparingterms, concepts and semantic classes in WORDNETand the UNIFIED MEDICAL LANGUAGE SYSTEM.

    In Proceedings of the NAACL 2001 WorkshopWORDNET and Other Lexical Resources: Applications,Extensions and Customizations, pages 7782.Pittsburgh, PA, June 3-4, 2001. New Brunswick, NJ:

    Association for Computational Linguistics.

  • 7/28/2019 Building a BioWordNet Using WordNet Data Formats

    14/15

    Jose Castano, Jason Zhang, and James Pustejovsky.2002. Anaphora resolution in biomedical literature. InProceedings of The International Symposium on ReferenceResolution for Natural Language Processing.

    Alicante, Spain, June 3-4, 2002.Christiane Fellbaum, Udo Hahn, and Barry Smith. 2006.Towards new information resources for public health:38

    From WORDNET to MEDICAL WORDNET. Journalof Biomedical Informatics, 39(3):321332.Christiane Fellbaum, editor. 1998. WORDNET: An ElectronicLexical Database. Cambridge,MA: MIT Press.

    Alvaro Graves and Caludio Gutierrez. 2006. Data representationsfor WORDNET: A case for RDF. In PetrSojka, Key-Sun Choi, Christiane Fellbaum, and PiekVossen, editors, GWC 2006 Proceedings of the 3rdInternationalWORDNET Conference, pages 165169.South Jeju Island, Korea, January 22-26, 2006.Udo Hahn, Martin Romacker, and Stefan Schulz. 1999.

    Discourse structures in medical reports watch out!The generation of referentially coherent and valid textknowledge bases in the MEDSYNDIKATE system. InternationalJournal of Medical Informatics, 53(1):128.Sanda M. Harabagiu, Razvan C. Bunescu, and Steven J.Maiorano. 2001. Text and knowledge mining forcoreference resolution. In NAACL01, Language Technologies2001 Proceedings of the 2nd Meeting ofthe North American Chapter of the Association forComputational Linguistics, pages 18. Pittsburgh, PA,USA, June 2-7, 2001. San Francisco, CA: MorganKaufmann.Deirdre Hogan. 2007. Coordinate noun phrase disambiguationin a generative parsing model. In ACL07 Proceedings of the 45th Annual Meeting of the Associationof Computational Linguistics, pages 680687.Prague, Czech Republic, June 28-29, 2007. Stroudsburg,PA: Association for Computational Linguistics.

    Ales Horak and Pavel Smrz. 2004. New features ofwordnet editor VisDic. Romanian Journal of Information

    Science and Technology (Special Issue), 7(1-2):201213.Harald Lungen, Claudia Kunze, Lothar Lemnitzer, and

    Angelika Storrer. 2007. Towards an integratedOWL model for domain-specific and general languageWordNets. In Attila Tanacs, Dora Csendes, VeronikaVincze, Christiane Fellbaum, and Piek Vossen, editors,GWC 2008 Proceedings of the 4th Global WORDNETConference, pages 281296. Szeged, Hungary,January 22-25, 2008.Rada Mihalcea and Dan Moldovan. 2001.

    EZ.WORDNET: Principles for automatic generationof a coarse grained WORDNET. In Proceedingsof the 14th International Florida Artificial IntelligenceResearch Society (FLAIRS) Conference, pages454458.

  • 7/28/2019 Building a BioWordNet Using WordNet Data Formats

    15/15