enhanced ontological searching of medical scientific information

Upload: christos-karaiskos

Post on 03-Jun-2018

218 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/12/2019 Enhanced Ontological Searching of Medical Scientific Information

    1/130

  • 8/12/2019 Enhanced Ontological Searching of Medical Scientific Information

    2/130

  • 8/12/2019 Enhanced Ontological Searching of Medical Scientific Information

    3/130

    Contents

    Abstract 7

    Declaration 9

    Intellectual Property Statement 11

    Acknowledgements 13

    List of Abbreviations 15

    List of Tables 17

    List of Figures 19

    1 Introduction 25

    1.1 Problem Context . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

    1.2 Motivation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

    1.3 Contribution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

    1.4 Thesis Organization. . . . . . . . . . . . . . . . . . . . . . . . . . 29

    2 Ontologies 31

    2.1 Modern Ontology Definition . . . . . . . . . . . . . . . . . . . . . 31

    2.2 Ontology vs. Terminology . . . . . . . . . . . . . . . . . . . . . . 33

    2.3 Notable Biomedical Ontologies and Terminologies . . . . . . . . . 34

    2.3.1 SNOMED CT . . . . . . . . . . . . . . . . . . . . . . . . . 34

    3

  • 8/12/2019 Enhanced Ontological Searching of Medical Scientific Information

    4/130

    2.3.2 NDF-RT . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

    2.3.3 ICD-10. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

    2.3.4 MedDRA . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

    2.3.5 NCI Thesaurus . . . . . . . . . . . . . . . . . . . . . . . . 38

    3 Similarity Metrics 39

    3.1 Similarity Metric vs. Distance Metric . . . . . . . . . . . . . . . . 39

    3.2 Lexical Similarity . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

    3.2.1 Character-based Similarity Measures . . . . . . . . . . . . 41

    Longest Common Substring . . . . . . . . . . . . . . . . . 41Hamming Similarity . . . . . . . . . . . . . . . . . . . . . 41

    Levenshtein Similarity . . . . . . . . . . . . . . . . . . . . 41

    Jaro Similarity . . . . . . . . . . . . . . . . . . . . . . . . 42

    Jaro-Winkler Similarity . . . . . . . . . . . . . . . . . . . 42

    N-gram Similarity. . . . . . . . . . . . . . . . . . . . . . . 43

    3.2.2 Word-based Similarity Measures . . . . . . . . . . . . . . . 43

    Dice Similarity . . . . . . . . . . . . . . . . . . . . . . . . 43

    Jaccard Similarity. . . . . . . . . . . . . . . . . . . . . . . 44

    Cosine Similarity . . . . . . . . . . . . . . . . . . . . . . . 44

    Manhattan Similarity. . . . . . . . . . . . . . . . . . . . . 44

    Euclidean Similarity . . . . . . . . . . . . . . . . . . . . . 45

    3.3 Ontological Semantic Similarity . . . . . . . . . . . . . . . . . . . 45

    3.3.1 Intra-ontology Semantic Similarity . . . . . . . . . . . . . 45

    Distance-based Metrics . . . . . . . . . . . . . . . . . . . . 45

    Information-Based Metrics . . . . . . . . . . . . . . . . . . 48

    Feature-Based Measures . . . . . . . . . . . . . . . . . . . 52

    3.3.2 Inter-ontology Semantic Similarity . . . . . . . . . . . . . 52

    4 Search Interfaces 55

    4.1 Information Seeking Models . . . . . . . . . . . . . . . . . . . . . 55

    4.2 Query Specification . . . . . . . . . . . . . . . . . . . . . . . . . . 56

    4

  • 8/12/2019 Enhanced Ontological Searching of Medical Scientific Information

    5/130

    4.3 Presentation of Search Results . . . . . . . . . . . . . . . . . . . . 60

    4.4 Query Reformulation . . . . . . . . . . . . . . . . . . . . . . . . . 62

    5 Requirements 65

    5.1 Feature Specification . . . . . . . . . . . . . . . . . . . . . . . . . 65

    6 Design 69

    6.1 Stage I: Access to Medical Ontologies . . . . . . . . . . . . . . . . 69

    6.1.1 Database and Table Creation . . . . . . . . . . . . . . . . 70

    6.1.2 Populating the Database Tables . . . . . . . . . . . . . . . 72

    6.2 Stage II: Computation of Semantic Similarity . . . . . . . . . . . 76

    6.2.1 Term Neighborhoods . . . . . . . . . . . . . . . . . . . . . 76

    6.2.2 Semantic Similarity Calculation . . . . . . . . . . . . . . . 77

    6.3 Stage III: Interface Design Data Presentation . . . . . . . . . . . 79

    6.4 Summary of Technology Choices. . . . . . . . . . . . . . . . . . . 80

    7 Implementation 83

    7.1 Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

    7.2 Search Entry Form . . . . . . . . . . . . . . . . . . . . . . . . . . 83

    7.3 Handling the Input Query . . . . . . . . . . . . . . . . . . . . . . 88

    7.3.1 Typing Speed . . . . . . . . . . . . . . . . . . . . . . . . . 88

    7.3.2 Querying the Database . . . . . . . . . . . . . . . . . . . . 88

    7.3.3 Ranking and Grouping of Search Results . . . . . . . . . . 89

    7.3.4 Return-key or Mouse-click Search . . . . . . . . . . . . . . 91

    7.3.5 Auto-completion Search . . . . . . . . . . . . . . . . . . . 91

    7.4 Error Correction . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

    7.5 Term Information Presentation . . . . . . . . . . . . . . . . . . . 96

    7.6 Navigation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

    8 Evaluation 103

    8.1 Testing the Failed Queries . . . . . . . . . . . . . . . . . . . . . . 103

    8.2 Comparison to BioPortal Search Services . . . . . . . . . . . . . . 109

    5

  • 8/12/2019 Enhanced Ontological Searching of Medical Scientific Information

    6/130

    8.2.1 Auto-completion . . . . . . . . . . . . . . . . . . . . . . . 109

    8.2.2 Results Ranking. . . . . . . . . . . . . . . . . . . . . . . . 111

    8.2.3 Error Correction . . . . . . . . . . . . . . . . . . . . . . . 113

    8.2.4 Visualization . . . . . . . . . . . . . . . . . . . . . . . . . 114

    8.3 Comments from an AstraZeneca Search Specialist . . . . . . . . . 117

    9 Conclusions and Future Work 121

    9.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

    9.2 Future Work. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122

    Bibliography 123

    Number of Words in the Document: 25648

    6

  • 8/12/2019 Enhanced Ontological Searching of Medical Scientific Information

    7/130

    University of Manchester

    School of Computer Science

    Degree Programme of Advanced Computer Science

    ABSTRACT OF

    MASTERS THESIS

    Author: Christos Karaiskos

    Title: Enhanced Ontological Searching of Medical Scientific Information

    Supervisors: Prof. Andrew Brass (University of Manchester)

    Dr. Jennifer Bradford (AstraZeneca)

    Abstract: An enormous amount of biomedical knowledge is encoded in narra-

    tive textual format. In an attempt to discover new or hidden knowledge, exten-

    sive research is being conducted to extract and exploit term relationships fromplain text, with the aid of technology. A common approach for the identification

    of biomedical entities in plain text involves usage of ontologies, i.e., knowledge

    bases which provide formal machine-understandable representations of domains

    of variable specificity. In addition to term extraction, ontologies may be used

    as controlled vocabularies or as a means for automatic knowledge acquisition

    through their inherent inference capabilities. Visualization of the content of on-

    tologies is, thus, very important for researchers in the biomedical domain. Un-

    fortunately, many of these researchers find it difficult to deal with formal logic

    and would prefer that ontology search interfaces completely hide any structural

    or functional references to ontologies. This thesis proposes a strategy for build-

    ing a web-based ontology search application that exploits ontologies behind the

    scene, transparently from the end user, and presents relevant concept informa-

    tion in such a way that searchers can successfully and quickly find what they

    are looking for. The proposed search interface features various search tools for

    enhanced ontological searching, including term auto-completion, error correction,

    clever results ranking, and similar term visualizations based on semantic similar-

    ity metrics. Evaluation of the developed application shows that its features can

    improve enterprise-strength ontology search applications, such as BioPortal.

    Keywords: search interface design, ontology hiding, biomedical ontology,

    semantic similarity, usability, data integration

    7

  • 8/12/2019 Enhanced Ontological Searching of Medical Scientific Information

    8/130

    8

  • 8/12/2019 Enhanced Ontological Searching of Medical Scientific Information

    9/130

    Declaration

    No portion of the work referred to in the dissertation has been submitted in

    support of an application for another degree or qualification of this or any other

    university or other institute of learning.

    9

  • 8/12/2019 Enhanced Ontological Searching of Medical Scientific Information

    10/130

    10

  • 8/12/2019 Enhanced Ontological Searching of Medical Scientific Information

    11/130

    Intellectual Property Statement

    i. The author of this dissertation (including any appendices and/or schedules

    to this dissertation) owns certain copyright or related rights in it (the Copy-

    right) and he has given The University of Manchester certain rights to use

    such Copyright, including for administrative purposes.

    ii. Copies of this dissertation, either in full or in extracts and whether in hard

    or electronic copy, may be made only in accordance with the Copyright,

    Designs and Patents Act 1988 (as amended) and regulations issued under

    it or, where appropriate, in accordance with licensing agreements which the

    University has entered into. This page must form part of any such copies

    made.

    iii. The ownership of certain Copyright, patents, designs, trade marks and other

    intellectual property (the Intellectual Property) and any reproductions of

    copyright works in the dissertation, for example graphs and tables (Repro-

    ductions), which may be described in this dissertation, may not be owned by

    the author and may be owned by third parties. Such Intellectual Property

    and Reproductions cannot and must not be made available for use with-

    out the prior written permission of the owner(s) of the relevant Intellectual

    Property and/or Reproductions.

    iv. Further information on the conditions under which disclosure, publication

    and commercialisation of this dissertation, the Copyright and any Intel-

    lectual Property and/or Reproductions described in it may take place is

    11

  • 8/12/2019 Enhanced Ontological Searching of Medical Scientific Information

    12/130

    available in the University IP Policy (see http://documents.manchester.ac.

    uk/display.aspx?DocID=487), in any relevant Dissertation restriction decla-

    rations deposited in the University Library, The University Librarys reg-

    ulations (see http://www.manchester.ac.uk/library/aboutus/regulations)

    and in The Universitys Guidance for the Presentation of Dissertations.

    12

    http://documents.manchester.ac.uk/display.aspx?DocID=487http://documents.manchester.ac.uk/display.aspx?DocID=487http://www.manchester.ac.uk/library/aboutus/regulationshttp://www.manchester.ac.uk/library/aboutus/regulationshttp://documents.manchester.ac.uk/display.aspx?DocID=487http://documents.manchester.ac.uk/display.aspx?DocID=487
  • 8/12/2019 Enhanced Ontological Searching of Medical Scientific Information

    13/130

    Acknowledgements

    I am deeply grateful to my supervisors, Prof. Andrew Brass (University of Manch-

    ester) and Dr. Jennifer Bradford (AstraZeneca), for their invaluable guidance and

    support throughout the duration of this project. I have greatly benefited from

    experiencing the different perspectives of academia and industry, which have both

    contributed to shaping the final outcome of this project.

    I would like to thank Sebastian Philipp Brandt (University of Manchester),

    for his suggestions on making the search application even better. Also, I would

    like to express my gratitude to Julie Mitchell (AstraZeneca), for taking the time

    to evaluate the application, and Paul Metcalfe (AstraZeneca), for his advice on

    improving the performance and security of the application.

    Finally, I would like to thank Matina for her patience and love, and my par-

    ents, Ioannis and Stavroula, for always being there.

    13

  • 8/12/2019 Enhanced Ontological Searching of Medical Scientific Information

    14/130

    14

  • 8/12/2019 Enhanced Ontological Searching of Medical Scientific Information

    15/130

    List of Abbreviations

    AI Artificial Intelligence

    AJAX Asynchronous JavaScript and XML

    API Application Programming Interface

    CSS Cascading Style Sheets

    DAG Directed Acyclic Graph

    HLGT High Level Group Term

    HLT High Level Term

    HTTP Hypertext Transfer Protocol

    IC Information Content

    ICD International Classification of Diseases

    JDBC Java Database Connectivity

    JSON JavaScript Object Notation

    LCS Least Common Subsumer

    MedDRA Medical Dictionary for Regulatory Activities

    NCIT National Cancer Institute Thesaurus

    NDF-RT National Drug File Reference Terminology

    15

  • 8/12/2019 Enhanced Ontological Searching of Medical Scientific Information

    16/130

    NHS UK National Health System

    NLP Natural Language Processing

    OBO Open Biomedical Ontologies

    OWL Web Ontology Language

    PHP PHP Hypertext Preprocessor

    PT Preferred Term

    RDF Resource Description Framework

    RDF-S Resource Description Framework Schema

    REST Representational State Transfer

    RF2 Release Format 2

    SNOMED CT Systematized Nomenclature of Medicine Clinical Terms

    SNOMED RT Systematized Nomenclature of Medicine Reference

    Terminology

    SOC System Organ Class

    UMLS Unified Medical Language System

    URI Uniform Resource Identifier

    URL Uniform Resource Locator

    UX User Experience

    VA U.S. Department of Veterans Affairs

    WHO World Health Organization

    XHTML Extensible HyperText Markup Language

    XML Extensible Markup Language

    16

  • 8/12/2019 Enhanced Ontological Searching of Medical Scientific Information

    17/130

  • 8/12/2019 Enhanced Ontological Searching of Medical Scientific Information

    18/130

  • 8/12/2019 Enhanced Ontological Searching of Medical Scientific Information

    19/130

    List of Figures

    2.1 The structure of the MedDRA terminology comprises a fixed-depth

    hierarchy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

    4.1 The google search engine entry form. . . . . . . . . . . . . . . . . 57

    4.2 Facebook uses grayed-out descriptive text to help in the formula-

    tion of user queries. . . . . . . . . . . . . . . . . . . . . . . . . . . 57

    4.3 Bings search interface features a powerful dynamic search sugges-

    tion, where prefixes are highlighted with grayed-out font and the

    remaining text is in bold. . . . . . . . . . . . . . . . . . . . . . . 58

    4.4 The Safari browsers embedded search interface explicitly states

    which queries are suggestions and which belong to the users recent

    search history. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

    4.5 The Firefox browsers embedded search interface contains recent

    queries on top, and separates them from suggestions using a solid

    line. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

    4.6 Googles search results page is a typical scrollable vertical list of

    captions. Metadata facets, that restrain results to a particular

    type of information, are also present in the interface (e.g. Images

    tab). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

    4.7 Amazons search interface provides facets as a left panel to the

    results page, helping the user dynamically refine the initial search. 62

    19

  • 8/12/2019 Enhanced Ontological Searching of Medical Scientific Information

    20/130

    4.8 Pubmeds results page includes term expansion in two ways. On

    the right of the screen, there is a Related searches panel that pre-

    serves the initial query and adds a new related term to it. Also,

    right below the entry form there is a See also feature which sug-

    gests complete or partial modifications in the initial query. . . . . 64

    6.1 A part of the XML response for the get all terms query of Table

    6.2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

    6.2 The provided methods of the ontoCAT APIAdamusiak et al.(2011). 75

    6.3 Populating the Ontologies database is performed with the help of

    the ontoCAT API. . . . . . . . . . . . . . . . . . . . . . . . . . . 75

    7.1 The organization of the files that comprise the web application.

    These files are responsible for the presentation, styling and inter-

    active behavior of the web application. . . . . . . . . . . . . . . . 84

    7.2 The main window of the search application. The search box is

    placed at the top of the screen, with central horizontal alignment.

    A submit button labeled Search is also provided, to assist users

    that prefer mouse-clicking. . . . . . . . . . . . . . . . . . . . . . . 87

    7.3 Once the user clicks inside the search box, the grey help message

    disappears and a blinking cursor takes its place. . . . . . . . . . . 87

    7.4 Terms, that would appear on their own table row, are grouped

    under a more lexically-matching term to the query, when their

    semantic similarity to that term is higher than a threshold. . . . . 90

    7.5 Pressing the Return key or clicking the Search button submits

    the query toindex.php and a table of search results is added to the

    interface.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

    7.6 Part of the JSON response from performQuery.php, for the input

    query rash. Each JSON object represents a term matching the

    query, and contains information that can be used for its presentation. 93

    20

  • 8/12/2019 Enhanced Ontological Searching of Medical Scientific Information

    21/130

    7.7 Pressing any other key except Return submits the query through

    AJAX toperformQuery.php and an auto-completion pop-up menu

    is created from the JSON response. . . . . . . . . . . . . . . . . . 93

    7.8 Error correction when input query is lyng. The closest term is

    suggested, as a clickable link. . . . . . . . . . . . . . . . . . . . . 95

    7.9 When the user places the mouse cursor on a circle, a tooltip imme-

    diately appears, containing the full term name and the semantic

    similarity score with the viewed term.. . . . . . . . . . . . . . . . 97

    7.10 Presentation page for the NCIT term Recurrent NSCLC. On the

    left side, the basic term information is shown, along with an XML

    representation of highly similar terms. On the right side, a visual-

    ization of highly similar terms is provided, using the D3 JavaScript

    library. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

    7.11 Presentation page for the MedDRA term Rash. The term has

    very close relations with terms that are not in the hierarchy. This

    is illustrated using blue color. . . . . . . . . . . . . . . . . . . . . 100

    7.12 The XML representation of a term. It includes basic term infor-

    mation and highly similar terms. . . . . . . . . . . . . . . . . . . 101

    7.13 Help is provided through tooltips that activate on mouse-over. . . 101

    8.1 The term DIHS is not found, but this is normal, since it is not

    part of any of the supported ontologies. Instead, the term DIOS

    is proposed, in case the user had mispelt the query. . . . . . . . . 106

    8.2 The term NMDA Antagonist is not found, but this is normal,since it is not part of any of the supported ontologies. No soundex

    match is found, so no error corrections are suggested. . . . . . . . 106

    8.3 The term Hepatotoxicity is shown in the auto-completion dialogue.106

    8.4 The term NSCLC is shown in the auto-completion dialogue.. . . 106

    8.5 The term DRESS syndrome is shown in the auto-completion di-

    alogue. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

    21

  • 8/12/2019 Enhanced Ontological Searching of Medical Scientific Information

    22/130

    8.6 The query LHRH produces two different 100%-matching results.

    Unlike in the previous search application, the user can now see that

    Gonadotropin Releasing Hormone is a preferred term for LHRH. 107

    8.7 The results for the query VEGFR, illustrate a semantic grouping

    of 4 similar terms, namely VEGFR, Vascular Endothelial Growth

    Factor Receptor 1, Vascular Endothelial Growth Factor Receptor

    2, Vascular Endothelial Growth Factor Receptor 3. The latter

    three are grouped under the parent term. . . . . . . . . . . . . . . 108

    8.8 The BioPortal interface is a simple text box, similar to this projects

    main page. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

    8.9 BioPortal also offers advanced options to improve the search results.110

    8.10 Only NCIT, MedDRA and ICD9CM are chosen for searching, out

    of the 353 ontologies offered by BioPortal, so that comparisons to

    this projects work are achievable. . . . . . . . . . . . . . . . . . . 111

    8.11 Auto-completion pop-up menu of BioPortal NCIT widget when

    the user has typed nsc. Only preferred terms are shown. The

    user might be confused when seeing the term Becatecarin in the

    results, since it does not contain nsc. . . . . . . . . . . . . . . . . 112

    8.12 Auto-completion pop-up menu of this projects search application

    when the user has typed nsc. . . . . . . . . . . . . . . . . . . . . 112

    8.13 Searching for Denatonium Benzoate through its preferred term

    name. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

    8.14 Searching for Denatonium Benzoate through its synonym THS-839. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

    8.15 Searching for Denatonium Benzoate through its synonym WIN

    16568. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

    8.16 BioPortal search results rankings for nsclc. All terms are grouped

    according to the ontology they belong to, under the preferred name

    of the most lexically-relevant term to the query. . . . . . . . . . . 114

    22

  • 8/12/2019 Enhanced Ontological Searching of Medical Scientific Information

    23/130

    8.17 This projects search results rankings for nsclc. Terms in the re-

    sults are rearranged into groups that show high semantic similarity.

    . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

    8.18 BioPortal returns no search results for the erroneously spelt term

    nsclca. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

    8.19 BioPortal returns no search results for the erroneously spelt term

    caancer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

    8.20 This projects search application returns a search suggestion of

    nsclc for the erroneously spelt term nsclca. . . . . . . . . . . . 116

    8.21 This projects search application returns a search suggestion of

    cancer for the erroneously spelt term caancer. . . . . . . . . . 116

    8.22 BioPortal uses a graph to visualize hierarchical relations. Edges

    are annotated with a description of the relationship between the

    connected nodes (e.g. subclassOf). . . . . . . . . . . . . . . . . . 116

    8.23 This projects application focuses on inexperienced users and at-

    tempts to completely hide any formal-logic relationships that might

    confuse the user. . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

    8.24 Search results depicting causal associations between smoking and

    cancer, as presented by the I2E text mining application. . . . . . 118

    8.25 Search results for the term MEK inhibitor in NCIT, when the

    I2E application is used. . . . . . . . . . . . . . . . . . . . . . . . . 119

    23

  • 8/12/2019 Enhanced Ontological Searching of Medical Scientific Information

    24/130

    24

  • 8/12/2019 Enhanced Ontological Searching of Medical Scientific Information

    25/130

    Chapter 1

    Introduction

    Ontologies are knowledge bases which provide formal machine-understandable

    representations of domains of variable specificity. Given a domain of discourse,

    concepts that belong to the domain are well documented in formal logic, along

    with their inter-relations. Ontologies, as representations, cannot perfectly capture

    the part of the world that they attempt to describe Davis et al. (1993). They

    are based on the open world assumption, which states that if something is not

    represented in a knowledge base, it does not mean that it does not exist in the

    real worldHustadt et al. (1994). As our knowledge about a domain increases,

    ontologies are updated and they become more complex. This has become evident

    in the biomedical domain, where ontologies have already attained a high degree of

    specificity, and has led to their quick adoption for data integration and knowledge

    discovery purposes.

    1.1 Problem Context

    Within biomedicine, ontologies can help researchers communicate, by promoting

    consistent use of biomedical terms and concepts. The construction of an ontol-

    ogy itself involves mediating across multiple views and requires that a number

    of domain experts reach a consensus that reflects the diverse viewpoints of the

    25

  • 8/12/2019 Enhanced Ontological Searching of Medical Scientific Information

    26/130

    CHAPTER 1. INTRODUCTION

    community. Ontologies are viewed as tools that provide opportunities for new

    knowledge acquisition, due to the complex semantic relations that they model.

    Inferences in a huge ontology may reveal connections that the human eye would

    bypass. This is especially important in the pharmaceutical sector, where drug

    discovery has slowed down significantly as a process and in the biological sector,

    where attempts to demystify genome patterns associated with disease are still

    at initial stage. Another common use for ontologies in the biomedical domain

    is as controlled vocabularies that feed filtered terms into computer applications.

    Finally, ontologies may be used to connect terms found in plain text to their

    semantic representations. Term extraction with the help of ontologies is a hot

    topic in biomedicine, due to the vast amounts of medical information stored in

    plain text. Due to the importance of ontologies, it is usual for researchers in the

    biomedical field to require access to their content.

    1.2 Motivation

    In the past, AstraZeneca employees were provided with a web-based search form

    that enabled them to look for concepts in one or more biomedical ontologies and

    select the most suitable from a list of search results. The chosen concepts were, in

    turn, conveyed to a text mining application. Understanding the results required

    the user to be familiar with the content and structure of the ontology from which

    the terms were retrieved. Unfortunately, most users did not feel comfortable

    with the idea of ontologies and struggled, or even refused, to use the provided

    interfaces, even though no logic-based content was there to confuse them.

    In many cases, though, this was not solely the fault of the users. The interface

    gave the users freedom to select the ontologies to be searched for the specified

    query. Inexperienced users usually did not know or care about which ontology

    contains the desired query term. For example, a user wished to search for Non-

    small cell lung carcinoma, by its abbreviation NSCLC. Querying NSCLC in

    26

  • 8/12/2019 Enhanced Ontological Searching of Medical Scientific Information

    27/130

    1.3. CONTRIBUTION

    the MedDRA terminology1 returned no results, since the concept is not present

    in the terminology. Although this behavior is correct, it seems wrong to the

    inexperienced user and may lead to loss of trust to the system.

    But even if the term is present in the ontology, the user should not be forced

    to know its exact spelling. For example, querying for NSCLC in the NCIT

    thesaurus also returned no results, despite the fact that the actual concept exists

    in the ontology. The searcher needed to know that the preferred term for the

    NSCLC concept is Non-small cell lung carcinoma. Abbreviations and dissimilar

    synonyms are common in the biomedical field, so expecting the user to know the

    preferred term for each concept is considered problematic.

    In addition to the above, presentation of results was not always straightfor-

    ward. Terms that demonstrate a strong semantic relation to each other were

    presented as stand-alone terms in the search results, subconsciously misleading

    users to deduce that the terms were independent. It was up to the user to judge

    the relevance of results to the query. For example, the results for Non-small cell

    lung carcinoma in NCIT included, among others, the terms Non-small cell lung

    carcinoma and Stage I non-small cell lung carcinoma equally spaced, in a way

    that users could not infer the connections between them. In fact, the latter term

    is a specification of the former. In reality, what users did was to choose all terms,

    even though they were looking for the broad term, because they became confused

    and did not want to take the risk of selecting only one.

    This collapse at the human-computer interface has motivated AstraZeneca to

    try to build tools that take advantage of the ontology structure and, at the same

    time, completely hide it from the user in order to facilitate the search procedure.

    1.3 Contribution

    The outcome of this thesis is the development of a user-friendly search applica-

    tion that allows users to find information about concepts present in a medical

    1The difference between terminology and ontology is described in Section2.2

    27

  • 8/12/2019 Enhanced Ontological Searching of Medical Scientific Information

    28/130

    CHAPTER 1. INTRODUCTION

    ontology, without requiring from them to understand the underlying structure of

    the ontology. Information about a concept includes its accession code within the

    given ontology, the term for its preferred name, its definition and all available

    synonym terms. In order to facilitate the search procedure and enhance User

    Experience (UX), the search application includes features such as dynamic term

    suggestion, spelling correction and similar term visualization tools.

    The main challenge lies in the presentation of results; as stated in section 1.2,

    users are usually not sure about which term(s) to choose, when multiple similarly-

    spelt terms appear. Ranking of terms is performed with the aid of both lexical

    and semantic similarity. The former screens those terms that best match the user

    query and ranks them according to a string relevance metric. These results are

    processed by the latter, so that terms showing a strong semantic connection are

    grouped together.

    Ideally, the search application should bridge across terms from multiple ontolo-

    gies. Due to the diversity in the format and annotation of different ontologies, this

    is not a straightforward generalization. Most importantly, within the biomedical

    society, the term ontology is often used erroneously to describe plain termi-

    nologies that, in fact, violate basic ontological principles.2 Therefore, ontology-

    specific difficulties are expected to arise, if semantic similarity measures are to be

    deployed.

    In summary, the goals of this thesis are to investigate the following topics:

    1. To develop user-friendly search tools that allow users to build search queries

    based on the terms present in a medical ontology, without need for the usersto understand the actual structure of the ontology.

    2. To exploit the semantic annotations of the underlying ontology in order to

    enhance the quality and presentation of results.

    3. To intermix results originating from different ontologies.

    2In MedDRA, the synonym of a term may be a child node of the term itself.

    28

  • 8/12/2019 Enhanced Ontological Searching of Medical Scientific Information

    29/130

    1.4. THESIS ORGANIZATION

    1.4 Thesis Organization

    The thesis is organized in a total of 9 chapters. Chapter 2 includes an introductionto ontologies and a brief description of some notable biomedical ontologies. Chap-

    ter 3 presents the background needed for understanding the different measures

    of lexical and semantic similarity. Chapter 4 discusses interface design principles

    for user-centered search applications. In chapter 5, the requirements and feature

    specifications for the final search application are addressed. Chapter 6 describes

    the design considerations that were taken into account for the ontological search

    application, while chapter 7 presents the final implementation. Chapter 8 in-cludes the evaluation of the search application. Finally, conclusions are drawn in

    chapter 9, along with possible future directions.

    29

  • 8/12/2019 Enhanced Ontological Searching of Medical Scientific Information

    30/130

    30

  • 8/12/2019 Enhanced Ontological Searching of Medical Scientific Information

    31/130

    Chapter 2

    Ontologies

    The term ontology is an uncountable noun coined in the philosophical field, by

    ancient Greek philosophersGuarino(1998). It involves the study of the nature

    of existence, at a fairly abstract level. In the world of computer science, the word

    ontology refers to the encoding of human knowledge in a format that allows

    for computational use. This chapter includes an introduction to the modern

    definition of ontology, along with a brief description of some of the most notablebiomedical ontologies.

    2.1 Modern Ontology Definition

    In Artificial Intelligence (AI), an ontology is commonly defined as a specification

    of a (shared) conceptualizationGruber et al. (1995). A conceptualization refers

    to an individuals knowledge about a specific domain, acquired through expe-

    rience, observation or introspection Huang et al. (2010). Ontologies are shared

    conceptualizations, meaning that multiple participants, usually domain experts,

    contribute to their construction, maintenance and expansion. Conflicts are cer-

    tain to arise among the different participants, so an important aspect of ontology

    design is to bridge across multiple views of the desired domain into a single con-

    crete representation. On the other hand, a specification is a transformation of

    31

  • 8/12/2019 Enhanced Ontological Searching of Medical Scientific Information

    32/130

    CHAPTER 2. ONTOLOGIES

    this shared conceptualization into a formal representation language.

    The outcome of a formal representation of a domain is a collection of entities,

    expressions and axioms. Entities include:

    concepts or classes, which are sets of individuals (e.g., Country, which

    contains all countries),

    individuals, which are specific instances of classes (e.g., Greece as an in-

    stance of Country),

    data types (e.g. string, integer),

    literals, which are specific values of a given data type (e.g. 1,2,3, or string

    values),

    properties (e.g. hasDisease, hasAge).

    Expressionsrefer to descriptions of entities in a formal representation language.

    The standardized family of languages for formal ontology representation is the

    Web Ontology Language (OWL), which builds on the Extensible Markup Lan-

    guage (XML), Resource Description Framework (RDF) and RDF-Schema (RDF-

    S) standards to provide a highly expressive means for representing knowledge

    McGuinness et al. (2004). The underlying format of the resulting OWL docu-

    ment can vary among several types, with the most common being RDF/XML.

    Finally, axioms relate entities/expressions. This connection can be made

    class-to-class (i.e. SubClassOf), individual-to-class (i.e. ClassAssertion), property-

    to-property (i.e. SubPropertyOf), among others. These relations can be asserted

    explicitly or inferred by a reasoner. Inferences are made, based on the logic rela-

    tions of concepts. As an example of a simple inference, a concepts ancestors can

    be inferred automatically, once the parent concept is specified.

    An ontology may be visualized as a graph, in which concepts are nodes and

    relations are edges between nodes. Furthermore, if transitive hierarchical rela-

    tions are isolated (e.g. subsumption, also known as is-a relation or hyponymy),

    32

  • 8/12/2019 Enhanced Ontological Searching of Medical Scientific Information

    33/130

    2.2. ONTOLOGY VS. TERMINOLOGY

    the ontology can be viewed as a taxonomy. The geometrical visualization of an

    ontology will be presented in more detail in chapter 3.

    2.2 Ontology vs. Terminology

    A terminology is a collection of term names that are associated with a given

    domain. A term is a mapping of a concrete concept to natural language. This

    term-to-concept mapping is usually not one-to-one, especially in the biomedical

    domain where term variation and term ambiguities arise Ananiadou and Mc-

    Naught(2006). Term variation is a result of the richness of natural language and

    refers to the existence of multiple terms for the description of the same concept.

    For example, the terms Transmembrane 4 Superfamily Member 1, TM4SF1t,

    L6 Antigen all point to the same protein. Term ambiguity occurs when a term is

    mapped to more than one distinct concept. This is common when new abbrevia-

    tions are introducedLiu et al.(2002). As an example, some of the concepts that

    the acronym CTX may map to are Cardiac Transplantation, Clinical Trial

    exemption and Conotoxin. Their disambiguation is a matter of context.

    A terminology is not constrained to being a simple list of terms. In fact,

    most terminologies feature some kind of structure, where terms that map to the

    same concept are grouped together and semantic relationships between concepts

    are explicitly or implicitly stated. Semantic relationships between terms include

    synonymy and antonymy, while semantic relationships between concepts include

    hyponymy, hypernymy, meronymy and holonymy Jurafsky and Martin (2000).

    Synonymy exists when two terms are interchangeable, while antonymy denotes

    that two terms have opposite meaning. Hyponymy introduces a parent-child, or

    is-a relation between concepts. A concept is a hyponym of another concept,

    if the former derives from the latter and it represents a more granular concept.

    Hyponymy is transitive; if concept a is a child of concept b, and concept b is a

    child of concept c, then a is also a child ofc. Hypernymy is the reverse relation

    of hyponymy. Meronymy exists when a concept represents a part of another

    33

  • 8/12/2019 Enhanced Ontological Searching of Medical Scientific Information

    34/130

    CHAPTER 2. ONTOLOGIES

    concept. Holonymy is the opposite relation, where a concept has part some other

    concept(s).

    The difference between a terminology and an ontology is not always clear, as

    terminologies continue to improve their state of organization in a way that resem-

    bles ontologies. The initial scope and aim of the two, though, is clearly different;

    the purpose of a terminology was initially, as the name implies, an effort to collect

    all terms associated with a specified domain. On the other hand, the target of

    an ontology has, from the start, been to provide a machine-readable specification

    of a shared conceptualization. Despite their many common characteristics, ter-

    minologies are not necessarily ontologies. If treated as ontologies, they may lead

    to inconsistencies or wrong inferencing mechanisms Ananiadou and McNaught

    (2006). An illustrative example is the case of MedDRA, which will be discussed

    in Section2.3.4.

    2.3 Notable Biomedical Ontologies and Termi-

    nologies

    Hundreds of biomedical ontologies and terminologies have been published on-

    line. According to BioPortal1 statistics, the top five most viewed ontologies or

    terminologies are SNOMED Clinical terms, National Drug File, International

    Classification of Diseases, MedDRA and NCI Thesaurus. In this section, a brief

    introduction to these ontologies/terminologies is performed.

    2.3.1 SNOMED CT

    The Systematized Nomenclature of Medicine Clinical Terms (SNOMED CT) is a

    biomedical terminology which covers most areas within medicine such as drugs,

    diseases, operations, medical devices and symptoms. It may be used for the cod-

    1BioPortal is a biomedical ontology/terminology repository which provides online ontology

    presentation and manipulation tools(http://bioportal.bioontology.org/ ).

    34

    http://bioportal.bioontology.org/http://bioportal.bioontology.org/http://bioportal.bioontology.org/
  • 8/12/2019 Enhanced Ontological Searching of Medical Scientific Information

    35/130

    2.3. NOTABLE BIOMEDICAL ONTOLOGIES AND TERMINOLOGIES

    ing, retrieval and processing of clinical data. SNOMED CT is written purely in

    formal logic-based syntax (i.e., the so-called Release Format 2 or RF2) available

    and organized into multiple independent hierarchies. It is the result of the merg-

    ing between the UK National Health Systems (NHS) Read codes and SNOMED

    Reference Terminology (SNOMED-RT), developed by the College of American

    Pathologists. The basic hierarchies, or axes, are Clinical Finding and Proce-

    dure. The last version contains more than 400000 concepts and over 1000000

    of relationships, rendering SNOMED CT the most complete terminology in the

    medical domain. Only few definitions are present in the terminology. Each con-

    cept contains a unique identifier and numerous synonymous terms that account

    for term variation. Also, each concept is part of at least one hierarchy and may

    have multiple is-a relationships with higher level nodes. SNOMED CT is part

    of the Unified Medical Language System (UMLS), a biomedical ontology and

    terminology integration attempt which comprises hundreds of resources.

    2.3.2 NDF-RT

    The National Drug File Reference Terminology (NDF-RT) was introduced by the

    U.S. Department of Veterans Affairs (VA) as a formalized representation for a

    medication terminology, written in description logic syntax VHA (2012). The

    terminology is organized into concept hierarchies, where each concept is a node

    comprising a list of term synonyms and a unique identifier. As expected, top-level

    concepts are more general than lower-level ones. The central hierarchy is named

    DRUG KIND and indicates the types of medications, the preparations used in

    them and clinical VA drug products. Other hierarchies include

    DISEASE KIND,

    INGREDIENT KIND,

    MECHANISM OF ACTION KIND,

    PHARMACOKINETICS KIND,

    35

  • 8/12/2019 Enhanced Ontological Searching of Medical Scientific Information

    36/130

    CHAPTER 2. ONTOLOGIES

    PHYSIOLOGIC EFFECT KIND,

    THERAPEUTIC CATEGORY KIND,

    DOSE FORM and

    DRUG INTERACTION KIND.

    Roles exist between different concepts, and are specified only with existential

    restrictions (i.e. OWL equivalent of someValuesFrom). Mappings to other ter-

    minologies are also available. Currently, NDF-RT more than 45000 concepts in

    hierarchies of maximum depth 12.

    2.3.3 ICD-10

    The International Statistical Classification of Diseases and Related Health Prob-

    lems (ICD) is a terminology which attempts to classify signs, symptoms and

    causes of disease and morbidity WHO(1992). It appeared in the mid-19th cen-

    tury and is now maintained by the World Health Organization (WHO). Currently

    it is available in its 10th revision, although the 11th version is claimed to be at

    the final stage before release. As a taxonomy, it has relatively small maximum

    depth, equal to 6. Codes assigned to each concept tie it to a specific place in the

    taxonomy, with each code having only a single parent. It is thus not a proper ap-

    plication of ontological principles2, since, in reality, it is not unusual for concepts

    to belong to more than one subsumers, and this is not modeled. In addition to

    that, there exist categories such as Not otherwise specified or Other, which are

    not needed in an ontology; the open world assumption already covers the fact

    that every ontology is incomplete, so stating it explicitly is redundant and may

    interfere with the evolution of the ontology, as new terms are not classified under

    their closest match.

    2nor was meant to be; its intent is classification

    36

  • 8/12/2019 Enhanced Ontological Searching of Medical Scientific Information

    37/130

    2.3. NOTABLE BIOMEDICAL ONTOLOGIES AND TERMINOLOGIES

    Figure 2.1: The structure of the MedDRA terminology comprises a fixed-depth hierarchy.

    2.3.4 MedDRA

    The Medical Dictionary for Regulatory Activities (MedDRA) is a terminology

    that is concerned with biopharmaceutical regulatory processes. It contains terms

    associated with all phases of the drug development cycle. MedDRA is organized

    in a hierarchical structure of fixed depth, as seen in Fig. 2.1. System Organ

    Classes (SOCs) represent the 26 predefined overlapping hierarchies in which terms

    belong to. High Level Group Terms (HLGTs) and High Level Terms (HLTs) are

    general term groupings, denoting disorders or complications. Preferred Terms

    (PTs) denote the preferred name for a concept, while Lowest Level Terms (LLTs)

    include terms of maximum specificity. LLTs may be connected with hyponymy,

    meronymy or synonymy relationships to their PTs. This is the main problem in

    trying to view MedDRA as an ontology. In a formal ontology, a concept cannot

    be a child of itself. In MedDRA, this clearly happens, when a PT and its LLTs

    share a synonymy relation.

    37

  • 8/12/2019 Enhanced Ontological Searching of Medical Scientific Information

    38/130

    CHAPTER 2. ONTOLOGIES

    2.3.5 NCI Thesaurus

    The National Cancer Institute Thesaurus (NCIT) is a controlled terminologyfor cancer research. The thesaurus has been converted to formal OWL syntax

    and is updated at fixed intervals. The conversion was not an easy one; many

    inconsistencies and modeling dead-ends that were encountered in the conversion

    procedure have been documentedCeusters et al. (2005), along with some clear

    violations of ontological principlesSchulz et al.(2010). The NCIT provides almost

    100000 concepts, with approximately 65% containing a definition.

    38

  • 8/12/2019 Enhanced Ontological Searching of Medical Scientific Information

    39/130

  • 8/12/2019 Enhanced Ontological Searching of Medical Scientific Information

    40/130

    CHAPTER 3. SIMILARITY METRICS

    4. d(a, b) +d(b, c) d(a, c) (triangular inequality).

    On the other hand, the requirements for a similarity metric were formally intro-

    duced not long ago Chen et al. (2009). The definition states that a similarity

    metric s(a, b) must satisfy the following properties:

    1. s(a, a) 0,

    2. s(a, b) =s(b, a),

    3. s(a, a) s(a, b),

    4. s(a, b) +s(b, c) s(a, c) +s(b, b),

    5. s(a, a) =s(b, b) =s(a, b) if and only ifa= b.

    The counter-intuitive 4th property can be proven, using set theory. More specif-

    ically, if|a b| denotes the cardinality of common characteristics between a and

    b, and c denotes the complement ofc, the following equality holds:

    |a b|= |a b c| + |a b c|. (3.1)

    Then,

    |a b| + |bc|= |a bc| + |a b c| + |a b c| + |a b c| |ac| + |b|, (3.2)

    since|a b c| |a c|and |a b c| + |a b c| + |a b c| |b|. Deduction of

    similarity from distance is a common procedure that requires simple operations.

    Similarity is, intuitively, a decreasing function of distance. Conversion between

    the two can take many formsChen et al.(2009). In this thesis, all formulas will

    be presented as similarity measures.

    3.2 Lexical Similarity

    String-based methods that calculate lexical similarity can be divided into character-

    based and word-based. In this section, some of the most popular metrics are

    presented. For a more complete survey of lexical similarity measures see Navarro

    (2001) andGomaa and Fahmy(2013).

    40

  • 8/12/2019 Enhanced Ontological Searching of Medical Scientific Information

    41/130

    3.2. LEXICAL SIMILARITY

    3.2.1 Character-based Similarity Measures

    In character-based similarity, strings are viewed as character sequences and at-tempts are made to discover character relevance.

    Longest Common Substring

    The Longest Common Substring algorithmGusfield(1997) tries to find the max-

    imum number of consecutive characters that two strings share. It may be imple-

    mented using a suffix tree or dynamic programming.

    Hamming Similarity

    Hamming similarity is a metric that can be applied to strings of equal length. It

    is a simple metric that measures the number of common characters between two

    strings. Given stringsaand b, the formula for string similarity can be constructed

    as follows:

    simham(a, b) =

    i 1(ai=bi)

    |a| , (3.3)

    where 1() is the indicator function and | | denotes string length, measured in

    characters.

    Levenshtein Similarity

    Levenshtein distance counts the number of character alterations that need tobe made in order to transform one string to another Levenshtein(1966). This

    number is bounded by the length of the larger string, which is commonly used as a

    normalizing measure that restrains the value of distance to [0 , 1]. Mathematically,

    normalized Levenshtein distance of termsaand b is computed using the following

    formula:

    dlev(a, b) = leva,b(|a|, |b|)

    max{|a|, |b|}, (3.4)

    41

  • 8/12/2019 Enhanced Ontological Searching of Medical Scientific Information

    42/130

    CHAPTER 3. SIMILARITY METRICS

    where| | denotes string length in number of characters,

    leva,b(i, j) =

    max{i, j} , if min{i, j}= 0

    min

    leva,b(i 1, j) + 1

    leva,b(i, j 1) + 1

    leva,b(i 1, j 1) + [ai =bj]

    , else(3.5)

    and max{}, min{} denote the maximum and minimum functions, respectively.

    Converting normalized distance to similarity can be done as follows:

    simlev(a, b) = 1 dlev(a, b). (3.6)

    Jaro Similarity

    Jaro similarityJaro(1989,1995) takes into account both the number and sequence

    of common characters present in the two strings. Let us consider strings a =

    a1 . . . aK and b = b1 . . . bL. A character ai is said to be common with b if the

    character exists in b within a window of

    min{|a|,|b|}

    2 frombi. Leta

    =a

    1 . . . a

    K

    bethose characters ina that are common withb, andb =b1 . . . b

    L those characters

    inbthat are common with a. A transposition fora, b is a positioni in the strings

    a, b in which ai = bi. The number of transpositions fora

    , b divided by two is

    denoted asTa,b. Then, Jaros formula for similarity is given by:

    simjaro (a, b) =1

    3

    |a|

    |a| +

    |b|

    |b| +

    |a| Ta,b

    |a|

    . (3.7)

    It should be noted that Jaro similarity violates the symmetry property of Eq.3.1, therefore it is not a true similarity metric, according to that definition.

    Jaro-Winkler Similarity

    Jaro-Winkler similarity Winkler (1999) is a variation of Jaro similarity which

    promotes strings with long common prefixes. The length of the longest prefix

    common to both strings a and b is denoted as P. Then, if P = max(P, 4),

    42

  • 8/12/2019 Enhanced Ontological Searching of Medical Scientific Information

    43/130

    3.2. LEXICAL SIMILARITY

    Jaro-Winkler similarity is given by:

    simj&w(a, b) = simjaro (a, b) + P

    10(1 simjaro (a, b)). (3.8)

    N-gram Similarity

    A string can be split into n-grams, i.e. all possible consecutive character sequences

    of lengthnin the string. As an example, the word protein can be split into the 3-

    grams pro, rot, ote, tei and ein. When comparing two strings, the number

    of common n-grams is computed and normalized by the maximum number of

    n-grams. More specifically, given strings aand b, similarity is given by:

    simngram(a, b) =NcomNmax

    , (3.9)

    where Ncom denotes the number of common n-grams andNmax denotes the max-

    imum number of n-grams in either of the two strings.

    3.2.2 Word-based Similarity Measures

    As the name implies, word-based measures view the string as a collection of words.

    Similarity measures dictate how similar two terms are word-wise, and no weight

    is given on character similarity.

    Dice Similarity

    Dice similarity considers input strings a and b as sets of words A and B respec-

    tively, and calculates similarity as follows:

    simdice(a, b) = 2|A B|

    |A| + |B|, (3.10)

    where | | denotes set cardinality in number of words.

    43

  • 8/12/2019 Enhanced Ontological Searching of Medical Scientific Information

    44/130

    CHAPTER 3. SIMILARITY METRICS

    Jaccard Similarity

    Jaccard similarity counts the number of common words of the compared stringsand divides it by the number of distinct words in both strings, i.e.

    simjacc(a, b) = |A B|

    |A B|. (3.11)

    Cosine Similarity

    In order to compute cosine similarity, the compared strings should be converted to

    vectors. The dimension of the resulting vectors will be equal to the total number

    of distinct words present in both. Therefore, each element in the vector represents

    one word. The vector values for each string are computed as follows: A vector

    contains unitary values in positions that correspond to words that are contained

    in the respective string. Similarly, a vector contains zero values in all positions

    that correspond to words that are not present in the respective string. Given

    strings a and b, the respective vectors a and b are computed. Cosine similarity

    is then given by:

    simcos(a, b) = a b||a|| ||b||, (3.12)

    where|| || denotes the Euclidean norm function.

    Manhattan Similarity

    Taxicab geometry considers that distance between two points in a grid is given

    by the sum of the absolute differences of their respective coordinates. The grid

    resembles a uniform city road map, where diagonal movements are not permitted.

    This is the reason why the distance metric in this space is often called Manhattan

    distance or city block distance. Considering N-dimension string vectors a and b,

    Manhattan distance can be computed as:

    simmanh(a, b) = 1

    Ni=1

    |ai bi|

    N , (3.13)

    whereNis a normalizing constant that represents the dimension ofaand b.

    44

  • 8/12/2019 Enhanced Ontological Searching of Medical Scientific Information

    45/130

    3.3. ONTOLOGICAL SEMANTIC SIMILARITY

    Euclidean Similarity

    Euclidean similarity also considers strings as vectors, and computes similarity as:

    simeucl(a, b) = 1

    N

    i=1

    |ai bi|2

    N . (3.14)

    3.3 Ontological Semantic Similarity

    An ontology is a collection of concepts and their inter-relationships. It may be

    visualized as a graph, in which nodes represent concepts and edges represent the

    relations between them. Usually, ontologies are viewed as taxonomies, where is-

    a and part-of relations play the most important role. Viewing the ontology as a

    taxonomy, one can apply semantic similarity metrics that exploit the hierarchical

    structure. Probably the most famous object of semantic similarity tests is the

    computational lexicon WordNetMiller(1995). In WordNet, closely related terms

    are grouped together to form synsets. These synsets, in turn, form semantic rela-

    tions with other synsets. WordNet is commonly referred to as a lexical ontology,

    due to an obvious mapping of lexical hyponymy to ontological subsumption.

    3.3.1 Intra-ontology Semantic Similarity

    Intra-ontology semantic similarity metrics are meant to measure similarity be-

    tween concepts that reside within the same ontology. These metrics can be

    roughly divided into distance-based, information-based and feature-based.

    Distance-based Metrics

    Distance-based metrics take advantage of the ontological topology to compute

    the similarity between concepts. This method requires viewing the ontology as

    a rooted Directed Acyclic Graph (DAG), in which nodes are concepts and edges

    among them are restricted to hierarchical relationships, with the most usual type

    45

  • 8/12/2019 Enhanced Ontological Searching of Medical Scientific Information

    46/130

    CHAPTER 3. SIMILARITY METRICS

    being is-a relationships. At the top, there is a single concept, the root. The graph

    is directed, starting from a low-level concept and directed towards its ancestors

    through transitive relationships. The graph is also acyclic, since a finite path

    from a source node to a destination node cannot return to the source node. In

    other words, a node can never be a child of one of its children.

    A simple look at an ontology from a geometric perspective may reveal im-

    portant information about the similarity of concepts. As depth in the DAG

    increases, concepts become increasingly specific, thus similarity is expected to

    increase. Another important characteristic of the ontology DAG is that the path

    between concepts is not always unique, therefore distance-based similarity will

    depend on which path is chosen. Finally, the density of nodes is a good indicator

    of similarity; as density increases, concepts approach each other and similarity

    increases.

    The accuracy of distance-based methods depends on the level of detail that

    the ontology captures. A poorly structured ontology with many omissions might

    yield misleading similarity results. Fortunately, a lot of effort has been made to

    make biomedical ontologies as complete as possible, therefore network density in

    biomedical ontologies is usually high.

    The most straightforward way to measure the similarity of concept nodes is

    given inRada et al. (1989). In that work by Rada et al., all edges are assigned

    a unitary weight and the distance between two concepts is equal to the number

    of edges that are present in their shortest path. Let us consider two distinct

    concepts c1 and c2 in the hierarchy. Each pathi that connects these two concept

    nodes may be represented as a set which includes all edges ek present in the path,

    i.e.

    pathi(c1, c2) ={e1, e2, . . . , eK}. (3.15)

    with cardinality |pathi(c1, c2)|= K. The distance between concepts c1 and c2 is,

    then, equal to the shortest path that connects them, i.e.,

    drada(c1, c2) = mini|pathi(c1, c2)|. (3.16)

    46

  • 8/12/2019 Enhanced Ontological Searching of Medical Scientific Information

    47/130

    3.3. ONTOLOGICAL SEMANTIC SIMILARITY

    Note that in literature, there are cases (e.g. Al-Mubaid and Nguyen(2006)) where

    Radas measure is used with node counting, instead of edge counting. In those

    cases, each path is represented as a set of the nodes that compose it, including

    the end nodes. The minimum distance can be converted into a similarity metric,

    as inResnik(1995):

    simrada(c1, c2) = 2D d(c1, c2), (3.17)

    where D is the maximum depth of the taxonomy. This method fails to capture

    the intuition that concept nodes, which reside at the lower part of the hierarchy

    and are separated by distanced, are more similar than higher-level nodes with the

    same distance separationd. Also, its success highly depends on the uniformity of

    edge distribution within the ontology. For these reasons, other approaches have

    been proposed in order to achieve a more representative score of similarity.

    InWu and Palmer(1994), the relative depth of the compared concepts in the

    hierarchy is considered. In that work, Wu and Palmer introduce the Least Com-

    mon Subsumer (LCS) of the compared concepts. The LCS is the hierarchically

    deepest common ancestor of the compared concepts. Similarity for concepts c1

    and c2 is then given as:

    simw&p(c1, c2) = 2h

    N1+N2+ 2h, (3.18)

    where N1 is the number of nodes in the path between concept c1 and the LCS,

    N2 is the number of nodes between concept c2 and the LCS, and h is the depth

    of the LCS, measured again in number of nodes.

    In Li et al. (2003), the authors followed various strategies in their attempt

    to calculate similarity as a function of the shortest path between the comparedconcepts, the depth of their LCS and the local density of the ontology. They

    perceived that the best performance was obtained when they used the following

    non-linear function:

    simli(c1, c2) =e drada(c1,c2)

    eh eh

    eh +eh, (3.19)

    where,are non-negative parameters and h = drada(LCS(c1, c2), root) denotes

    the minimum depth of the LCS. Distances are measured in number of edges.

    47

  • 8/12/2019 Enhanced Ontological Searching of Medical Scientific Information

    48/130

    CHAPTER 3. SIMILARITY METRICS

    Al-Mubaid and Nguyen attempt to combine path length and node depth in one

    measure. InAl-Mubaid and Nguyen(2006), they view the DAG as a composition

    of clusters, with each cluster having as root a child of the ontology root. The

    usage of clusters aims to exploit local characteristics of different branches. Given

    concepts c1 and c2, they first compute their so-called common specificity:

    Cspec(c1, c2) =Dc h, (3.20)

    whereDcdenotes the depth of the specific cluster and h refers to the depth of the

    LCS in the ontology, with both quantities measured in number of nodes. Then

    similarity is computed as:

    sima&n(c1, c2) = log((Path 1) (CSpec) +k), (3.21)

    where Path is a modified version of Radas distance measure which is adapted

    according to the largest cluster, and , ,k are constants, whose default values

    are unitary.

    Information-Based Metrics

    One of the first attempts to focus on nodes in the similarity formula is that

    of Leacock and Chodorow Leacock and Chodorow (1998). This method uses

    negative log likelihood in a way that resembles the formula of self-information

    Cover and Thomas(2012), but does not really involve valid probability. Instead,

    a normalized form of the path length between the concepts is used:

    siml&c(c1, c2) =log(Np/2D), (3.22)

    where Np is the number of nodes in the shortest path between concepts c1 and

    c2. This variable also includes the end nodes.

    Resnik, inResnik(1995), continues down this path by replacing the normal-

    ized path length with a probability measure P() to calculate the information

    content (IC) of a concept. He considers all common subsumersCSi of concepts

    48

  • 8/12/2019 Enhanced Ontological Searching of Medical Scientific Information

    49/130

    3.3. ONTOLOGICAL SEMANTIC SIMILARITY

    c1 and c2 and calculates similarity as:

    simresn(c1, c2) = maxi [log(P(CSi))], (3.23)

    or, equivalently,

    simresn(c1, c2) =log(P(LCS)). (3.24)

    Considering that the IC of a concept c is defined as the negative logarithm of its

    probability, i.e. IC(c)= -log(P(c)), equation (3.24) can also be written as:

    simresn(c1, c2) = IC(LCS(c1, c2)). (3.25)

    Probabilities are estimated with the help of a text corpus, i.e. a collection of

    nature language excerpts, specifically chosen to provide a good representation of

    actual term usage. When dealing with biomedical ontology concepts, collections

    of Pubmed1 abstracts are commonly used as corpora to determine the probability

    of each concept.

    Given a corpus, the occurrence of a term which corresponds to concept c

    essentially implies the occurrence of each and every concept that subsumes c

    within the ontological structure. Conversely, the number of occurrences of a

    conceptc depends not only on the number of appearances ofcitself in the corpus,

    but also on every occurrence of its descendants in the hierarchy. Thus, the number

    of occurrences of concept c is given by:

    occ(c) =

    n=subsumed(c)

    count(n), (3.26)

    where subsumed(c) represents c and its children concept nodes, and count()

    denotes the number of occurrences of the specific concept within the given corpus.

    Converting occurrences to probability can be done using:

    P(c) =occ(c)

    N , (3.27)

    where N is the total number of occurrences of ontology terms in the corpus.

    This method results to higher probabilities for concepts residing at the top part

    1http://www.ncbi.nlm.nih.gov/pubmed

    49

    http://www.ncbi.nlm.nih.gov/pubmedhttp://www.ncbi.nlm.nih.gov/pubmed
  • 8/12/2019 Enhanced Ontological Searching of Medical Scientific Information

    50/130

    CHAPTER 3. SIMILARITY METRICS

    of the hierarchy, with the root having unitary probability. Therefore, concepts

    whose LCS lies lower in the hierarchy are more similar, since their LCS has low

    probability (i.e., high IC).

    A possible drawback of this method is that probabilities are tied to the choice

    of corpus. So far, in the biomedical domain, there is no widely accepted corpus

    that covers the domain needsAl-Mubaid and Nguyen(2006). This is due to the

    fact that thousands of new terms and abbreviations appear in the literature every

    year, thus a stable corpus might not function well. Since extensions of the corpus

    would need to be considered at fixed intervals, it might not serve as a useful

    benchmark.

    Alternatively, computation of IC can be performed without the use of a corpus,

    by solely relying on the structure of the ontology DAG. Intrinsic computation of

    IC involves approximating the occurrence probability of a concept as a function

    of multiple variables, such as number of descendant nodes, number of subsumers

    or number of descendant nodes which are leaves in the ontology. InSeco et al.

    (2004), the IC of a concept c is given by:

    ICseco(c) = 1 log(descendants(c) + 1)log(allConcepts)

    , (3.28)

    wheredescendants(c) returns the number of nodes that concept c subsumes, and

    allConcepts denotes the number of all the available concepts in the ontology.

    The IC function introduced by Seco et. al has the drawback that it assigns IC

    equal to one for every leaf node in the ontology, and also that concepts containing

    the same number of descendant nodes are again given the same IC. An attempt to

    distinguish the IC between leaf concepts was made in Zhou et al.(2008), by also

    including the depth of the node in the calculation, normalized by the maximum

    depth of the ontology. The proposed IC formula is given by:

    ICzhou(c) =kICseco(c) + (1 k)log(depth(c) + 1)

    log(maxDepth) , (3.29)

    wheredepth(c) represents the depth of the concept c in the hierarchy, maxDepth

    is the maximum depth of the ontology, measured in node number and k is a

    weighting constant.

    50

  • 8/12/2019 Enhanced Ontological Searching of Medical Scientific Information

    51/130

    3.3. ONTOLOGICAL SEMANTIC SIMILARITY

    The authors inSanchez et al.(2011) further improve the modeling of the IC

    function. In that work, the IC function can also distinguish concepts that contain

    the same number of descendants, due to the fact that the number of subsumers

    of a concept is also used. The IC is given as:

    ICsan(c) =log

    leaves(c)ancestors(c)

    + 1)

    allLeaves

    , (3.30)

    where leaves(c) is the number of nodes that are descendants of c and have no

    children, ancestors(c) refers to the number of concepts which subsume c and

    allLeavesdenotes the total number of leaf nodes in the ontology. The IC func-

    tions of equations (3.28), (3.29) and (3.30) can be used in equation (3.25) to

    compute the similarity between two concepts without using a corpus.

    Lin et al. use IC in an alteration of the similarity metric ofWu and Palmer

    (1994). More specifically,

    siml&p(c1, c2) =2 simresn(c1, c2)

    IC(c1) + IC(c2), (3.31)

    This approach aims to include the individual characteristics of the compared

    nodes that Resniks approach neglected. Indeed, in Resniks measure, any two

    pairs of nodes that have the same LCS produce the same similarity.

    Jiang and Conrath follow a similar approach with Wu and Palmer (1994),

    but avoid the scaling of similarityJiang and Conrath(1997). Instead, they use a

    distance metric as follows:

    dj&c(c1, c2) = IC(c1) + IC(c2) 2 simresn(c1, c2). (3.32)

    Various transformations have been applied to convert this distance to similarity.

    Among these, the authors in Seco et al. (2004) consider a linear transformation

    and present the following formula of similarity normalized in the interval [0,1]:

    simj&c(c1, c2) = 1 dj&c(c1, c2)

    2 . (3.33)

    Another example can be found in Zhu et al. (2009), in which an exponential

    function is used for the similarity formula, along with a constant that accounts

    51

  • 8/12/2019 Enhanced Ontological Searching of Medical Scientific Information

    52/130

    CHAPTER 3. SIMILARITY METRICS

    for curve steepness:

    simj&c(c1, c2) =edj&c(c1,c2)

    . (3.34)

    Feature-Based Measures

    Feature-based measures do not necessarily conform to the similarity metric rules

    ofChen et al. (2009), as they allow for similarity asymmetry. In feature-based

    techniques, the two compared concepts are viewed as sets of features, in contrast

    to the geometric view presented in previous sections. To calculate similarity, not

    only the common features of the concepts are taken into account, but also the

    differences between them. That way, common features improve similarity, while

    different features penalize its valueTversky et al.(1977). Given concepts c1 and

    c2, let C1 and C2 denote the sets that contain their features. Then, similarity

    between the two can be given as:

    simtve(c1, c2) = |C1 C2|

    |C1 C2| +|C1 C2| + (1 )|C2 C1|, (3.35)

    whereis a weight which takes values in [0,1]. InRodrguez et al. (1999), the

    parameter is computed as follows:

    =

    d(c1,LCS)d(c1,c2)

    , d(c1,LCS) d(c2,LCS)

    1 d(c1,LCS)d(c1,c2)

    , else(3.36)

    This asymmetric function stems from Tverskys observation that similarity might

    not be symmetric. In one of Tverskys examples, North Korea was said to be more

    similar to Red China than the reverse.

    3.3.2 Inter-ontology Semantic Similarity

    Inter-ontology semantic similarity measures try to quantify the similarity between

    concepts that belong to different ontologies. Fairly little research has been doc-

    umented in this area, due to the inherent difficulty of comparing heterogeneous

    structures. A common approach is to combine the different ontologies into a

    52

  • 8/12/2019 Enhanced Ontological Searching of Medical Scientific Information

    53/130

    3.3. ONTOLOGICAL SEMANTIC SIMILARITY

    single ontology through detailed concept mappings Gangemi et al. (1998). It is

    clear that this is very challenging and requires the help of a domain expert, as

    well as plenty of time and effort. Furthermore, not all biomedical terminologies

    are consistent and their lack of homogeneity is a major problem. Simpler ap-

    proaches have been proposed in the literature. A usual first step is to merge the

    different ontologies under a dummy root. This approach is found inRodrguez

    and Egenhofer (2003), where the authors use a weighted version of Tverskys

    similarity which also takes into account geometrical features of the ontologies.

    A similar route is followed by Petrakis et al. (2006), where the authors substi-

    tute Tverskys similarity with a form of Jaccard similarity. The drawback of

    these cross-similarity metrics is that they do not consider term overlap in both

    ontologies. Other methods rely on extensions of single ontology similarity met-

    rics. Examples of such work can be found in Al-Mubaid and Nguyen(2006) and

    Sanchez et al.(2012).

    53

  • 8/12/2019 Enhanced Ontological Searching of Medical Scientific Information

    54/130

  • 8/12/2019 Enhanced Ontological Searching of Medical Scientific Information

    55/130

    Chapter 4

    Search Interfaces

    Search has risen to be one of the most commonly used tools for computer users.

    It can be found everywhere, from stand-alone web-based search engines to em-

    bedded search forms that appear in desktop applications and websites. To a large

    extent, success of the search procedure depends on the users ability to formulate

    their information needs, transforming them into queries that are highly likely to

    produce desired results. For this reason, a lot of effort has been spent on improv-

    ing the search interfaces and providing tools that will enhance user experience.

    In this chapter, the basic characteristics of successful search interface design are

    presented, with main focus on web-search interfaces.

    4.1 Information Seeking Models

    Information seeking models attempt to recognize and describe the strategies fol-

    lowed by humans from the moment they sense a search need until the moment

    they acquire desired results. The search procedure may be viewed as a repetition

    of actions. InSutcliffe and Ennis (1998), the authors identify the following four

    actions in what is considered the standard model of information seeking:

    1. Problem Identification

    2. Articulation of Need

    55

  • 8/12/2019 Enhanced Ontological Searching of Medical Scientific Information

    56/130

    CHAPTER 4. SEARCH INTERFACES

    3. Query Formulation

    4. Evaluation of Results

    The first step refers to conceptualization of the search need, while the second step

    involves expressing this need in words. The third step requires the user to trans-

    form the articulated need into a format that will be accepted by the underlying

    search system. Finally, the fourth step refers to the procedure of judging the

    results critically, exploiting any relevant domain knowledge and deciding whether

    the need is satisfied. A search may be characterized as ok, failed or unsatis-

    factory. An ok search ends the cycle successfully. An unsatisfactory search

    may lead to reformulation of the query or re-articulation of the need, while a

    completely failed search might require re-identification of the problem.

    Sutcliffe and Enniss model assumes that the need does not change, unless

    results are disappointing. It does not capture the fact that users learn as they

    search. This dynamic aspect of information seeking was captured in an earlier

    work by BatesBates(1989). In that study, the users needs are assumed to change

    as the process advances. Furthermore, Bates claims that the success of the search

    procedure does not only depend on the final list of results, but on the selections

    made along the way. This model is referred to as the berry-picking model, to

    denote that it does not result in a single set of results. A simple example of the

    berry-picking model can be illustrated when a user attempts a broad query such

    as String similarity algorithms and refines the query to Jaro similarity after

    viewing this result in the initial result list.

    4.2 Query Specification

    Queries are usually specified through rectangular entry forms, as in Fig. 4.1. The

    width of these forms varies in size, with studies showing that wider forms promote

    formulation of longer queriesFranzen and Karlgren(2000);Belkin et al.(2003).

    It has been observed that around 88% of search queries are composed of 1 to 4

    56

  • 8/12/2019 Enhanced Ontological Searching of Medical Scientific Information

    57/130

    4.2. QUERY SPECIFICATION

    Figure 4.1: The google search engine entry form.

    Figure 4.2: Facebook uses grayed-out descriptive text to help in the formulation of user

    queries.

    words, with mean length equal to 2.8 words per query Jansen et al.(2007). The

    actual search is executed by pressing the return key or mouse-clicking a specified

    button (e.g. magnifying glass in Bing). In some cases, entry forms decorate their

    background with descriptive text that provides guidance for the user. An example

    is Facebooks search form, as seen in Fig. 4.2. The text disappears, once the user

    clicks inside the form. This usually helps to narrow down the search domain.

    After query submission, processing of the query takes place before any attempt

    to retrieve results. This process may include removal of stopwords (i.e. words

    with high appearance probability such as the, a), normalization of words (e.g.

    plural to singular) and permutation of word order. Boolean logic may also be used

    in the case of multiple words per query. Returning results that contain all query

    words (i.e. Boolean AND operator) seems more intuitive, although this might

    sometimes lead to overly specific queries that return no results. The actual types

    of processing are often hidden from the users, in an attempt to avoid confusion

    and promote transparency,Muramatsu and Pratt(2001).

    Most modern search interfaces are equipped with dynamic search suggestion,

    also known as auto-completion (See Fig. 4.3). As the user starts typing, a list of

    57

  • 8/12/2019 Enhanced Ontological Searching of Medical Scientific Information

    58/130

    CHAPTER 4. SEARCH INTERFACES

    Figure 4.3: Bings search interface features a powerful dynamic search suggestion, where

    prefixes are highlighted with grayed-out font and the remaining text is in bold.

    term suggestions appears under the entry form. The suggestions contained in the

    list are usually queries whose prefix matches what has been typed so far, although

    there are cases where interior matches are also included. The user can then mouse-

    click the most relevant query or navigate through the list, using keyboard arrows.

    Studies have shown that approximately one third of all search attempts in the

    Yahoo Search Assist were performed through a dynamically suggested queryAn-

    ick and Kantamneni(2008). The dynamic search suggestion technique attempts

    to minimize unneeded typing from the user side and can alleviate spelling errors

    early. Most importantly, though, it reassures the user that results are available,

    so there is no frustration from empty result pages.

    An important point to consider is that searchers often return to their pre-

    viously accessed information. In the empirical study undertaken by Tauscher

    and GreenbergTauscher and Greenberg(1997), it was found that there is a 58%

    chance that the next web page to be visited had been visited before. A more

    recent studyZhang and Zhao(2011) about tabbed browsing, conducted in 2010,

    also finds page revisitation to be around the same levels, at 59.3%. Various tools

    58

  • 8/12/2019 Enhanced Ontological Searching of Medical Scientific Information

    59/130

    4.2. QUERY SPECIFICATION

    Figure 4.4: The Safari browsers embedded search interface explicitly states which queries are

    suggestions and which belong to the users recent search history.

    Figure 4.5: The Firefox browsers embedded search interface contains recent queries on top,

    and separates them from suggestions using a solid line.

    exist to help users find their intended pages, including Uniform Resource Locator

    (URL) history, bookmarking of pages, basic navigation buttons (e.g. Back but-

    ton for short term page revisit) and change of URL font color if page has already

    been visited. Among other methods documented, users may save whole webpages

    to their local disk or keep URLs in text documents, after enriching them with

    comments Jones et al. (2002). Interestingly, a common approach to revisiting

    documents is actually re-searching for them Obendorf et al. (2007). Users who

    59

  • 8/12/2019 Enhanced Ontological Searching of Medical Scientific Information

    60/130

    CHAPTER 4. SEARCH INTERFACES

    Figure 4.6: Googles search results page is a typical scrollable vertical list of captions. Meta-

    data facets, that restrain results to a particular type of information, are also present in the

    interface (e.g. Images tab).

    adopt this strategy attempt to re-create the conditions of their previous search, by

    trying to formulate the exact same query. Another strategy requires past searchqueries to appear as the user types, along with regular dynamic term sugges-

    tion. Separation between suggested queries and previously generated ones varies

    among interfaces, as can be seen in Figures 4.4and4.5.

    4.3 Presentation of Search Results

    Search applications usually present results as a vertical list of captions, distributed

    along multiple pages (see Fig. 4.6). Each caption is a clickable entity which, as a

    minimum requirement, comprises a title and an excerpt of the target document

    Clarke et al.(2007). Usually, the excerpt includes some or all of the query terms,

    as highlighted text. In most cases, highlighting is performed using bold font or

    colored term background. Many search applications tend to group similar results,

    that originate from the same source, into the same caption. That way, result

    60

  • 8/12/2019 Enhanced Ontological Searching of Medical Scientific Information

    61/130

    4.3. PRESENTATION OF SEARCH RESULTS

    pollution from few sources is avoided and diversity is promoted. The relevance

    of search results is reflected in their order of appearance. Although relevance

    scores were formerly used to grade the fit of the result to the query, they are

    usually not present anymore in modern search applications. The reasons behind

    their omission might be to avoid reverse-engineering of the ranking algorithms and

    to reduce redundancy, since the ranking itself already reflects the importance of

    resultsHearst(2009).

    It has been observed that users tend to click on the uppermost captions

    Joachims et al. (2005). In the same study, it was found that the first caption

    received more attention than its successors, even if its relevance was actually

    lower. Furthermore, the majority of users often remain on the first page of re-

    sults. The authors inJansen et al. (2007) observed that only 30% continued to

    look for relevant results in the second page of the results, and only 15% looked

    even further. Usually, the patience of a user is a function of his/her experience

    in using the system. More experienced users tend to be more patient than users

    who are not accustomed to the search procedure. Inexperienced users, on the

    other hand, often prefer to refine their query or simply accept that what they

    search for cannot be found by the search applicationHearst(2009).

    Apart from plain lists of results, further organization of captions may be per-

    formed, using some form of faceted browsing. Facets attempt to refine search

    results, a