enhanced ontological searching of medical scientific information
Post on 03-Jun-2018
218 Views
Preview:
TRANSCRIPT
-
8/12/2019 Enhanced Ontological Searching of Medical Scientific Information
1/130
-
8/12/2019 Enhanced Ontological Searching of Medical Scientific Information
2/130
-
8/12/2019 Enhanced Ontological Searching of Medical Scientific Information
3/130
Contents
Abstract 7
Declaration 9
Intellectual Property Statement 11
Acknowledgements 13
List of Abbreviations 15
List of Tables 17
List of Figures 19
1 Introduction 25
1.1 Problem Context . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
1.2 Motivation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
1.3 Contribution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
1.4 Thesis Organization. . . . . . . . . . . . . . . . . . . . . . . . . . 29
2 Ontologies 31
2.1 Modern Ontology Definition . . . . . . . . . . . . . . . . . . . . . 31
2.2 Ontology vs. Terminology . . . . . . . . . . . . . . . . . . . . . . 33
2.3 Notable Biomedical Ontologies and Terminologies . . . . . . . . . 34
2.3.1 SNOMED CT . . . . . . . . . . . . . . . . . . . . . . . . . 34
3
-
8/12/2019 Enhanced Ontological Searching of Medical Scientific Information
4/130
2.3.2 NDF-RT . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.3.3 ICD-10. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.3.4 MedDRA . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.3.5 NCI Thesaurus . . . . . . . . . . . . . . . . . . . . . . . . 38
3 Similarity Metrics 39
3.1 Similarity Metric vs. Distance Metric . . . . . . . . . . . . . . . . 39
3.2 Lexical Similarity . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.2.1 Character-based Similarity Measures . . . . . . . . . . . . 41
Longest Common Substring . . . . . . . . . . . . . . . . . 41Hamming Similarity . . . . . . . . . . . . . . . . . . . . . 41
Levenshtein Similarity . . . . . . . . . . . . . . . . . . . . 41
Jaro Similarity . . . . . . . . . . . . . . . . . . . . . . . . 42
Jaro-Winkler Similarity . . . . . . . . . . . . . . . . . . . 42
N-gram Similarity. . . . . . . . . . . . . . . . . . . . . . . 43
3.2.2 Word-based Similarity Measures . . . . . . . . . . . . . . . 43
Dice Similarity . . . . . . . . . . . . . . . . . . . . . . . . 43
Jaccard Similarity. . . . . . . . . . . . . . . . . . . . . . . 44
Cosine Similarity . . . . . . . . . . . . . . . . . . . . . . . 44
Manhattan Similarity. . . . . . . . . . . . . . . . . . . . . 44
Euclidean Similarity . . . . . . . . . . . . . . . . . . . . . 45
3.3 Ontological Semantic Similarity . . . . . . . . . . . . . . . . . . . 45
3.3.1 Intra-ontology Semantic Similarity . . . . . . . . . . . . . 45
Distance-based Metrics . . . . . . . . . . . . . . . . . . . . 45
Information-Based Metrics . . . . . . . . . . . . . . . . . . 48
Feature-Based Measures . . . . . . . . . . . . . . . . . . . 52
3.3.2 Inter-ontology Semantic Similarity . . . . . . . . . . . . . 52
4 Search Interfaces 55
4.1 Information Seeking Models . . . . . . . . . . . . . . . . . . . . . 55
4.2 Query Specification . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4
-
8/12/2019 Enhanced Ontological Searching of Medical Scientific Information
5/130
4.3 Presentation of Search Results . . . . . . . . . . . . . . . . . . . . 60
4.4 Query Reformulation . . . . . . . . . . . . . . . . . . . . . . . . . 62
5 Requirements 65
5.1 Feature Specification . . . . . . . . . . . . . . . . . . . . . . . . . 65
6 Design 69
6.1 Stage I: Access to Medical Ontologies . . . . . . . . . . . . . . . . 69
6.1.1 Database and Table Creation . . . . . . . . . . . . . . . . 70
6.1.2 Populating the Database Tables . . . . . . . . . . . . . . . 72
6.2 Stage II: Computation of Semantic Similarity . . . . . . . . . . . 76
6.2.1 Term Neighborhoods . . . . . . . . . . . . . . . . . . . . . 76
6.2.2 Semantic Similarity Calculation . . . . . . . . . . . . . . . 77
6.3 Stage III: Interface Design Data Presentation . . . . . . . . . . . 79
6.4 Summary of Technology Choices. . . . . . . . . . . . . . . . . . . 80
7 Implementation 83
7.1 Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
7.2 Search Entry Form . . . . . . . . . . . . . . . . . . . . . . . . . . 83
7.3 Handling the Input Query . . . . . . . . . . . . . . . . . . . . . . 88
7.3.1 Typing Speed . . . . . . . . . . . . . . . . . . . . . . . . . 88
7.3.2 Querying the Database . . . . . . . . . . . . . . . . . . . . 88
7.3.3 Ranking and Grouping of Search Results . . . . . . . . . . 89
7.3.4 Return-key or Mouse-click Search . . . . . . . . . . . . . . 91
7.3.5 Auto-completion Search . . . . . . . . . . . . . . . . . . . 91
7.4 Error Correction . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
7.5 Term Information Presentation . . . . . . . . . . . . . . . . . . . 96
7.6 Navigation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
8 Evaluation 103
8.1 Testing the Failed Queries . . . . . . . . . . . . . . . . . . . . . . 103
8.2 Comparison to BioPortal Search Services . . . . . . . . . . . . . . 109
5
-
8/12/2019 Enhanced Ontological Searching of Medical Scientific Information
6/130
8.2.1 Auto-completion . . . . . . . . . . . . . . . . . . . . . . . 109
8.2.2 Results Ranking. . . . . . . . . . . . . . . . . . . . . . . . 111
8.2.3 Error Correction . . . . . . . . . . . . . . . . . . . . . . . 113
8.2.4 Visualization . . . . . . . . . . . . . . . . . . . . . . . . . 114
8.3 Comments from an AstraZeneca Search Specialist . . . . . . . . . 117
9 Conclusions and Future Work 121
9.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
9.2 Future Work. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
Bibliography 123
Number of Words in the Document: 25648
6
-
8/12/2019 Enhanced Ontological Searching of Medical Scientific Information
7/130
University of Manchester
School of Computer Science
Degree Programme of Advanced Computer Science
ABSTRACT OF
MASTERS THESIS
Author: Christos Karaiskos
Title: Enhanced Ontological Searching of Medical Scientific Information
Supervisors: Prof. Andrew Brass (University of Manchester)
Dr. Jennifer Bradford (AstraZeneca)
Abstract: An enormous amount of biomedical knowledge is encoded in narra-
tive textual format. In an attempt to discover new or hidden knowledge, exten-
sive research is being conducted to extract and exploit term relationships fromplain text, with the aid of technology. A common approach for the identification
of biomedical entities in plain text involves usage of ontologies, i.e., knowledge
bases which provide formal machine-understandable representations of domains
of variable specificity. In addition to term extraction, ontologies may be used
as controlled vocabularies or as a means for automatic knowledge acquisition
through their inherent inference capabilities. Visualization of the content of on-
tologies is, thus, very important for researchers in the biomedical domain. Un-
fortunately, many of these researchers find it difficult to deal with formal logic
and would prefer that ontology search interfaces completely hide any structural
or functional references to ontologies. This thesis proposes a strategy for build-
ing a web-based ontology search application that exploits ontologies behind the
scene, transparently from the end user, and presents relevant concept informa-
tion in such a way that searchers can successfully and quickly find what they
are looking for. The proposed search interface features various search tools for
enhanced ontological searching, including term auto-completion, error correction,
clever results ranking, and similar term visualizations based on semantic similar-
ity metrics. Evaluation of the developed application shows that its features can
improve enterprise-strength ontology search applications, such as BioPortal.
Keywords: search interface design, ontology hiding, biomedical ontology,
semantic similarity, usability, data integration
7
-
8/12/2019 Enhanced Ontological Searching of Medical Scientific Information
8/130
8
-
8/12/2019 Enhanced Ontological Searching of Medical Scientific Information
9/130
Declaration
No portion of the work referred to in the dissertation has been submitted in
support of an application for another degree or qualification of this or any other
university or other institute of learning.
9
-
8/12/2019 Enhanced Ontological Searching of Medical Scientific Information
10/130
10
-
8/12/2019 Enhanced Ontological Searching of Medical Scientific Information
11/130
Intellectual Property Statement
i. The author of this dissertation (including any appendices and/or schedules
to this dissertation) owns certain copyright or related rights in it (the Copy-
right) and he has given The University of Manchester certain rights to use
such Copyright, including for administrative purposes.
ii. Copies of this dissertation, either in full or in extracts and whether in hard
or electronic copy, may be made only in accordance with the Copyright,
Designs and Patents Act 1988 (as amended) and regulations issued under
it or, where appropriate, in accordance with licensing agreements which the
University has entered into. This page must form part of any such copies
made.
iii. The ownership of certain Copyright, patents, designs, trade marks and other
intellectual property (the Intellectual Property) and any reproductions of
copyright works in the dissertation, for example graphs and tables (Repro-
ductions), which may be described in this dissertation, may not be owned by
the author and may be owned by third parties. Such Intellectual Property
and Reproductions cannot and must not be made available for use with-
out the prior written permission of the owner(s) of the relevant Intellectual
Property and/or Reproductions.
iv. Further information on the conditions under which disclosure, publication
and commercialisation of this dissertation, the Copyright and any Intel-
lectual Property and/or Reproductions described in it may take place is
11
-
8/12/2019 Enhanced Ontological Searching of Medical Scientific Information
12/130
available in the University IP Policy (see http://documents.manchester.ac.
uk/display.aspx?DocID=487), in any relevant Dissertation restriction decla-
rations deposited in the University Library, The University Librarys reg-
ulations (see http://www.manchester.ac.uk/library/aboutus/regulations)
and in The Universitys Guidance for the Presentation of Dissertations.
12
http://documents.manchester.ac.uk/display.aspx?DocID=487http://documents.manchester.ac.uk/display.aspx?DocID=487http://www.manchester.ac.uk/library/aboutus/regulationshttp://www.manchester.ac.uk/library/aboutus/regulationshttp://documents.manchester.ac.uk/display.aspx?DocID=487http://documents.manchester.ac.uk/display.aspx?DocID=487 -
8/12/2019 Enhanced Ontological Searching of Medical Scientific Information
13/130
Acknowledgements
I am deeply grateful to my supervisors, Prof. Andrew Brass (University of Manch-
ester) and Dr. Jennifer Bradford (AstraZeneca), for their invaluable guidance and
support throughout the duration of this project. I have greatly benefited from
experiencing the different perspectives of academia and industry, which have both
contributed to shaping the final outcome of this project.
I would like to thank Sebastian Philipp Brandt (University of Manchester),
for his suggestions on making the search application even better. Also, I would
like to express my gratitude to Julie Mitchell (AstraZeneca), for taking the time
to evaluate the application, and Paul Metcalfe (AstraZeneca), for his advice on
improving the performance and security of the application.
Finally, I would like to thank Matina for her patience and love, and my par-
ents, Ioannis and Stavroula, for always being there.
13
-
8/12/2019 Enhanced Ontological Searching of Medical Scientific Information
14/130
14
-
8/12/2019 Enhanced Ontological Searching of Medical Scientific Information
15/130
List of Abbreviations
AI Artificial Intelligence
AJAX Asynchronous JavaScript and XML
API Application Programming Interface
CSS Cascading Style Sheets
DAG Directed Acyclic Graph
HLGT High Level Group Term
HLT High Level Term
HTTP Hypertext Transfer Protocol
IC Information Content
ICD International Classification of Diseases
JDBC Java Database Connectivity
JSON JavaScript Object Notation
LCS Least Common Subsumer
MedDRA Medical Dictionary for Regulatory Activities
NCIT National Cancer Institute Thesaurus
NDF-RT National Drug File Reference Terminology
15
-
8/12/2019 Enhanced Ontological Searching of Medical Scientific Information
16/130
NHS UK National Health System
NLP Natural Language Processing
OBO Open Biomedical Ontologies
OWL Web Ontology Language
PHP PHP Hypertext Preprocessor
PT Preferred Term
RDF Resource Description Framework
RDF-S Resource Description Framework Schema
REST Representational State Transfer
RF2 Release Format 2
SNOMED CT Systematized Nomenclature of Medicine Clinical Terms
SNOMED RT Systematized Nomenclature of Medicine Reference
Terminology
SOC System Organ Class
UMLS Unified Medical Language System
URI Uniform Resource Identifier
URL Uniform Resource Locator
UX User Experience
VA U.S. Department of Veterans Affairs
WHO World Health Organization
XHTML Extensible HyperText Markup Language
XML Extensible Markup Language
16
-
8/12/2019 Enhanced Ontological Searching of Medical Scientific Information
17/130
-
8/12/2019 Enhanced Ontological Searching of Medical Scientific Information
18/130
-
8/12/2019 Enhanced Ontological Searching of Medical Scientific Information
19/130
List of Figures
2.1 The structure of the MedDRA terminology comprises a fixed-depth
hierarchy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.1 The google search engine entry form. . . . . . . . . . . . . . . . . 57
4.2 Facebook uses grayed-out descriptive text to help in the formula-
tion of user queries. . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.3 Bings search interface features a powerful dynamic search sugges-
tion, where prefixes are highlighted with grayed-out font and the
remaining text is in bold. . . . . . . . . . . . . . . . . . . . . . . 58
4.4 The Safari browsers embedded search interface explicitly states
which queries are suggestions and which belong to the users recent
search history. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.5 The Firefox browsers embedded search interface contains recent
queries on top, and separates them from suggestions using a solid
line. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.6 Googles search results page is a typical scrollable vertical list of
captions. Metadata facets, that restrain results to a particular
type of information, are also present in the interface (e.g. Images
tab). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.7 Amazons search interface provides facets as a left panel to the
results page, helping the user dynamically refine the initial search. 62
19
-
8/12/2019 Enhanced Ontological Searching of Medical Scientific Information
20/130
4.8 Pubmeds results page includes term expansion in two ways. On
the right of the screen, there is a Related searches panel that pre-
serves the initial query and adds a new related term to it. Also,
right below the entry form there is a See also feature which sug-
gests complete or partial modifications in the initial query. . . . . 64
6.1 A part of the XML response for the get all terms query of Table
6.2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
6.2 The provided methods of the ontoCAT APIAdamusiak et al.(2011). 75
6.3 Populating the Ontologies database is performed with the help of
the ontoCAT API. . . . . . . . . . . . . . . . . . . . . . . . . . . 75
7.1 The organization of the files that comprise the web application.
These files are responsible for the presentation, styling and inter-
active behavior of the web application. . . . . . . . . . . . . . . . 84
7.2 The main window of the search application. The search box is
placed at the top of the screen, with central horizontal alignment.
A submit button labeled Search is also provided, to assist users
that prefer mouse-clicking. . . . . . . . . . . . . . . . . . . . . . . 87
7.3 Once the user clicks inside the search box, the grey help message
disappears and a blinking cursor takes its place. . . . . . . . . . . 87
7.4 Terms, that would appear on their own table row, are grouped
under a more lexically-matching term to the query, when their
semantic similarity to that term is higher than a threshold. . . . . 90
7.5 Pressing the Return key or clicking the Search button submits
the query toindex.php and a table of search results is added to the
interface.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
7.6 Part of the JSON response from performQuery.php, for the input
query rash. Each JSON object represents a term matching the
query, and contains information that can be used for its presentation. 93
20
-
8/12/2019 Enhanced Ontological Searching of Medical Scientific Information
21/130
7.7 Pressing any other key except Return submits the query through
AJAX toperformQuery.php and an auto-completion pop-up menu
is created from the JSON response. . . . . . . . . . . . . . . . . . 93
7.8 Error correction when input query is lyng. The closest term is
suggested, as a clickable link. . . . . . . . . . . . . . . . . . . . . 95
7.9 When the user places the mouse cursor on a circle, a tooltip imme-
diately appears, containing the full term name and the semantic
similarity score with the viewed term.. . . . . . . . . . . . . . . . 97
7.10 Presentation page for the NCIT term Recurrent NSCLC. On the
left side, the basic term information is shown, along with an XML
representation of highly similar terms. On the right side, a visual-
ization of highly similar terms is provided, using the D3 JavaScript
library. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
7.11 Presentation page for the MedDRA term Rash. The term has
very close relations with terms that are not in the hierarchy. This
is illustrated using blue color. . . . . . . . . . . . . . . . . . . . . 100
7.12 The XML representation of a term. It includes basic term infor-
mation and highly similar terms. . . . . . . . . . . . . . . . . . . 101
7.13 Help is provided through tooltips that activate on mouse-over. . . 101
8.1 The term DIHS is not found, but this is normal, since it is not
part of any of the supported ontologies. Instead, the term DIOS
is proposed, in case the user had mispelt the query. . . . . . . . . 106
8.2 The term NMDA Antagonist is not found, but this is normal,since it is not part of any of the supported ontologies. No soundex
match is found, so no error corrections are suggested. . . . . . . . 106
8.3 The term Hepatotoxicity is shown in the auto-completion dialogue.106
8.4 The term NSCLC is shown in the auto-completion dialogue.. . . 106
8.5 The term DRESS syndrome is shown in the auto-completion di-
alogue. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
21
-
8/12/2019 Enhanced Ontological Searching of Medical Scientific Information
22/130
8.6 The query LHRH produces two different 100%-matching results.
Unlike in the previous search application, the user can now see that
Gonadotropin Releasing Hormone is a preferred term for LHRH. 107
8.7 The results for the query VEGFR, illustrate a semantic grouping
of 4 similar terms, namely VEGFR, Vascular Endothelial Growth
Factor Receptor 1, Vascular Endothelial Growth Factor Receptor
2, Vascular Endothelial Growth Factor Receptor 3. The latter
three are grouped under the parent term. . . . . . . . . . . . . . . 108
8.8 The BioPortal interface is a simple text box, similar to this projects
main page. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
8.9 BioPortal also offers advanced options to improve the search results.110
8.10 Only NCIT, MedDRA and ICD9CM are chosen for searching, out
of the 353 ontologies offered by BioPortal, so that comparisons to
this projects work are achievable. . . . . . . . . . . . . . . . . . . 111
8.11 Auto-completion pop-up menu of BioPortal NCIT widget when
the user has typed nsc. Only preferred terms are shown. The
user might be confused when seeing the term Becatecarin in the
results, since it does not contain nsc. . . . . . . . . . . . . . . . . 112
8.12 Auto-completion pop-up menu of this projects search application
when the user has typed nsc. . . . . . . . . . . . . . . . . . . . . 112
8.13 Searching for Denatonium Benzoate through its preferred term
name. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
8.14 Searching for Denatonium Benzoate through its synonym THS-839. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
8.15 Searching for Denatonium Benzoate through its synonym WIN
16568. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
8.16 BioPortal search results rankings for nsclc. All terms are grouped
according to the ontology they belong to, under the preferred name
of the most lexically-relevant term to the query. . . . . . . . . . . 114
22
-
8/12/2019 Enhanced Ontological Searching of Medical Scientific Information
23/130
8.17 This projects search results rankings for nsclc. Terms in the re-
sults are rearranged into groups that show high semantic similarity.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
8.18 BioPortal returns no search results for the erroneously spelt term
nsclca. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
8.19 BioPortal returns no search results for the erroneously spelt term
caancer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
8.20 This projects search application returns a search suggestion of
nsclc for the erroneously spelt term nsclca. . . . . . . . . . . . 116
8.21 This projects search application returns a search suggestion of
cancer for the erroneously spelt term caancer. . . . . . . . . . 116
8.22 BioPortal uses a graph to visualize hierarchical relations. Edges
are annotated with a description of the relationship between the
connected nodes (e.g. subclassOf). . . . . . . . . . . . . . . . . . 116
8.23 This projects application focuses on inexperienced users and at-
tempts to completely hide any formal-logic relationships that might
confuse the user. . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
8.24 Search results depicting causal associations between smoking and
cancer, as presented by the I2E text mining application. . . . . . 118
8.25 Search results for the term MEK inhibitor in NCIT, when the
I2E application is used. . . . . . . . . . . . . . . . . . . . . . . . . 119
23
-
8/12/2019 Enhanced Ontological Searching of Medical Scientific Information
24/130
24
-
8/12/2019 Enhanced Ontological Searching of Medical Scientific Information
25/130
Chapter 1
Introduction
Ontologies are knowledge bases which provide formal machine-understandable
representations of domains of variable specificity. Given a domain of discourse,
concepts that belong to the domain are well documented in formal logic, along
with their inter-relations. Ontologies, as representations, cannot perfectly capture
the part of the world that they attempt to describe Davis et al. (1993). They
are based on the open world assumption, which states that if something is not
represented in a knowledge base, it does not mean that it does not exist in the
real worldHustadt et al. (1994). As our knowledge about a domain increases,
ontologies are updated and they become more complex. This has become evident
in the biomedical domain, where ontologies have already attained a high degree of
specificity, and has led to their quick adoption for data integration and knowledge
discovery purposes.
1.1 Problem Context
Within biomedicine, ontologies can help researchers communicate, by promoting
consistent use of biomedical terms and concepts. The construction of an ontol-
ogy itself involves mediating across multiple views and requires that a number
of domain experts reach a consensus that reflects the diverse viewpoints of the
25
-
8/12/2019 Enhanced Ontological Searching of Medical Scientific Information
26/130
CHAPTER 1. INTRODUCTION
community. Ontologies are viewed as tools that provide opportunities for new
knowledge acquisition, due to the complex semantic relations that they model.
Inferences in a huge ontology may reveal connections that the human eye would
bypass. This is especially important in the pharmaceutical sector, where drug
discovery has slowed down significantly as a process and in the biological sector,
where attempts to demystify genome patterns associated with disease are still
at initial stage. Another common use for ontologies in the biomedical domain
is as controlled vocabularies that feed filtered terms into computer applications.
Finally, ontologies may be used to connect terms found in plain text to their
semantic representations. Term extraction with the help of ontologies is a hot
topic in biomedicine, due to the vast amounts of medical information stored in
plain text. Due to the importance of ontologies, it is usual for researchers in the
biomedical field to require access to their content.
1.2 Motivation
In the past, AstraZeneca employees were provided with a web-based search form
that enabled them to look for concepts in one or more biomedical ontologies and
select the most suitable from a list of search results. The chosen concepts were, in
turn, conveyed to a text mining application. Understanding the results required
the user to be familiar with the content and structure of the ontology from which
the terms were retrieved. Unfortunately, most users did not feel comfortable
with the idea of ontologies and struggled, or even refused, to use the provided
interfaces, even though no logic-based content was there to confuse them.
In many cases, though, this was not solely the fault of the users. The interface
gave the users freedom to select the ontologies to be searched for the specified
query. Inexperienced users usually did not know or care about which ontology
contains the desired query term. For example, a user wished to search for Non-
small cell lung carcinoma, by its abbreviation NSCLC. Querying NSCLC in
26
-
8/12/2019 Enhanced Ontological Searching of Medical Scientific Information
27/130
1.3. CONTRIBUTION
the MedDRA terminology1 returned no results, since the concept is not present
in the terminology. Although this behavior is correct, it seems wrong to the
inexperienced user and may lead to loss of trust to the system.
But even if the term is present in the ontology, the user should not be forced
to know its exact spelling. For example, querying for NSCLC in the NCIT
thesaurus also returned no results, despite the fact that the actual concept exists
in the ontology. The searcher needed to know that the preferred term for the
NSCLC concept is Non-small cell lung carcinoma. Abbreviations and dissimilar
synonyms are common in the biomedical field, so expecting the user to know the
preferred term for each concept is considered problematic.
In addition to the above, presentation of results was not always straightfor-
ward. Terms that demonstrate a strong semantic relation to each other were
presented as stand-alone terms in the search results, subconsciously misleading
users to deduce that the terms were independent. It was up to the user to judge
the relevance of results to the query. For example, the results for Non-small cell
lung carcinoma in NCIT included, among others, the terms Non-small cell lung
carcinoma and Stage I non-small cell lung carcinoma equally spaced, in a way
that users could not infer the connections between them. In fact, the latter term
is a specification of the former. In reality, what users did was to choose all terms,
even though they were looking for the broad term, because they became confused
and did not want to take the risk of selecting only one.
This collapse at the human-computer interface has motivated AstraZeneca to
try to build tools that take advantage of the ontology structure and, at the same
time, completely hide it from the user in order to facilitate the search procedure.
1.3 Contribution
The outcome of this thesis is the development of a user-friendly search applica-
tion that allows users to find information about concepts present in a medical
1The difference between terminology and ontology is described in Section2.2
27
-
8/12/2019 Enhanced Ontological Searching of Medical Scientific Information
28/130
CHAPTER 1. INTRODUCTION
ontology, without requiring from them to understand the underlying structure of
the ontology. Information about a concept includes its accession code within the
given ontology, the term for its preferred name, its definition and all available
synonym terms. In order to facilitate the search procedure and enhance User
Experience (UX), the search application includes features such as dynamic term
suggestion, spelling correction and similar term visualization tools.
The main challenge lies in the presentation of results; as stated in section 1.2,
users are usually not sure about which term(s) to choose, when multiple similarly-
spelt terms appear. Ranking of terms is performed with the aid of both lexical
and semantic similarity. The former screens those terms that best match the user
query and ranks them according to a string relevance metric. These results are
processed by the latter, so that terms showing a strong semantic connection are
grouped together.
Ideally, the search application should bridge across terms from multiple ontolo-
gies. Due to the diversity in the format and annotation of different ontologies, this
is not a straightforward generalization. Most importantly, within the biomedical
society, the term ontology is often used erroneously to describe plain termi-
nologies that, in fact, violate basic ontological principles.2 Therefore, ontology-
specific difficulties are expected to arise, if semantic similarity measures are to be
deployed.
In summary, the goals of this thesis are to investigate the following topics:
1. To develop user-friendly search tools that allow users to build search queries
based on the terms present in a medical ontology, without need for the usersto understand the actual structure of the ontology.
2. To exploit the semantic annotations of the underlying ontology in order to
enhance the quality and presentation of results.
3. To intermix results originating from different ontologies.
2In MedDRA, the synonym of a term may be a child node of the term itself.
28
-
8/12/2019 Enhanced Ontological Searching of Medical Scientific Information
29/130
1.4. THESIS ORGANIZATION
1.4 Thesis Organization
The thesis is organized in a total of 9 chapters. Chapter 2 includes an introductionto ontologies and a brief description of some notable biomedical ontologies. Chap-
ter 3 presents the background needed for understanding the different measures
of lexical and semantic similarity. Chapter 4 discusses interface design principles
for user-centered search applications. In chapter 5, the requirements and feature
specifications for the final search application are addressed. Chapter 6 describes
the design considerations that were taken into account for the ontological search
application, while chapter 7 presents the final implementation. Chapter 8 in-cludes the evaluation of the search application. Finally, conclusions are drawn in
chapter 9, along with possible future directions.
29
-
8/12/2019 Enhanced Ontological Searching of Medical Scientific Information
30/130
30
-
8/12/2019 Enhanced Ontological Searching of Medical Scientific Information
31/130
Chapter 2
Ontologies
The term ontology is an uncountable noun coined in the philosophical field, by
ancient Greek philosophersGuarino(1998). It involves the study of the nature
of existence, at a fairly abstract level. In the world of computer science, the word
ontology refers to the encoding of human knowledge in a format that allows
for computational use. This chapter includes an introduction to the modern
definition of ontology, along with a brief description of some of the most notablebiomedical ontologies.
2.1 Modern Ontology Definition
In Artificial Intelligence (AI), an ontology is commonly defined as a specification
of a (shared) conceptualizationGruber et al. (1995). A conceptualization refers
to an individuals knowledge about a specific domain, acquired through expe-
rience, observation or introspection Huang et al. (2010). Ontologies are shared
conceptualizations, meaning that multiple participants, usually domain experts,
contribute to their construction, maintenance and expansion. Conflicts are cer-
tain to arise among the different participants, so an important aspect of ontology
design is to bridge across multiple views of the desired domain into a single con-
crete representation. On the other hand, a specification is a transformation of
31
-
8/12/2019 Enhanced Ontological Searching of Medical Scientific Information
32/130
CHAPTER 2. ONTOLOGIES
this shared conceptualization into a formal representation language.
The outcome of a formal representation of a domain is a collection of entities,
expressions and axioms. Entities include:
concepts or classes, which are sets of individuals (e.g., Country, which
contains all countries),
individuals, which are specific instances of classes (e.g., Greece as an in-
stance of Country),
data types (e.g. string, integer),
literals, which are specific values of a given data type (e.g. 1,2,3, or string
values),
properties (e.g. hasDisease, hasAge).
Expressionsrefer to descriptions of entities in a formal representation language.
The standardized family of languages for formal ontology representation is the
Web Ontology Language (OWL), which builds on the Extensible Markup Lan-
guage (XML), Resource Description Framework (RDF) and RDF-Schema (RDF-
S) standards to provide a highly expressive means for representing knowledge
McGuinness et al. (2004). The underlying format of the resulting OWL docu-
ment can vary among several types, with the most common being RDF/XML.
Finally, axioms relate entities/expressions. This connection can be made
class-to-class (i.e. SubClassOf), individual-to-class (i.e. ClassAssertion), property-
to-property (i.e. SubPropertyOf), among others. These relations can be asserted
explicitly or inferred by a reasoner. Inferences are made, based on the logic rela-
tions of concepts. As an example of a simple inference, a concepts ancestors can
be inferred automatically, once the parent concept is specified.
An ontology may be visualized as a graph, in which concepts are nodes and
relations are edges between nodes. Furthermore, if transitive hierarchical rela-
tions are isolated (e.g. subsumption, also known as is-a relation or hyponymy),
32
-
8/12/2019 Enhanced Ontological Searching of Medical Scientific Information
33/130
2.2. ONTOLOGY VS. TERMINOLOGY
the ontology can be viewed as a taxonomy. The geometrical visualization of an
ontology will be presented in more detail in chapter 3.
2.2 Ontology vs. Terminology
A terminology is a collection of term names that are associated with a given
domain. A term is a mapping of a concrete concept to natural language. This
term-to-concept mapping is usually not one-to-one, especially in the biomedical
domain where term variation and term ambiguities arise Ananiadou and Mc-
Naught(2006). Term variation is a result of the richness of natural language and
refers to the existence of multiple terms for the description of the same concept.
For example, the terms Transmembrane 4 Superfamily Member 1, TM4SF1t,
L6 Antigen all point to the same protein. Term ambiguity occurs when a term is
mapped to more than one distinct concept. This is common when new abbrevia-
tions are introducedLiu et al.(2002). As an example, some of the concepts that
the acronym CTX may map to are Cardiac Transplantation, Clinical Trial
exemption and Conotoxin. Their disambiguation is a matter of context.
A terminology is not constrained to being a simple list of terms. In fact,
most terminologies feature some kind of structure, where terms that map to the
same concept are grouped together and semantic relationships between concepts
are explicitly or implicitly stated. Semantic relationships between terms include
synonymy and antonymy, while semantic relationships between concepts include
hyponymy, hypernymy, meronymy and holonymy Jurafsky and Martin (2000).
Synonymy exists when two terms are interchangeable, while antonymy denotes
that two terms have opposite meaning. Hyponymy introduces a parent-child, or
is-a relation between concepts. A concept is a hyponym of another concept,
if the former derives from the latter and it represents a more granular concept.
Hyponymy is transitive; if concept a is a child of concept b, and concept b is a
child of concept c, then a is also a child ofc. Hypernymy is the reverse relation
of hyponymy. Meronymy exists when a concept represents a part of another
33
-
8/12/2019 Enhanced Ontological Searching of Medical Scientific Information
34/130
CHAPTER 2. ONTOLOGIES
concept. Holonymy is the opposite relation, where a concept has part some other
concept(s).
The difference between a terminology and an ontology is not always clear, as
terminologies continue to improve their state of organization in a way that resem-
bles ontologies. The initial scope and aim of the two, though, is clearly different;
the purpose of a terminology was initially, as the name implies, an effort to collect
all terms associated with a specified domain. On the other hand, the target of
an ontology has, from the start, been to provide a machine-readable specification
of a shared conceptualization. Despite their many common characteristics, ter-
minologies are not necessarily ontologies. If treated as ontologies, they may lead
to inconsistencies or wrong inferencing mechanisms Ananiadou and McNaught
(2006). An illustrative example is the case of MedDRA, which will be discussed
in Section2.3.4.
2.3 Notable Biomedical Ontologies and Termi-
nologies
Hundreds of biomedical ontologies and terminologies have been published on-
line. According to BioPortal1 statistics, the top five most viewed ontologies or
terminologies are SNOMED Clinical terms, National Drug File, International
Classification of Diseases, MedDRA and NCI Thesaurus. In this section, a brief
introduction to these ontologies/terminologies is performed.
2.3.1 SNOMED CT
The Systematized Nomenclature of Medicine Clinical Terms (SNOMED CT) is a
biomedical terminology which covers most areas within medicine such as drugs,
diseases, operations, medical devices and symptoms. It may be used for the cod-
1BioPortal is a biomedical ontology/terminology repository which provides online ontology
presentation and manipulation tools(http://bioportal.bioontology.org/ ).
34
http://bioportal.bioontology.org/http://bioportal.bioontology.org/http://bioportal.bioontology.org/ -
8/12/2019 Enhanced Ontological Searching of Medical Scientific Information
35/130
2.3. NOTABLE BIOMEDICAL ONTOLOGIES AND TERMINOLOGIES
ing, retrieval and processing of clinical data. SNOMED CT is written purely in
formal logic-based syntax (i.e., the so-called Release Format 2 or RF2) available
and organized into multiple independent hierarchies. It is the result of the merg-
ing between the UK National Health Systems (NHS) Read codes and SNOMED
Reference Terminology (SNOMED-RT), developed by the College of American
Pathologists. The basic hierarchies, or axes, are Clinical Finding and Proce-
dure. The last version contains more than 400000 concepts and over 1000000
of relationships, rendering SNOMED CT the most complete terminology in the
medical domain. Only few definitions are present in the terminology. Each con-
cept contains a unique identifier and numerous synonymous terms that account
for term variation. Also, each concept is part of at least one hierarchy and may
have multiple is-a relationships with higher level nodes. SNOMED CT is part
of the Unified Medical Language System (UMLS), a biomedical ontology and
terminology integration attempt which comprises hundreds of resources.
2.3.2 NDF-RT
The National Drug File Reference Terminology (NDF-RT) was introduced by the
U.S. Department of Veterans Affairs (VA) as a formalized representation for a
medication terminology, written in description logic syntax VHA (2012). The
terminology is organized into concept hierarchies, where each concept is a node
comprising a list of term synonyms and a unique identifier. As expected, top-level
concepts are more general than lower-level ones. The central hierarchy is named
DRUG KIND and indicates the types of medications, the preparations used in
them and clinical VA drug products. Other hierarchies include
DISEASE KIND,
INGREDIENT KIND,
MECHANISM OF ACTION KIND,
PHARMACOKINETICS KIND,
35
-
8/12/2019 Enhanced Ontological Searching of Medical Scientific Information
36/130
CHAPTER 2. ONTOLOGIES
PHYSIOLOGIC EFFECT KIND,
THERAPEUTIC CATEGORY KIND,
DOSE FORM and
DRUG INTERACTION KIND.
Roles exist between different concepts, and are specified only with existential
restrictions (i.e. OWL equivalent of someValuesFrom). Mappings to other ter-
minologies are also available. Currently, NDF-RT more than 45000 concepts in
hierarchies of maximum depth 12.
2.3.3 ICD-10
The International Statistical Classification of Diseases and Related Health Prob-
lems (ICD) is a terminology which attempts to classify signs, symptoms and
causes of disease and morbidity WHO(1992). It appeared in the mid-19th cen-
tury and is now maintained by the World Health Organization (WHO). Currently
it is available in its 10th revision, although the 11th version is claimed to be at
the final stage before release. As a taxonomy, it has relatively small maximum
depth, equal to 6. Codes assigned to each concept tie it to a specific place in the
taxonomy, with each code having only a single parent. It is thus not a proper ap-
plication of ontological principles2, since, in reality, it is not unusual for concepts
to belong to more than one subsumers, and this is not modeled. In addition to
that, there exist categories such as Not otherwise specified or Other, which are
not needed in an ontology; the open world assumption already covers the fact
that every ontology is incomplete, so stating it explicitly is redundant and may
interfere with the evolution of the ontology, as new terms are not classified under
their closest match.
2nor was meant to be; its intent is classification
36
-
8/12/2019 Enhanced Ontological Searching of Medical Scientific Information
37/130
2.3. NOTABLE BIOMEDICAL ONTOLOGIES AND TERMINOLOGIES
Figure 2.1: The structure of the MedDRA terminology comprises a fixed-depth hierarchy.
2.3.4 MedDRA
The Medical Dictionary for Regulatory Activities (MedDRA) is a terminology
that is concerned with biopharmaceutical regulatory processes. It contains terms
associated with all phases of the drug development cycle. MedDRA is organized
in a hierarchical structure of fixed depth, as seen in Fig. 2.1. System Organ
Classes (SOCs) represent the 26 predefined overlapping hierarchies in which terms
belong to. High Level Group Terms (HLGTs) and High Level Terms (HLTs) are
general term groupings, denoting disorders or complications. Preferred Terms
(PTs) denote the preferred name for a concept, while Lowest Level Terms (LLTs)
include terms of maximum specificity. LLTs may be connected with hyponymy,
meronymy or synonymy relationships to their PTs. This is the main problem in
trying to view MedDRA as an ontology. In a formal ontology, a concept cannot
be a child of itself. In MedDRA, this clearly happens, when a PT and its LLTs
share a synonymy relation.
37
-
8/12/2019 Enhanced Ontological Searching of Medical Scientific Information
38/130
CHAPTER 2. ONTOLOGIES
2.3.5 NCI Thesaurus
The National Cancer Institute Thesaurus (NCIT) is a controlled terminologyfor cancer research. The thesaurus has been converted to formal OWL syntax
and is updated at fixed intervals. The conversion was not an easy one; many
inconsistencies and modeling dead-ends that were encountered in the conversion
procedure have been documentedCeusters et al. (2005), along with some clear
violations of ontological principlesSchulz et al.(2010). The NCIT provides almost
100000 concepts, with approximately 65% containing a definition.
38
-
8/12/2019 Enhanced Ontological Searching of Medical Scientific Information
39/130
-
8/12/2019 Enhanced Ontological Searching of Medical Scientific Information
40/130
CHAPTER 3. SIMILARITY METRICS
4. d(a, b) +d(b, c) d(a, c) (triangular inequality).
On the other hand, the requirements for a similarity metric were formally intro-
duced not long ago Chen et al. (2009). The definition states that a similarity
metric s(a, b) must satisfy the following properties:
1. s(a, a) 0,
2. s(a, b) =s(b, a),
3. s(a, a) s(a, b),
4. s(a, b) +s(b, c) s(a, c) +s(b, b),
5. s(a, a) =s(b, b) =s(a, b) if and only ifa= b.
The counter-intuitive 4th property can be proven, using set theory. More specif-
ically, if|a b| denotes the cardinality of common characteristics between a and
b, and c denotes the complement ofc, the following equality holds:
|a b|= |a b c| + |a b c|. (3.1)
Then,
|a b| + |bc|= |a bc| + |a b c| + |a b c| + |a b c| |ac| + |b|, (3.2)
since|a b c| |a c|and |a b c| + |a b c| + |a b c| |b|. Deduction of
similarity from distance is a common procedure that requires simple operations.
Similarity is, intuitively, a decreasing function of distance. Conversion between
the two can take many formsChen et al.(2009). In this thesis, all formulas will
be presented as similarity measures.
3.2 Lexical Similarity
String-based methods that calculate lexical similarity can be divided into character-
based and word-based. In this section, some of the most popular metrics are
presented. For a more complete survey of lexical similarity measures see Navarro
(2001) andGomaa and Fahmy(2013).
40
-
8/12/2019 Enhanced Ontological Searching of Medical Scientific Information
41/130
3.2. LEXICAL SIMILARITY
3.2.1 Character-based Similarity Measures
In character-based similarity, strings are viewed as character sequences and at-tempts are made to discover character relevance.
Longest Common Substring
The Longest Common Substring algorithmGusfield(1997) tries to find the max-
imum number of consecutive characters that two strings share. It may be imple-
mented using a suffix tree or dynamic programming.
Hamming Similarity
Hamming similarity is a metric that can be applied to strings of equal length. It
is a simple metric that measures the number of common characters between two
strings. Given stringsaand b, the formula for string similarity can be constructed
as follows:
simham(a, b) =
i 1(ai=bi)
|a| , (3.3)
where 1() is the indicator function and | | denotes string length, measured in
characters.
Levenshtein Similarity
Levenshtein distance counts the number of character alterations that need tobe made in order to transform one string to another Levenshtein(1966). This
number is bounded by the length of the larger string, which is commonly used as a
normalizing measure that restrains the value of distance to [0 , 1]. Mathematically,
normalized Levenshtein distance of termsaand b is computed using the following
formula:
dlev(a, b) = leva,b(|a|, |b|)
max{|a|, |b|}, (3.4)
41
-
8/12/2019 Enhanced Ontological Searching of Medical Scientific Information
42/130
CHAPTER 3. SIMILARITY METRICS
where| | denotes string length in number of characters,
leva,b(i, j) =
max{i, j} , if min{i, j}= 0
min
leva,b(i 1, j) + 1
leva,b(i, j 1) + 1
leva,b(i 1, j 1) + [ai =bj]
, else(3.5)
and max{}, min{} denote the maximum and minimum functions, respectively.
Converting normalized distance to similarity can be done as follows:
simlev(a, b) = 1 dlev(a, b). (3.6)
Jaro Similarity
Jaro similarityJaro(1989,1995) takes into account both the number and sequence
of common characters present in the two strings. Let us consider strings a =
a1 . . . aK and b = b1 . . . bL. A character ai is said to be common with b if the
character exists in b within a window of
min{|a|,|b|}
2 frombi. Leta
=a
1 . . . a
K
bethose characters ina that are common withb, andb =b1 . . . b
L those characters
inbthat are common with a. A transposition fora, b is a positioni in the strings
a, b in which ai = bi. The number of transpositions fora
, b divided by two is
denoted asTa,b. Then, Jaros formula for similarity is given by:
simjaro (a, b) =1
3
|a|
|a| +
|b|
|b| +
|a| Ta,b
|a|
. (3.7)
It should be noted that Jaro similarity violates the symmetry property of Eq.3.1, therefore it is not a true similarity metric, according to that definition.
Jaro-Winkler Similarity
Jaro-Winkler similarity Winkler (1999) is a variation of Jaro similarity which
promotes strings with long common prefixes. The length of the longest prefix
common to both strings a and b is denoted as P. Then, if P = max(P, 4),
42
-
8/12/2019 Enhanced Ontological Searching of Medical Scientific Information
43/130
3.2. LEXICAL SIMILARITY
Jaro-Winkler similarity is given by:
simj&w(a, b) = simjaro (a, b) + P
10(1 simjaro (a, b)). (3.8)
N-gram Similarity
A string can be split into n-grams, i.e. all possible consecutive character sequences
of lengthnin the string. As an example, the word protein can be split into the 3-
grams pro, rot, ote, tei and ein. When comparing two strings, the number
of common n-grams is computed and normalized by the maximum number of
n-grams. More specifically, given strings aand b, similarity is given by:
simngram(a, b) =NcomNmax
, (3.9)
where Ncom denotes the number of common n-grams andNmax denotes the max-
imum number of n-grams in either of the two strings.
3.2.2 Word-based Similarity Measures
As the name implies, word-based measures view the string as a collection of words.
Similarity measures dictate how similar two terms are word-wise, and no weight
is given on character similarity.
Dice Similarity
Dice similarity considers input strings a and b as sets of words A and B respec-
tively, and calculates similarity as follows:
simdice(a, b) = 2|A B|
|A| + |B|, (3.10)
where | | denotes set cardinality in number of words.
43
-
8/12/2019 Enhanced Ontological Searching of Medical Scientific Information
44/130
CHAPTER 3. SIMILARITY METRICS
Jaccard Similarity
Jaccard similarity counts the number of common words of the compared stringsand divides it by the number of distinct words in both strings, i.e.
simjacc(a, b) = |A B|
|A B|. (3.11)
Cosine Similarity
In order to compute cosine similarity, the compared strings should be converted to
vectors. The dimension of the resulting vectors will be equal to the total number
of distinct words present in both. Therefore, each element in the vector represents
one word. The vector values for each string are computed as follows: A vector
contains unitary values in positions that correspond to words that are contained
in the respective string. Similarly, a vector contains zero values in all positions
that correspond to words that are not present in the respective string. Given
strings a and b, the respective vectors a and b are computed. Cosine similarity
is then given by:
simcos(a, b) = a b||a|| ||b||, (3.12)
where|| || denotes the Euclidean norm function.
Manhattan Similarity
Taxicab geometry considers that distance between two points in a grid is given
by the sum of the absolute differences of their respective coordinates. The grid
resembles a uniform city road map, where diagonal movements are not permitted.
This is the reason why the distance metric in this space is often called Manhattan
distance or city block distance. Considering N-dimension string vectors a and b,
Manhattan distance can be computed as:
simmanh(a, b) = 1
Ni=1
|ai bi|
N , (3.13)
whereNis a normalizing constant that represents the dimension ofaand b.
44
-
8/12/2019 Enhanced Ontological Searching of Medical Scientific Information
45/130
3.3. ONTOLOGICAL SEMANTIC SIMILARITY
Euclidean Similarity
Euclidean similarity also considers strings as vectors, and computes similarity as:
simeucl(a, b) = 1
N
i=1
|ai bi|2
N . (3.14)
3.3 Ontological Semantic Similarity
An ontology is a collection of concepts and their inter-relationships. It may be
visualized as a graph, in which nodes represent concepts and edges represent the
relations between them. Usually, ontologies are viewed as taxonomies, where is-
a and part-of relations play the most important role. Viewing the ontology as a
taxonomy, one can apply semantic similarity metrics that exploit the hierarchical
structure. Probably the most famous object of semantic similarity tests is the
computational lexicon WordNetMiller(1995). In WordNet, closely related terms
are grouped together to form synsets. These synsets, in turn, form semantic rela-
tions with other synsets. WordNet is commonly referred to as a lexical ontology,
due to an obvious mapping of lexical hyponymy to ontological subsumption.
3.3.1 Intra-ontology Semantic Similarity
Intra-ontology semantic similarity metrics are meant to measure similarity be-
tween concepts that reside within the same ontology. These metrics can be
roughly divided into distance-based, information-based and feature-based.
Distance-based Metrics
Distance-based metrics take advantage of the ontological topology to compute
the similarity between concepts. This method requires viewing the ontology as
a rooted Directed Acyclic Graph (DAG), in which nodes are concepts and edges
among them are restricted to hierarchical relationships, with the most usual type
45
-
8/12/2019 Enhanced Ontological Searching of Medical Scientific Information
46/130
CHAPTER 3. SIMILARITY METRICS
being is-a relationships. At the top, there is a single concept, the root. The graph
is directed, starting from a low-level concept and directed towards its ancestors
through transitive relationships. The graph is also acyclic, since a finite path
from a source node to a destination node cannot return to the source node. In
other words, a node can never be a child of one of its children.
A simple look at an ontology from a geometric perspective may reveal im-
portant information about the similarity of concepts. As depth in the DAG
increases, concepts become increasingly specific, thus similarity is expected to
increase. Another important characteristic of the ontology DAG is that the path
between concepts is not always unique, therefore distance-based similarity will
depend on which path is chosen. Finally, the density of nodes is a good indicator
of similarity; as density increases, concepts approach each other and similarity
increases.
The accuracy of distance-based methods depends on the level of detail that
the ontology captures. A poorly structured ontology with many omissions might
yield misleading similarity results. Fortunately, a lot of effort has been made to
make biomedical ontologies as complete as possible, therefore network density in
biomedical ontologies is usually high.
The most straightforward way to measure the similarity of concept nodes is
given inRada et al. (1989). In that work by Rada et al., all edges are assigned
a unitary weight and the distance between two concepts is equal to the number
of edges that are present in their shortest path. Let us consider two distinct
concepts c1 and c2 in the hierarchy. Each pathi that connects these two concept
nodes may be represented as a set which includes all edges ek present in the path,
i.e.
pathi(c1, c2) ={e1, e2, . . . , eK}. (3.15)
with cardinality |pathi(c1, c2)|= K. The distance between concepts c1 and c2 is,
then, equal to the shortest path that connects them, i.e.,
drada(c1, c2) = mini|pathi(c1, c2)|. (3.16)
46
-
8/12/2019 Enhanced Ontological Searching of Medical Scientific Information
47/130
3.3. ONTOLOGICAL SEMANTIC SIMILARITY
Note that in literature, there are cases (e.g. Al-Mubaid and Nguyen(2006)) where
Radas measure is used with node counting, instead of edge counting. In those
cases, each path is represented as a set of the nodes that compose it, including
the end nodes. The minimum distance can be converted into a similarity metric,
as inResnik(1995):
simrada(c1, c2) = 2D d(c1, c2), (3.17)
where D is the maximum depth of the taxonomy. This method fails to capture
the intuition that concept nodes, which reside at the lower part of the hierarchy
and are separated by distanced, are more similar than higher-level nodes with the
same distance separationd. Also, its success highly depends on the uniformity of
edge distribution within the ontology. For these reasons, other approaches have
been proposed in order to achieve a more representative score of similarity.
InWu and Palmer(1994), the relative depth of the compared concepts in the
hierarchy is considered. In that work, Wu and Palmer introduce the Least Com-
mon Subsumer (LCS) of the compared concepts. The LCS is the hierarchically
deepest common ancestor of the compared concepts. Similarity for concepts c1
and c2 is then given as:
simw&p(c1, c2) = 2h
N1+N2+ 2h, (3.18)
where N1 is the number of nodes in the path between concept c1 and the LCS,
N2 is the number of nodes between concept c2 and the LCS, and h is the depth
of the LCS, measured again in number of nodes.
In Li et al. (2003), the authors followed various strategies in their attempt
to calculate similarity as a function of the shortest path between the comparedconcepts, the depth of their LCS and the local density of the ontology. They
perceived that the best performance was obtained when they used the following
non-linear function:
simli(c1, c2) =e drada(c1,c2)
eh eh
eh +eh, (3.19)
where,are non-negative parameters and h = drada(LCS(c1, c2), root) denotes
the minimum depth of the LCS. Distances are measured in number of edges.
47
-
8/12/2019 Enhanced Ontological Searching of Medical Scientific Information
48/130
CHAPTER 3. SIMILARITY METRICS
Al-Mubaid and Nguyen attempt to combine path length and node depth in one
measure. InAl-Mubaid and Nguyen(2006), they view the DAG as a composition
of clusters, with each cluster having as root a child of the ontology root. The
usage of clusters aims to exploit local characteristics of different branches. Given
concepts c1 and c2, they first compute their so-called common specificity:
Cspec(c1, c2) =Dc h, (3.20)
whereDcdenotes the depth of the specific cluster and h refers to the depth of the
LCS in the ontology, with both quantities measured in number of nodes. Then
similarity is computed as:
sima&n(c1, c2) = log((Path 1) (CSpec) +k), (3.21)
where Path is a modified version of Radas distance measure which is adapted
according to the largest cluster, and , ,k are constants, whose default values
are unitary.
Information-Based Metrics
One of the first attempts to focus on nodes in the similarity formula is that
of Leacock and Chodorow Leacock and Chodorow (1998). This method uses
negative log likelihood in a way that resembles the formula of self-information
Cover and Thomas(2012), but does not really involve valid probability. Instead,
a normalized form of the path length between the concepts is used:
siml&c(c1, c2) =log(Np/2D), (3.22)
where Np is the number of nodes in the shortest path between concepts c1 and
c2. This variable also includes the end nodes.
Resnik, inResnik(1995), continues down this path by replacing the normal-
ized path length with a probability measure P() to calculate the information
content (IC) of a concept. He considers all common subsumersCSi of concepts
48
-
8/12/2019 Enhanced Ontological Searching of Medical Scientific Information
49/130
3.3. ONTOLOGICAL SEMANTIC SIMILARITY
c1 and c2 and calculates similarity as:
simresn(c1, c2) = maxi [log(P(CSi))], (3.23)
or, equivalently,
simresn(c1, c2) =log(P(LCS)). (3.24)
Considering that the IC of a concept c is defined as the negative logarithm of its
probability, i.e. IC(c)= -log(P(c)), equation (3.24) can also be written as:
simresn(c1, c2) = IC(LCS(c1, c2)). (3.25)
Probabilities are estimated with the help of a text corpus, i.e. a collection of
nature language excerpts, specifically chosen to provide a good representation of
actual term usage. When dealing with biomedical ontology concepts, collections
of Pubmed1 abstracts are commonly used as corpora to determine the probability
of each concept.
Given a corpus, the occurrence of a term which corresponds to concept c
essentially implies the occurrence of each and every concept that subsumes c
within the ontological structure. Conversely, the number of occurrences of a
conceptc depends not only on the number of appearances ofcitself in the corpus,
but also on every occurrence of its descendants in the hierarchy. Thus, the number
of occurrences of concept c is given by:
occ(c) =
n=subsumed(c)
count(n), (3.26)
where subsumed(c) represents c and its children concept nodes, and count()
denotes the number of occurrences of the specific concept within the given corpus.
Converting occurrences to probability can be done using:
P(c) =occ(c)
N , (3.27)
where N is the total number of occurrences of ontology terms in the corpus.
This method results to higher probabilities for concepts residing at the top part
1http://www.ncbi.nlm.nih.gov/pubmed
49
http://www.ncbi.nlm.nih.gov/pubmedhttp://www.ncbi.nlm.nih.gov/pubmed -
8/12/2019 Enhanced Ontological Searching of Medical Scientific Information
50/130
CHAPTER 3. SIMILARITY METRICS
of the hierarchy, with the root having unitary probability. Therefore, concepts
whose LCS lies lower in the hierarchy are more similar, since their LCS has low
probability (i.e., high IC).
A possible drawback of this method is that probabilities are tied to the choice
of corpus. So far, in the biomedical domain, there is no widely accepted corpus
that covers the domain needsAl-Mubaid and Nguyen(2006). This is due to the
fact that thousands of new terms and abbreviations appear in the literature every
year, thus a stable corpus might not function well. Since extensions of the corpus
would need to be considered at fixed intervals, it might not serve as a useful
benchmark.
Alternatively, computation of IC can be performed without the use of a corpus,
by solely relying on the structure of the ontology DAG. Intrinsic computation of
IC involves approximating the occurrence probability of a concept as a function
of multiple variables, such as number of descendant nodes, number of subsumers
or number of descendant nodes which are leaves in the ontology. InSeco et al.
(2004), the IC of a concept c is given by:
ICseco(c) = 1 log(descendants(c) + 1)log(allConcepts)
, (3.28)
wheredescendants(c) returns the number of nodes that concept c subsumes, and
allConcepts denotes the number of all the available concepts in the ontology.
The IC function introduced by Seco et. al has the drawback that it assigns IC
equal to one for every leaf node in the ontology, and also that concepts containing
the same number of descendant nodes are again given the same IC. An attempt to
distinguish the IC between leaf concepts was made in Zhou et al.(2008), by also
including the depth of the node in the calculation, normalized by the maximum
depth of the ontology. The proposed IC formula is given by:
ICzhou(c) =kICseco(c) + (1 k)log(depth(c) + 1)
log(maxDepth) , (3.29)
wheredepth(c) represents the depth of the concept c in the hierarchy, maxDepth
is the maximum depth of the ontology, measured in node number and k is a
weighting constant.
50
-
8/12/2019 Enhanced Ontological Searching of Medical Scientific Information
51/130
3.3. ONTOLOGICAL SEMANTIC SIMILARITY
The authors inSanchez et al.(2011) further improve the modeling of the IC
function. In that work, the IC function can also distinguish concepts that contain
the same number of descendants, due to the fact that the number of subsumers
of a concept is also used. The IC is given as:
ICsan(c) =log
leaves(c)ancestors(c)
+ 1)
allLeaves
, (3.30)
where leaves(c) is the number of nodes that are descendants of c and have no
children, ancestors(c) refers to the number of concepts which subsume c and
allLeavesdenotes the total number of leaf nodes in the ontology. The IC func-
tions of equations (3.28), (3.29) and (3.30) can be used in equation (3.25) to
compute the similarity between two concepts without using a corpus.
Lin et al. use IC in an alteration of the similarity metric ofWu and Palmer
(1994). More specifically,
siml&p(c1, c2) =2 simresn(c1, c2)
IC(c1) + IC(c2), (3.31)
This approach aims to include the individual characteristics of the compared
nodes that Resniks approach neglected. Indeed, in Resniks measure, any two
pairs of nodes that have the same LCS produce the same similarity.
Jiang and Conrath follow a similar approach with Wu and Palmer (1994),
but avoid the scaling of similarityJiang and Conrath(1997). Instead, they use a
distance metric as follows:
dj&c(c1, c2) = IC(c1) + IC(c2) 2 simresn(c1, c2). (3.32)
Various transformations have been applied to convert this distance to similarity.
Among these, the authors in Seco et al. (2004) consider a linear transformation
and present the following formula of similarity normalized in the interval [0,1]:
simj&c(c1, c2) = 1 dj&c(c1, c2)
2 . (3.33)
Another example can be found in Zhu et al. (2009), in which an exponential
function is used for the similarity formula, along with a constant that accounts
51
-
8/12/2019 Enhanced Ontological Searching of Medical Scientific Information
52/130
CHAPTER 3. SIMILARITY METRICS
for curve steepness:
simj&c(c1, c2) =edj&c(c1,c2)
. (3.34)
Feature-Based Measures
Feature-based measures do not necessarily conform to the similarity metric rules
ofChen et al. (2009), as they allow for similarity asymmetry. In feature-based
techniques, the two compared concepts are viewed as sets of features, in contrast
to the geometric view presented in previous sections. To calculate similarity, not
only the common features of the concepts are taken into account, but also the
differences between them. That way, common features improve similarity, while
different features penalize its valueTversky et al.(1977). Given concepts c1 and
c2, let C1 and C2 denote the sets that contain their features. Then, similarity
between the two can be given as:
simtve(c1, c2) = |C1 C2|
|C1 C2| +|C1 C2| + (1 )|C2 C1|, (3.35)
whereis a weight which takes values in [0,1]. InRodrguez et al. (1999), the
parameter is computed as follows:
=
d(c1,LCS)d(c1,c2)
, d(c1,LCS) d(c2,LCS)
1 d(c1,LCS)d(c1,c2)
, else(3.36)
This asymmetric function stems from Tverskys observation that similarity might
not be symmetric. In one of Tverskys examples, North Korea was said to be more
similar to Red China than the reverse.
3.3.2 Inter-ontology Semantic Similarity
Inter-ontology semantic similarity measures try to quantify the similarity between
concepts that belong to different ontologies. Fairly little research has been doc-
umented in this area, due to the inherent difficulty of comparing heterogeneous
structures. A common approach is to combine the different ontologies into a
52
-
8/12/2019 Enhanced Ontological Searching of Medical Scientific Information
53/130
3.3. ONTOLOGICAL SEMANTIC SIMILARITY
single ontology through detailed concept mappings Gangemi et al. (1998). It is
clear that this is very challenging and requires the help of a domain expert, as
well as plenty of time and effort. Furthermore, not all biomedical terminologies
are consistent and their lack of homogeneity is a major problem. Simpler ap-
proaches have been proposed in the literature. A usual first step is to merge the
different ontologies under a dummy root. This approach is found inRodrguez
and Egenhofer (2003), where the authors use a weighted version of Tverskys
similarity which also takes into account geometrical features of the ontologies.
A similar route is followed by Petrakis et al. (2006), where the authors substi-
tute Tverskys similarity with a form of Jaccard similarity. The drawback of
these cross-similarity metrics is that they do not consider term overlap in both
ontologies. Other methods rely on extensions of single ontology similarity met-
rics. Examples of such work can be found in Al-Mubaid and Nguyen(2006) and
Sanchez et al.(2012).
53
-
8/12/2019 Enhanced Ontological Searching of Medical Scientific Information
54/130
-
8/12/2019 Enhanced Ontological Searching of Medical Scientific Information
55/130
Chapter 4
Search Interfaces
Search has risen to be one of the most commonly used tools for computer users.
It can be found everywhere, from stand-alone web-based search engines to em-
bedded search forms that appear in desktop applications and websites. To a large
extent, success of the search procedure depends on the users ability to formulate
their information needs, transforming them into queries that are highly likely to
produce desired results. For this reason, a lot of effort has been spent on improv-
ing the search interfaces and providing tools that will enhance user experience.
In this chapter, the basic characteristics of successful search interface design are
presented, with main focus on web-search interfaces.
4.1 Information Seeking Models
Information seeking models attempt to recognize and describe the strategies fol-
lowed by humans from the moment they sense a search need until the moment
they acquire desired results. The search procedure may be viewed as a repetition
of actions. InSutcliffe and Ennis (1998), the authors identify the following four
actions in what is considered the standard model of information seeking:
1. Problem Identification
2. Articulation of Need
55
-
8/12/2019 Enhanced Ontological Searching of Medical Scientific Information
56/130
CHAPTER 4. SEARCH INTERFACES
3. Query Formulation
4. Evaluation of Results
The first step refers to conceptualization of the search need, while the second step
involves expressing this need in words. The third step requires the user to trans-
form the articulated need into a format that will be accepted by the underlying
search system. Finally, the fourth step refers to the procedure of judging the
results critically, exploiting any relevant domain knowledge and deciding whether
the need is satisfied. A search may be characterized as ok, failed or unsatis-
factory. An ok search ends the cycle successfully. An unsatisfactory search
may lead to reformulation of the query or re-articulation of the need, while a
completely failed search might require re-identification of the problem.
Sutcliffe and Enniss model assumes that the need does not change, unless
results are disappointing. It does not capture the fact that users learn as they
search. This dynamic aspect of information seeking was captured in an earlier
work by BatesBates(1989). In that study, the users needs are assumed to change
as the process advances. Furthermore, Bates claims that the success of the search
procedure does not only depend on the final list of results, but on the selections
made along the way. This model is referred to as the berry-picking model, to
denote that it does not result in a single set of results. A simple example of the
berry-picking model can be illustrated when a user attempts a broad query such
as String similarity algorithms and refines the query to Jaro similarity after
viewing this result in the initial result list.
4.2 Query Specification
Queries are usually specified through rectangular entry forms, as in Fig. 4.1. The
width of these forms varies in size, with studies showing that wider forms promote
formulation of longer queriesFranzen and Karlgren(2000);Belkin et al.(2003).
It has been observed that around 88% of search queries are composed of 1 to 4
56
-
8/12/2019 Enhanced Ontological Searching of Medical Scientific Information
57/130
4.2. QUERY SPECIFICATION
Figure 4.1: The google search engine entry form.
Figure 4.2: Facebook uses grayed-out descriptive text to help in the formulation of user
queries.
words, with mean length equal to 2.8 words per query Jansen et al.(2007). The
actual search is executed by pressing the return key or mouse-clicking a specified
button (e.g. magnifying glass in Bing). In some cases, entry forms decorate their
background with descriptive text that provides guidance for the user. An example
is Facebooks search form, as seen in Fig. 4.2. The text disappears, once the user
clicks inside the form. This usually helps to narrow down the search domain.
After query submission, processing of the query takes place before any attempt
to retrieve results. This process may include removal of stopwords (i.e. words
with high appearance probability such as the, a), normalization of words (e.g.
plural to singular) and permutation of word order. Boolean logic may also be used
in the case of multiple words per query. Returning results that contain all query
words (i.e. Boolean AND operator) seems more intuitive, although this might
sometimes lead to overly specific queries that return no results. The actual types
of processing are often hidden from the users, in an attempt to avoid confusion
and promote transparency,Muramatsu and Pratt(2001).
Most modern search interfaces are equipped with dynamic search suggestion,
also known as auto-completion (See Fig. 4.3). As the user starts typing, a list of
57
-
8/12/2019 Enhanced Ontological Searching of Medical Scientific Information
58/130
CHAPTER 4. SEARCH INTERFACES
Figure 4.3: Bings search interface features a powerful dynamic search suggestion, where
prefixes are highlighted with grayed-out font and the remaining text is in bold.
term suggestions appears under the entry form. The suggestions contained in the
list are usually queries whose prefix matches what has been typed so far, although
there are cases where interior matches are also included. The user can then mouse-
click the most relevant query or navigate through the list, using keyboard arrows.
Studies have shown that approximately one third of all search attempts in the
Yahoo Search Assist were performed through a dynamically suggested queryAn-
ick and Kantamneni(2008). The dynamic search suggestion technique attempts
to minimize unneeded typing from the user side and can alleviate spelling errors
early. Most importantly, though, it reassures the user that results are available,
so there is no frustration from empty result pages.
An important point to consider is that searchers often return to their pre-
viously accessed information. In the empirical study undertaken by Tauscher
and GreenbergTauscher and Greenberg(1997), it was found that there is a 58%
chance that the next web page to be visited had been visited before. A more
recent studyZhang and Zhao(2011) about tabbed browsing, conducted in 2010,
also finds page revisitation to be around the same levels, at 59.3%. Various tools
58
-
8/12/2019 Enhanced Ontological Searching of Medical Scientific Information
59/130
4.2. QUERY SPECIFICATION
Figure 4.4: The Safari browsers embedded search interface explicitly states which queries are
suggestions and which belong to the users recent search history.
Figure 4.5: The Firefox browsers embedded search interface contains recent queries on top,
and separates them from suggestions using a solid line.
exist to help users find their intended pages, including Uniform Resource Locator
(URL) history, bookmarking of pages, basic navigation buttons (e.g. Back but-
ton for short term page revisit) and change of URL font color if page has already
been visited. Among other methods documented, users may save whole webpages
to their local disk or keep URLs in text documents, after enriching them with
comments Jones et al. (2002). Interestingly, a common approach to revisiting
documents is actually re-searching for them Obendorf et al. (2007). Users who
59
-
8/12/2019 Enhanced Ontological Searching of Medical Scientific Information
60/130
CHAPTER 4. SEARCH INTERFACES
Figure 4.6: Googles search results page is a typical scrollable vertical list of captions. Meta-
data facets, that restrain results to a particular type of information, are also present in the
interface (e.g. Images tab).
adopt this strategy attempt to re-create the conditions of their previous search, by
trying to formulate the exact same query. Another strategy requires past searchqueries to appear as the user types, along with regular dynamic term sugges-
tion. Separation between suggested queries and previously generated ones varies
among interfaces, as can be seen in Figures 4.4and4.5.
4.3 Presentation of Search Results
Search applications usually present results as a vertical list of captions, distributed
along multiple pages (see Fig. 4.6). Each caption is a clickable entity which, as a
minimum requirement, comprises a title and an excerpt of the target document
Clarke et al.(2007). Usually, the excerpt includes some or all of the query terms,
as highlighted text. In most cases, highlighting is performed using bold font or
colored term background. Many search applications tend to group similar results,
that originate from the same source, into the same caption. That way, result
60
-
8/12/2019 Enhanced Ontological Searching of Medical Scientific Information
61/130
4.3. PRESENTATION OF SEARCH RESULTS
pollution from few sources is avoided and diversity is promoted. The relevance
of search results is reflected in their order of appearance. Although relevance
scores were formerly used to grade the fit of the result to the query, they are
usually not present anymore in modern search applications. The reasons behind
their omission might be to avoid reverse-engineering of the ranking algorithms and
to reduce redundancy, since the ranking itself already reflects the importance of
resultsHearst(2009).
It has been observed that users tend to click on the uppermost captions
Joachims et al. (2005). In the same study, it was found that the first caption
received more attention than its successors, even if its relevance was actually
lower. Furthermore, the majority of users often remain on the first page of re-
sults. The authors inJansen et al. (2007) observed that only 30% continued to
look for relevant results in the second page of the results, and only 15% looked
even further. Usually, the patience of a user is a function of his/her experience
in using the system. More experienced users tend to be more patient than users
who are not accustomed to the search procedure. Inexperienced users, on the
other hand, often prefer to refine their query or simply accept that what they
search for cannot be found by the search applicationHearst(2009).
Apart from plain lists of results, further organization of captions may be per-
formed, using some form of faceted browsing. Facets attempt to refine search
results, a
top related