enhanced ontological searching of med- ical scienti c...
TRANSCRIPT
University of Manchester
School of Computer Science
Degree Programme of Advanced Computer Science
Christos Karaiskos
Enhanced Ontological Searching of Med-ical Scientific Information
Progress Report
Manchester, May 10, 2013
Supervisors: Prof. Andrew Brass
Dr. Jennifer Bradford (AstraZeneca)
University of Manchester
School of Computer Science
Degree Programme of Advanced Computer Science
ABSTRACT OF
MASTER’S THESIS
Author: Christos Karaiskos
Title: Enhanced Ontological Searching of Medical Scientific Information
Date: May 10, 2013 Pages: 7+49+8
Pathway: Data and Knowledge Management
Supervisors: Prof. Andrew Brass
Dr. Jennifer Bradford (AstraZeneca)
An enormous amount of biomedical knowledge is encoded in narrative textual
format. In an attempt to discover new or hidden knowledge, extensive research is
being conducted to extract and exploit term relationships from plain text, with
the aid of technology. A common approach for the identification of biomedical en-
tities in plain text involves usage of ontologies, i.e., knowledge bases which provide
formal machine-understandable representations of domains of variable specificity.
In addition to term extraction, ontologies may also be used as controlled vocab-
ularies or as a means for automatic knowledge acquisition through their inherent
inference capabilities. Visualization of the content of ontologies is, thus, very im-
portant for researchers in the biomedical domain. Unfortunately, many of these
researchers find it difficult to deal with formal logic and would prefer that ontol-
ogy search interfaces completely hide any structural or functional references to
ontologies, even if they are as simple as parent/child relationships. This thesis
proposes a strategy for building an ontology search application that exploits on-
tologies behind the scene, transparently from the end user, and presents relevant
concept information in such a way that searchers can successfully and quickly
find what they are looking for. The proposed search interface features various
search tools for enhanced ontological searching, such as term auto-completion,
error correction, clever results ranking and similar concept suggestions based on
semantic similarity metrics.
Keywords: search interface design, ontology hiding, biomedical ontology,
semantic similarity, usability, data integration
#Words: 12754 (Abstract + Chapters 1-6 Isolated)
ii
List of Abbreviations
AI Artificial Intelligence
API Application Programming Interface
DAG Directed Acyclic Graph
GUI Graphic User Interface
HLGT High Level Group Term
HLT High Level Term
IC Information Content
ICD International Classification of Diseases
LCS Least Common Subsumer
MedDRA Medical Dictionary for Regulatory Activities
NCIT National Cancer Institute Thesaurus
NDF-RT National Drug File Reference Terminology
NHS UK National Health System
OWL Web Ontology Language
PT Preferred Term
RDF Resource Description Framework
RDF-S Resource Description Framework Schema
iii
RF2 Release Format 2
SNOMED CT Systematized Nomenclature of Medicine Clini-
cal Terms
SNOMED RT Systematized Nomenclature of Medicine Refer-
ence Terminology
SOC System Organ Class
UMLS Unified Medical Language System
UX User Experience
VA U.S. Department of Veterans Affairs
WHO World Health Organization
XML Extensible Markup Language
iv
Contents
1 Introduction 1
1.1 Problem Context . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Ontologies 6
2.1 Modern Ontology Definition . . . . . . . . . . . . . . . . . . . 6
2.2 Ontology vs. Terminology . . . . . . . . . . . . . . . . . . . . 8
2.3 Notable Biomedical Ontologies and Terminologies . . . . . . . 9
2.3.1 SNOMED CT . . . . . . . . . . . . . . . . . . . . . . . 9
2.3.2 NDF-RT . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3.3 ICD-10 . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3.4 MedDRA . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3.5 NCI Thesaurus . . . . . . . . . . . . . . . . . . . . . . 12
3 Similarity Metrics 13
3.1 Similarity Metric vs. Distance Metric . . . . . . . . . . . . . . 13
3.2 Lexical Similarity . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.2.1 Character-based Similarity Measures . . . . . . . . . . 15
Longest Common Substring . . . . . . . . . . . . . . . 15
Hamming Similarity . . . . . . . . . . . . . . . . . . . 15
Levenshtein Similarity . . . . . . . . . . . . . . . . . . 15
Jaro Similarity . . . . . . . . . . . . . . . . . . . . . . 16
Jaro-Winkler Similarity . . . . . . . . . . . . . . . . . 16
N-gram Similarity . . . . . . . . . . . . . . . . . . . . . 17
v
CONTENTS CONTENTS
3.2.2 Word-based Similarity Measures . . . . . . . . . . . . . 17
Dice Similarity . . . . . . . . . . . . . . . . . . . . . . 17
Jaccard Similarity . . . . . . . . . . . . . . . . . . . . . 17
Cosine Similarity . . . . . . . . . . . . . . . . . . . . . 18
Manhattan Similarity . . . . . . . . . . . . . . . . . . . 18
Euclidean Similarity . . . . . . . . . . . . . . . . . . . 18
3.3 Ontological Semantic Similarity . . . . . . . . . . . . . . . . . 19
3.3.1 Intra-ontology Semantic Similarity . . . . . . . . . . . 19
Distance-based Metrics . . . . . . . . . . . . . . . . . . 19
Information-Based Metrics . . . . . . . . . . . . . . . . 22
Feature-Based Measures . . . . . . . . . . . . . . . . . 25
3.3.2 Inter-ontology Semantic Similarity . . . . . . . . . . . 26
4 Search Interfaces 27
4.1 Information Seeking Models . . . . . . . . . . . . . . . . . . . 27
4.2 Query Specification . . . . . . . . . . . . . . . . . . . . . . . . 28
4.3 Presentation of Search Results . . . . . . . . . . . . . . . . . . 32
4.4 Query Reformulation . . . . . . . . . . . . . . . . . . . . . . . 33
5 Design 36
5.1 AstraZeneca’s Search Application . . . . . . . . . . . . . . . . 36
5.2 Design Considerations . . . . . . . . . . . . . . . . . . . . . . 38
5.2.1 Ontology Access . . . . . . . . . . . . . . . . . . . . . 38
5.2.2 Ontology Manipulation . . . . . . . . . . . . . . . . . . 39
5.2.3 Search Entry Form . . . . . . . . . . . . . . . . . . . . 40
5.2.4 Result Calculation . . . . . . . . . . . . . . . . . . . . 40
5.2.5 Error Correction . . . . . . . . . . . . . . . . . . . . . 41
5.2.6 Results Presentation . . . . . . . . . . . . . . . . . . . 41
5.2.7 Concept Presentation . . . . . . . . . . . . . . . . . . . 42
5.2.8 Navigation . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.2.9 History . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.2.10 Feedback . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
vi
CONTENTS
5.5 Current Implementation State . . . . . . . . . . . . . . . . . . 44
6 Conclusions and Future Work 49
vii
Chapter 1
Introduction
Ontologies are knowledge bases which provide formal machine-understandable
representations of domains of variable specificity. Given a domain of dis-
course, concepts that belong to the domain are well documented in formal
logic, along with their inter-relations. Ontologies, as representations, can-
not perfectly capture the part of the world that they attempt to describe
[Davis et al., 1993]. They are based on the open world assumption, which
states that if something is not represented in a knowledge base, it does not
mean that it does not exist in the real world [Hustadt et al., 1994]. As our
knowledge about a domain increases, ontologies are updated and they become
more complex. This has become evident in the biomedical domain, where
ontologies have already attained a high degree of specificity, and has led to
their quick adoption for data integration and knowledge discovery purposes.
1.1 Problem Context
Within biomedicine, ontologies can help researchers communicate, by pro-
moting consistent use of biomedical terms and concepts. The construction
of an ontology itself involves mediating across multiple views and requires
that a number of domain experts reach a consensus that reflects the diverse
viewpoints of the community. Ontologies are viewed as tools that provide
opportunities for new knowledge acquisition, due to the complex semantic
relations that they model. Inferences in a huge ontology may reveal connec-
1
1.2. MOTIVATION
tions that the human eye would bypass. This is especially important in the
pharmaceutical sector, where drug discovery has slowed down significantly as
a process and in the biological sector, where attempts to demystify genome
patterns associated with disease are still at initial stage. Another common
use for ontologies in the biomedical domain is as controlled vocabularies that
feed filtered terms into computer applications. Finally, ontologies may be
used to connect terms found in plain text to their semantic representations.
Term extraction with the help of ontologies is a hot topic in biomedicine, due
to the vast amounts of medical information stored in plain text. Due to the
importance of ontologies, it is usual for researchers in the biomedical field to
require access to their content.
1.2 Motivation
In the past, AstraZeneca employees were provided with a web-based search
form that enabled them to look for concepts in one or more biomedical on-
tologies and select the most suitable from a list of search results. The chosen
concepts were, in turn, conveyed to a text mining application. Understanding
the results required the user to be familiar with the content and structure of
the ontology from which the terms were retrieved. Unfortunately, most users
did not feel comfortable with the idea of ontologies and struggled, or even
refused, to use the provided interfaces, even though no logic-based content
was there to confuse them.
In many cases, though, this was not solely the fault of the users. The
interface gave the users freedom to select the ontologies to be searched for
the specified query. Inexperienced users usually did not know or care about
which ontology contains the desired query term. For example, a user wished
to search for ‘Non-small cell lung carcinoma’, by its abbreviation ‘NSCLC’.
Querying ‘NSCLC’ in the MedDRA terminology1 returned no results, since
the concept is not present in the terminology. Although this behavior is
correct, it seems wrong to the inexperienced user and may lead to loss of
trust to the system.
1The difference between terminology and ontology is described in Section 2.2
2
1.3. CONTRIBUTION
But even if the term is present in the ontology, the user should not be
forced to know its exact spelling. For example, querying for ‘NSCLC’ in
the NCIT thesaurus also returned no results, despite the fact that the ac-
tual concept exists in the ontology. The searcher needed to know that the
preferred term for the ‘NSCLC’ concept is ‘Non-small cell lung carcinoma’.
Abbreviations and dissimilar synonyms are common in the biomedical field,
so expecting the user to know the preferred term for each concept is consid-
ered problematic.
In addition to the above, presentation of results was not always straight-
forward. Terms that demonstrate a strong semantic relation to each other
were presented as stand-alone terms in the search results, subconsciously
misleading users to deduce that the terms were independent. It was up to
the user to judge the relevance of results to the query. For example, the
results for ‘Non-small cell lung carcinoma’ in NCIT included, among others,
the terms ‘Non-small cell lung carcinoma’ and ‘Stage I non-small cell lung
carcinoma’ equally spaced, in a way that users could not infer the connec-
tions between them. In fact, the latter term is a specification of the former.
In reality, what users did was to choose all terms, even though they were
looking for the broad term, because they became confused and did not want
to take the risk of selecting only one.
This collapse at the human-computer interface has motivated AstraZeneca
to try to build tools that take advantage of the ontology structure and, at the
same time, completely hide it from the user in order to facilitate the search
procedure.
1.3 Contribution
The outcome of this thesis will be the development of a user-friendly search
application that will allow users to find information about concepts present
in a medical ontology, without requiring from them to understand the un-
derlying structure of the ontology. Information about a concept may include
its accession code within the given ontology, the term for its preferred name,
its definition and all available synonym terms. In order to facilitate the
search procedure and enhance User Experience (UX), the search application
3
1.4. THESIS ORGANIZATION
will include features such as dynamic term suggestion, spelling correction,
recent query history and basic navigational functionality (e.g. back, forward
buttons).
The main challenge lies in the presentation of results; as stated in sec-
tion 1.2, users are usually not sure about which term(s) to choose, when
multiple similarly-spelt terms appear. Ranking of terms will be performed
with the aid of both lexical and semantic similarity. The former will screen
those terms that best match the user query and rank them according to a
string relevance metric. These results will be processed by the latter, so that
terms showing a strong semantic connection are grouped together.
Ideally, the search application should bridge across terms from multiple
ontologies. Due to the diversity in the format and annotation of different
ontologies, this is not a straightforward generalization. Most importantly,
within the biomedical society, the term ‘ontology’ is often used erroneously
to describe plain terminologies that, in fact, violate basic ontological princi-
ples.2 Therefore, ontology-specific difficulties are expected to arise, if seman-
tic similarity measures are to be deployed.
In summary, the goals of this thesis are to investigate the following topics:
1. To develop user-friendly search tools that allow users to build search
queries based on the terms present in a medical ontology, without need
for the users to understand the actual structure of the ontology.
2. To exploit the semantic annotations of the underlying ontology in order
to enhance the quality and presentation of results.
3. To bridge across ontologies and intermix their results appropriately.
1.4 Thesis Organization
The present report is organized in a total of 6 chapters. Chapter 2 includes
an introduction to ontologies and a brief description of some notable biomed-
ical ontologies. Chapter 3 presents the background needed for understanding
the different measures of lexical and semantic similarity. Chapter 4 discusses
2In MedDRA, the synonym of a term may be a child node of the term itself.
4
1.4. THESIS ORGANIZATION
interface design principles for user-centered search applications. Chapter 5
presents the design considerations that are taken into account for the imple-
mentation of the ontological search application. Also, the current implemen-
tation state is presented. Finally, conclusions are drawn in chapter 6, along
with possible future directions.
5
Chapter 2
Ontologies
The term ‘ontology’ is an uncountable noun coined in the philosophical field,
by ancient Greek philosophers [Guarino, 1998]. It involves the study of the
nature of existence, at a fairly abstract level. In the world of computer sci-
ence, the word ‘ontology’ refers to the encoding of human knowledge in a
format that allows for computational use. This chapter includes an intro-
duction to the modern definition of ontology, along with a brief description
of some of the most notable biomedical ontologies.
2.1 Modern Ontology Definition
In Artificial Intelligence (AI), an ontology is commonly defined as a speci-
fication of a (shared) conceptualization [Gruber et al., 1995]. A conceptual-
ization refers to an individual’s knowledge about a specific domain, acquired
through “experience, observation or introspection” [Huang et al., 2010]. On-
tologies are shared conceptualizations, meaning that multiple participants,
usually domain experts, contribute to their construction, maintenance and
expansion. Conflicts are certain to arise among the different participants, so
an important aspect of ontology design is to bridge across multiple views of
the desired domain into a single concrete representation. On the other hand,
a specification is a transformation of this shared conceptualization into a
formal representation language.
The outcome of a formal representation of a domain is a collection of
6
2.1. MODERN ONTOLOGY DEFINITION
entities, expressions and axioms. Entities include:
• concepts or classes, which are sets of individuals (e.g., ‘Country’, which
contains all countries),
• individuals, which are specific instances of classes (e.g., ‘Greece’ as an
instance of ‘Country’),
• data types (e.g. string, integer),
• literals, which are specific values of a given data type (e.g. 1,2,3, or
string values),
• properties (e.g. hasDisease, hasAge).
Expressions refer to descriptions of entities in a formal representation lan-
guage. The standardized family of languages for formal ontology represen-
tation is the Web Ontology Language (OWL), which builds on the Extensi-
ble Markup Language (XML), Resource Description Framework (RDF) and
RDF-Schema (RDF-S) standards to provide a highly expressive means for
representing knowledge [McGuinness et al., 2004]. The underlying format of
the resulting OWL document can vary among several types, with the most
common being RDF/XML.
Finally, axioms relate entities/expressions. This connection can be made
class-to-class (i.e. SubClassOf), individual-to-class (i.e. ClassAssertion),
property-to-property (i.e. SubPropertyOf), among others. These relations
can be asserted explicitly or inferred by a reasoner. Inferences are made,
based on the logic relations of concepts. As an example of a simple infer-
ence, a concept’s ancestors can be inferred automatically, once the parent
concept is specified.
An ontology may be visualized as a graph, in which concepts are nodes
and relations are edges between nodes. Furthermore, if transitive hierarchical
relations are isolated (e.g. subsumption, also known as “is-a” relation or
hyponymy), the ontology can be viewed as a taxonomy. The geometrical
visualization of an ontology will be presented in more detail in chapter 3.
7
2.2. ONTOLOGY VS. TERMINOLOGY
2.2 Ontology vs. Terminology
A terminology is a collection of term names that are associated with a
given domain. A term is a mapping of a concrete concept to natural lan-
guage. This term-to-concept mapping is usually not one-to-one, especially
in the biomedical domain where term variation and term ambiguities arise
[Ananiadou and McNaught, 2006]. Term variation is a result of the richness
of natural language and refers to the existence of multiple terms for the
description of the same concept. For example, the terms ‘Transmembrane
4 Superfamily Member 1’, ‘TM4SF1t’, ‘L6 Antigen’ all point to the same
protein. Term ambiguity occurs when a term is mapped to more than one
distinct concept. This is common when new abbreviations are introduced
[Liu et al., 2002]. As an example, some of the concepts that the acronym
‘CTX’ may map to are ‘Cardiac Transplantation’, ‘Clinical Trial exemption’
and ‘Conotoxin’. Their disambiguation is a matter of context.
A terminology is not constrained to being a simple list of terms. In
fact, most terminologies feature some kind of structure, where terms that
map to the same concept are grouped together and semantic relationships
between concepts are explicitly or implicitly stated. Semantic relationships
between terms include synonymy and antonymy, while semantic relationships
between concepts include hyponymy, hypernymy, meronymy and holonymy
[Jurafsky and Martin, 2000]. Synonymy exists when two terms are inter-
changeable, while antonymy denotes that two terms have opposite meaning.
Hyponymy introduces a parent-child, or “is-a” relation between concepts. A
concept is a hyponym of another concept, if the former derives from the latter
and it represents a more granular concept. Hyponymy is transitive; if con-
cept a is a child of concept b, and concept b is a child of concept c, then a is
also a child of c. Hypernymy is the reverse relation of hyponymy. Meronymy
exists when a concept represents a part of another concept. Holonymy is the
opposite relation, where a concept has part some other concept(s).
The difference between a terminology and an ontology is not always clear,
as terminologies continue to improve their state of organization in a way that
resembles ontologies. The initial scope and aim of the two, though, is clearly
different; the purpose of a terminology was initially, as the name implies,
8
2.3. NOTABLE BIOMEDICAL ONTOLOGIES AND TERMINOLOGIES
an effort to collect all terms associated with a specified domain. On the
other hand, the target of an ontology has, from the start, been to provide a
machine-readable specification of a shared conceptualization. Despite their
many common characteristics, terminologies are not necessarily ontologies. If
treated as ontologies, they may lead to inconsistencies or wrong inferencing
mechanisms [Ananiadou and McNaught, 2006]. An illustrative example is
the case of MedDRA, which will be discussed in Section 2.3.4.
2.3 Notable Biomedical Ontologies and Ter-
minologies
Hundreds of biomedical ontologies and terminologies have been published on-
line. According to Bioportal1 statistics, the top five most viewed ontologies
or terminologies are SNOMED Clinical terms, National Drug File, Interna-
tional Classification of Diseases (ICD), MedDRA and NCI Thesaurus. In this
section, a brief introduction to these ontologies/terminologies is performed.
2.3.1 SNOMED CT
The Systematized Nomenclature of Medicine Clinical Terms (SNOMED CT)
is a biomedical terminology which covers most areas within medicine such as
drugs, diseases, operations, medical devices and symptoms. It may be used
for the coding, retrieval and processing of clinical data. SNOMED CT is
written purely in formal logic-based syntax (i.e., the so-called Release For-
mat 2 or RF2) available and organized into multiple independent hierarchies.
It is the result of the merging between the UK National Health System’s
(NHS) Read codes and SNOMED Reference Terminology (SNOMED-RT),
developed by the College of American Pathologists. The basic hierarchies, or
axes, are ‘Clinical Finding’ and ‘Procedure’. The last version contains more
than 400000 concepts and over 1000000 of relationships, rendering SNOMED
CT the most complete terminology in the medical domain. Only few defini-
1Bioportal is a biomedical ontology/terminology repository which provides online on-
tology presentation and manipulation tools (http://bioportal.bioontology.org/).
9
2.3. NOTABLE BIOMEDICAL ONTOLOGIES AND TERMINOLOGIES
tions are present in the terminology. Each concept contains a unique identi-
fier and numerous synonymous terms that account for term variation. Also,
each concept is part of at least one hierarchy and may have multiple “is-a”
relationships with higher level nodes. SNOMED CT is part of the Unified
Medical Language System (UMLS), a biomedical ontology and terminology
integration attempt which comprises hundreds of resources.
2.3.2 NDF-RT
The National Drug File Reference Terminology (NDF-RT) was introduced
by the U.S. Department of Veterans Affairs (VA) as a formalized repre-
sentation for a medication terminology, written in description logic syntax
[VHA, 2012]. The terminology is organized into concept hierarchies, where
each concept is a node comprising a list of term synonyms and a unique
identifier. As expected, top-level concepts are more general than lower-level
ones. The central hierarchy is named DRUG KIND and indicates the types
of medications, the preparations used in them and clinical VA drug products.
Other hierarchies include
• DISEASE KIND,
• INGREDIENT KIND,
• MECHANISM OF ACTION KIND,
• PHARMACOKINETICS KIND,
• PHYSIOLOGIC EFFECT KIND,
• THERAPEUTIC CATEGORY KIND,
• DOSE FORM and
• DRUG INTERACTION KIND.
Roles exist between different concepts, and are specified only with existen-
tial restrictions (i.e. OWL equivalent of someValuesFrom). Mappings to
other terminologies are also available. Currently, NDF-RT more than 45000
concepts in hierarchies of maximum depth 12.
10
2.3. NOTABLE BIOMEDICAL ONTOLOGIES AND TERMINOLOGIES
2.3.3 ICD-10
The International Statistical Classification of Diseases and Related Health
Problems (ICD) is a terminology which attempts to classify signs, symp-
toms and causes of disease and morbidity [WHO, 1992]. It appeared in the
mid-19th century and is now maintained by the World Health Organization
(WHO). Currently it is available in its 10th revision, although the 11th ver-
sion is claimed to be at the final stage before release. As a taxonomy, it has
relatively small maximum depth, equal to 6. Codes assigned to each concept
tie it to a specific place in the taxonomy, with each code having only a single
parent. It is thus not a proper application of ontological principles2, since, in
reality, it is not unusual for concepts to belong to more than one subsumers,
and this is not modeled. In addition to that, there exist categories such as
“Not otherwise specified” or “Other”, which are not needed in an ontology;
the open world assumption already covers the fact that every ontology is
incomplete, so stating it explicitly is redundant and may interfere with the
evolution of the ontology, as new terms are not classified under their closest
match.
2.3.4 MedDRA
The Medical Dictionary for Regulatory Activities (MedDRA) is a termi-
nology that is concerned with biopharmaceutical regulatory processes. It
contains terms associated with all phases of the drug development cycle.
MedDRA is organized in a hierarchical structure of fixed depth, as seen in
Fig. 2.1. System Organ Classes (SOCs) represent the 26 predefined overlap-
ping hierarchies in which terms belong to. High Level Group Terms (HLGTs)
and High Level Terms (HLTs) are general term groupings, denoting disor-
ders or complications. Preferred Terms (PTs) denote the preferred name
for a concept, while Lowest Level Terms (LLTs) include terms of maximum
specificity. LLTs may be connected with hyponymy, meronymy or synonymy
relationships to their PTs. This is the main problem in trying to view Med-
DRA as an ontology. In a formal ontology, a concept cannot be a child of
2nor was meant to be; its intent is classification
11
2.3. NOTABLE BIOMEDICAL ONTOLOGIES AND TERMINOLOGIES
Figure 2.1: The structure of the MedDRA terminology comprises a fixed-depth hierarchy.
itself. In MedDRA, this clearly happens, when a PT and its LLTs share a
synonymy relation.
2.3.5 NCI Thesaurus
The National Cancer Institute Thesaurus (NCIT) is a controlled terminology
for cancer research. The thesaurus has been converted to formal OWL syntax
and is updated at fixed intervals. The conversion was not an easy one;
many inconsistencies and modeling dead-ends that were encountered in the
conversion procedure have been documented [Ceusters et al., 2005], along
with some clear violations of ontological principles [Schulz et al., 2010]. The
NCIT provides almost 100000 concepts, with approximately 65% containing
a definition.
12
Chapter 3
Similarity Metrics
Similarity metrics aim at measuring the lexical or semantic similarity between
terms. Lexical similarity focuses on terms that contain similar character
or word sequences, while semantic similarity tries to determine how close
in meaning the terms are. Lexical similarity is simpler to calculate, since
string-based algorithms only require plain text to function. On the other
hand, semantic similarity requires extra information about the terms present
in plain text. This extra information is usually acquired with the help of
a knowledge base (e.g. ontology, terminology, etc.) or through statistical
analysis of corpora, i.e., large collections of text documents that resemble
real-world usage of words.
3.1 Similarity Metric vs. Distance Metric
It is common in literature to come across the term ‘semantic distance’, instead
of ‘semantic similarity’. A distance metric d(a, b), that compares entities a
and b, must satisfy the following properties:
1. d(a, b) = 0 if and only if a = b (zero property),
2. d(a, b) = d(b, a) (symmetric property),
3. d(a, b) ≥ 0 (non-negativity property),
4. d(a, b) + d(b, c) ≤ d(a, c) (triangular inequality).
13
3.2. LEXICAL SIMILARITY
On the other hand, the requirements for a similarity metric were formally
introduced not long ago [Chen et al., 2009]. The definition states that a
similarity metric s(a, b) must satisfy the following properties:
1. s(a, a) ≥ 0,
2. s(a, b) = s(b, a),
3. s(a, a) ≥ s(a, b),
4. s(a, b) + s(b, c) ≤ s(a, c) + s(b, b),
5. s(a, a) = s(b, b) = s(a, b) if and only if a = b.
The counter-intuitive 4th property can be proven, using set theory. More
specifically, if |a ∩ b| denotes the cardinality of common characteristics be-
tween a and b, and c denotes the complement of c, the following equality
holds:
|a ∩ b| = |a ∩ b ∩ c|+ |a ∩ b ∩ c|. (3.1)
Then,
|a∩b|+|b∩c| = |a∩b∩c|+|a∩b∩ c|+|a∩b∩c|+|a∩b∩c| ≤ |a∩c|+|b|, (3.2)
since |a ∩ b ∩ c| ≤ |a ∩ c| and |a ∩ b ∩ c| + |a ∩ b ∩ c| + |a ∩ b ∩ c| ≤ |b|.Deduction of similarity from distance is a common procedure that requires
simple operations. Similarity is, intuitively, a decreasing function of distance.
Conversion between the two can take many forms [Chen et al., 2009]. In this
thesis, all formulas will be presented as similarity measures.
3.2 Lexical Similarity
String-based methods that calculate lexical similarity can be divided into
character-based and word-based. In this section, some of the most popu-
lar metrics are presented. For a more complete survey of lexical similarity
measures see [Gomaa and Fahmy, 2013] and [Navarro, 2001].
14
3.2. LEXICAL SIMILARITY
3.2.1 Character-based Similarity Measures
In character-based similarity, strings are viewed as character sequences and
attempts are made to discover character relevance.
Longest Common Substring
The Longest Common Substring algorithm [Gusfield, 1997] tries to find the
maximum number of consecutive characters that two strings share. It may
be implemented using a suffix tree or dynamic programming.
Hamming Similarity
Hamming similarity is a metric that can be applied to strings of equal length.
It is a simple metric that measures the number of common characters between
two strings. Given strings a and b, the formula for string similarity can be
constructed as follows:
simham(a, b) =
∑∀i
1(ai = bi)
|a|, (3.3)
where 1(·) is the indicator function and | · | denotes string length, measured
in characters.
Levenshtein Similarity
Levenshtein distance counts the number of character alterations that need
to be made in order to transform one string to another [Levenshtein, 1966].
This number is bounded by the length of the larger string, which is com-
monly used as a normalizing measure that restrains the value of distance
to [0, 1]. Mathematically, normalized Levenshtein distance of terms a and b
is computed using the following formula:
dlev(a, b) =leva,b(|a|, |b|)max{|a|, |b|}
, (3.4)
15
3.2. LEXICAL SIMILARITY
where
leva,b(i, j) =
max{i, j} ,min{i, j} = 0
min
leva,b(i− 1, j) + 1
leva,b(i, j − 1) + 1
leva,b(i− 1, j − 1) + [ai 6= bj]
, else
(3.5)
and max{·}, min{·} denote the maximum and minimum functions, respec-
tively. Converting normalized distance to similarity can be done as follows:
simlev(a, b) = 1− dlev(a, b). (3.6)
Jaro Similarity
Jaro similarity [Jaro, 1989, Jaro, 1995] takes into account both the number
and sequence of common characters present in the two strings. Let us con-
sider strings a = a1 . . . aK and b = b1 . . . bL. A character ai is said to be
common with b if the character exists in b within a window of min{|a|,|b|}2
from
bi. Let a′ = a′1 . . . a′K′ be those characters in a that are common with b, and
b′ = b′1 . . . b′L′ those characters in b that are common with a. A transposi-
tion for a′, b′ is a position i in the strings a′, b′ in which a′i 6= b′i. The number
of transpositions for a′, b′ divided by two is denoted as Ta′,b′ . Then, Jaro’s
formula for similarity is given by:
simjaro(a, b) =1
3
(|a′||a|
+|b′||b|
+|a′| − Ta′,b′|a′|
). (3.7)
Jaro-Winkler Similarity
Jaro-Winkler similarity [Winkler, 1999] is a variation of Jaro similarity which
promotes strings with long common prefixes. The length of the longest prefix
common to both strings a and b is denoted as P . Then, if P ′ = max(P, 4),
Jaro-Winkler similarity is given by:
simj&w(a, b) = simjaro(a, b) +P ′
10(1− simjaro(a, b)). (3.8)
16
3.2. LEXICAL SIMILARITY
N-gram Similarity
A string can be split into n-grams, i.e. all possible consecutive character
sequences of length n in the string. As an example, the word “protein”
can be split into the 3-grams “pro”, “rot”, “ote”,“tei” and “ein”. When
comparing two strings, the number of common n-grams is computed and
normalized by the maximum number of n-grams. More specifically, given
strings a and b, similarity is given by:
simngram(a, b) =Ncom
Nmax
, (3.9)
where Ncom denotes the number of common n-grams and Nmax denotes the
maximum number of n-grams in either of the two strings.
3.2.2 Word-based Similarity Measures
As the name implies, word-based measures view the string as a collection of
words. Similarity measures dictate how similar two terms are word-wise, and
no weight is given on character similarity.
Dice Similarity
Dice similarity considers input strings a and b as sets of words A and B
respectively, and calculates similarity as follows:
simdice(a, b) =2|A ∩B||A|+ |B|
, (3.10)
where | · | denotes set cardinality in number of words.
Jaccard Similarity
Jaccard similarity counts the number of common words of the compared
strings and divides it by the number of distinct words in both strings, i.e.
simjacc(a, b) =|A ∩B||A ∪B|
. (3.11)
17
3.2. LEXICAL SIMILARITY
Cosine Similarity
In order to compute cosine similarity, the compared strings should be con-
verted to vectors. The dimension of the resulting vectors will be equal to
the total number of distinct words present in both. Therefore, each element
in the vector represents one word. The vector values for each string are
computed as follows: A vector contains unitary values in positions that cor-
respond to words that are contained in the respective string. Similarly, a
vector contains zero values in all positions that correspond to words that are
not present in the respective string. Given strings a and b, the respective
vectors a and b are computed. Cosine similarity is then given by:
simcos(a, b) =a · b||a|| ||b||
, (3.12)
where || · || denotes the Euclidean norm function.
Manhattan Similarity
Taxicab geometry considers that distance between two points in a grid is
given by the sum of the absolute differences of their respective coordinates.
The grid resembles a uniform city road map, where diagonal movements are
not permitted. This is the reason why the distance metric in this space
is often called Manhattan distance or city block distance. Considering N -
dimension string vectors a and b, Manhattan distance can be computed as:
simmanh(a, b) = 1−
N∑i=1
|ai − bi|
N, (3.13)
where N is a normalizing constant that represents the dimension of a and b.
Euclidean Similarity
Euclidean similarity also considers strings as vectors, and computes similarity
as:
simeucl(a, b) = 1−
√√√√√ N∑i=1
|ai − bi|2
N. (3.14)
18
3.3. ONTOLOGICAL SEMANTIC SIMILARITY
3.3 Ontological Semantic Similarity
An ontology is a collection of concepts and their inter-relationships. It may be
visualized as a graph, in which nodes represent concepts and edges represent
the relations between them. Usually, ontologies are viewed as taxonomies,
where “is-a” and “part-of” relations play the most important role. Viewing
the ontology as a taxonomy, one can apply semantic similarity metrics that
exploit the hierarchical structure. Probably the most famous object of se-
mantic similarity tests is the computational lexicon WordNet [Miller, 1995].
In WordNet, closely related terms are grouped together to form synsets.
These synsets, in turn, form semantic relations with other synsets. WordNet
is commonly referred to as a lexical ontology, due to an obvious mapping of
lexical hyponymy to ontological subsumption.
3.3.1 Intra-ontology Semantic Similarity
Intra-ontology semantic similarity metrics are meant to measure similarity
between concepts that reside within the same ontology. These metrics can be
roughly divided into distance-based, information-based and feature-based.
Distance-based Metrics
Distance-based metrics take advantage of the ontological topology to com-
pute the similarity between concepts. This method requires viewing the
ontology as a rooted Directed Acyclic Graph (DAG), in which nodes are
concepts and edges among them are restricted to hierarchical relationships,
with the most usual type being “is-a” relationships. At the top, there is a
single concept, the root. The graph is directed, starting from a low-level con-
cept and directed towards its ancestors through transitive relationships. The
graph is also acyclic, since a finite path from a source node to a destination
node cannot return to the source node. In other words, a node can never be
a child of one of its children.
A simple look at an ontology from a geometric perspective may reveal
important information about the similarity of concepts. As depth in the DAG
increases, concepts become increasingly specific, thus similarity is expected
19
3.3. ONTOLOGICAL SEMANTIC SIMILARITY
to increase. Another important characteristic of the ontology DAG is that
the path between concepts is not always unique, therefore distance-based
similarity will depend on which path is chosen. Finally, the density of nodes
is a good indicator of similarity; as density increases, concepts approach each
other and similarity increases.
The accuracy of distance-based methods depends on the level of detail
that the ontology captures. A poorly structured ontology with many omis-
sions might yield misleading similarity results. Fortunately, a lot of effort has
been made to make biomedical ontologies as complete as possible, therefore
network density in biomedical ontologies is usually high.
The most straightforward way to measure the similarity of concept nodes
is given in [Rada et al., 1989]. In that work by Rada et al., all edges are
assigned a unitary weight and the distance between two concepts is equal to
the number of edges that are present in their shortest path. Let us consider
two distinct concepts c1 and c2 in the hierarchy. Each path i that connects
these two concept nodes may be represented as a set which includes all edges
ek present in the path, i.e.
pathi(c1, c2) = {e1, e2, . . . , eK}. (3.15)
with cardinality |pathi(c1, c2)| = K. The distance between concepts c1 and
c2 is, then, equal to the shortest path that connects them, i.e.,
drada(c1, c2) = min∀i|pathi(c1, c2)|. (3.16)
Note that in literature, there are cases (e.g. [Al-Mubaid and Nguyen, 2006])
where Rada’s measure is used with node counting, instead of edge counting.
In those cases, each path is represented as a set of the nodes that compose
it, including the end nodes. The minimum distance can be converted into a
similarity metric, as in [Resnik, 1995]:
simrada(c1, c2) = 2D− d(c1, c2), (3.17)
where D is the maximum depth of the taxonomy. This method fails to
capture the intuition that concept nodes, which reside at the lower part of
the hierarchy and are separated by distance d, are more similar than higher-
level nodes with the same distance separation d. Also, its success highly
20
3.3. ONTOLOGICAL SEMANTIC SIMILARITY
depends on the uniformity of edge distribution within the ontology. For
these reasons, other approaches have been proposed in order to achieve a
more representative score of similarity.
In [Wu and Palmer, 1994], the relative depth of the compared concepts
in the hierarchy is considered. In that work, Wu and Palmer introduce the
Least Common Subsumer (LCS) of the compared concepts. The LCS is the
lowest-depth ancestor node common to both concepts, when looking at the
shortest path between them. Similarity for concepts c1 and c2 is then given
as:
simw&p(c1, c2) =2h
N1 +N2 + 2h, (3.18)
where N1 is the number of nodes in the path between concept a and the
LCS1, N2 is the number of nodes between concept b and the LCS, and h is
the minimum depth of the LCS towards the root, measured again in number
of nodes.
In [Li et al., 2003], the authors followed various strategies in their at-
tempt to calculate similarity as a function of the shortest path between the
compared concepts, the depth of their LCS and the local density of the on-
tology. They perceived that the best performance was obtained when they
used the following non-linear function:
simli(c1, c2) = e−α drada(c1,c2)eβh − e−βh
eβh + e−βh, (3.19)
where α, β are non-negative parameters and h = drada(LCS(c1, c2), root)
denotes the minimum depth of the LCS. Distances are measured in number
of edges.
Al-Mubaid and Nguyen attempt to combine path length and node depth
in one measure. In [Al-Mubaid and Nguyen, 2006], they view the DAG as
a composition of clusters, with each cluster having as root a child of the
ontology root. The usage of clusters aims to exploit local characteristics
of different branches. Given concepts c1 and c2, they first compute their
so-called common specificity:
Cspec(c1, c2) = Dc − h, (3.20)
1start and end nodes of the path are also included in the calculation
21
3.3. ONTOLOGICAL SEMANTIC SIMILARITY
where Dc denotes the depth of the specific cluster and h refers to the depth of
the LCS in the ontology, with both quantities measured in number of nodes.
Then similarity is computed as:
sima&n(c1, c2) = log((Path− 1)α × (CSpec)β + k), (3.21)
where Path is a modified version of Rada’s distance measure which is adapted
according to the largest cluster, and α, β, k are constants, whose default
values are unitary.
Information-Based Metrics
One of the first attempts to focus on nodes in the similarity formula is that
of Leacock and Chodorow [Leacock and Chodorow, 1998]. This method uses
negative log likelihood in a way that resembles the formula of self-information
[Cover and Thomas, 2012], but does not really involve valid probability. In-
stead, a normalized form of the path length between the concepts is used:
siml&c(c1, c2) = −log(Np/2D), (3.22)
where Np is the number of nodes in the shortest path between concepts c1
and c2. This variable also includes the end nodes.
Resnik, in [Resnik, 1995], continues down this path by replacing the nor-
malized path length with a probability measure P(·) to calculate the infor-
mation content (IC) of a concept. He considers all common subsumers CSi
of concepts c1 and c2 and calculates similarity as:
simresn(c1, c2) = max∀i
[−log(P(CSi))], (3.23)
or, equivalently,
simresn(c1, c2) = −log(P(LCS)). (3.24)
Considering that the IC of a concept c is defined as the negative logarithm
of its probability, i.e. IC(c)= -log(P(c)), equation (3.24) can also be written
as:
simresn(c1, c2) = IC(LCS(c1, c2)). (3.25)
22
3.3. ONTOLOGICAL SEMANTIC SIMILARITY
Probabilities are estimated with the help of a text corpus, i.e. a collection of
nature language excerpts, specifically chosen to provide a good representa-
tion of actual term usage. When dealing with biomedical ontology concepts,
collections of Pubmed2 abstracts are commonly used as corpora to determine
the probability of each concept.
Given a corpus, the occurrence of a term which corresponds to concept c
essentially implies the occurrence of each and every concept that subsumes
c within the ontological structure. Conversely, the number of occurrences
of a concept c depends not only on the number of appearances of c itself in
the corpus, but also on every occurrence of its descendants in the hierarchy.
Thus, the number of occurrences of concept c is given by:
occ(c) =∑
∀n=subsumed(c)
count(n), (3.26)
where subsumed(c) represents c and its children concept nodes, and count(·)denotes the number of occurrences of the specific concept within the given
corpus. Converting occurrences to probability can be done using:
P(c) =occ(c)
N, (3.27)
where N is the total number of occurrences of ontology terms in the corpus.
This method results to higher probabilities for concepts residing at the top
part of the hierarchy, with the root having unitary probability. Therefore,
concepts whose LCS lies lower in the hierarchy are more similar, since their
LCS has low probability (i.e., high IC).
A possible drawback of this method is that probabilities are tied to the
choice of corpus. So far, in the biomedical domain, there is no widely accepted
corpus that covers the domain needs [Al-Mubaid and Nguyen, 2006]. This
is due to the fact that thousands of new terms and abbreviations appear in
the literature every year, thus a stable corpus might not function well. Since
extensions of the corpus would need to be considered at fixed intervals, it
might not serve as a useful benchmark.
Alternatively, computation of IC can be performed without the use of
a corpus, by solely relying on the structure of the ontology DAG. Intrinsic
2http://www.ncbi.nlm.nih.gov/pubmed
23
3.3. ONTOLOGICAL SEMANTIC SIMILARITY
computation of IC involves approximating the occurrence probability of a
concept as a function of multiple variables, such as number of descendant
nodes, number of subsumers or number of descendant nodes which are leaves
in the ontology. In [Seco et al., 2004], the IC of a concept c is given by:
ICseco(c) = 1− log(descendants(c) + 1)
log(allConcepts), (3.28)
where descendants(c) returns the number of nodes that concept c subsumes,
and allConcepts denotes the number of all the available concepts in the
ontology.
The IC function introduced by Seco et. al has the drawback that it assigns
IC equal to one for every leaf node in the ontology, and also that concepts
containing the same number of descendant nodes are again given the same
IC. An attempt to distinguish the IC between leaf concepts was made in
[Zhou et al., 2008], by also including the depth of the node in the calculation,
normalized by the maximum depth of the ontology. The proposed IC formula
is given by:
ICzhou(c) = kICseco(c) + (1− k)log(depth(c) + 1)
log(maxDepth), (3.29)
where depth(c) represents the depth of the concept c in the hierarchy, maxDepth
is the maximum depth of the ontology, measured in node number and k is a
weighting constant.
The authors in [Sanchez et al., 2011] further improve the modeling of the
IC function. In that work, the IC function can also distinguish concepts that
contain the same number of descendants, due to the fact that the number of
subsumers of a concept is also used. The IC is given as:
ICsan(c) = −log
( leaves(c)ancestors(c)
+ 1)
allLeaves
), (3.30)
where leaves(c) is the number of nodes that are descendants of c and have no
children, ancestors(c) refers to the number of concepts which subsume c and
allLeaves denotes the total number of leaf nodes in the ontology. The IC
functions of equations (3.28), (3.29) and (3.30) can be used in equation (3.25)
to compute the similarity between two concepts without using a corpus.
24
3.3. ONTOLOGICAL SEMANTIC SIMILARITY
Lin et al. use IC in an alteration of the similarity metric presented in
[Wu and Palmer, 1994]. More specifically,
siml&p(c1, c2) =2 simresn(c1, c2)
IC(c1) + IC(c2), (3.31)
This approach aims to include the individual characteristics of the compared
nodes that Resnik’s approach neglected. Indeed, in Resnik’s measure, any
two pairs of nodes that have the same LCS produce the same similarity.
Jiang and Conrath follow a similar approach with [Wu and Palmer, 1994],
but avoid the scaling of similarity [Jiang and Conrath, 1997]. Instead, they
use a distance metric as follows:
dj&c(c1, c2) = IC(c1) + IC(c2)− 2 simresn(c1, c2). (3.32)
Various transformations have been applied to convert this distance to simi-
larity. Among these, the authors in [Seco et al., 2004] consider a linear trans-
formation and present the following formula of similarity normalized in the
interval [0,1]:
simj&c(c1, c2) = 1− dj&c(c1, c2)
2. (3.33)
Another example can be found in [Zhu et al., 2009], in which an exponential
function is used for the similarity formula, along with a constant λ that
accounts for curve steepness:
simj&c(c1, c2) = edj&c(c1,c2)
λ . (3.34)
Feature-Based Measures
Feature-based measures do not necessarily conform to the similarity met-
ric rules of [Chen et al., 2009], as they allow for similarity asymmetry. In
feature-based techniques, the two compared concepts are viewed as sets of
features, in contrast to the geometric view presented in previous sections.
To calculate similarity, not only the common features of the concepts are
taken into account, but also the differences between them. That way, com-
mon features improve similarity, while different features penalize its value
[Tversky et al., 1977]. Given concepts c1 and c2, let C1 and C2 denote the
25
3.3. ONTOLOGICAL SEMANTIC SIMILARITY
sets that contain their features. Then, similarity between the two can be
given as:
simtve(c1, c2) =|C1 ∩ C2|
|C1 ∩ C2|+ µ|C1 − C2|+ (1− µ)|C2 − C1|, (3.35)
where µ is a weight which takes values in [0,1]. In [Rodrıguez et al., 1999],
the µ parameter is computed as follows:
µ =
d(c1,LCS)d(c1,c2)
, d(c1, LCS) ≤ d(c2, LCS)
1− d(c1,LCS)d(c1,c2)
, else(3.36)
This asymmetric function stems from Tversky’s observation that similarity
might not be symmetric. In one of Tversky’s examples, North Korea was
said to be more similar to Red China than the reverse.
3.3.2 Inter-ontology Semantic Similarity
Inter-ontology semantic similarity measures try to quantify the similarity
between concepts that belong to different ontologies. Fairly little research
has been documented in this area, due to the inherent difficulty of com-
paring heterogeneous structures. A common approach is to combine the
different ontologies into a single ontology through detailed concept mappings
[Gangemi et al., 1998]. It is clear that this is very challenging and requires
the help of a domain expert, as well as plenty of time and effort. Fur-
thermore, not all biomedical terminologies are consistent and their lack of
homogeneity is a major problem. Simpler approaches have been proposed in
the literature. A usual first step is to merge the different ontologies under
a dummy root. This approach is found in [Rodriguez and Egenhofer, 2003],
where the authors use a weighted version of Tversky’s similarity which also
takes into account geometrical features of the ontologies. A similar route
is followed by [Petrakis et al., 2006], where the authors substitute Tversky’s
similarity with a form of Jaccard similarity. The drawback of these cross-
similarity metrics is that they do not consider term overlap in both ontolo-
gies. Other methods rely on extensions of single ontology similarity metrics.
Examples of such work can be found in [Al-Mubaid and Nguyen, 2006] and
[Sanchez et al., 2012].
26
Chapter 4
Search Interfaces
Search has risen to be one of the most commonly used tools for computer
users. It can be found everywhere, from stand-alone web-based search engines
to embedded search forms that appear in desktop applications and websites.
To a large extent, success of the search procedure depends on the users’
ability to formulate their information needs, transforming them into queries
that are highly likely to produce desired results. For this reason, a lot of
effort has been spent on improving the search interfaces and providing tools
that will enhance user experience. In this chapter, the basic characteristics
of successful search interface design are presented, with main focus on web-
search interfaces.
4.1 Information Seeking Models
Information seeking models attempt to recognize and describe the strategies
followed by humans from the moment they sense a search need until the
moment they acquire desired results. The search procedure may be viewed as
a repetition of actions. In [Sutcliffe and Ennis, 1998], the authors identify the
following four actions in what is considered the standard model of information
seeking:
1. Problem Identification
2. Articulation of Need
27
4.2. QUERY SPECIFICATION
3. Query Formulation
4. Evaluation of Results
The first step refers to conceptualization of the search need, while the second
step involves expressing this need in words. The third step requires the user
to transform the articulated need into a format that will be accepted by the
underlying search system. Finally, the fourth step refers to the procedure
of judging the results critically, exploiting any relevant domain knowledge
and deciding whether the need is satisfied. A search may be characterized
as ‘ok’, ‘failed’ or ‘unsatisfactory’. An ‘ok’ search ends the cycle successfully.
An ‘unsatisfactory’ search may lead to reformulation of the query or re-
articulation of the need, while a completely ‘failed’ search might require
re-identification of the problem.
Sutcliffe and Ennis’s model assumes that the need does not change, unless
results are disappointing. It does not capture the fact that users learn as they
search. This dynamic aspect of information seeking was captured in an earlier
work by Bates [Bates, 1989]. In that study, the user’s needs are assumed to
change as the process advances. Furthermore, Bates claims that the success
of the search procedure does not only depend on the final list of results, but
on the selections made along the way. This model is referred to as the berry-
picking model, to denote that it does not result in a single set of results. A
simple example of the berry-picking model can be illustrated when a user
attempts a broad query such as “String similarity algorithms” and refines
the query to “Jaro similarity” after viewing this result in the initial result
list.
4.2 Query Specification
Queries are usually specified through rectangular entry forms, as in Fig. 4.1.
The width of these forms varies in size, with studies showing that wider
forms promote formulation of longer queries [Franzen and Karlgren, 2000,
Belkin et al., 2003]. It has been observed that around 88% of search queries
are composed of 1 to 4 words, with mean length equal to 2.8 words per query
[Jansen et al., 2007]. The actual search is executed by pressing the return
28
4.2. QUERY SPECIFICATION
Figure 4.1: The google search engine entry form.
Figure 4.2: Facebook uses grayed-out descriptive text to help in the formulation of user
queries.
key or mouse-clicking a specified button (e.g. magnifying glass in Bing). In
some cases, entry forms decorate their background with descriptive text that
provides guidance for the user. An example is Facebook’s search form, as
seen in Fig. 4.2. The text disappears, once the user clicks inside the form.
This usually helps to narrow down the search domain.
After query submission, processing of the query takes place before any
attempt to retrieve results. This process may include removal of stopwords
(i.e. words with high appearance probability such as ‘the’, ‘a’), normalization
of words (e.g. plural to singular) and permutation of word order. Boolean
logic may also be used in the case of multiple words per query. Returning
results that contain all query words (i.e. Boolean AND operator) seems more
intuitive, although this might sometimes lead to overly specific queries that
return no results. The actual types of processing are often hidden from the
users, in an attempt to avoid confusion and promote transparency, while
hiding implementation details [Muramatsu and Pratt, 2001].
Most modern search interfaces are equipped with dynamic search sug-
gestion, also known as auto-completion (See Fig. 4.3). As the user starts
typing, a list of term suggestions appears under the entry form. The sugges-
tions contained in the list are usually queries whose prefix matches what has
29
4.2. QUERY SPECIFICATION
Figure 4.3: Bing’s search interface features a powerful dynamic search suggestion, where
prefixes are highlighted with grayed-out font and the remaining text is in bold.
been typed so far, although there are cases where interior matches are also
included. The user can then mouse-click the most relevant query or navigate
through the list, using keyboard arrows. Studies have shown that approxi-
mately one third of all search attempts in the Yahoo Search Assist were per-
formed through a dynamically suggested query [Anick and Kantamneni, 2008].
The dynamic search suggestion technique attempts to minimize unneeded
typing from the user side and can alleviate spelling errors early. Most im-
portantly, though, it reassures the user that results are available, so there is
no frustration from empty result pages.
An important point to consider is that searchers often return to their pre-
viously accessed information. In the empirical study undertaken by Tauscher
and Greenberg [Tauscher and Greenberg, 1997], it was found that there is a
58% chance that the next web page to be visited had been visited before.
A more recent study [Zhang and Zhao, 2011] about tabbed browsing, con-
ducted in 2010, also finds page revisitation to be around the same levels,
at 59.3%. Various tools exist to help users find their intended pages, in-
cluding URL history, bookmarking of pages, basic navigation buttons (e.g.
‘Back’ button for short term page revisit) and change of URL font color if
30
4.2. QUERY SPECIFICATION
Figure 4.4: The Safari browser’s embedded search interface explicitly states which queries
are suggestions and which belong to the user’s recent search history.
Figure 4.5: The Firefox browser’s embedded search interface contains recent queries on
top, and separates them from suggestions using a solid line.
page has already been visited. Among other methods documented, users
may save whole webpages to their local disk or keep URLs in text docu-
ments, after enriching them with comments [Jones et al., 2002]. Interest-
ingly, a common approach to revisiting documents is actually re-searching
for them [Obendorf et al., 2007]. Users who adopt this strategy attempt to
re-create the conditions of their previous search, by trying to formulate the
exact same query. Another strategy requires past search queries to appear
31
4.3. PRESENTATION OF SEARCH RESULTS
Figure 4.6: Google’s search results page is a typical scrollable vertical list of captions.
Metadata facets, that restrain results to a particular type of information, are also present
in the interface (e.g. ‘Images’ tab).
as the user types, along with regular dynamic term suggestion. Separation
between suggested queries and previously generated ones varies among inter-
faces, as can be seen in Figures 4.4 and 4.5.
4.3 Presentation of Search Results
Search applications usually present results as a vertical list of captions, dis-
tributed along multiple pages (see Fig. 4.6). Each caption is a clickable
entity which, as a minimum requirement, comprises a title and an excerpt of
the target document [Clarke et al., 2007]. Usually, the excerpt includes some
or all of the query terms, as highlighted text. In most cases, highlighting is
performed using bold font or colored term background. Many search applica-
tions tend to group similar results, that originate from the same source, into
the same caption. That way, result ‘pollution’ from few sources is avoided
and diversity is promoted. The relevance of search results is reflected in
their order of appearance. Although relevance scores were formerly used to
grade the fit of the result to the query, they are usually not present anymore
32
4.4. QUERY REFORMULATION
in modern search applications. The reasons behind their omission might
be to avoid reverse-engineering of the ranking algorithms and to reduce re-
dundancy, since the ranking itself already reflects the importance of results
[Hearst, 2009].
It has been observed that users tend to click on the uppermost captions
[Joachims et al., 2005]. In the same study, it was found that the first cap-
tion received more attention than its successors, even if its relevance was
actually lower. Furthermore, the majority of users often remain on the first
page of results. The authors in [Jansen et al., 2007] observed that only 30%
continued to look for relevant results in the second page of the results, and
only 15% looked even further. Usually, the patience of a user is a function
of his/her experience in using the system. More experienced users tend to
be more patient than users who are not accustomed to the search procedure.
Inexperienced users, on the other hand, often prefer to refine their query
or simply accept that what they search for cannot be found by the search
application [Hearst, 2009].
Apart from plain lists of results, further organization of captions may be
performed, using some form of faceted browsing. Facets attempt to refine
search results, according to their characteristics. As an example, Amazon’s
search interface provides facets that correspond to the different departments
that might contain the desired item (see Fig. 4.7).
4.4 Query Reformulation
It is common that desired search results are not discovered with the first
try. Query reformulation is the procedure which attempts to transform the
original query to a format that will match the information retrieval sys-
tem’s vocabulary. Studies using query logs have shown that the number
of reformulated queries may reach up to 52% [Jansen et al., 2005] of all
queries. It has been observed that, if no help for query reformulation is
given explicitly by the search application, users tend to provide simple alter-
ations of the initial query [Hertzum and Frøkjær, 1996]. This bias towards
initial queries is referred to as anchoring, a term coined by psychologists
[Tversky and Kahneman, 1975].
33
4.4. QUERY REFORMULATION
Figure 4.7: Amazon’s search interface provides facets as a left panel to the results page,
helping the user dynamically refine the initial search.
One of the most common sources of search failure is query mistyping
[Cucerzan and Brill, 2004]. A common approach, which aims to correct ty-
pographical errors, is using a dictionary and finding the most similar term
to the erroneous query [Kukich, 1992]. Among other techniques mentioned
in that work are heuristic rule-based corrections, probabilistic approaches
that determine how often specific sequences of characters are spelt wrong,
and neural network models that train the system to automatically identify
errors. The outcome of the reformulation procedure may be shown explicitly
on the interface as a suggested query (e.g. Google’s ‘Did you mean’), or be
implicitly shown in the results. The former approach is preferred, since it
gives users freedom to decide whether their intent is actually captured in
the proposed correction. More recently, distributional approaches that take
advantage of user query logs are preferred, especially by web-based search
engines [Li et al., 2006].
Another dimension of query reformulation is term expansion. Term ex-
pansion refers to the suggestion of queries that relate to the initial one in some
way. Choice of related queries might take the form of thesaurus-based term
substitution [Dennis et al., 1998] or attempt to extend the present query,
34
4.4. QUERY REFORMULATION
Figure 4.8: Pubmed’s results page includes term expansion in two ways. On the right of
the screen, there is a ‘Related searches’ panel that preserves the initial query and adds a
new related term to it. Also, right below the entry form there is a ‘See also’ feature which
suggests complete or partial modifications in the initial query.
usually by adding single words (see Fig. 4.8). Query suggestion might also
be fetched from sessions of users who previously searched for the same infor-
mation. In has also been proposed that search applications ask the user to
provide relevance feedback [Ruthven and Lalmas, 2003]. Although theoreti-
cal studies approve of this feature, its appearance in commercial applications
is rare.
35
Chapter 5
Design
The main design requirement by AstraZeneca is a “Google-like” search ap-
plication that takes care of ontologies behind the scene and provides visual
tools that help users choose the most appropriate term(s) for them. In this
chapter, the previously used search application within AstraZeneca is briefly
described, along with examples and justifications of query failures. Further-
more, the methodology to be followed for improving the implementation is
analyzed.
5.1 AstraZeneca’s Search Application
The ontology search application used by AstraZeneca is integrated into a text
mining application. The user searches for terms which belong to ontologies
and the most relevant ones are used for searching medical documents. The
search application appears as a pop-up window once the user clicks on certain
fields of the text mining application. It includes a form, in which users
type their query, and a results page which lists result entries vertically. The
results page is organized as a two-column table. Each row entry includes
the preferred name for a concept (i.e., left column entry) and the list of
children or synonyms for the specific concept (i.e., right column entry). The
searcher can pick one or more of the table entries that correspond to ontology
concepts, and these terms are, in turn, fed back to the text mining application
for further processing.
36
5.1. ASTRAZENECA’S SEARCH APPLICATION
Table 5.1: Documented failed queries and suggested reasons for their failure.
Query Comments Suggested Reason for Failure
Hepatotoxicity Searcher did not find the term
and decided to search on-
line to find a synonym for it
and reformulate the query as
‘Liver Disease’.
Wrong ontology choice by user. The
term is clearly in MedDRA. It is also
a preferred name, so the application
would find it.
NSCLC The acronym refers to ‘Non-
Small Cell Lung Carcinoma’,
a concept which is listed in
NCIT. Search returned no re-
sults.
Although the abbreviation ‘NSCLC’
is documented in NCIT, it is not a
preferred name so it was bypassed
by the program.
DIHS Searcher expected the concept
‘Drug-induced hypersensitiv-
ity syndrome’ in MedDRA.
No results were returned.
DIHS does not appear as an abbre-
viation in MedDRA, so this behav-
ior was normal. Searcher needed
to explicitly specify the preferred
name, which is ‘Drug-induced hy-
persensitivity syndrome’.
DRESS Syn-
drome
Refers to the same concept as
DIHS. It was not found.
The term exists as an LLT in Med-
DRA. The application searched only
for PTs in that ontology. The
PT for ‘DRESS Syndrome’ is ‘Drug
rash with eosinophilia and systemic
symptoms’.
VEGFR Searcher came across multiple
returned terms and did not
know which one(s) to choose.
Therefore, all were chosen.
The application does not help the
user visualize possible relationships
among results (e.g. hyponymy). If
it did, the user would only choose
the broader term, and would not be
confused.
LHRH Most relevant result was ‘Go-
nadotropin Releasing Hor-
mone’. The searcher did not
know that term, so again used
internet search.
The preferred term for ‘LHRH’ is
‘Gonadotropin Releasing Hormone’.
The searcher did not have back-
ground knowledge to grade the rel-
evance of the two.
NMDA Antag-
onist
The searcher wanted to find a
list of the different NMDA an-
tagonists. What was returned
was the definition of ‘NMDA
Receptor Antagonist’
This is an ontology organization
characteristic. For example the
NMDA antagonist ‘Ketamine’ is
listed in NCIT as a subclass of
‘Anesthetic Substance’, while ‘Apti-
ganel’ is listed as a subclass of ‘Neu-
roprotective Agent’.
5.2. DESIGN CONSIDERATIONS
Although no log file containing extensive lists of query failures is avail-
able for AstraZeneca’s search application, examples of failed queries have
been given. The reasons behind query failure are diverse; Table 5.1 lists
some of the most characteristic failed queries, along with given or deduced
justifications for the reason of failure. It is clear that failure of some queries
was due to the content of the ontologies, therefore inevitable. Changing the
interface would improve many cases of failure, though. Other causes of fail-
ure included wrong ontology chosen by the user, incomplete term coverage
by the search application, lack of help and guidance from the system (e.g.,
relevance feedback or result visualization).
5.2 Design Considerations
In this section the modeling choices for the search engine and its interface
are presented. The actual coding is performed in the programming language
Java, and Graphic User Interface (GUI) design is done entirely in Java Swing.
The main reasons behind this choice were the cross-platform nature of Java
and the wide availability of high-performance Application Programming In-
terfaces (APIs) (e.g. OWL Java API1, Patricia Trie API2, OntoCAT API3).
Furthermore, the goal of this thesis is not an enterprise-strength application,
but a proof of concept.
5.2.1 Ontology Access
There are two alternatives for the access of biomedical ontologies. The first
is to process the file for each ontology locally, while the second is accessing
ontologies through the Bioportal4 Representational State Transfer (REST)
services online.
The current state of implementation (see Section 5.5) uses a local copy
of the NCIT, in OWL format, and the Java OWL API is chosen to extract
useful information about its contents. The problem with this approach is
1http://owlapi.sourceforge.net/2http://code.google.com/p/patricia-trie/3http://www.ontocat.org/4http://bioportal.bioontology.org/
38
5.2. DESIGN CONSIDERATIONS
that, when loading large ontologies with the Java OWL API, there is a long
time delay (i.e. approximately 30 seconds for NCIT with 2GB RAM) and
the whole ontology remains in memory for the duration of the program.
Therefore, the extensibility of this approach to many large ontologies and the
usability of the application to a novice user are questioned. Various attempts
to provide a database backend for multiple ontologies have been documented
(e.g., see [Iordanov, 2010, Henß et al., 2009]), but loading ontologies into a
local database takes time, requires a private server and if reasoning is to be
done, whole ontologies must be brought back into main memory whatsoever.
Fortunately, we do not need to worry about providing a database backend,
since this work has been done already in Bioportal. Queries about ontologies
may be performed online, using the provided REST web API. Available
REST services include getting all parents, children or properties of a term,
getting all terms of an ontology and getting all paths between a term and
the root. Results are returned in XML format, which is easily parsable. A
Java API for accessing terms from Bioportal is also available, by the name
OntoCAT. The second phase of the implementation will consider using these
Bioportal services to access ontologies.
5.2.2 Ontology Manipulation
Manipulation of ontologies will be performed in multiple stages. Firstly, all
available terms of each ontology will be retrieved and all their word-wise per-
mutations will be saved in a Patricia trie structure, also known as a radix tree.
This type of structure allows for quick on-demand retrieval performance. In
our implementation, it will be used for quick auto-completion. The Patricia
trie for a given ontology needs to be built only once and will be saved to
file for future references. The second type of ontology manipulation is the
computation of semantic similarity for each pair of concepts. This procedure
will involve exploiting the ontology as a DAG and is expected to be com-
putationally intensive. Fortunately, it also needs to be performed once per
ontology and the results can be stored in one or more text files as a trian-
gular (i.e. due to symmetry) matrix. A final form of ontology manipulation
will be to access specific concepts. Once the user chooses a term from the
39
5.2. DESIGN CONSIDERATIONS
list of results, the contents of that term must be retrieved. In the case of
using a local OWL file, information about the class with the given IRI can
be fetched using the OWL API, while in the case of using Bioportal REST
services, information about the term can be retrieved by forming an HTTP
request which includes the term’s accession code.
5.2.3 Search Entry Form
As mentioned in section 4.2, queries are usually less than or equal to 4 words.
That result reflects query specification in web-based search engines, where
users can search about any topic they wish for. In the more granular biomed-
ical domain, users usually attempt more targeted searches. Furthermore, the
application to be deployed in this thesis is aimed at term searching, instead
of document searching. Thus, users are aware that they are searching for
short-length terms instead of multi-page documents, and it is likely that
queries are even shorter than the average 2.8 words. Indeed, the example
queries given by AstraZeneca are comprised of at most two words. Also, an
auto-completion feature will be provided, so lengthy terms will not need to
be typed, but simply chosen from a dynamic list. Despite the fact that short
queries are expected, a wide entry form is chosen, to resemble “Google-like”
experience and provide better visibility for auto-completion features.
5.2.4 Result Calculation
Result calculation is performed at each key press action. If a user presses
keys at fast pace, calculation of results is performed only for the most re-
cently submitted query and processing of any previous queries is immediately
terminated. The query is viewed as a bag of words and is compared to all
ontology terms, which are also viewed as bags of words.
Given a query of n words, an ontology term appears in the results if it
shares n−1 words with the query and the remaining word of the query prefix-
matches one word from the term. For example, given a query ‘carcinoma lu’
and the term ‘lung carcinoma’, the word ‘carcinoma’ is contained in both,
while ‘lu’ prefix-matches ‘lung’; therefore, ‘lung carcinoma’ is included in the
result set. It is clear that the Boolean AND operator is considered. The above
40
5.2. DESIGN CONSIDERATIONS
procedure is equivalent to searching through all possible word permutations
of ontology terms to find the query as a prefix match. So, if we look for the
query ‘carcinoma lu’ as a prefix in ‘lung carcinoma’ and ‘carcinoma lung’, it
will indeed prefix-match the second permutation of the same term. Since all
permutations of terms are already saved in a Patricia trie (see Section 5.2.2),
determining the results is trivial and fast.
After the results list has been completed, it is cleaned from terms that
correspond to the same concept. Only one representative term is chosen for
each concept, and this is not always the preferred term, but the term best
matching the given query lexically. The results list is maintained until the
user presses the next key or interface button.
5.2.5 Error Correction
If no matches are found, a lexical similarity measure is applied and the closest
match is either directly returned or proposed as a correction and its adoption
is left to the user’s discretion. For cases where error correction proposals pro-
duce very low similarity scores (e.g. < 0.5 with maximum lexical similarity
equal to 1), the Boolean AND operator may be dropped and Boolean OR
may act as a softer replacement.
5.2.6 Results Presentation
If the user presses any key except ‘return’, the auto-completion function is
triggered as a pop-up window below the search entry form. To fill the auto-
completion window, a list of the most relevant terms is selected from the
results list and presented to the user, along with the relevance score against
the query. The list is fixed in size, holding a maximum of 10 terms. Ranking
is performed by comparing the query with each of the result terms lexically.
This comparison is the maximum of a character-based and a word-based
lexical similarity (e.g., maximum of Levenshtein and Jaccard). For lexical
similarity, the Java version of the Simmetrics5 library is used.
If the user presses ‘return’ or clicks the ‘Search’ button of the interface,
5http://sourceforge.net/projects/simmetrics/
41
5.2. DESIGN CONSIDERATIONS
results are shown in a table. The subset of terms that were screened using
lexical similarity are grouped together either by using semantic similarity
or by simply determining ancestor-descendant relationships. Let us assume
that descendants of a term are present in the results list after the initial
screening, along with the term itself. Those descendants will not hold their
own position in the result table, next to each other. Instead, they will be
listed as subsumed terms of their most distant ancestor present in the results.
This choice is expected to further separate the first result in the ranking from
the rest in terms of relevance, and to make choice easier and less ambiguous
for the user. For example, given the term ‘NSCLC’ and its child term ‘Stage 0
NSCLC’, the latter will be presented within the former, implying a hyponymy
relation.
5.2.7 Concept Presentation
Information about a concept will, at a minimum, include its unique code
accession in the given ontology, its preferred name, any available definition(s),
a list of synonyms and a group of highly similar terms. The grouping of
similar terms will be computed using semantic similarity metrics.
5.2.8 Navigation
A basic form of navigation through pages will be present through “Back” and
“Forward” buttons. These buttons will permit a specific window of moves
back or forward. Furthermore, typical keyboard shortcuts will be present
(e.g. traversing the auto-completion list with keyboard arrow keys).
5.2.9 History
In its final form, the auto-completion function will also host a history fea-
ture. Previously attempted queries will be presented right above the query
suggestions, with a line dividing the two. The history function will be an
independent Patricia trie, that will be updated with every query and saved
to file for future reference.
42
5.3. RELATED WORK
5.2.10 Feedback
In this first phase, a percentage which reflects (lexical) relevance of results to
the query is also included. Responses from AstraZeneca have been positive
for including this numerical indicator. Once the semantic groupings of terms
in the results is performed, we will assess the utility of a percentage.
5.3 Related Work
The most relevant work is Bioportal, a public web-based repository of biomed-
ical ontologies and terminologies [Noy et al., 2009]. Bioportal features a pow-
erful web search application, which scans multiple ontologies at once. The
main differences of our implementation to Bioportal’s are the following:
• Bioportal’s main search form does not supply an auto-completion func-
tion. Most users use this form. Auto-completion function is supported
as a widget for individual ontologies, but the feature is not immediately
evident to the novice user of the site.
• Bioportal’s auto-completion feature proposes only preferred names. Start-
ing to type ‘L6 Antigen’ in Bioportal’s NCIT widget will present ‘Trans-
membrane 4 Superfamily Member 1’ in the auto-completion list, result
whose interpretation depends on the searcher’s ability to judge that
these two indeed refer to the same concept. Our approach will present
‘L6 Antigen’ in the auto-completion menu.
• After performing a search, traversal of concepts in Bioportal depends
on the browser’s ‘Back’ and ‘Forward’ buttons and the user can easily
get lost. Our approach has taken into account navigational aspects of
usability.
• Bioportal does not offer a search history in its search form. Our imple-
mentation will feature past query proposal.
• Bioportal provides a visual representation of the parents and children
of a concept, if the user selects visualization. Our application’s in-
43
5.4. EVALUATION
terface will include a term suggestion visualization based not only on
parent/child relationships, but on general semantic similarity scores.
5.4 Evaluation
Evaluation of the search application will be performed by the AstraZeneca
side. Comparisons to the old system for the same queries will be assessed,
along with user satisfaction or dissatisfaction, which will be documented.
Evaluation may take the form of a simple questionnaire or be a description
of positive and negative feedback provided by the users themselves.
5.5 Current Implementation State
The current implementation uses the OWL file representation of the NCIT,
which is stored locally. Using the OWL Java API, all classes are retrieved and
their annotations exploited. From all annotations, the following are used:
• preferred name,
• synonym list (if present),
• definition (if present).
All preferred names and their synonyms are inserted into a Patricia trie,
in every possible permutation of their word sequence. The main interface is
shown in Fig. 5.1. It includes a wide search entry form, ‘Back’ and ‘Forward’
buttons for navigation, a ‘Clear’ button which clears the interface and results,
and a ‘Search’ button in case the user wants to use mouse-clicking. The auto-
completion feature is already fully functional, as can be seen in Fig. 5.2. To
compute the relevance score for each term in the results list, its Levenshtein
and Jaccard similarity to the query are first evaluated. The maximum of
these similarities is, then, chosen to be the relevance score for the term. An
initial form of error correction is already present, as shown in Fig. 5.3. It
is computed using Levenshtein similarity of the query to each ontology term
and choosing the highest score. The presentation of results is shown in Fig.
44
5.5. CURRENT IMPLEMENTATION STATE
Figure 5.1: The main window of the search application.
5.4. It is still at initial stage, and only lexical similarity metrics are used.
Result grouping according to semantic similarity is not yet implemented.
45
5.5. CURRENT IMPLEMENTATION STATE
Figure 5.2: Auto-completion function appears immediately after the user presses a key.
The function works independent of the query word order. Top 10 most relevant results
are shown and percentages that indicate lexical similarity to the query are also present.
46
5.5. CURRENT IMPLEMENTATION STATE
Figure 5.3: Basic error correction is shown as a proposal, in case the query produces no
results.
Figure 5.4: The results page is a table of entries. Each entry contains the matching term
name, a relevance score, the preferred name for the concept and the ontology source.
47
5.5. CURRENT IMPLEMENTATION STATE
Figure 5.5: The term description page currently presents the preferred term name,
definitions and synonym terms. The term that was chosen by the user to reach this screen
is highlighted (i.e. ‘Liver Cancer’). It may be also noticed that the ‘Back button’ is not
disabled anymore and can be used to return to search results.
48
Chapter 6
Conclusions and Future Work
Ontologies are expected to play a major role in the discovery of new knowl-
edge within the biomedical sector. Providing user-friendly tools that help
researchers navigate efficiently through ontologies without requiring from
them to understand about ontological principles is more likely to help them
reach their final goals quickly, without confusion and frustration. In this
report, proposals were made for enhancing the user experience in ontological
search, through a simple search interface that features enhanced searching
tools such as auto-completion, semantic grouping of results, query reformula-
tion and similar concept suggestion. The current state of implementation will
be further improved to account for multiple ontology searching and seman-
tic grouping/ranking of results. Usage of local OWL files against web-based
REST services for ontology access will also be reconsidered. Further changes
will also be made on the visual aspects of the interface. Future extensions of
the final outcome may include transforming it into a web-based application
and providing tools that allow for its integration with other applications,
especially those involving text mining.
49
Bibliography
[Al-Mubaid and Nguyen, 2006] Al-Mubaid, H. and Nguyen, H. A. (2006). A
cluster-based approach for semantic similarity in the biomedical domain.
In Engineering in Medicine and Biology Society, 2006. EMBS’06. 28th
Annual International Conference of the IEEE, pages 2713–2717. IEEE.
[Ananiadou and McNaught, 2006] Ananiadou, S. and McNaught, J. (2006).
Text mining for biology and biomedicine. Artech House Boston, London.
[Anick and Kantamneni, 2008] Anick, P. and Kantamneni, R. G. (2008). A
longitudinal study of real-time search assistance adoption. In Proceedings
of the 31st annual international ACM SIGIR conference on Research and
development in information retrieval, pages 701–702. ACM.
[Bates, 1989] Bates, M. J. (1989). The design of browsing and berrypicking
techniques for the online search interface. Online Information Review,
13(5):407–424.
[Belkin et al., 2003] Belkin, N. J., Kelly, D., Kim, G., Kim, J.-Y., Lee, H.-
J., Muresan, G., Tang, M.-C., Yuan, X.-J., and Cool, C. (2003). Query
length in interactive information retrieval. In Proceedings of the 26th an-
nual international ACM SIGIR conference on Research and development
in informaion retrieval, pages 205–212. ACM.
[Ceusters et al., 2005] Ceusters, W., Smith, B., and Goldberg, L. (2005). A
terminological and ontological analysis of the nci thesaurus. Methods of
information in medicine, 44(4):498.
50
BIBLIOGRAPHY
[Chen et al., 2009] Chen, S., Ma, B., and Zhang, K. (2009). On the sim-
ilarity metric and the distance metric. Theoretical Computer Science,
410(24):2365–2376.
[Clarke et al., 2007] Clarke, C. L., Agichtein, E., Dumais, S., and White,
R. W. (2007). The influence of caption features on clickthrough patterns in
web search. In Proceedings of the 30th annual international ACM SIGIR
conference on Research and development in information retrieval, pages
135–142. ACM.
[Cover and Thomas, 2012] Cover, T. M. and Thomas, J. A. (2012). Elements
of information theory. Wiley-interscience.
[Cucerzan and Brill, 2004] Cucerzan, S. and Brill, E. (2004). Spelling cor-
rection as an iterative process that exploits the collective knowledge of web
users. In Proceedings of EMNLP, volume 4, pages 293–300.
[Davis et al., 1993] Davis, R., Shrobe, H., and Szolovits, P. (1993). What is
a knowledge representation? AI magazine, 14(1):17.
[Dennis et al., 1998] Dennis, S., Robert, M., and Bmza, P. (1998). Searching
the world wide web made easy? the cognitive load imposed by query refine-
ment mechanisms. In Proceedings of ADCS 98 Third Australian Document
Computing Symposium, page 65.
[Franzen and Karlgren, 2000] Franzen, K. and Karlgren, J. (2000). Verbosity
and interface design. SICS Research Report.
[Gangemi et al., 1998] Gangemi, A., Pisanelli, D., and Steve, G. (1998). On-
tology integration: Experiences with medical terminologies. In Formal
ontology in information systems, volume 46, pages 98–94. IOS Press, Am-
sterdam, AM.
[Gomaa and Fahmy, 2013] Gomaa, W. H. and Fahmy, A. A. (2013). Article:
A survey of text similarity approaches. International Journal of Computer
Applications, 68(13):13–18. Published by Foundation of Computer Science,
New York, USA.
51
BIBLIOGRAPHY
[Gruber et al., 1995] Gruber, T. R. et al. (1995). Toward principles for the
design of ontologies used for knowledge sharing. International journal of
human computer studies, 43(5):907–928.
[Guarino, 1998] Guarino, N. (1998). Formal Ontology in Information Sys-
tems: Proceedings of the 1st International Conference June 6-8, 1998,
Trento, Italy, volume 46. Ios PressInc.
[Gusfield, 1997] Gusfield, D. (1997). Algorithms on strings, trees and se-
quences: computer science and computational biology. Cambridge Univer-
sity Press.
[Hearst, 2009] Hearst, M. (2009). Search user interfaces. Cambridge Uni-
versity Press.
[Henß et al., 2009] Henß, J., Kleb, J., Grimm, S., and Bock, J. (2009). A
database backend for owl. In OWL: Experiences and Directions (OWLED
2009), CEUR Workshop Proceedings. CEUR-WS. org.
[Hertzum and Frøkjær, 1996] Hertzum, M. and Frøkjær, E. (1996). Browsing
and querying in online documentation: a study of user interfaces and the
interaction process. ACM Transactions on Computer-Human Interaction
(TOCHI), 3(2):136–161.
[Huang et al., 2010] Huang, C.-r., Calzolari, N., Gangemi, A., Lenci, A.,
Oltramari, A., and Prevot, L. (2010). Ontology and the Lexicon: A Natural
Language Processing Perspective. Cambridge University Press Cambridge.
[Hustadt et al., 1994] Hustadt, U. et al. (1994). Do we need the closed-world
assumption in knowledge representation. Working Notes of the KI, 94:24–
26.
[Iordanov, 2010] Iordanov, B. (2010). Hypergraphdb: a generalized graph
database. In Web-Age Information Management, pages 25–36. Springer.
[Jansen et al., 2007] Jansen, B. J., Spink, A., and Koshman, S. (2007). Web
searcher interaction with the dogpile.com metasearch engine. Journal of
the American Society for Information Science and Technology, 58(5):744–
755.
52
BIBLIOGRAPHY
[Jansen et al., 2005] Jansen, B. J., Spink, A., and Pedersen, J. (2005). A
temporal comparison of altavista web searching. Journal of the American
Society for Information Science and Technology, 56(6):559–570.
[Jaro, 1989] Jaro, M. A. (1989). Advances in record-linkage methodology
as applied to matching the 1985 census of tampa, florida. Journal of the
American Statistical Association, 84(406):414–420.
[Jaro, 1995] Jaro, M. A. (1995). Probabilistic linkage of large public health
data files. Statistics in medicine, 14(5-7):491–498.
[Jiang and Conrath, 1997] Jiang, J. and Conrath, D. (1997). Semantic sim-
ilarity based on corpus statistics and lexical taxonomy. In Proc. of the
Int’l. Conf. on Research in Computational Linguistics, pages 19–33.
[Joachims et al., 2005] Joachims, T., Granka, L., Pan, B., Hembrooke, H.,
and Gay, G. (2005). Accurately interpreting clickthrough data as implicit
feedback. In Proceedings of the 28th annual international ACM SIGIR
conference on Research and development in information retrieval, pages
154–161. ACM.
[Jones et al., 2002] Jones, W., Dumais, S., and Bruce, H. (2002). Once
found, what then? a study of keeping behaviors in the personal use of
web information. Proceedings of the American Society for Information
Science and Technology, 39(1):391–402.
[Jurafsky and Martin, 2000] Jurafsky, D. and Martin, J. H. (2000). Speech
& Language Processing. Pearson Education India.
[Kukich, 1992] Kukich, K. (1992). Techniques for automatically correcting
words in text. ACM Computing Surveys (CSUR), 24(4):377–439.
[Leacock and Chodorow, 1998] Leacock, C. and Chodorow, M. (1998). Com-
bining local context and wordnet similarity for word sense identification.
WordNet: An electronic lexical database, 49(2):265–283.
[Levenshtein, 1966] Levenshtein, V. I. (1966). Binary codes capable of cor-
recting deletions, insertions, and reversals. Technical Report 8.
53
BIBLIOGRAPHY
[Li et al., 2006] Li, M., Zhang, Y., Zhu, M., and Zhou, M. (2006). Explor-
ing distributional similarity based models for query spelling correction. In
Proceedings of the 21st International Conference on Computational Lin-
guistics and the 44th annual meeting of the Association for Computational
Linguistics, pages 1025–1032. Association for Computational Linguistics.
[Li et al., 2003] Li, Y., Bandar, Z. A., and McLean, D. (2003). An approach
for measuring semantic similarity between words using multiple informa-
tion sources. Knowledge and Data Engineering, IEEE Transactions on,
15(4):871–882.
[Liu et al., 2002] Liu, H., Johnson, S. B., and Friedman, C. (2002). Auto-
matic resolution of ambiguous terms based on machine learning and con-
ceptual relations in the umls. Journal of the American Medical Informatics
Association, 9(6):621–636.
[McGuinness et al., 2004] McGuinness, D. L., Van Harmelen, F., et al.
(2004). Owl web ontology language overview. W3C recommendation,
10(2004-03):10.
[Miller, 1995] Miller, G. A. (1995). Wordnet: a lexical database for english.
Communications of the ACM, 38(11):39–41.
[Muramatsu and Pratt, 2001] Muramatsu, J. and Pratt, W. (2001). Trans-
parent queries: investigation users’ mental models of search engines. In
Proceedings of the 24th annual international ACM SIGIR conference on
Research and development in information retrieval, pages 217–224. ACM.
[Navarro, 2001] Navarro, G. (2001). A guided tour to approximate string
matching. ACM computing surveys (CSUR), 33(1):31–88.
[Noy et al., 2009] Noy, N. F., Shah, N. H., Whetzel, P. L., Dai, B., Dorf, M.,
Griffith, N., Jonquet, C., Rubin, D. L., Storey, M.-A., Chute, C. G., et al.
(2009). Bioportal: ontologies and integrated data resources at the click of
a mouse. Nucleic acids research, 37(suppl 2):W170–W173.
[Obendorf et al., 2007] Obendorf, H., Weinreich, H., Herder, E., and Mayer,
M. (2007). Web page revisitation revisited: implications of a long-term
54
BIBLIOGRAPHY
click-stream study of browser usage. In Proceedings of the SIGCHI con-
ference on Human factors in computing systems, pages 597–606. ACM.
[Petrakis et al., 2006] Petrakis, E. G., Varelas, G., Hliaoutakis, A., and
Raftopoulou, P. (2006). X-similarity: computing semantic similarity be-
tween concepts from different ontologies. Journal of Digital Information
Management, 4(4):233.
[Rada et al., 1989] Rada, R., Mili, H., Bicknell, E., and Blettner, M. (1989).
Development and application of a metric on semantic nets. Systems, Man
and Cybernetics, IEEE Transactions on, 19(1):17–30.
[Resnik, 1995] Resnik, P. (1995). Using information content to evaluate se-
mantic similarity in a taxonomy. arXiv preprint cmp-lg/9511007.
[Rodriguez and Egenhofer, 2003] Rodriguez, M. A. and Egenhofer, M. J.
(2003). Determining semantic similarity among entity classes from dif-
ferent ontologies. Knowledge and Data Engineering, IEEE Transactions
on, 15(2):442–456.
[Rodrıguez et al., 1999] Rodrıguez, M. A., Egenhofer, M. J., and Rugg, R. D.
(1999). Assessing semantic similarities among geospatial feature class defi-
nitions. In Interoperating Geographic Information Systems, pages 189–202.
Springer.
[Ruthven and Lalmas, 2003] Ruthven, I. and Lalmas, M. (2003). A survey on
the use of relevance feedback for information access systems. The Knowl-
edge Engineering Review, 18(02):95–145.
[Sanchez et al., 2011] Sanchez, D., Batet, M., and Isern, D. (2011).
Ontology-based information content computation. Knowledge-Based Sys-
tems, 24(2):297–303.
[Sanchez et al., 2012] Sanchez, D., Sole-Ribalta, A., Batet, M., and Ser-
ratosa, F. (2012). Enabling semantic similarity estimation across multiple
ontologies: An evaluation in the biomedical domain. Journal of Biomedical
Informatics, 45(1):141–155.
55
BIBLIOGRAPHY
[Schulz et al., 2010] Schulz, S., Schober, D., Tudose, I., and Stenzhorn, H.
(2010). The pitfalls of thesaurus ontologization–the case of the nci the-
saurus. In AMIA Annual Symposium Proceedings, volume 2010, page 727.
American Medical Informatics Association.
[Seco et al., 2004] Seco, N., Veale, T., and Hayes, J. (2004). An intrinsic
information content metric for semantic similarity in wordnet. In ECAI,
volume 16, page 1089. Citeseer.
[Sutcliffe and Ennis, 1998] Sutcliffe, A. and Ennis, M. (1998). Towards a
cognitive theory of information retrieval. Interacting with computers,
10(3):321–351.
[Tauscher and Greenberg, 1997] Tauscher, L. and Greenberg, S. (1997). How
people revisit web pages: Empirical findings and implications for the design
of history systems. International Journal of Human-Computer Studies,
47(1):97–137.
[Tversky et al., 1977] Tversky, A. et al. (1977). Features of similarity. Psy-
chological review, 84(4):327–352.
[Tversky and Kahneman, 1975] Tversky, A. and Kahneman, D. (1975).
Judgment under uncertainty: Heuristics and biases. Springer.
[VHA, 2012] VHA, V. H. A. (2012). National Drug File Reference Termi-
nology (NDF-RT) Documentation. U.S. Department of Veterans Affairs.
[WHO, 1992] WHO, W. H. O. (1992). International Statistical Classifica-
tion of Diseases and Related Health Problems, Tenth Revision: Introduc-
tion; list of three-character categories; tabular list of inclusions and four-
character subcategories; morphology of neoplams; special tabulation lists
for mortality and morbidity; definitions; regulations. World Health Orga-
nization.
[Winkler, 1999] Winkler, W. E. (1999). The state of record linkage and
current research problems. In Statistical Research Division, US Census
Bureau. Citeseer.
56
BIBLIOGRAPHY
[Wu and Palmer, 1994] Wu, Z. and Palmer, M. (1994). Verbs semantics and
lexical selection. In Proceedings of the 32nd annual meeting on Association
for Computational Linguistics, pages 133–138. Association for Computa-
tional Linguistics.
[Zhang and Zhao, 2011] Zhang, H. and Zhao, S. (2011). Measuring web page
revisitation in tabbed browsing. In Proceedings of the 2011 annual confer-
ence on Human factors in computing systems, pages 1831–1834. ACM.
[Zhou et al., 2008] Zhou, Z., Wang, Y., and Gu, J. (2008). A new model of
information content for semantic similarity in wordnet. In Future Genera-
tion Communication and Networking Symposia, 2008. FGCNS’08. Second
International Conference on, volume 3, pages 85–89. IEEE.
[Zhu et al., 2009] Zhu, S., Zeng, J., and Mamitsuka, H. (2009). Enhancing
medline document clustering by incorporating mesh semantic similarity.
Bioinformatics, 25(15):1944–1951.
57