How to Build a Biomedical Ontology
Success StoriesThe Gene Ontology (GO)
SNOMED, ICD and other controlled vocabulariesOntology Design Principles
Ontology Applications
Barry Smithhttp://ontology.buffalo.edu/smith
Uses of ‘ontology’ in PubMed abstracts
2
3
By far the most successful: GO (Gene Ontology)
4
5
Hierarchical view of GO representing relations between represented types
6
Gene Ontology
$100 mill. invested in literature and database curation using the Gene Ontology (GO)
based on the idea of annotation
over 11 million annotations relating gene products (proteins) described in the UniProt, Ensembl and other databases to terms in the GO
multiple secondary uses – because the ontology was not built to meet one specific set of requirements
7
GO provides a controlled system of terms for use in annotating (describing, tagging)
data• multi-species, multi-disciplinary, open source
• contributing to the cumulativity of scientific results obtained by distinct research communities
• compare use of kilograms, meters, seconds in formulating experimental results
8
Sample Gene Array Data
9
where in the cell ?
what kind of molecular function ?
semantic annotation of data
what kind of biological process?
10
natural language labels
to make the data cognitively accessible to human beings
11
compare: legends for mapscompare: legends for maps
12
compare: legends for diagrams
13
ontologies are legends for data
14
compare: legends for mapscompare: legends for maps
15
ontologies are legends for images
16
what lesion ?
what brain function ?
17
ontologies are legends for databases
MouseEcotope GlyProt
DiabetInGene
GluChem
sphingolipid transporter
activity
18
annotation using common ontologies yields integration of databases
MouseEcotope GlyProt
DiabetInGene
GluChem
Holliday junction helicase complex
19
annotation using common ontologies can support comparison of data
20
annotation with Gene Ontology
supports reusability of data
supports search of data by humans
supports comparison of data
supports aggregation of data
supports reasoning with data by humans and machines
21
22
The goal: virtual science
• consistent (non-redundant) annotation
• cumulative (additive) annotation
yielding, by incremental steps, a virtual map of the entirety of reality that is accessible to computational reasoning
23
This goal is realizable if we have a common ontology framework
data is retrievable
data is comparable
data is integratable
only to the degree that it is annotated using a common controlled vocabulary
– compare the role of seconds, meters, kilograms … in unifying science
24
To achieve this end we have to engage in something like philosophy (?)
is this the right way to organize the top level of this portion of the GO?how does the top level of this ontology relate to the top levels of other, neighboring ontologies? 25
Strategy for doing this
see the world as organized via types/universals/categories which are hierarchically organized
and in relation to which statements can be formulated which are universally true of all instances:
cell membrane part_of cell26
Pleural Cavity
Pleural Cavity
Interlobar recess
Interlobar recess
Mesothelium of Pleura
Mesothelium of Pleura
Pleura(Wall of Sac)
Pleura(Wall of Sac)
VisceralPleura
VisceralPleura
Pleural SacPleural Sac
Parietal Pleura
Parietal Pleura
Anatomical SpaceAnatomical Space
OrganCavityOrganCavity
Serous SacCavity
Serous SacCavity
AnatomicalStructure
AnatomicalStructure
OrganOrgan
Serous SacSerous Sac
MediastinalPleura
MediastinalPleura
TissueTissue
Organ PartOrgan Part
Organ Subdivision
Organ Subdivision
Organ Component
Organ Component
Organ CavitySubdivision
Organ CavitySubdivision
Serous SacCavity
Subdivision
Serous SacCavity
Subdivision
part
_of
is_a
Foundational Model of Anatomy Ontology27
siamese
mammal
cat
organism
substancespecies, genera
animal
instances
frog
28
29
with thanks to http://dbmotion.com 30
the problem of continuity of care: patients move around
31
f
f
f
ff
synchronic and diachronic problems of semantic interoperability
(across space and across time)
f
32
f
f
f
ff
how can we link EHR 1 to EHR 2 in a reliable, trustworthy, useful way, which
both systems can understand ?
f
EHR 1 EHR 2
33
f
f
f
ff
the ideal solution: WHO International Classification of
Diseases
fICD
EHR 1 EHR 2
ICDICDPRO: De facto US billing standardMultilanguageCON: De facto US billing standard (corrupts data)No definitions of terms, and so difficult to
judge accuracy of hierarchy and of codingInconsistent hierarchiesHard to reason with resultsHence few secondary uses e.g. for research
34
ICD 11ICD 11The (ontology-based) planmultiple views including
◦billing ◦public health statistics◦research
◦SNOMED compatibility
35
36
f
f
f
ff
the ideal solution: a single universal clinical vocabulary
fSNOMED-CT
EHR 1 EHR 2
SNOMED CT: SNOMED CT: Systematized Nomenclature of Systematized Nomenclature of Medicine-Clinical TermsMedicine-Clinical Terms
PRO:International standard (sort of)Huge resourceFree for member countriesMulti-language (including Spanish)
37
SNOMED CTSNOMED CTCONHuge (but redundant ... and gappy)Contains many examples of false synonymyStill in need of work
◦ No consistent interpretation of relations◦ Many erroneous relation assertions◦ Many idiosyncratic relations◦ Mixes ontology with epistemology◦ It contains numerous compound terms (e.g., test for
X) without the constituent terms (here: X), even where the latter are of obvious salience
(38
Coding with SNOMED-CT is unreliable and inconsistent
Multi-stage multi-committee process for adding terms that follows intuitive rules and not formal principles
Does there exist a strategy for evolutionary improvement?
39
SNOMED CT
40
f
f
f
fanf
above all: SNOMED CT cannot solve the problem of continuity of care because it has
too much redundancy
f
EHR 1 EHR 2
SNOMED-CT
41
f
f
f
fanf
AND because it is used only in certain countries
f
EHR 1 EHR 2
SNOMED-CT
42
f
f
f
ff
link EHR 1 to EHR 2 through a snapshot of the patient’s condition which both systems
can understand
f Unified Medical Language System
(UMLS)
EHR 1 EHR 2
Unified Medical Language Unified Medical Language System (UMLS)System (UMLS)
UMLS is not unified, not a language, not a system (and not only medical); it is an aggregation If we use something like UMLS as reference terminology, we will not solve the translation problem
43
EN
DE
New York State Center of Excellence in Bioinformatics & Life Sciences
R T U New York State Center of Excellence in Bioinformatics & Life Sciences
R T U
UMLS approach to countering silo formation– By ‘linking between different clinical or biomedical
vocabularies’
– However: ‘… the Metathesaurus does not represent a comprehensive NLM-authored ontology of biomedicine or a single consistent view of the world. The Metathesaurus preserves the many views of the world present in its source vocabularies because these different views may be useful for different tasks.’
http://www.nlm.nih.gov/pubs/factsheets/umlsmeta.html
New York State Center of Excellence in Bioinformatics & Life Sciences
R T U New York State Center of Excellence in Bioinformatics & Life Sciences
R T U
Prospective standardization is a good thing
Prospective standardization is the only thing which will work in mission critical domains
Prospective standardization means that certain limits to tolerance must be imposed,
Need for top-down governance to ensure common architecture and resolution of border disputes in areas of overlap between domains
46
Principles of Best Practice in Ontology Development
47
Problem of ensuring sensible cooperation in a massively interdisciplinary community
Consider multiple uses of technical terms such as
− type− concept− instance− model− representation− data
48
Three Levels
L3. Words, models (published representations, ontologies, databases ...)
L2. Ideas (concepts, thoughts, memories, ...)
L1. Things (cells, planets, processes of cell division ...)
49
Entity =def
anything which exists, including things and processes, functions and qualities, beliefs and actions, documents and software
(entities on levels 1, 2 and 3)
50
First basic distinction among entities
type vs. instance
(science text vs. diary)
(human being vs. Tom Cruise)
51
For ontologies
it is generalizations that are important = types, universals,
kinds, species
52
A 515287 DC3300 Dust Collector Fan
B 521683 Gilmer Belt
C 521682 Motor Drive Belt
Catalog vs. inventory
53
An ontology is a representation of types
We learn about types in reality from looking at the results of scientific experiments in the form of scientific theories
experiments relate to what is particular science describes what is general
54
Ontology =def.
a representational artifact whose representational units (which may be drawn from a natural or from some formalized language) are intended to represent
1. types in reality
2. those relations between these types which obtain universally (= for all instances)
lung is_a anatomical structure
lobe of lung part_of lung
in accordance with our best current established science
55
siamese
mammal
cat
organism
objecttypes
animal
frog
instances56
Domain =def
a portion of reality that forms the subject-matter of a single science or technology or mode of study or administrative practice:
proteomics
epidemiology
C2
M&S
57
Representation =def
an image, idea, map, picture, name or description ... of some entity or entities.
58
Ontologies are representational artifacts
comparable to science textsand subject to the same sorts of
constraints (including need for update)
59
Representational units =def
terms, icons, alphanumeric identifiers ... which refer, or are intended to refer, to entities
and which are minimal (atoms)
60
Composite representation =def
representation
(1) built out of representational units
which
(2) form a structure that mirrors, or is intended to mirror, the entities in some domain
61
Periodic Table
The Periodic Table
62
Ontologies are here
63
or here
64
Ontologies represent general structures in reality (leg)
65
Ontologies do not represent concepts in people’s heads
66
They represent types in reality
67
How do we know which general terms designate types?
Types are repeatables:
cell, electron, weapon, F16 ...
Instances are one-off:
Bill Clinton, this laptop, this handwave
68
Problem
The same general term can be used to refer both to types and to collections of particulars. Consider:
HIV is an infectious retrovirus
HIV is spreading very rapidly through Asia
69
Class =def
a maximal collection of particulars determined by a general term (‘cell’, ‘electron’ but also: ‘ ‘restaurant in Palo Alto’, ‘Italian’)
the class A = the collection of all particulars x for which ‘x is A’ is true
70
types vs. their extensions
types
{a,b,c,...} collections of particulars
71
Extension
=def The extension of a type is the class of its instances
72
types vs. classes
types
{c,d,e,...} classes
73
types vs. classes
compare: ‘natural kinds’
types
extensions other sorts of classes
74
types vs. classes
types
populations, ...
the class of all diabetic patients in Leipzig on 4 June 1952
75
OWL is a good representation of classes
• F16s
• sibling of Finnish spy
• member of Abba aged > 50 years
76
types, classes, concepts
types
classes
‘concepts’ ?
77
types < classes < ‘concepts’ ?
Cases of ‘concepts’ which, some people say, do not correspond to classes:
‘Cancelled oophorectomy’‘Absent nipple’‘Unlocalized ligand’
A cancelled oophorectomy is not a special kind of conceptual oophorectory
Use: Information Artifact Ontology (IAO)
78
Principle of Low Hanging Fruit
Include even absolutely trivial assertions (assertions you know to be universally true)
pneumococcal virus is_a virus
Computers need to be led by the hand
79
Example: MeSH
MeSH Descriptors Index Medicus Descriptor Anthropology, Education, Sociology and Social Phenomena (MeSH Category) Social Sciences Political Systems National Socialism
National Socialism is_a Political SystemsNational Socialism is_a Anthropology ...
80
Principle of Singular Nouns
Terms in ontologies represent types
Goal: Each term in an ontology should represent exactly one type
Thus every term should be a singular noun
81
Principle: do not commit the use-mention confusion
mouse =def. common name for the species mus musculus
swimming is healthy and has eight letters
82
Principle: do not commit the use-mention confusion
Avoid confusing between words and things
Avoid confusing between concepts in our minds and entities in reality
Recommendation: avoid the word ‘concept’ entirely
83
Trialbank
‘information’ = def. ‘a written or spoken designation of a concept’
84
‘Heparin therapy’ is an instance of ‘written or spoken designation of a concept’
What are the problems here?
1. misuse of quotation marks
2. confusion of instances and types
3. confusion of concept and reality
Trialbank
85
Principle: beware of terminological baggage
For the sake of interoperability with other ontologies, do not give special meanings to terms with established general meanings
(Don’t use ‘cell’ when you mean ‘plant cell’)
86
ICNP: International Classification of Nursing Procedures (old version)
water =def. a type of Nursing Phenomenon of Physical Environment with the specific characteristics: clear liquid compound of hydrogen and oxygen that is essential for most plant and animal life influencing life and development of human beings.
87
Principle of definitions
Supply definitions for every term
1.human-understandable natural language definition
2.an equivalent formal definition
88
Principle: definitions must be unique
Each term should have exactly one definition
it may have both natural-language and formal versions
(issue with ontologies which exist with different levels of expressivity)
89
The Problem of Circularity
A Person =def. A person with an identity document
Hemolysis =def. The causes of hemolysis
90
Principle of non-circularity
The term defined should not appear in its own definition
91
Example: HL7
‘stopping a medication’ = def.
change of state in the record of a Substance Administration Act from Active to Aborted
92
Principle of Increase in Understandability
A definition should use only terms which are easier to understand than the term defined
Definitions should not make simple things more difficult than they are
93
Generalized Tarski principle (a good, general constraint on a
theory of meaning)
For each linguistic expression ‘E’
‘E’ means E
‘snow’ means: snow
‘pneumonia’ means: pneumonia
94
HL7 Reference Information Model
‘medication’ does not mean: medication
rather it means:
the record of medication in an information system
‘disease’ does not mean: disease
rather it means:
the observation of a disease
95
Principle of Acknowledging Primitives
In every ontology some terms and some relations are primitive = they cannot be defined (on pain of infinite regress)
Examples of primitive relations:
identity
instance_of
96
Principle of Aristotelian Definitions
Use Aristotelian definitions
An A is a B which C’s.
A human being is an animal which is rational
97
Rules for Formulating Terms
Avoid abbreviations even when it is clear in context what they mean (‘breast’ for ‘breast tumor’)
Avoid acronymsAvoid mass terms (‘tissue’, ‘brain mapping’,
‘clinical research’ ...)Treat each term ‘A’ in an ontology is
shorthand for a term of the form ‘the type A’
98
Univocity Terms should have the same meanings on
every occasion of use.
(= They should refer to the same types)
Basic ontological relations such as is_a and part_of should be used in the same way by all ontologies
99
Universality
Ontologies are made of relational assertions
They should include only those which hold universally
100
Universality
Often, order will matter:
We can assert
adult transformation_of child
but not
child transforms_into adult
101
Universality
viral pneumonia caused by virus
but not
virus causes pneumonia
pneumococcal virus causes pneumonia
102
Principle of Universality
results analysis later_than protocol-design
but not
protocol-design earlier_than results analysis
103
Principle of PositivityComplements of types are not themselves types.
Terms such as
non-mammal non-membrane other metalworker in New Zealand
do not designate types in reality
104
Generalized Anti-Boolean Principle
There are no conjunctive and disjunctive types:
anatomic structure, system, or substance
musculoskeletal and connective tissue disorder
105
Objectivity
Which types exist in reality is not a function of our knowledge.
Terms such as
unknown
unclassified
unlocalized
arthropathies not otherwise specified
do not designate types in reality.106
Keep Epistemology Separate from Ontology
If you want to say that
We do not know where A’s are located
do not invent a new class of
A’s with unknown locations
(A well-constructed ontology should grow linearly; it should not need to delete classes or relations because of increases in knowledge)
107
If you want to say
I surmise that this is a case of pneumonia
do not invent a new class of surmised pneumonias
Confusion of ‘findings’ in medical terminologies
Keep Sentences Separate from Terms
108
Single Inheritance
No kind in a classificatory hierarchy should be asserted to have more than one is_a parent on the immediate higher level
109
Multiple Inheritance
thing
carblue thing
blue car
is_a is_a
110
Multiple Inheritance
is a source of errors
encourages laziness
serves as obstacle to integration with neighboring ontologies
hampers use of Aristotelian methodology for defining terms
hampers use of statistical search tools
111
Multiple Inheritance
thing
carblue thing
blue car
is_a1 is_a2
112
Principle of asserted single inheritance
Each reference ontology module should be built as an asserted monohierarchy (a hierarchy in which each term has at most one parent)
Asserted hierarchy vs. inferred hierarchy
113
Principle of normalization
Polyhierarchies should be decomposable into homogeneous disjoint monohierarchies
114
Principle of instantiability
A term should be included in an ontology only if there is evidence that instances to which that term refers exist or have existed or can exist in reality.
Fist
Crowd
115
Avoid mass nouns
Count nouns = an organism, a planet, a handshake
Mass nouns = tissue, information, discourse
Mass nouns almost always go hand in hand with ontological confusion
116
is_a Overloading
The success of ontology alignment demands that ontological relations (is_a, part_of, ...) have the same meanings in the different ontologies to be aligned.
117
Multiple Inheritance
thing
carblue thing
blue car
is_a1 is_a2
118
How to solve this problem
Create two ontologies:
of cars
of colors
Link the two together via cross-products
(= factoring, normalization, modularization)
119
Compositionality
The meanings of compound terms should be determined
1. by the meanings of component terms
together with
2. the rules governing syntax
120
User feedback principle
An ontology should evolve on the basis of feedback derived from those who are using the ontology for example for purposes in annotation.
121