1 how to build an ontology barry smith
TRANSCRIPT
1
How to Build an Ontology
Barry Smithhttp://ontology.buffalo.edu/smith
2
Ontology
A classification of entities and the relations
between them. Ontology is a list of types structured by relationsDefined by a scientific field's vocabulary and by the
canonical formulations of its theories. Scientific theories consist of generalizations.
What I will not be talking about: XML, OWL, ..., data(types), information models, file formats ...
3
Top-Level
GOOBO, OBO Core
NCBOFMA
NCBC Roadmap CentersNCI EVS
NECTAR (National Electronic Clinical Trials and Research) Network
4
Instances are not included in an ontology
It is the generalizations that are important
(but instances must still be taken into account)
5
A 515287 DC3300 Dust Collector Fan
B 521683 Gilmer Belt
C 521682 Motor Drive Belt
6
Ontology Types Instances
7
Ontology = A Representation of Types
8
Ontology = A Representation of Types
Each node of an ontology consists of:
• preferred term (aka term)
• term identifier (TUI, aka CUI)
• synonyms
• definition, glosses, comments
9
Ontology = A Representation of Types
Nodes in an ontology are connected by relations:
primarily: is_a (= is subtype of) and part_of
designed to support search, reasoning and annotation
10
Rules for formating terms
• Terms are names of types: if you prefix a term with
the type ___the term should still make sense
• Hence: terms should be in the singular• Terms should be lower case• Avoid abbreviations even when it is clear in
context what they mean (‘breast’ for ‘breast tumor’)
11
Motivation: to capture reality
Inferences and decisions we make are based upon what we know of reality.
An ontology is a computable representation of this underlying bio(techno)logical reality.
Enables a computer to reason over the data in (some of) the ways that we do.
12
Biomedical ontology integration / interoperability
Will never be achieved through integration of meanings or concepts
The problem is precisely that different user communities use different concepts
What’s really needed is to have well-defined commonly used relationships
13
Concepts
Biomedical ontology integration will never be achieved through integration of meanings or concepts
The problem is precisely that different user communities use different concepts
14
Concepts
Concepts are in your head and will change as our understanding changes
Ontologies represent types: not concepts, meanings, ideas ...
Types exist, with their instances, in objective reality
– including types of experimental process, design, method, ...
15
Most ontologies are execrableBut some good ontologies do
already exist
• as far as possible don’t reinvent
• use the power of combination and collaboration
• ontologies are like telephones: they are valuable only to the degree that they are used and networked with other ontologies
16
Why do we need rules/standards for good ontology?
Ontologies must be intelligible both to humans (for annotation) and to machines (for reasoning and error-checking): unintuitive rules for classification lead to errors
Intuitive rule facilitate training of curators and annotators
Common rules allow alignment with other ontologies
Logically coherent rules enhance harvesting of content through automatic reasoning systems
17
Rules on types
Don’t confuse types with conceptsDon’t confuse types with ways of getting to
know typesDon’t confuse types with ways of talking
about typesDon’t confuses types with data about types
18
First Rule: Univocity
Terms (including those describing relations) should have the same meanings on every occasion of use.
In other words, they should refer to the same types in reality
19
Second Rule: Positivity
There are no negative types
Terms such as ‘non-mammal’ or ‘non-membrane’ do not designate genuine types.
(There are also no conjunctive and disjunctive types: rabbit and nailfile; rabbit or nosewipe)
20
Third Rule: Objectivity
Which types exist is not a function of our biological knowledge.
Terms such as ‘unknown’ or ‘unclassified’ or ‘unlocalized’ do not designate biological natural kinds.
21
Fourth Rule: Single Inheritance
No type in a classificatory hierarchy should have more than one is_a parent on the immediate higher level
22
Rule of Single Inheritance
no diamonds:
C
is_a2
B
is_a1
A
23
Problems with multiple inheritance
B C
is_a1 is_a2
A
‘is_a’ no longer univocal
24
‘is_a’ is pressed into service to mean a variety of different things
shortfalls from single inheritance are often clues to incorrect entry of terms and relations
the resulting ambiguities make the rules for correct entry difficult to communicate to human curators
25
is_a Overloading
serves as obstacle to integration with neighboring ontologies
The success of ontology alignment demands that ontological relations (is_a, part_of, ...) have the same meanings in the different ontologies to be aligned.
26
To the degree that the above rules are not satisfied, error
checking and ontology alignment will be achievable,
at best, only with human intervention and via force
majeure
27
Current Best Practice:The Foundational Model of Anatomy
28
Pleural Cavity
Pleural Cavity
Interlobar recess
Interlobar recess
Mesothelium of Pleura
Mesothelium of Pleura
Pleura(Wall of Sac)
Pleura(Wall of Sac)
VisceralPleura
VisceralPleura
Pleural SacPleural Sac
Parietal Pleura
Parietal Pleura
Anatomical SpaceAnatomical Space
OrganCavityOrganCavity
Serous SacCavity
Serous SacCavity
AnatomicalStructure
AnatomicalStructure
OrganOrgan
Serous SacSerous Sac
MediastinalPleura
MediastinalPleura
TissueTissue
Organ PartOrgan Part
Organ Subdivision
Organ Subdivision
Organ Component
Organ Component
Organ CavitySubdivision
Organ CavitySubdivision
Serous SacCavity
Subdivision
Serous SacCavity
Subdivision
part
_of
is_a
29
Current Best Practice:The Foundational Model of Anatomy
Follows formal rules for definitions laid down by Aristotle.
When A is_a B, the definition of ‘A’ takes the form:
an A =def. a B which ...
a human being =def. an animal which is rational
30
FMA Example
Cell def an anatomical structure which consists of cytoplasm surrounded by a plasma membrane with or without a cell nucleus
Plasma membrane =def a cell part that surrounds the cytoplasm
31
The FMA regimentation
Brings the advantage that each definition reflects the position in the hierarchy to which a defined term belongs.
The position of a term within the hierarchy enriches its own definition by incorporating automatically the definitions of all the terms above it.
The entire information content of the FMA’s term hierarchy can be translated very cleanly into a computer representation
32
GO now adopting structured definitions contain both genus and differentiae
Essence = Genus + Differentiae
neuron cell differentiation =Genus: differentiation (processes whereby a relativelyunspecialized cell acquires the specialized features of..)Differentiae: acquires features of a neuron
33
Ontology alignmentOne of the current goals of GO is to align:
cone cell fate commitment retinal_cone_cell
keratinocyte differentiation keratinocyte
adipocyte differentiation fat_cell
dendritic cell activation dendritic_cell
lymphocyte proliferation lymphocyte
T-cell homeostasis T_lymphocyte
garland cell differentiation garland_cell
heterocyst cell differentiation heterocyst
Cell Types in GO Cell Types in the Cell Ontologywith
34
Alignment of the two ontologies will permit the generation of consistent and complete definitions
id: CL:0000062name: osteoblastdef: "A bone-forming cell which secretes an extracellular matrix. Hydroxyapatite crystals are then deposited into the matrix to form bone." [MESH:A.11.329.629]is_a: CL:0000055relationship: develops_from CL:0000008relationship: develops_from CL:0000375
GO
Cell type
New Definition
+
=Osteoblast differentiation: Processes whereby an osteoprogenitor cell or a cranial neural crest cell acquires the specialized features of an osteoblast, a bone-forming cell which secretes extracellular matrix.
35
Other Ontologies to be aligned with GO
Chemical ontologies– 3,4-dihydroxy-2-butanone-4-phosphate synthase
activity
Anatomy ontologies– metanephros development
GO itself– mitochondrial inner membrane peptidase activity
OBO core
36
eventually to comprehend all of OBO
37
Top Level OBO-UBO
continuants: objects, characteristics, spatial regions
occurrents: processes, temporal regions, spatio-temporal regions
38
Definitions should be intelligible to both machines and humans
Machines can cope with the full formal representation
Humans need modularity
39
Fifth Rule:Terms and relations should have
clear definitions
These tell us how the ontology relates to the world of biological instances, meaning the actual particulars in reality: – actual cells, actual portions of cytoplasm, and
so on
40
But
Some terms are primitive (cannot be defined)
AVOID CIRCULAR DEFINITIONS !Avoid definitions of the forms:
An A is an A which is B (person = person with identity documents)
An A is the B of an A (heptolysis = the causes of heptolysis)
41
siamese
mammal
cat
organism
substancetypes
animal
instances
frogleaf type
42
Benefits of well-defined relationships
If the relations in an ontology are well-defined, then reasoning can cascade from one relational assertion (A R1 B) to the next (B R2 C).
Find all DNA binding proteins should also find all transcription factor proteins becausetranscription factor is_a DNA binding
protein
43
What happens when an ontology has no clear definition of A is_a B:
cancer documentation is_a cancer
disease prevention is_a disease
living subject is_a information object representing an animal or complex organism
individual allele is_a act of observation
44
Pleural Cavity
Pleural Cavity
Interlobar recess
Interlobar recess
Mesothelium of Pleura
Mesothelium of Pleura
Pleura(Wall of Sac)
Pleura(Wall of Sac)
VisceralPleura
VisceralPleura
Pleural SacPleural Sac
Parietal Pleura
Parietal Pleura
Anatomical SpaceAnatomical Space
OrganCavityOrganCavity
Serous SacCavity
Serous SacCavity
AnatomicalStructure
AnatomicalStructure
OrganOrgan
Serous SacSerous Sac
MediastinalPleura
MediastinalPleura
TissueTissue
Organ PartOrgan Part
Organ Subdivision
Organ Subdivision
Organ Component
Organ Component
Organ CavitySubdivision
Organ CavitySubdivision
Serous SacCavity
Subdivision
Serous SacCavity
Subdivision
part
_of
is_a
45
How to define A is_a B
A is_a B =def.
all instances of A are as a matter of biological science also instances of B
here A and B are names of types in reality
46
How to define A is_a B
A is_a B =def.
for all a if a instance_of A, then a instance_of B
47
Kinds of relations
Between types:– is_a, part_of, ...
Between an instance and a type– this explosion instance_of the type explosion
Between instances:– Mary’s heart part_of Mary
48
Part_of as a relation between types is more problematic than
is standardly supposedheart part_of human being ?
human heart part_of human being ?
human being has_part human testis ?
testis part_of human being ?
49
Definition of part_of as a relation between types
A part_of B =Def all instances of A are instance-level parts of some instance of B
human testis part_of adult human being
50
Instance level
this nucleus is adjacent to this cytoplasm
implies:
this cytoplasm is adjacent to this nucleus
Type level
nucleus adjacent_to cytoplasm
Not: cytoplasm adjacent_to nucleus
seminal vesicle adjacent_to urinary bladder
Not: urinary bladder adjacent_to seminal vesicle
51
Definitions of the all-some form
allow cascading inferences
If A R1 B and B R2 C, then we know that
every A stands in R1 to some B, but we know also that, whichever B this is, it can be plugged into the R2 relation
52
c at t1
C
c at t
C1
time
same instance
transformation_of
pre-RNA mature RNA
adultchild
53
transformation_of
A transformation_of B =Def. Every instance of A was at some earlier time an
instance of B
adult transformation_of child
54
embryological development C
c at t c at t1
C1
55
C
c at t c at t1
C1
tumor development
56
C
c at t
C1
c1 at t1
C'
c' at t
time
instances
zygote derives_fromovumsperm
derives_from
57
One main obstacle to integrating biological and experiment-
generated data
Most ontologies have no facility for dealing with time and instances
58EXPO: Experiment Ontology
59
representational style part_of experimental hypothesisexperimental actions part_of experimental design
60tool part_of experimental design
(confuses object with specification)
61
hypothesis driven is_a Galilean
62
physical is_a scientific experiment(avoid abbreviations)
63
admin info about experiment is_a scientific experiment
64
where is the top level? objects, processes, characteristics
65
is_a and part_of never cross categorial divides
(cf. tripartite organization of GO)
if A is_a B
then A is an object type iff B is an object type
then A is a process type iff B is a process type
then A is a characteristic type iff B is a characteristic type
66
Some thoughts on time
continuants vs. occurrentsobjects, characteristics vs. processes
timetimeline
daydaytime
menstrual cyclehigh tide
67
What is time?
68
Top Level OBO-UBO
continuants: objects, characteristics, spatial regions
occurrents: processes, temporal regions, spatio-temporal regions
Space = the largest spatial region
Time = the largest temporal region
69
Relative time, subjective time
terms describing (regions of) time in special (qualitative, perspective-dependent, landmark dependent) ways
tomorrow, yesterday
uptown, downtown
phase A trial
Wednesday
70
Characteristics are continuants
many characteristics have realizations, applications or executions, which are processes
plandesignmethodmenstrual cyclefunction
71
GlaxoSmithKline*
What we need is “industrial-strength” ontologies with a consistent and rich representation formalism that are amenable for use as an integration framework, and support reasoning capabilities. We anticipate that pharma’s need to bring together mountains of data and information and to properly analyse that information all depend on having a stable, well-developed semantic framework that links information/data and that allows reasoning systems to perform some of our more "mundane" analysis work.
*Robin McEntire
72
OBO Relation Ontology
“Relations in Biomedical Ontologies”, Genome Biology, Apr. 2005
relations for continuants behave differently from relations for processes
73
part_offor component types is
time-indexed
A part_of B =def.given any particular a and any time t, if a is an instance of A at t,then there is some instance b of B such that a is an instance-level part_of b at t
74
part_offor process types is not
time-indexed
A part_of B =def.given any particular a, if a is an instance of A,then there is some instance b of B such that a is an instance-level part_of b at t
75
Main Upper Level OntologiesCYCCycorp (Austin, TX)human being = partially tangible thing
SUO (Suggested Upper Ontology)IEEEmonkey, body covering
DOLCE (Descriptive Ontology for Linguistic and Cognitive Engineering)
BFO (Basic Formal Ontology)
76
SUO top levelEntity
– Physical • Object
– SelfConnectedObject » Substance » CorpuscularObject » Food
– Region – Collection – Agent
• Process – Abstract
• SetOrClass • Relation • Quantity
– Number – PhysicalQuantity
• Attribute • Proposition
77
MIGS Specification Top Levels
Organism
Phenotype
Environment
Sample Process
Data Process